Table of contents
1.
Introduction
2.
Definition
3.
Difference between BPTT and TBPTT
4.
Why use TBPTT over BPTT?
5.
Preparing Sequence Data for TBPTT
5.1.
Use Data As-Is
5.2.
Naive Data Split
5.3.
Domain-Specific Data Split
5.4.
Systematic Data Split
5.5.
Lean Heavily On Internal State With TBPTT
5.6.
Decouple Forward and Backward Sequence Length
6.
FAQs
7.
Key Takeaways
Last Updated: Mar 27, 2024

Truncated BPTT

Author soham Medewar
0 upvote
Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

In sequence prediction challenges, recurrent neural networks can learn the temporal dependence across several time steps.

Backpropagation Through time is a version of the Backpropagation method used to train modern recurrent neural networks like the Long Short-Term Memory network. Truncated Backpropagation Through Time is a modified version of this approach that is more efficient on sequence prediction problems with very long sequences.

Choosing how many timesteps to utilize as input while training recurrent neural networks like LSTMs using Truncated Backpropagation Through Time is an important setup choice. That is, how to break down your extremely long input sequences into subsequences for optimal efficiency.

Also Read, Resnet 50 Architecture

Definition

The input is considered as fixed-length subsequences in truncated backpropagation through time (TBPTT). The hidden state of the previous subsequence is passed on Cas input to the following subsequence in the forward pass. On the other hand, the computed gradient values are dropped at the end of each subsequence as we move back in gradient computation. The gradient values at time t' are employed in every time step t if t < t' in normal backpropagation. If t'-t exceeds the subsequence length, the gradients do not flow from t' to t in shortened backpropagation.

Difference between BPTT and TBPTT

Graphical representation of BPTT and TBPTT is shown below:

The full history of activations and inputs in the forward pass (blue arrow above representing hidden/internal state flow) must be stored for use in the backpropagation step in standard backpropagation through time (BPTT) (red arrow shows gradient flow). This can be both computationally and memory intensive, especially for a character language model.

But TBPTT reduces the number of timesteps utilized on the backward pass, allowing it to estimate rather than calculate the gradient used to update the weights.

Why use TBPTT over BPTT?

Truncated Backpropagation Through Time (BPTT) offers the computational benefits of BPTT while eliminating the need for a complete retrace through the whole data sequence at each stage. However, truncation favors short-term dependencies: the gradient estimate of truncated BPTT is biassed. Therefore, it does not benefit from the stochastic gradient theory's convergence guarantees.

Preparing Sequence Data for TBPTT

The number of timesteps utilized in the forward and backward passes of BPTT is determined by how you divide up your sequence data.

Use Data As-Is

If the number of timesteps in each sequence is small, such as tens or a few hundred, you can utilize the input sequences as-is. TBPTT has been suggested to have practical limitations of 200 to 400 timesteps. You can reshape the sequence observations as timesteps for the input data if your sequence data is smaller than or equal to this range.

For example, if you had a collection of 100 univariate sequences with 25 timesteps, you could reshape it into 100 samples, 25 timesteps, and 1 feature, or [100, 25, 1].

Naive Data Split

If your input sequences are large, such as hundreds of timesteps, you may need to divide them up into many contiguous subsequences.

This will need the implementation of a stateful LSTM in Keras, with the internal state being retained across sub-sequence input and only being reset at the conclusion of a true fuller input sequence.

If you had 100 input sequences with 50,000 timesteps, for example, each one might be broken into 100 subsequences with 500 timesteps. One input sequence would provide 100 samples, resulting in a total of 10,000 original samples. Keras' input would have a dimensionality of 10,000 samples, 500 timesteps, and 1 feature, or [10000, 500, 1]. It would be necessary to take care to preserve the state throughout every 100 subsequences and to explicitly or implicitly reset the internal state after every 100 samples.

Domain-Specific Data Split

Knowing the proper number of timesteps to generate a useful estimate of the error gradient can be difficult.

We can generate a model rapidly using the naïve technique, but the model may not be optimal. Alternatively, while learning the issue, we may utilize domain-specific knowledge to predict the number of timesteps that will be important to the model. If the sequence problem is a regression time series, looking at the autocorrelation and partial autocorrelation plots might help you decide on the number of timesteps to use.

Systematic Data Split

You can systematically examine a suite of possible subsequence lengths for your sequence prediction challenge rather than guessing at a reasonable number of timesteps.

You might do a grid search over each sub-sequence length and pick the arrangement that produces the best overall model.

Lean Heavily On Internal State With TBPTT

Each timestep of your sequence prediction problem may be reformulated as having one input and one output.

If you had 100 sequences of 50 timesteps, for example, each timestep would be a new sample. The original 100 samples would be multiplied by 5,000. The three-dimensional input would be [5000, 1, 1], or 5,000 samples, 1 timestep, and 1 feature.

Again, this would need preserving the internal state of the sequence at each timestep and resetting it at the conclusion of each real sequence (50 samples).

Decouple Forward and Backward Sequence Length

For the forward and backward passes of Truncated Backpropagation Through Time, the Keras deep learning package was utilized to support a variable number of timesteps.

In essence, the number of timesteps on input sequences may be used to specify the k1 parameter, while the "truncate gradient" argument on the LSTM layer could be used to specify the k2 parameter.

FAQs

  1. What is TBPTT?
    TBPTT is a modified version of backpropagation through time it is also called Truncated BPTT.
     
  2. What is backpropagation used for?
    Backpropagation is a technique to calculate derivatives quickly. It is a learning technique used by artificial neural networks to compute a gradient descent with regard to weights.
     
  3. What is the BPTT algorithm?
    It is the application of the Backpropagation training algorithm to recurrent neural networks applied to sequence data, for example, time-series data. A recurrent neural network is shown one input each timestep and predicts one output. Conceptually, BPTT works by unrolling all input timesteps.
     
  4. What is ARTBP?
    Anticipated Reweighted Truncated Backpropagation (ARTBP) is an unbiased approach that maintains the computational benefits of truncated BPTT. In the backpropagation equation, ARTBP uses varying truncation lengths and carefully adjusted compensating components.

Key Takeaways

In this article, we have seen the following topics:

  • Introduction to TBPTT
  • Difference between TBPTT and BPTT
  • Preparing sequence data for TBPTT

Want to learn more about Machine Learning? Here is an excellent course that can guide you in learning. 

Happy Coding!

Live masterclass