Difference between BPTT and TBPTT
Graphical representation of BPTT and TBPTT is shown below:

The full history of activations and inputs in the forward pass (blue arrow above representing hidden/internal state flow) must be stored for use in the backpropagation step in standard backpropagation through time (BPTT) (red arrow shows gradient flow). This can be both computationally and memory intensive, especially for a character language model.
But TBPTT reduces the number of timesteps utilized on the backward pass, allowing it to estimate rather than calculate the gradient used to update the weights.
Why use TBPTT over BPTT?
Truncated Backpropagation Through Time (BPTT) offers the computational benefits of BPTT while eliminating the need for a complete retrace through the whole data sequence at each stage. However, truncation favors short-term dependencies: the gradient estimate of truncated BPTT is biassed. Therefore, it does not benefit from the stochastic gradient theory's convergence guarantees.
Preparing Sequence Data for TBPTT
The number of timesteps utilized in the forward and backward passes of BPTT is determined by how you divide up your sequence data.
Use Data As-Is
If the number of timesteps in each sequence is small, such as tens or a few hundred, you can utilize the input sequences as-is. TBPTT has been suggested to have practical limitations of 200 to 400 timesteps. You can reshape the sequence observations as timesteps for the input data if your sequence data is smaller than or equal to this range.
For example, if you had a collection of 100 univariate sequences with 25 timesteps, you could reshape it into 100 samples, 25 timesteps, and 1 feature, or [100, 25, 1].
Naive Data Split
If your input sequences are large, such as hundreds of timesteps, you may need to divide them up into many contiguous subsequences.
This will need the implementation of a stateful LSTM in Keras, with the internal state being retained across sub-sequence input and only being reset at the conclusion of a true fuller input sequence.
If you had 100 input sequences with 50,000 timesteps, for example, each one might be broken into 100 subsequences with 500 timesteps. One input sequence would provide 100 samples, resulting in a total of 10,000 original samples. Keras' input would have a dimensionality of 10,000 samples, 500 timesteps, and 1 feature, or [10000, 500, 1]. It would be necessary to take care to preserve the state throughout every 100 subsequences and to explicitly or implicitly reset the internal state after every 100 samples.
Domain-Specific Data Split
Knowing the proper number of timesteps to generate a useful estimate of the error gradient can be difficult.
We can generate a model rapidly using the naïve technique, but the model may not be optimal. Alternatively, while learning the issue, we may utilize domain-specific knowledge to predict the number of timesteps that will be important to the model. If the sequence problem is a regression time series, looking at the autocorrelation and partial autocorrelation plots might help you decide on the number of timesteps to use.
Systematic Data Split
You can systematically examine a suite of possible subsequence lengths for your sequence prediction challenge rather than guessing at a reasonable number of timesteps.
You might do a grid search over each sub-sequence length and pick the arrangement that produces the best overall model.
Lean Heavily On Internal State With TBPTT
Each timestep of your sequence prediction problem may be reformulated as having one input and one output.
If you had 100 sequences of 50 timesteps, for example, each timestep would be a new sample. The original 100 samples would be multiplied by 5,000. The three-dimensional input would be [5000, 1, 1], or 5,000 samples, 1 timestep, and 1 feature.
Again, this would need preserving the internal state of the sequence at each timestep and resetting it at the conclusion of each real sequence (50 samples).
Decouple Forward and Backward Sequence Length
For the forward and backward passes of Truncated Backpropagation Through Time, the Keras deep learning package was utilized to support a variable number of timesteps.
In essence, the number of timesteps on input sequences may be used to specify the k1 parameter, while the "truncate gradient" argument on the LSTM layer could be used to specify the k2 parameter.
FAQs
-
What is TBPTT?
TBPTT is a modified version of backpropagation through time it is also called Truncated BPTT.
-
What is backpropagation used for?
Backpropagation is a technique to calculate derivatives quickly. It is a learning technique used by artificial neural networks to compute a gradient descent with regard to weights.
-
What is the BPTT algorithm?
It is the application of the Backpropagation training algorithm to recurrent neural networks applied to sequence data, for example, time-series data. A recurrent neural network is shown one input each timestep and predicts one output. Conceptually, BPTT works by unrolling all input timesteps.
-
What is ARTBP?
Anticipated Reweighted Truncated Backpropagation (ARTBP) is an unbiased approach that maintains the computational benefits of truncated BPTT. In the backpropagation equation, ARTBP uses varying truncation lengths and carefully adjusted compensating components.
Key Takeaways
In this article, we have seen the following topics:
- Introduction to TBPTT
- Difference between TBPTT and BPTT
- Preparing sequence data for TBPTT
Want to learn more about Machine Learning? Here is an excellent course that can guide you in learning.
Happy Coding!