Code360 powered by Coding Ninjas X Code360 powered by Coding Ninjas X
Table of contents
Need of LSTM
The Architecture of the LSTM Unit
Input gate
Forget gate
Output gate
How do LSTM works?
What is Bi-LSTM?
Working of Bi-LSTM
Bi-LSTM in keras
Key Takeaways
Last Updated: Mar 27, 2024

Bidirectional LSTM

Author soham Medewar
1 upvote
Leveraging ChatGPT - GenAI as a Microsoft Data Expert
Prerita Agarwal
Data Specialist @
23 Jul, 2024 @ 01:30 PM


To understand the working of Bi-LSTM first, we need to understand the unit cell of LSTM and LSTM network. LSTM stands for long short-term memory. In 1977 Hochretier and Schmidhuber introduced LSTM networks. These are the most commonly used recurrent neural networks.

Need of LSTM

As we know that sequential data is better handled by recurrent neural networks, but sometimes it is also necessary to store the result of the previous data. For example, “I will play cricket” and “I can play cricket” are two different sentences with different meanings. As we can see, the meaning of the sentence depends on a single word so, it is necessary to store the data of previous words. But no such memory is available in simple RNN. To solve this problem, we need to study a term called LSTM.

Also Read, Resnet 50 Architecture

The Architecture of the LSTM Unit

The image below is the architecture of the LSTM unit.

The LSTM unit has three gates:

Input gate

First, the current state x(t) and previous hidden state h(t-1) are passed into the input gate, i.e., the second sigmoid function. The x(t) and h(t-1) values are transformed between 0 and 1, where 0 is important, and 1 is not important. Furthermore, the current and hidden state information will be passed through the tanh function. The output from the tanh function will range from -1 to 1, and it will help to regulate the network. The output values generated from the activation functions are ready for point-by-point multiplication.

Forget gate

The forget gate decides which information needs to be kept for further processing and which can be ignored. The hidden state h(t-1) and current input X(t) information are passed through the sigmoid function. After passing the values through the sigmoid function, it generates values between 0 and 1 that conclude whether the part of the previous output is necessary (by giving the output closer to 1). 

Output gate

The output gate helps in deciding the value of the next hidden state. This state contains information on previous inputs. First, the current and previously hidden state values are passed into the third sigmoid function. Then the new cell state generated from the cell state is passed through the tanh function. Both these outputs are multiplied point-by-point. Based upon the final value, the network decides which information the hidden state should carry. This hidden state is used for prediction.

Finally, the new cell state and the new hidden state are carried over to the next step.

To conclude, the forget gate determines which relevant information from the prior steps is needed. The input gate decides what relevant information can be added from the current step, and the output gates finalize the next hidden state.

Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job

How do LSTM works?

The Lengthy Short Term Memory architecture was inspired by an examination of error flow in current RNNs, which revealed that long time delays were inaccessible to existing designs due to backpropagated error, which either blows up or decays exponentially.

An LSTM layer is made up of memory blocks that are recurrently linked. These blocks can be thought of as a differentiable version of a digital computer's memory chips. Each one has recurrently connected memory cells as well as three multiplicative units – the input, output, and forget gates – that offer continuous analogs of the cells' write, read, and reset operations. 

Must read topic, Spring Boot Architecture

What is Bi-LSTM?

Bidirectional LSTM networks function by presenting each training sequence forward and backward to two independent LSTM networks, both of which are coupled to the same output layer. This means that the Bi-LSTM contains comprehensive, sequential information about all points before and after each point in a particular sequence. 

In other words, rather than encoding the sequence in the forward direction only, we encode it in the backward direction as well and concatenate the results from both forward and backward LSTM at each time step. The encoded representation of each word now understands the words before and after the specific word.

Below is the basic architecture of Bi-LSTM.


Working of Bi-LSTM

Let us understand the working of Bi-LSTM using an example. Consider the sentence “I will swim today”. The below image represents the encoded representation of the sentence in the Bi-LSTM network.

So when forward LSTM occurs, “I” will be passed into the LSTM network at time t = 0, “will” at t = 1, “swim” at t = 2, and “today” at t = 3. In backward LSTM “today” will be passed into the network at time t = 0, “swim” at t = 1, “will” at t = 2, and “I” at t = 3. In this way, results from both forward and backward LSTM at each time step are calculated.

Bi-LSTM in keras

To implement Bi-LSTM in keras, we need to import the Bidirectional class and LSTM class provided by keras.

First, let us understand the syntax of the LSTM layer. There is one mandatory argument in the LSTM layer, i.e., the number of LSTM units in a particular layer.


LSTM layer accepts many other arguments like activation, recurrent activation, use_bias, kernel_initializer, recurrent_initializer, bias_initializer, etc. But all these arguments have some default value, so the user doesn't need to specify all the parameters until he wants to change them. (For more information, visit here)

Now, to implement the Bi-LSTM, we just need to wrap the LSTM layer inside the Bidirectional class.


The Bidirectional layer accepts other arguments like merge_mode, backward_layer. (For more details, visit here)


1. What is the difference between GRU and LSTM?

A: GRU has two gates, i.e., reset and update gate, whereas LSTM has three gates, i.e, input, output, and forget gate. GRU is preferred in small datasets, whereas LSTM is preferred while handling larger datasets.

2. Why is Bi-LSTM better than LSTM?

A: At every time step, LSTM calculates the results of forwarding LSTM, but in the case of Bi-Direction results from both forward and backward LSTM at each time step are calculated.

3. What are the limitations of Bi-LSTM?

A: Bi-LSTM takes more time to train than normal LSTM networks. Also, they acquire more memory to train. They are easy to overfit and dropout implementation is hard in Bi-LSTM.

4. What is a bidirectional layer?

A: Bidirectional recurrent neural networks (BRNN) connect two hidden layers of opposite directions to the same output. With this form of generative deep learning, the output layer can simultaneously get information from past (backward), and future (forward) states.

Key Takeaways

In this article, we have covered the following topics:

  • Introduction and need of LSTM network
  • The architecture of the LSTM unit
  • Working of LSTM and Bi-LSTM
  • Implementation of Bi-LSTM in keras

Want to learn more about Machine Learning? Here is an excellent course that can guide you in learning. 

Happy Coding!

Live masterclass