Code360 powered by Coding Ninjas X Code360 powered by Coding Ninjas X
Table of contents
LSTM Architecture
Forget Gate
Input Gate
Output Gate
Key Takeaways
Last Updated: Mar 27, 2024

Long Short Term Memory(LSTM) Cells

Author Mayank Goyal
0 upvote
Leveraging ChatGPT - GenAI as a Microsoft Data Expert
Prerita Agarwal
Data Specialist @
23 Jul, 2024 @ 01:30 PM


The Long Short Term Memory Network is an advanced Recurrent Neural Network, a sequential network that allows information to persist.

In Recurrent Neural Network, output from the last step is fed as input in the current step. RNN's suffer from short-term memory, i.e., if a sequence is long enough, they'll have difficulty carrying information from past time steps to future ones. So if we are trying to process a paragraph of text to make predictions, RNN's may leave out important information from the beginning.

During backpropagation, recurrent neural networks suffer from the vanishing gradient issue. Gradients are the values used to update the weights of a neural network. The vanishing gradient problem is when the gradient shrinks or becomes negligible as it back propagates through time. If the gradient value becomes extremely small, it doesn't contribute too much to learning.

While watching a video, we remember the previous scene, or while reading a book, we know what happened in the last chapter. Similarly, in RNNs, they retain the previous information or data and use it for processing the current input. The shortcoming of Recurrent Neural networks is they can not recognize Long term dependencies due to the vanishing gradient problem. Hence, Long Short Term Memory is explicitly designed to avoid long-term dependency problems.

Also Read, Resnet 50 Architecture

LSTM Architecture

LSTM works pretty much like a Recurrent Neural Network cell. At a high level, The Long Short Term Memory cells consist of three parts; The first part of LSTM chooses whether the information approaching from the previous timestamp is to be remembered or is of no use and can be forgotten. In the second part, the cell learns new information from the input. And at last is the third part, where the cell passes the updated information from the current timestamp to the next timestamp. 


The above three parts of a Long Short Term Memory cell are gates. The first part is called Forget gate, while the second part is known as the Input gate, and the last part is known as the output gate. Like an RNN, an LSTM has a hidden state where H(t-1) represents the hidden state of the last timestamp and H(t) is the hidden state of the current timestamp. In addition to that, LSTM also has a cell state represented by C(t-1) and C(t) for previous and current timestamps, respectively.

Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job


Let us understand how LSTM works with the help of an example. Here we have given two sentences separated by a complete stop. The first sentence is "Mayank can play cricket," and the second sentence is "Soham, on the other hand, cannot." It is evident, in the first sentence, we are talking about Mayank, and as soon as we encounter the full stop(.), we started talking about Soham.

So, as we move from the first sentence to the second, our network should realize that we are no longer talking about Mayank. Now our subject is Soham. So, here we have the Forget gate of the network that allows it to forget about the previous data. Let's understand the roles played by each of these gates in LSTM architecture in detail.

Forget Gate

In an LSTM cell network, the first step is to decide whether to keep the information from the previous timestamp or forget it. We feed two inputs to the gate. One is x(t) (input at the current timestamp), and the other is h(t-1) (previous hidden state), followed by multiplication with weight matrices than by the addition of bias. We pass the final result through an activation function that gives us a binary classification. For a particular cell state, if the output is zero, then the piece of information is forgotten, while for output one, we retain the information for future use. 


Equation of forget gate is given by:

Let's try to understand the equation, here

  • X(t): input at the current timestamp.
  • U(f): weight associated with the input.
  • H(t-1): Hidden state of the last timestamp.
  • W(f): Weight matrix related to previous hidden state.

Later, we apply a sigmoid function. That will make f(t) fall between zero and one. If f(t) is zero, then the network will forget everything, and if the value of f(t) is one, the network won't forget. Moving back to our example, the first sentence talks about Mayank, and after a complete stop, the web will encounter Soham. In an ideal case, the network will forget about Mayank.

Input Gate

Let's take another example.

"Mayank plays cricket well. He told me on the text that he has been practicing it for the past two years."

So, as we can see in both these sentences, we are talking about Mayank. However, both sentences give different kinds of information about Mayank. In the first sentence, we receive the information that Mayank plays cricket. The second sentence says Mayank uses the text medium and has been practicing cricket for two years.

Now think about it, based on the context given in the first sentence, which is critical information of the second sentence. First, Mayank used the text medium to tell whether Mayank had been training. In this context, it doesn't matter whether Mayank used the text medium or any other communication medium to pass on the information. The fact that he has been practicing is essential information, and this is what we want our model to remember. That's is the task of the input gate.

The input gate does the addition of helpful information to the cell state. First, the information is regulated using the sigmoid function and filters the remembered values, similar to the forget gate using inputs h(t-1) and x(t). Then, we create a vector using the tanh function that gives an output from minus one to plus one, which contains all the possible values from h(t-1) and x(t). At last, we multiply the vector and the regulated values to obtain helpful information.


Output Gate

This gate decides what the next hidden state should be. We should remember that the hidden state contains information on previous inputs. We also use the hidden state for predictions. First, we pass the past hidden state and the current input into a sigmoid function. Then we pass the newly modified cell state to the tanh function. We multiply the tanh output with the sigmoid output to decide the hidden state's information. The output is the hidden state. At last, we multiply the vector and the regulated values to send as an output and input to the next cell.



LSTM can solve tasks not solvable by previous learning algorithms for recurrent neural networks. The Recurrent Neural Network (RNN) is a neural sequence model that achieves the state of the art performance on essential tasks that include language modeling, speech recognition, and machine translation. Since LSTMs effectively capture long-term temporal dependencies without suffering from the optimization hurdles that plague simple recurrent networks, they have been used to advance state of art for many complex problems. This includes handwriting recognition and generation, language modeling and translation, acoustic modeling of speech, speech synthesis, protein secondary structure prediction, analysis of audio, and video data.
You can also read about the memory hierarchy.


  1. What are some common problems with LSTM?
    LSTMs are prone to overfitting, and it is challenging to apply the dropout algorithm to check this issue. 
  2. How many gates are there in LSTM?
    There are three gates in a Long Short Term Memory cell: a forget gate, an input gate, and an output gate.
  3. How does an LSTM network work?
    LSTMs use a series of gates that control how the data sequence feeds into, is stored in, and leaves the network. There are three gates in an LSTM; forget gate, input gate, and output gate.
  4. What is LSTM suitable for?
    LSTM networks are well-suited for classifying, processing, and making predictions based on time series data. There can be lags of unknown duration between important events in a time series. We LSTMs developed to deal with the vanishing gradient problem encountered when training traditional RNNs.

Key Takeaways

Let us brief out the article.
Firstly, we saw the meaning of LSTM certain limitations of RNN, which led to the development of LSTM. Further, we saw the basic architecture of LSTM and the working of different gates. Lastly, we saw some of the applications of LSTM and the limitations of LSTM.

Check out this article - Padding In Convolutional Neural Network

I hope you all like this article.
Happy Learning Ninjas!

Live masterclass