Code360 powered by Coding Ninjas X Code360 powered by Coding Ninjas X
Last Updated: Mar 27, 2024

Encoder-Decoder Models

Leveraging ChatGPT - GenAI as a Microsoft Data Expert
Prerita Agarwal
Data Specialist @
23 Jul, 2024 @ 01:30 PM


The encoder-decoder model is a technique of using RNN's for sequence-to-sequence prediction problems. Encoder-decoder was initially developed for machine translation problems, but it has been proven successful at related sequence-to-sequence prediction problems such as question answering. This architecture is very new, been pioneered in 2014, although it has been adopted as the core technology inside Google's translate service.

For a greater understanding of the structure of the encoder-decoder model, previous knowledge of RNN/LSTM/GRU is helpful. 

Also Read, Resnet 50 Architecture

Architecture Of Encoder-Decoder Model

The general structure of the sequence to sequence model commonly used consists of 3 parts: encoder, intermediate vector, and decoder.

Encoder-It accepts a single data element of the input sequence at each step, processes it, collects information for that element, and transfers it forward.

Intermediate vector: It is the final state obtained from the encoder part of the model. This vector contains information about the entire input sequence to help the decoder make accurate predictions.

Decoder- It gives the entire sentence. It predicts an output at each step.


Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job

Working Of Encoder-Decoder Model


The encoder is an LSTM/GRU cell. An encoder part is a stack of several recurrent units (LSTM or GRU cells for better performance). Each unit accepts a single element of the input sequence, collects information for that element, and propagates it forward. The outputs of the encoder are rejected, and only internal states are used.

Enoder cells take only one element as input at a time, so if the input sequence is length x, then the encoder takes x time steps to read the entire sentence.

  • Xt is the input at each time step t.
  • ht is internal states at time step t of the encoder cell.
  • Yt is the final output of encoder cells at time step t.

Let us consider an example of English to Hindi translation. Consider the English sentence- "My name is Mayank Goyal." This sentence contains five words (My, name, is, Mayank, Goyal). Here,

X1 =' My.'


X3= 'is'



Therefore LSTM/GRU will read this sequence word by word in 5-time steps as follows-

Each input(Xt) is represented as a vector using the word embedding, converting each word into a fixed-length vector. The hidden states h(i) is computed using the formula:

The final state, h5, contains the crux of the entire input sequence.

Intermediate Vector

It is the final hidden state produced from the encoder part of the model. It is calculated using the formula above. This vector encapsulates the information for all input elements to help the decoder cell units to make accurate predictions. It is the initial hidden state of the decoder part of the model.

Decoder Part

 Let us understand the working of the decoder part during the training phase. Take the running example of translating My name Mayank Goyal to its Hindi conversion. The decoder also returns the output sequence word by word like an encoder. So we have to generate the output —  मेरा नाम मयंक गोयल है| We will add START_ and _END at the stat and end of the sequence for the decoder to recognize the beginning and the ending of the sentence. So after applying changes, our output sentence will be START_मेरा नाम मयंक गोयल है_END.

Let us understand the working visually-





y4= गोयल

y5= है

The initial states (ho) of the decoder are set to the final states of the encoder. We can think that the decoder is trained on the information collected by the encoder. Now, we input the START_ so that the decoder generates the next word. Moreover, after the last word in the Hindi sentence, we make the decoder learn to predict the _END. Decoder consists of a stack of several recurrent units where each unit predicts an output yt at each time step t. Each unit accepts a hidden state from the previous unit and produces an output and its hidden state. We compute the hidden state hi using the formula given below:

As we can see, we are using the previous hidden state to compute the next one. The output yt at each time step t is calculated using the formula given below:

We calculate the output using the hidden state and the respective weight W(S) at the current time step. We use Softmax to create a probability vector that will help us determine the final output (e.g., the word in the question-answering problem). At last, we calculate the loss on the predicted outputs from each step, and the errors are backpropagated through time to update the model's parameters. The final states of the decoder are discarded as we get the output. Hence it is of no use.

That is how the encoder-decoder model works. 


Google Translation

The Encoder-decoder model reads an input sentence, understands the message and the concepts, then translates it into a second language. Google Translate is built upon an encoder-decoder structure.

Sentimental Analysis

Encoder-decoder models understand the meaning of the input sentence and output a sentiment score. The sentiment score is usually rated between -1 (negative) and 1 (positive), where 0 is neutral. It is used in call centers to analyze the client's emotions and reactions to specific keywords or company discounts.

Video/Image Captioning

These models generate a sentence describing an image. The image is fed as the input and outputs a sequence of words. This also works with videos.

Frequently Asked Questions

1. What does the encoder/decoder model do?

Ans. Encoder decoder models allow for a process in which a machine learning model generates a sentence describing an image. It receives the image as the input and outputs a sequence of words. This also works with videos.

2. What is the difference between Autoencoder and encoder-decoder?

Ans. An encoder-decoder architecture has an encoder section that takes an input and maps it to a latent space. The decoder section takes that latent space and maps it to an output. Usually, this results in better results. An autoencoder takes x as input and reconstructs x as an output.

3. What are encoders and decoders in deep learning?

Ans. An Encoder-Decoder architecture was developed where an input sequence was read in entirety and encoded to a fixed-length internal representation. A decoder network then used this internal representation to output words until the end of the sequence token was reached.

Key Takeaways

Let us brief the article.

Firstly we saw what sequence to sequence model, and how encoder and decoder are used is. Moving on, we saw the architecture of the encoder-decider model and its working. Lastly, we saw some of the applications of the encoder-decoder model. 

Recommended Reading: Instruction Format in Computer Architecture

The above explanation covers the most straightforward encoder-decoder model. So, we cannot expect it to perform well on complex tasks. The reason is that using a single vector for encoding the whole input sequence cannot capture the complete information.

That is the end of the article. I hope you all like it.

Happy Learning Ninjas!

Topics covered
Architecture Of Encoder-Decoder Model
Working Of Encoder-Decoder Model
Intermediate Vector
Decoder Part
Google Translation
Sentimental Analysis
Video/Image Captioning
Frequently Asked Questions
Key Takeaways