Code360 powered by Coding Ninjas X Code360 powered by Coding Ninjas X
Table of contents
Seq2Seq Model
Architecture Of Transformers
Encoder Block
Input Embedding
Multi-Head Attention
Feed Forward Network
Decoder Block
Masked Multi-Head Attention Part
Multi-Head Attention Block
Feed-Forward Unit
Key Takeaways
Last Updated: Mar 27, 2024

Transformer Network

Author Mayank Goyal
0 upvote
Create a resume that lands you SDE interviews at MAANG
Anubhav Sinha
SDE-2 @
12 Jun, 2024 @ 01:30 PM


New deep learning models are being launched at an increasing rate, and it can be challenging to keep up with all of the changes. However, one neural network model has proven to be particularly effective for common natural language processing tasks. The model is known as a Transformer, and it employs several ways and procedures that I'll go through here.

Before directly going into Transformer, we will explain why we use it and from where it comes into the picture. 

Seq2Seq Model

Transformers and what is known as a sequence-to-sequence architecture are described in the paper 'Attention Is All You Need.' Sequence-to-Sequence is a neural network that converts one sequence of components into another, such as the words in a phrase. Seq2Seq models excel in translation, which involves transforming a sequence of words from one language into a sequence of other words in another. Long-Short-Term-Memory (LSTM)-based models are a frequent choice for this model. The LSTM modules can offer to mean to the sequence while remembering (or forgetting) the parts it finds significant with sequence-dependent data (or unimportant). Sentences, for example, are sequence-dependent because the order of the words is critical for comprehension. For this type of data, LSTMs are an obvious choice.

An Encoder and a Decoder are included in Seq2Seq models. The Encoder converts the input sequence to a higher-dimensional space (n-dimensional vector). The Decoder receives the abstract vector and converts it into an output sequence. The output sequence could be in a different language, symbols, a replica of the input, or something else entirely.

Consider the Encoder and Decoder to be human interpreters who can only communicate in two languages. Their first language is their mother tongue, which differs from theirs (for example, German and French), while their second language is a made-up language they share. The Encoder translates the German sentence into the other language it understands, namely the fictitious language, to turn it to French. Because the Decoder can read that fictitious language, it can now translate from it into French. The model (which consists of Encoder and Decoder) may convert German to French when used together! Assume that neither the Encoder nor the Decoder speaks the imagined language particularly well at first. We give them (the model) a lot of practice to learn it.

A single LSTM for each Encoder and Decoder in the Seq2Seq architecture is a relatively simple choice.

However, it takes the same amount of time to train as a simple RNN, if not longer.

They sequentially process data, limiting GPUs' utilization, which is meant for parallel processing.

What is the best way to parallelize sequential data? (I'll get back to you on this.)

For the time being, we're working on two issues.

  • Vanishing gradient
  • Slow training

We can solve the vanishing gradient issues by using an attention mechanism.

Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job


The attention mechanism examines an input sequence and determines which elements of the sequence are relevant at each stage. It may appear abstract, but let me explain with an example: When reading this text, your mind is always focused on the word you're reading, but it's also keeping track of the text's crucial keywords to provide context. The attention mechanism examines an input sequence and determines which elements of the sequence are relevant at each stage. It may appear abstract, but let me explain with an example: When reading this text, your mind is always focused on the word you're reading, but it's also keeping track of the text's crucial keywords to provide context.

For a given sequence, an attention mechanism works similarly. Consider the human Encoder and Decoder in our example. Instead of just writing down the sentence's translation in the imagined language, the Encoder also writes down keywords relevant to the sentence's semantics. It delivers them to the Decoder along with the ordinary translation. Because the Decoder knows which portions of the sentence are significant and which key elements give the sentence context, those additional keywords make translation much more straightforward.

In other words, the attention-mechanism considers numerous other inputs simultaneously as the LSTM (Encoder) examines each one and decides which ones are relevant by assigning different weights to those inputs. The encoded message and the weights provided by the attention mechanism will be fed into the Decoder.


In 2017, a paper titled "Attention Is All You Need" was published, which describes the Transformer, an encoder-decoder design based on attention layers.

One significant distinction is that the input sequence can be sent in parallel, allowing the GPU to be fully utilized and the training speed to be enhanced. A substantial margin also overcomes the vanishing gradient issue based on the multi-headed attention layer. The study is based on using a transformer using NMTs (Neural Machine Translator).

So, both of our problems mentioned earlier have been remedied here.

For example, in a simple RNN translator, we input our sequence or sentence one word at a time to build word embeddings continuously. Because every word is dependent on the previous word, its hidden state reacts like it's best to take things one step at a time. While this is not the case in Transformer, we may pass all of the words in a phrase simultaneously and determine the word embedding simultaneously.

Architecture Of Transformers


Encoder Block

Input Embedding

Computers do not understand language; instead, they deal with numbers, vectors, and matrices. As a result, we must translate our words to vectors. But how is this possible? As a result, the idea of Embedding Space is introduced. It's like an open area or a dictionary where words with similar meanings are grouped or located next. This area is known as the embedding space, and each word is mapped and assigned a specific value based on its purpose. As a result, we're going to turn our words into vectors.

But we'll also have to deal with the fact that each word in different sentences has a distinct meaning. As a result, we use Positional Encoders to tackle this problem. It's a vector that provides context based on the word's position in a phrase.

Context is the result of Word Embedding and Positional Embedding.

So, now that our input is ready, it's time to move on to the encoder block.


Multi-Head Attention

Now comes the Transformer's essential essence, "Self Attention."

It focuses on the part when a word's relevance to other words in the sentence is determined. It is depicted as a vector of attention. We may build an attention vector for each word that captures the contextual relationship between the words in that sentence. The only issue is that it places a far higher value on each word in the sentence than humans do, even though we are more interested in how it interacts with other words. To compute the final attention vector for each word, we add multiple attention vectors per word and take a weighted average. Using multiple attention vectors, it is called the Multi-Head Attention Block.

Feed Forward Network

The Feed Forward Neural Network is the next level. This is a simple feed-forward Neural Network applied to each attention vector. Its primary goal is to translate the attention vectors into a format that the following encoder or decoder layer can understand. Attention vectors are accepted "one at a time" by the Feed Forward Network. The best part is that, unlike with RNN, each of these attention vectors is independent of the others. As a result, parallelization can be used here, making a huge difference. We can now pass all of the words into the encoder block and obtain the set of Encoded Vectors for each word simultaneously.

Decoder Block


For example, if we train a translator from English to French, we must provide an English statement and its translated French sentence for the model to learn. As a result, our English sentences go through the Encoder Block, whereas our French sentences go through the Decoder Block. First, we have the Embedding layer and Positional encoder component, which converts words into vectors, similar to what we saw in the Encoder section.

Masked Multi-Head Attention Part

It will proceed to the self-attention block, where attention vectors are constructed for each word in French phrases to show how closely related each word is. 

But this block is known as the Masked Multi-Head Attention Block. To do so, we must first understand how the learning mechanism works. First, we give an English word, which will convert into its French version using prior results, and then compare it to the actual French translation (which we fed in the decoder block). It will update its matrix value after comparing both. After numerous repetitions, it will learn in this manner.

We can learn any word from the English sentence, but only the preceding word from the French sentence. As a result, while executing the parallelization using matrix operation, we ensure that the matrix masks the words that arrive later by translating them to 0's, preventing the attention network from using them.

Multi-Head Attention Block

The preceding layer's attention vectors and the vectors from the Encoder Block are now sent into a new Multi-Head Attention Block. (At this point, the encoder block's results are also considered.) The output from the Encoder blocks is likewise readily evident in the diagram.) It's called the Encoder-Decoder Attention Block for a reason.

Because each English and French sentence has a single vector for each word, this block is responsible for mapping English and French words and determining their relationship. As a result, this is where the primary English to French word mapping takes place.

Feed-Forward Unit

We'll feed each attention vector into a feed-forward unit, which will convert the output vectors into something that another decoder block or a linear layer can understand.


Another feed-forward layer is linear. After translation, it is utilized to expand the dimensions into numbers of words in the French language.


It is now sent via a Softmax Layer, converting the input into a human-interpretable probability distribution.

And the resulting word is produced with the highest probability after translation. Each word's initial representations, or embeddings, are generated by the Transformer first. These are indicated by the circles that aren't filled in. Then, self-attention gathers data from the other words, constructing a new representation for each word informed by the complete context, as depicted by the filled balls. This procedure is then performed numerous times in parallel for all words, generating new representations sequentially.

The Decoder works similarly, but it generates words one at a time, from left to right. It pays attention not only to the previously created words but also to the Encoder's final representations.

So that's how the Transformer works, and it's now the most advanced NLP technique. It uses a self-attention method to get fantastic outcomes while also resolving the parallelization problem. BERT, which employs a transformer to pre-train models for typical NLP applications, is even used by Google.

Also read, Sampling and Quantization


  1. Do transformers use RNN?
    Transformers employ a non-RNN attention mechanism that processes all tokens simultaneously and calculates attention weights between them in succeeding levels.
  2. In NLP, what are the constraints of transformers?
    Only fixed-length text strings can be dealt with by attention. Before being fed into the system as input, the text must be broken into several segments or chunks. Context fragmentation occurs as a result of text chunking.
  3. Why would a transformer be preferable to an RNN-based model?
    It's due to the length of the path. Suppose you have an n-length sequence. Then a transformer can access each element with O(1) consecutive operations, whereas a recurrent neural network can only access an element with O(n) sequential operations.
  4. What is the function of a transformer in a neural network?
    The Transformer is a component found in many neural network architectures for processing sequential data, including plain language text, genomic sequences, acoustic signals, and time-series data. The most common use of transformer neural networks is in natural language processing.

Key Takeaways

Let us brief out the article.

Firstly, we had a brief discussion about transformers. We saw why we use it, the shortcomings the other models face, and how transformers help overcome it. Lastly, we saw the architecture of the transformers and the importance of every different layer.

That's the end of the article. I hope you all like it.

Happy Coding Ninjas!

Previous article
Visual QA
Next article
Spatial Transformer Network
Live masterclass