Table of contents
1.
Introduction
2.
Reason Behind Failure Of seq2seq models
3.
Working
4.
Algorithm
5.
Implementation
6.
FAQs
7.
Key Takeaways
Last Updated: Mar 27, 2024

Attention Mechanism In Deep Learning

Author Mayank Goyal
0 upvote
Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

Attention is an essential powerful mechanism developed to enhance the performance of the Encoder-Decoder architecture on neural network-based machine translation tasks. Attention is one of the most prominent mechanisms in the Deep Learning field. Even though attention mechanism is now used in various problems like image captioning, it was initially designed in Neural Machine Translation using Seq2Seq Models. 

The attention mechanism was developed to improve the performance of the encoder-decoder model for machine translation. The idea behind the attention mechanism was to permit the decoder to utilize the most relevant parts of the input sequence flexibly, by a weighted combination of all of the encoded input vectors, with the most pertinent vectors being attributed the highest weights. 

Reason Behind Failure Of seq2seq models

The seq2seq model is usually composed of an encoder-decoder architecture. The encoder processes the input sequence or sentence and encodes/compresses/summarizes the information into a context vector (also called the thought vector) of a fixed length. This context vector representation is expected to be a good summary of the entire input sequence. The decoder is then initialized with this context vector, using which it starts generating the transformed output.

A critical and apparent disadvantage of this fixed-length context vector design is the system's incapability to remember longer sequences. Often it has forgotten the earlier parts of the sequence once it has processed the entire sequence. The attention mechanism was born to resolve this problem.

Working

When the model predicts an output word, it only uses input bits where the most critical information is concentrated rather than the complete sequence. Put another way. It merely pays attention to a few input words.

Attention is a connection between the encoder and the decoder that transmits information from every encoder concealed state to the decoder. The model can selectively focus on valuable bits of the input sequence using this framework and learn their relationship. It makes it more accessible for the model to deal with extended input sentences.

The goal is to maintain the same decoder and replace sequential RNN/LSTM in the encoder with bidirectional RNN/LSTM.

By considering a window size of Tx, we pay special attention to specific words (say four words x1, x2, x3, and x4). We'll generate a context vector c1 with these four words and provide it to the decoder as input. Using these four words, we'll create a context vector c2. We also have weights of a1, a2, and a3, with the sum of all weights within one window equaling 1.

Similarly, we construct context vectors from various word sets with varying values. Because not all of the inputs are used in generating the associated output, the attention model computes a set of attention weights denoted by a(t,1), a(t,2),..., a(t,t). The weighted total of the annotations is used to construct the context vector ci for the output word yi:

Normalizing the output score of a feed-forward neural network represented by the function that captures the alignment between input at j and output at I yields the attention weights.

Algorithm

By assigning a score to each word, attention exerts varied emphasis on different terms. The context vector is generated by aggregating the encoder hidden states using the softmax scores and a weighted sum.

An attention layer's algorithm can be split down into four steps.

Step 1: Get ready for the hidden states.

Step 2: Apply a softmax layer to all of the scores.

Step 3: Multiply the softmax score of each encoder's hidden state.

Step 4: Add all of the alignment vectors together.

Step 5: Feed the decoder the context vector.

Implementation

This section will look at using Python's NumPy and SciPy modules to build the general attention mechanism.

We'll compute the attention for simplicity's first word of a four-word sequence. Next, the function will be generalized to calculate an attention output in matrix form for all four words.

As a result, let's begin by defining the word embeddings of the four different words for calculating attention. An encoder would usually create these word embeddings, but we'll explain them manually for this example.

from numpy import array
from numpy import random
from numpy import dot
from scipy.special import softmax
You can also try this code with Online Python Compiler
Run Code

 

word_1 = array([1, 0, 1])
word_2 = array([0, 1, 0])
word_3 = array([1, 1, 0])
word_4 = array([0, 0, 1])
You can also try this code with Online Python Compiler
Run Code


Now, we stack the word embeddings into a single array

words = array([word_1, word_2, word_3, word_4])
You can also try this code with Online Python Compiler
Run Code


The weight matrices are generated next, multiplied by the word embeddings to generate the queries, keys, and values. These weight matrices will be generated at random here, but in actuality, they will have been learned throughout the training.

random.seed(42) # to allow us to reproduce the same attention values
W_Q = random.randint(3, size=(3, 3))
W_K = random.randint(3, size=(3, 3))
W_V = random.randint(3, size=(3, 3))
You can also try this code with Online Python Compiler
Run Code


Generating the keys, queries and values

Q = words @ W_Q
K = words @ W_K
V = words @ W_V
You can also try this code with Online Python Compiler
Run Code


Calculating the scores of the query vectors against all key vectors

scores = Q @ K.transpose()
You can also try this code with Online Python Compiler
Run Code


The scores are then processed via a softmax procedure to construct the weights. We generally divide the score values by the square root of the key vectors' dimensionality to keep the gradients constant.

weights = softmax(scores / K.shape[1] ** 0.5)
You can also try this code with Online Python Compiler
Run Code


Finally, we calculate attention values that are calculated by a weighted sum of all four value vectors. 

attention = weights @ V 
print(attention)
You can also try this code with Online Python Compiler
Run Code


Output

[[9.92901880e-01 1.62883281e+00 6.35930933e-01]
 [1.22920950e-05 1.90485723e-05 6.75647734e-06]
 [3.95390534e-03 5.93279494e-03 1.97888961e-03]
 [3.08678210e-03 5.06988081e-03 1.98309872e-03]]

That’s the basic implementation of the attention mechanism.

FAQs

  1. Why does the attention mechanism work?
    The Attention mechanism has revolutionized how we build NLP models, and it is now a standard feature in most modern NLP models. This allows the model to "remember" all words in the input and focus on select terms when forming a response.
     
  2. How does the attention mechanism aid in resolving RNNS and Lstms issues?
    The Long sentences, especially those that are longer than those in the training corpus, may be challenging for the neural network to cope with. Unlike the traditional Encoder-Decoder approach, the attention mechanism allows us to look at all hidden states from the encoder sequence to make predictions.
     
  3. What are the advantages of focusing on a local audience?
    It's differentiable, like soft attention, and easier to execute and train. It takes less time to compute than global or soft attention. Local attention finds the best-matched point pt in the input sequence based on the decoder's current state.
     
  4. What problem does attention solve?
    The Encoder-Decoder model's limitation of encoding the input sequence to one fixed-length vector to decode each output time step is addressed. This difficulty is thought to be more prevalent when decoding extended sequences.

Key Takeaways

Let us brief out the article.

Firstly, we saw the basic definition of attention and causes, which led to the discovery of the attention mechanism. Later, we saw the working and the algorithm of the attention mechanism. Lastly, we implemented a primary attention mechanism using numpy and pandas.

I hope you all like this article.

Happy Learning Ninjas!

Live masterclass