Code360 powered by Coding Ninjas X Code360 powered by Coding Ninjas X
Table of contents
Frequently Asked Questions
Key Takeaways
Last Updated: Mar 27, 2024

Hierarchical Attention Network

Author Rajkeshav
0 upvote
Leveraging ChatGPT - GenAI as a Microsoft Data Expert
Prerita Agarwal
Data Specialist @
23 Jul, 2024 @ 01:30 PM



Text Classification is one of the most astonishing tasks. In more general terms, We can say Artificial Intelligence is the field that tries to achieve human-like intelligent models to simplify the tasks for all of us. We have excellent proficiency in text classification, but even many advanced NLP models have failed to achieve mastery even close to it. So the question arises what do we humans do differently? How do we classify text?

We know that words will form sentences; sentences include a document, or character forms a word at a lower level. We can guess many unknown words just by the structure of a sentence. Then we interpret the message that those series of sentences impart. Then from these series of sentences, we can understand the meaning of a paragraph. In the Hierarchical Attention model, we perform similar things.

Hierarchical Attention Network uses stacked recurrent neural networks on word level, followed by an attention network. The goal is to extract such words that are important to the meaning of the entire sentence and aggregate these instructional words to form a vector of the sentence. The same technique is applied to the derived sentence vectors, which generate a vector that draws up the meaning of the given document, passing the vector further for text classification.

The intention is to derive sentence meaning from the informative words and derive the document's meaning from those informative sentences. All words are not equally important. Some of the words distinguish a sentence more than other words. So, we use the attention network so that the sentence vector can have more attention to informative words.

The attention model consists of two parts:

1) Bidirectional Recurrent Neural Network

2) Attention networks. 

Bidirectional RNN learns the meaning behind those sequences of words and returns a vector corresponding to each word.

The attention network gets weights corresponding to each word vector using its external neural network. It then aggregates the representation of these words to form a sentence vector, which means that it computes the weighted sum of every vector. This weighted sum personifies the entire sentence. The same steps apply to sentence vectors so that the resulting vector illustrates the gist of the whole document. It has two levels of attention models, called Hierarchical Attention Networks.


1. use the necessary dependencies


2. We can test the module using 


3. To train, test and save our model, first import the HAN module

    import HAN


4. Import the dataset(preferably as a pandas dataframe)

5. Import pretrained embedded vector

6. Initialise HAN module

han_network = HAN.HAN(text = df.text, labels = df.category, num_categories = total_categories, pretrained_embedded_vector_path = embedded_vector_path, max_features = max_num_of_features, max_senten_len = max_sentence_len, max_senten_num = max_sentence_num , embedding_size = size_of_embedded_vectors)

7. Import the essential libraries

import pandas as pd
import numpy as np
from keras.preprocessing.text import Tokenizer,  text_to_word_sequence
from keras import initializers as initializers, regularizers, constraints
from keras.callbacks import Callback, ModelCheckpoint
from keras.utils.np_utils import to_categorical
from keras.layers import Embedding, Input, Dense, LSTM, GRU, Bidirectional, TimeDistributed, Dropout
from keras import backend as K
from keras import optimizers
from keras.models import Model
import nltk
import re
import matplotlib.pyplot as plt
import sys
from sklearn.metrics import roc_auc_score
from nltk import tokenize
import seaborn as sns


8. I will build an attention layer  which can be used in hierarchical attention networks. I am using tensorflow as a backend.

  • Input shape- 3D tensor with shape: `(samples, steps, features)`.
  • Output shape- 2D tensor with shape: `(samples, features)`.
  • How to use- Just put it on top of an RNN Layer (GRU/LSTM/SimpleRNN) with     return_sequences=True.
  • The dimensions are inferred based on the output shape of the RNN.
  • The layer has been tested with Keras 2.0.6. For Example:

            model.add(LSTM(64, return_sequences=True))


  • Next add a Dense layer (for classification/regression


def dot_product(x, kernel):
    Wrapper for dot product operation, in order to be compatibl|e with both
    Theano and Tensorflow
        x (): input
        kernel (): weights
    if K.backend() == 'tensorflow':
        return K.squeeze(, K.expand_dims(kernel)), axis=-1)
        return, kernel)

class AttentionWithContext():

    def __init__(self,
                 W_regularizer=None, u_regularizer=None, b_regularizer=None,
                 W_constraint=None, u_constraint=None, b_constraint=None,
                 bias=True, **kwargs):

        self.supports_masking = True
        self.init = initializers.get('glorot_uniform')

        self.W_regularizer = regularizers.get(W_regularizer)
        self.u_regularizer = regularizers.get(u_regularizer)
        self.b_regularizer = regularizers.get(b_regularizer)

        self.W_constraint = constraints.get(W_constraint)
        self.u_constraint = constraints.get(u_constraint)
        self.b_constraint = constraints.get(b_constraint)

        self.bias = bias
        super(AttentionWithContext, self).__init__(**kwargs)

    def build(self, input_shape):
        assert len(input_shape) == 3

        self.W = self.add_weight((input_shape[-1], input_shape[-1],),
        if self.bias:
            self.b = self.add_weight((input_shape[-1],),

        self.u = self.add_weight((input_shape[-1],),

        super(AttentionWithContext, self).build(input_shape)

    def compute_mask(self, input, input_mask=None):
        # do not pass the mask to the next layers
        return None

    def call(self, x, mask=None):
        uit = dot_product(x, self.W)

        if self.bias:
            uit += self.b

        uit = K.tanh(uit)
        ait = dot_product(uit, self.u)

        a = K.exp(ait)

        # apply mask after the exp. will be re-normalized next
        if mask is not None:
            # Cast the mask to floatX to avoid float64 upcasting in theano
            a *= K.cast(mask, K.floatx())
        # in some cases especially in the early stages of training the sum may be almost zero
        # and this results in NaN's. A workaround is to add a very small positive number ε to the sum.
        # a /= K.cast(K.sum(a, axis=1, keepdims=True), K.floatx())
        a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())

        a = K.expand_dims(a)
        weighted_input = x * a
        return K.sum(weighted_input, axis=1)

    def compute_output_shape(self, input_shape):
        return input_shape[0], input_shape[-1]


This attention network is used for document classification in hierarchical attention networks.

Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job


  • Marketing- With text classification using hierarchical attention networks, businesses can classify users based on their opinions about a product. This method is useful in identifying trends and customer types.
  • Reviews- Businesses can easily find aspects on which customers disagree with their services or products based on Text Classification Using hierarchical attention networks.
  • Sentiment Analysis- Text classification using hierarchical attention networks is used for sentiment analysis. Sentiment analysis predicts the sentiments towards specific characteristics based on text classification.
  • Spam detection- Text classification using Hierarchical attention networks is used to filter spam emails or spam text.

Frequently Asked Questions

  1. What is the structure of a hierarchical attention network?
    Hierarchical Attention network uses stacked recurrent neural networks on word level, followed by attention network.
  2. What consists of an attention model?
    1. Bidirectional Recurrent Neural Network
    2. Attention networks.
  3. What is the function of bidirectional RNN in the attention model?
    Bidirectional RNN learns the meaning behind those sequences of words and returns a vector corresponding to each word.
  4. What is the function of the attention network in the attention model?
    The attention network gets weights corresponding to each word vector using its external neural network. It then aggregates the representation of these words to form a sentence vector, which means that it computes the weighted sum of every vector.
  5. How many attention levels are there in a hierarchical attention network?
    Hierarchical attention networks have two layers of attention level. 

Key Takeaways

The Hierarchical attention networks have various exciting use cases. We looked at its concept and the implementation of attention networks used in hierarchical attention networks. 

Are you interested in learning similar Machine Learning tools? check here.

Also, check-

Autoencoders VS PCA

Convolutional Neural Networks

Live masterclass