Introduction
Text Classification is one of the most astonishing tasks. In more general terms, We can say Artificial Intelligence is the field that tries to achieve human-like intelligent models to simplify the tasks for all of us. We have excellent proficiency in text classification, but even many advanced NLP models have failed to achieve mastery even close to it. So the question arises what do we humans do differently? How do we classify text?
We know that words will form sentences; sentences include a document, or character forms a word at a lower level. We can guess many unknown words just by the structure of a sentence. Then we interpret the message that those series of sentences impart. Then from these series of sentences, we can understand the meaning of a paragraph. In the Hierarchical Attention model, we perform similar things.
Hierarchical Attention Network uses stacked recurrent neural networks on word level, followed by an attention network. The goal is to extract such words that are important to the meaning of the entire sentence and aggregate these instructional words to form a vector of the sentence. The same technique is applied to the derived sentence vectors, which generate a vector that draws up the meaning of the given document, passing the vector further for text classification.
The intention is to derive sentence meaning from the informative words and derive the document's meaning from those informative sentences. All words are not equally important. Some of the words distinguish a sentence more than other words. So, we use the attention network so that the sentence vector can have more attention to informative words.
The attention model consists of two parts:
1) Bidirectional Recurrent Neural Network
2) Attention networks.
Bidirectional RNN learns the meaning behind those sequences of words and returns a vector corresponding to each word.
The attention network gets weights corresponding to each word vector using its external neural network. It then aggregates the representation of these words to form a sentence vector, which means that it computes the weighted sum of every vector. This weighted sum personifies the entire sentence. The same steps apply to sentence vectors so that the resulting vector illustrates the gist of the whole document. It has two levels of attention models, called Hierarchical Attention Networks.
Implementation
1. use the necessary dependencies
bash setup.sh
2. We can test the module using
python3 run_han.py
3. To train, test and save our model, first import the HAN module
import HAN
4. Import the dataset(preferably as a pandas dataframe)
5. Import pretrained embedded vector
6. Initialise HAN module
han_network = HAN.HAN(text = df.text, labels = df.category, num_categories = total_categories, pretrained_embedded_vector_path = embedded_vector_path, max_features = max_num_of_features, max_senten_len = max_sentence_len, max_senten_num = max_sentence_num , embedding_size = size_of_embedded_vectors)
7. Import the essential libraries
import pandas as pd
import numpy as np
from keras.preprocessing.text import Tokenizer, text_to_word_sequence
from keras import initializers as initializers, regularizers, constraints
from keras.callbacks import Callback, ModelCheckpoint
from keras.utils.np_utils import to_categorical
from keras.layers import Embedding, Input, Dense, LSTM, GRU, Bidirectional, TimeDistributed, Dropout
from keras import backend as K
from keras import optimizers
from keras.models import Model
import nltk
import re
import matplotlib.pyplot as plt
import sys
from sklearn.metrics import roc_auc_score
from nltk import tokenize
import seaborn as sns
8. I will build an attention layer which can be used in hierarchical attention networks. I am using tensorflow as a backend.
- Input shape- 3D tensor with shape: `(samples, steps, features)`.
- Output shape- 2D tensor with shape: `(samples, features)`.
- How to use- Just put it on top of an RNN Layer (GRU/LSTM/SimpleRNN) with return_sequences=True.
- The dimensions are inferred based on the output shape of the RNN.
- The layer has been tested with Keras 2.0.6. For Example:
model.add(LSTM(64, return_sequences=True))
model.add(AttentionWithContext())
- Next add a Dense layer (for classification/regression
def dot_product(x, kernel):
"""
Wrapper for dot product operation, in order to be compatibl|e with both
Theano and Tensorflow
Args:
x (): input
kernel (): weights
Returns:
"""
if K.backend() == 'tensorflow':
return K.squeeze(K.dot(x, K.expand_dims(kernel)), axis=-1)
else:
return K.dot(x, kernel)
class AttentionWithContext():
def __init__(self,
W_regularizer=None, u_regularizer=None, b_regularizer=None,
W_constraint=None, u_constraint=None, b_constraint=None,
bias=True, **kwargs):
self.supports_masking = True
self.init = initializers.get('glorot_uniform')
self.W_regularizer = regularizers.get(W_regularizer)
self.u_regularizer = regularizers.get(u_regularizer)
self.b_regularizer = regularizers.get(b_regularizer)
self.W_constraint = constraints.get(W_constraint)
self.u_constraint = constraints.get(u_constraint)
self.b_constraint = constraints.get(b_constraint)
self.bias = bias
super(AttentionWithContext, self).__init__(**kwargs)
def build(self, input_shape):
assert len(input_shape) == 3
self.W = self.add_weight((input_shape[-1], input_shape[-1],),
initializer=self.init,
name='{}_W'.format(self.name),
regularizer=self.W_regularizer,
constraint=self.W_constraint)
if self.bias:
self.b = self.add_weight((input_shape[-1],),
initializer='zero',
name='{}_b'.format(self.name),
regularizer=self.b_regularizer,
constraint=self.b_constraint)
self.u = self.add_weight((input_shape[-1],),
initializer=self.init,
name='{}_u'.format(self.name),
regularizer=self.u_regularizer,
constraint=self.u_constraint)
super(AttentionWithContext, self).build(input_shape)
def compute_mask(self, input, input_mask=None):
# do not pass the mask to the next layers
return None
def call(self, x, mask=None):
uit = dot_product(x, self.W)
if self.bias:
uit += self.b
uit = K.tanh(uit)
ait = dot_product(uit, self.u)
a = K.exp(ait)
# apply mask after the exp. will be re-normalized next
if mask is not None:
# Cast the mask to floatX to avoid float64 upcasting in theano
a *= K.cast(mask, K.floatx())
# in some cases especially in the early stages of training the sum may be almost zero
# and this results in NaN's. A workaround is to add a very small positive number ε to the sum.
# a /= K.cast(K.sum(a, axis=1, keepdims=True), K.floatx())
a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())
a = K.expand_dims(a)
weighted_input = x * a
return K.sum(weighted_input, axis=1)
def compute_output_shape(self, input_shape):
return input_shape[0], input_shape[-1]
This attention network is used for document classification in hierarchical attention networks.