Introduction
This article has discussed the theoretical aspects and implementation of Sentiment Analysis using BERT.
Bidirectional Representation for Transformers (BERT) was proposed by Google AI language researchers in 2018. Although the original goal was to improve understanding of the meaning of Google Search queries, BERT has become one of the essential and comprehensive architectures for various natural language tasks, producing state-of-the-art results on Sentence pair classification, question-answer tasks, and other natural language tasks.
Architecture
BERT's versatility to do various NLP tasks with state-of-the-art accuracy is one of its most essential advantages (similar to the transfer learning we used in Computer vision). The study also provided the architecture of several jobs for this purpose. In this post, we'll look at applying the BERT architecture for single-sentence classification challenges, specifically the architecture used for the binary classification test CoLA (Corpus of Linguistic Acceptability). We went over BERT architecture in detail in the previous post, but let's go over some of the key points again:
BERT has proposed the two versions:
- 12 levels of encoder stack with 12 bidirectional self-attention heads and 768 hidden units in BERT (BASE).
- 24 layers of encoder stack with 24 bidirectional self-attention heads and 1024 hidden units in BERT (LARGE).
Google has released two versions of the BERT BASE and BERT LARGE for TensorFlow implementation: Uncased and Cased. Before WordPiece tokenization, letters are lowercased in an uncased version.
Sentiment Analysis with BERT
Steps necessary to train sentiment analysis model:
- Install the Transformers library;
- Load the BERT Classifier and Tokenizer modules and the Input modules.
- Create a processed dataset by downloading the IMDB Reviews Data (this will take multiple procedures;
- Fine-tune the Loaded BERT model by configuring it and training it.
- Use the Fine-Tuned Model to Make Predictions
Installing Transformers
Installing the Transformers library is pretty straightforward:
pip install transformersWe'll load the pre-trained BERT Tokenizer and Sequence Classifier and InputExample and InputFeatures after the installation is complete. Then, using the Sequence Classifier and BERT's Tokenizer, we'll create our model and tokenizer.
CODE-
from transformers import BertTokenizer, TFBertForSequenceClassification
from transformers import InputExample, InputFeatures
model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")IMDB Dataset
Andrew L. Maas compiled the IMDB Reviews Dataset, an extensive movie review dataset from the leading movie rating site. The IMDB Reviews dataset determines if a review is positive or negative in nature. There are 25,000 movie reviews for training and 25,000 for testing in this database. All 50,000 reviews have been categorized and can be used for supervised deep learning. Furthermore, another 50,000 unlabeled reviews will not be used in this case study. We shall solely use the training dataset in this case study.
TensorFlow and Pandas will be the initial two imports.
CODE-
import tensorflow as tf
import pandas as psGet the Data from the Stanford Repo
Then, using the tf.keras.utils.get file function, we can get the dataset from Stanford's appropriate directory, as seen below:
URL = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
dataset =tf.keras.utils.get_file(fname="aclImdb_v1.tar.gz",origin=URL,untar=True,cache_dir='.',cache_subdir='')Remove Unlabeled Reviews
The following operations are required to remove the unlabeled reviews. Each operation is explained in the comments below:
# shutil module helps in getting a number of high-level
# operations on files and collections of files.
import os
import shutil
# Create main directory path ("/aclImdb")
main_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')
# Create sub directory path ("/aclImdb/train")
train_dir = os.path.join(main_dir, 'train')
# Remove unsup folder since this is a supervised learning task
remove_dir = os.path.join(train_dir, 'unsup')
shutil.rmtree(remove_dir)
# View the final train folder
print(os.listdir(train_dir))Train and Test Split
Now that we've cleaned and prepared our data, we can use the lines below to construct the text dataset from directory. I'd like to process all of the data in one go. That's why I went with a big batch size:
# Creation of training dataset and validation set
# dataset from our "aclImdb/train" directory with a 80/20 split.
train = tf.keras.preprocessing.text_dataset_from_directory(
'aclImdb/train', batch_size=30000, validation_split=0.2,
subset='training', seed=123)
test = tf.keras.preprocessing.text_dataset_from_directory(
'aclImdb/train', batch_size=30000, validation_split=0.2,
subset='validation', seed=123)Convert to Pandas to View and Process
I'd like to prepare our basic train and test datasets for our BERT model now that we have them. I'll generate a pandas dataframe from our TensorFlow dataset object to make it more understandable. Our train Dataset object is converted to a train pandas dataframe using the following code:
for i in train.take(1):
train_feat = i[0].numpy()
train_lab = i[1].numpy()
train = pd.DataFrame([train_feat, train_lab]).T
train.columns = ['DATA_COLUMN', 'LABEL_COLUMN']
train['DATA_COLUMN'] = train['DATA_COLUMN'].str.decode("utf-8")
train.head()Creating Input Sequences
We have two pandas Dataframe objects that need to be converted into acceptable BERT model objects. We'll use the InputExample method to build sequences from our dataset. The following is how to invoke the InputExample function:
InputExample(guid=None, text_a = "Hello, world", text_b = None,label = 1)We'll now construct two major functions:
1 — convert data to examples: This takes our train and test datasets and turns every row into an object InputExample.
2 — conversion of examples to tf dataset: This function tokenizes the InputExample objects, then creates the appropriate input format using the tokenized objects, and lastly creates an input dataset to feed to the model.
CODE-
def convert_data_to_examples(trn, tst, DT_COLUMN, LBL_COLUMN):
trn_InputExmpls = trn.apply(lambda x: InputExample(guid=None,txt_a = x[DT_COLUMN], txt_b = None,lbl = x[LBL_COLUMN]), axis = 1)
vldtn_InputExamples = test.apply(lambda x: InputExample(guid=None,txt_a = x[DT_COLUMN],txt_b = None,label = x[LBL_COLUMN]), axis = 1)
return trn_InputExmpls, vldtn_InputExamples
trn_InputExamples, vldtn_InputExamples = convert_data_to_examples(trn, tst,'DT_COLUMN','LBL_COLUMN')
def convert_examples_to_tf_dataset(exmpls, tknzr, max_length=128):
features = []
for e in examples:
input_dict = tknzr.encode_plus(e.text_a,add_special_tokens=True,
max_length=max_length, # truncates if len(s) > max_length
return_token_type_ids=True,
return_attention_mask=True,
pad_to_max_length=True, # pads to the right by default # CHECK THIS for pad_to_max_length
truncation=True
)
input_ids, token_type_ids, attention_mask = (input_dict["input_ids"],
input_dict["token_type_ids"], input_dict['attention_mask'])
features.append(
InputFeatures(
input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, label=e.label
)
)
def gen():
for f in features:
yield (
{
"input_ids": f.input_ids,
"attention_mask": f.attention_mask,
"token_type_ids": f.token_type_ids,
},
f.label,
)
return tf.data.Dataset.from_generator(gen,({"input_ids": tf.int32, "attention_mask": tf.int32, "token_type_ids": tf.int32}, tf.int64),({
"input_ids": tf.TensorShape([None]),
"attention_mask": tf.TensorShape([None]),
"token_type_ids": tf.TensorShape([None]),
},
tf.TensorShape([]),
),
)
DATA_COLUMN = 'DATA_COLUMN'
LABEL_COLUMN = 'LABEL_COLUMN'For calling the above function,
train_InputExamples, validation_InputExamples = convert_data_to_examples(train, test, DATA_COLUMN, LABEL_COLUMN)
train_data = convert_examples_to_tf_dataset(list(train_InputExamples), tokenizer)
train_data = train_data.shuffle(100).batch(32).repeat(2)
validation_data = convert_examples_to_tf_dataset(list(validation_InputExamples), tokenizer)
validation_data = validation_data.batch(32)Our dataset containing processed input sequences is ready to be fed to the model.
Configuring the BERT model and Fine-tuning
Our optimizer will be Adam, our loss function will be CategoricalCrossentropy, and our accuracy measure will be SparseCategoricalAccuracy. We can get around 95% accuracy by fine-tuning the model for two epochs, which is fantastic.
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=[tf.keras.metrics.SparseCategoricalAccuracy('accuracy')])
model.fit(train_data, epochs=2, validation_data=validation_data)Making Predictions
Made two lists for reviews, The first one represents a positive review, while the second one is clearly negative.
pred_sentences = ['This was an awesome movie. I watch this beautiful movie twice my time if I have known it was this good','Worst movies of all time. I lost two hours of my life because of this movie']We'll use our pre-trained BERT tokenizer to tokenize our reviews. We'll feed these tokenized sequences into our model and run a final softmax layer to get the predictions. The argmax function can then determine whether our review sentiment prediction is positive or negative. Finally, we'll use a simple for loop to output the results. All of these procedures are performed in the following lines:
tf_batch = tokenizer(pred_sentences, max_length=128, padding=True, truncation=True, return_tensors='tf')
tf_outputs = model(tf_batch)
tf_predictions = tf.nn.softmax(tf_outputs[0], axis=-1)
labels = ['Negative','Positive']
label = tf.argmax(tf_predictions, axis=1)
label = label.numpy()
for i in range(len(pred_sentences)):
print(pred_sentences[i], ": \n", labels[label[i]])You've created a transformers network using a pre-trained BERT model and achieved a sentiment analysis accuracy of 95% on the IMDB reviews dataset!
Check this out : Boundary value analysis




