Introduction
In this article, we will use a simple IMDB dataset that is provided by the Keras datasets. As usual, sentiment Analysis or Opinion Mining is an important research topic that is currently used by most e-commerce companies and many customer service-based companies. It is like a given sentence or text. We are going to predict or analyze the sentiment or opinion or particular context of it. This involves the use of machine learning concepts like RNN.
RNNs are basically designed to perform tasks involving sequential/time series data. The main deal with RNN is we will go for future steps based on the previous activity. We generally use weights of previous inputs, and additionally, using the tanh function, we will develop a new state. This mechanism continues till our goal is achieved.
Why Sentiment Analysis using RNN?
Basically, as we discussed earlier, RNN will perform tasks that involve sequential/time series data. Sentiment Analysis involves sequential data. Wait! But How?
Let’s take a simple example: “I am very much happy about meeting you! But this time, the meeting becomes unfair.”
On seeing the above example sentence, we can say that the person is in a happy mood. But there is a reason for this happiness. Here, a part of the sentence will give a different meaning: I am very happy! This sentence is followed by another sentence, “about meeting you,” which gives another meaning. Like a sequence of sentences will create a nature or feeling. Thus sentiment analysis contains sequential data. For this, RNN will be used.
Let’s see a simple python script that implements a Recurrent Neural Network for sentiment analysis.
We will take a chapter from Udacity NLP Nanodegree and go through it.
Building our First Sentiment Analysis Model
For this task, we will use the Keras library. Keras is also a place for some built-in datasets for our learning purpose. Let’s take a dataset called IMDB. This dataset will contain preprocessed texts. You don’t need to perform preprocessing tasks again. The current dataset includes 25000 samples of reviews. You can learn more about this dataset, how it is preprocessed by using this link.
from keras.datasets import imdb # import the built-in imdb dataset in Keras
# Set the vocabulary size
vocabulary_size = 5000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=vocabulary_size)
print("Loaded dataset with {} training samples, {} test samples".format(len(X_train), len(X_test)))
Loaded dataset with 25000 training samples, 25000 test samples.
Then we will check how the format of the sentence by checking a random sample as below.
print("--- Review ---")
print(X_train[17])
print("--- Label ---")
print(y_train[17])
Output:

Here we can see that the review is in the form of numbers. These integers represent the id of the words in this dataset. And the Label is whether the review is positive or negative. Integer value 1 represents a positive review, and 0 represents a negative review. Here is a positive review. We can also see the original words rather than their id by using imdb.get_word_index() method. But I am doing that. You can have a try on it.
The reviews have different lengths. Some reviews are longer than others. But in order to pass these reviews into our RNN model, we need to make them equal, i.e., we need inputs to be of the same length. This can be accomplished by using the pad_sequences() method, which takes data and the max_len as parameters.
from keras.preprocessing import sequence
max_words = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_words)
X_test = sequence.pad_sequences(X_test, maxlen=max_words)
As the data is ready for training, our next step is to build a model. We will use the help of the Keras library for this.
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout
embedding_size = 64
model = Sequential()
model.add(Embedding(vocabulary_size, embedding_size, input_length=max_words))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
We used an embedding layer, an LSTM layer, and the last layer as a dense layer with a sigmoid as an activation function for this simple task.
As our model is ready, as we are the Keras library, we then need to compile our model by specifying the loss function, optimizers, and metrics as shown below:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
Once our model got compiled, we were all ready to train our model for the data.
batch_size = 64
num_epochs = 3
X_valid, y_valid = X_train[:batch_size], y_train[:batch_size] # first batch_size samples
X_train2, y_train2 = X_train[batch_size:], y_train[batch_size:] # rest for training
model.fit(X_train2, y_train2,
validation_data=(X_valid, y_valid),
batch_size=batch_size, epochs=num_epochs)
Output:

Checking the test accuracy:
scores = model.evaluate(X_test, y_test, verbose=0) # returns loss and other metrics specified in model.compile()
print("Test accuracy:", scores[1]) # scores[1] should correspond to accuracy if you passed in metrics=['accuracy']
Output:
Test accuracy: 0.87604
Well, 87% accuracy is a good score even by using this simple model. You can make a try on adding additional layers, changing the parameter values, using different optimizers and metrics, etc. In this way, one can try to achieve better results.