Code360 powered by Coding Ninjas X Naukri.com. Code360 powered by Coding Ninjas X Naukri.com
Table of contents
1.
Introduction
2.
What is Sentiment Analysis?
3.
Understanding the IMDB Datasets
3.1.
Characteristics of IMDB Datasets
4.
Pre-requisites
5.
Building Model to Predict Sentiments with Keras
6.
Preprocessing of Data
7.
Analysis of the IMDB Dataset
8.
Truncating Review Words
9.
Building IMDB Model
10.
Training of IMDB Model
11.
Testing of IMDB Model
12.
Frequently Asked Questions
12.1.
What is Keras?
12.2.
Which datasets can be preferred for sentiment analysis with Keras?
12.3.
How to deal with unbalanced sentiment classes in the dataset?
13.
Conclusion
Last Updated: Mar 27, 2024
Medium

Predicting Sentiments with Keras

Author Ayush Mishra
0 upvote
Leveraging ChatGPT - GenAI as a Microsoft Data Expert
Speaker
Prerita Agarwal
Data Specialist @
23 Jul, 2024 @ 01:30 PM

Introduction

Sentiment expresses feelings, thoughts, ideas, or attitudes toward a specific issue or topic. Machine learning models that predict sentiments have helpful real-world applications, such as measuring public opinion and determining customer happiness.

Predicting Sentiments with Keras

In this blog, we will be predicting sentiments with Keras. Let’s start going!

What is Sentiment Analysis?

A sentiment analysis, also known as opinion mining, is an NLP (Natural Language Processing) task that identifies the sentiment or emotion expressed in a text. Understanding the author's underlying attitude or sentiment towards a given topic, whether positive, negative, or neutral, is the main objective of sentiment analysis.

Some applications of sentiment analysis are social media monitoring, brand monitoring, business intelligence, and market research. It enables companies to examine text data for insightful information, keep checks on public opinion, and base decisions on the opinions and sentiments of their customers. 

In this blog, we will predict sentiments with Keras on IMDB movie review datasets.

Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job
Bootcamp

Understanding the IMDB Datasets

IMDB dataset is a standard dataset used for sentiment analysis, which is frequently used as a benchmark to evaluate the efficiency of sentiment analysis models. It consists of reviews of films, each of which is given a positive (1) or negative (0) sentiment rating. 

The dataset is easily accessible through the keras.datasets module in Python.

Characteristics of IMDB Datasets

The characteristics of IMDB Datasets are:-

  • Size: There are 50,000 movie reviews in the IMDb dataset. Two sets of these reviews are created: a training set that contains 25,000 reviews and a testing set that contains the remaining 25,000 reviews
     
  • Label Distribution: This is a binary sentiment classification problem because the dataset is evenly balanced with 50% positive and 50% negative ratings
     
  • Text Data: The length of each movie review in the dataset varies, and each review is composed of a string of words. Various movie genres and themes are covered in these English-language reviews

Pre-requisites

Before predicting sentiments with Keras, Users must have some experience in below mentioned technology.

  • Familiar with Python and Natural Language Programming Concepts
     
  • Familiar with Keras and TensorFlow for building Deep Learning Models
     
  • Some basic knowledge of numpy and matplotlib library in Python

Building Model to Predict Sentiments with Keras

In this section, we will build NLP Model in Keras to analyze the sentiment in a given review (text) that predicts the sentiment, whether positive or negative, based on the words used in the text. We will use the IMDB dataset to predict the sentiment.

Preprocessing of Data

In this section of "Predicting Sentiments with Keras," we will import all the important and necessary libraries along with the Keras framework.

Code

// Loading Libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import tensorflow as tf


// Importing Keras Modules
import keras
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding, LSTM, Dense
from keras import Sequential


// Loading IMDB Dataset
from keras.datasets import IMDb


import warnings
warnings.filterwarnings('ignore')


Now, we will split the dataset into test and training datasets directly with the below code:-

(x_train, y_train),(x_test, y_test) = imdb.load_data()


Next, we will check by printing whether the data is split into a set of  2500 tests and 2500 training or not.

Output

Output

Analysis of the IMDB Dataset

In this section of "Predicting Sentiments with Keras," we will analyze the IMDB Dataset by exploring it. First, we will check the number of unique words and sentiment classes in the training dataset.

Code

print("Number of Unique Words: ")
print(len(np.unique(np.hstack(x_train))))
print("Sentiment Classes:")
print(np.unique(y_train))


Output

Output

Now, we will see a few words and their index in IMDB Datasets. We will print the top 10 entries in the dictionary using the function get_word_index() with the below code:-

Code

w_i = tf.keras.datasets.imdb.get_word_index()
for x in list(w_i)[0:10]:
  print("{}:{}".format(x, w_i[x]))

 

Output

Output

Now, we will check the average length of the review with the given below code:-

Code

print('Length of the Review')
r_len= list(map(len,x_train))


print("Mean %.2f", %(np.mean(r_len)))


Output

Output

Now, we will see the box plot for review length in words using the below given below command.

Code

# plot review length
res = [len(x) for x in x_train]
plt.boxplot(res)
plt.show()


Output

Output

In the above plot, we can see the mass distribution has a clipped length of 400-1000 words.

Truncating Review Words

The reviews in the dataset are different in length. To truncate the review, we will use the maxlen parameter.

maxlen: This parameter specifies the longest possible text, such as a movie review, that can be used. We must establish a time limit for the evaluation in order to analyze the material effectively. By using this, it will be ensured that any reviews that are longer than maxlen will be truncated.

For neural networks, all reviews must be of the same length. So we will convert all the reviews to the exact size of the max length of 500 using the pad_sequences function in Keras.

Code:

x_train = pad_sequences(x_train, maxlen=500)
x_test = pad_sequences(x_test, maxlen=500)


In the above code, we are converting all the reviews greater than 500 in length into a maximum word limit of 500.

Building IMDB Model

A neural Network Model can be created by a straightforward single-layer hidden with a multi-layer perceptron model. This model will be constructed using an embedding layer, an LSTM layer, and a dense layer. 

LSTM models are preferred to address this problem since they are better at maintaining long-distance connections and can handle sequential data. Using LSTM in NLP tasks is advantageous since it can take input for prediction in the form of a sentence rather than a single word. Thus, using LSTM is more practical and effective for NLP jobs.

Each word in the input will be transformed by the embedding layer into a dense vector of a specific size (embedded dimensions). To increase the model's accuracy, we must also set the hyperparameters, such as batch size, epochs, LSTM units, etc.

The layers of the LSTM are embedded in Keras using a sequential() model. To improve the model, you can experiment with the number of layers, or to prevent overfitting, you can add dropout layers. 

Code

num_words = 5000
embeding_dim = 32 
output_lstm = 100


# Setting up the model
model = Sequential() 
model.add(Embedding(num_words, embeding_dim, input_length=500)) 
model.add(LSTM(output_lstm)) 
model.add(Dense(1, activation='sigmoid')) 
model.compile(loss='binary_crossentropy',
              optimizer='adam', metrics=['accuracy']) 
model.summary()


Output

Output

We have set the word to 5,000, the word vector size to 32 dimensions, and the input length to 500. The output of this first layer will be a 32×500-sized matrix.

The output layer has one neuron, which employs sigmoid activation to produce 0 and 1 as predictions.

Training of IMDB Model

We can train the model by specifying the training set, validation_data as X_test, and y_test to evaluate our accuracy and loss for the training and validation sets at each epoch.

To train our model with two epochs, we will use model.fit().

Code

model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=2, batch_size=128, verbose=2)

 

Output

Output

In two epochs, we get an accuracy of the model to 87%. The model will give an 87% accurate result. You will have to train on more epochs to get more accuracy.

If no validation data is provided while training the model, the model can be assessed separately using the following code:

Code

model_eval = model.evaluate(x_test, y_test)


Output

Output

To save the model, write the below code.

Code

model.save('imbdb_analysis.h5')

Testing of IMDB Model

We will load a model and provide user input to check whether the model is working correctly. We will create a function that accepts user input and, based on user input, predict whether the model is working. The review's length will be adjusted, the words will be converted to an index, and predictions about the review's sentiment will be made.

Code

#load model
from keras.models import load_model
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
import numpy as np
loaded_model = load_model('imdb_analysis.h5')


sentiment = ['Neutral','Negative','Positive']
sequence = Tokenizer().texts_to_sequences(['This movie is adventures.'])
test = pad_sequences(sequence, maxlen=500)
sentiment[np.around(loaded_model.predict(test), decimals=0).argmax(axis=1)[0]]
print(loaded_model.predict(test))


Output

Output

We get an accuracy of 0.684, closer to 1 than negative sentiment. Our sentiment analysis model correctly predicted a positive sentiment as a result. We can increase your accuracy by training more epochs.

Frequently Asked Questions

What is Keras?

Python-based Keras is an open-source deep-learning library. To create neural networks, it offers a high-level interface. TensorFlow is frequently utilized as Keras' backend.

Which datasets can be preferred for sentiment analysis with Keras?

IMDb movie reviews, Twitter sentiment140, Amazon product reviews, and Yelp reviews are well-known datasets for sentiment analysis. Online, pre-labeled datasets are readily accessible.

How to deal with unbalanced sentiment classes in the dataset?

Suppose there are many more positive reviews than negative reviews in the dataset's sentiment classes, for example. In that case, you can tackle the imbalance during training using oversampling, undersampling, or class weighting strategies.

Conclusion

In this blog, we have discussed using a trained LSTM model in Keras to predict sentiment (positive or negative) in the movie review text using IMDB Datasets. We have trained as well as tested the working of the model.

We hope this blog has helped you to gain knowledge of predicting sentiments with Keras. Do not stop learning! We recommend you read some of our related articles to predicting sentiments with Keras:

 

Refer to our Guided Path to upskill yourself in DSACompetitive ProgrammingJavaScriptSystem Design, and many more! If you want to test your competency in coding, you may check out the mock test series and participate in the contests hosted on Coding Ninjas Studio!

But suppose you have just started your learning process and are looking for questions from tech giants like Amazon, Microsoft, Uber, etc. For placement preparations, you must look at the problemsinterview experiences, and interview bundles

We wish you Good Luck! 

Happy Learning!

Live masterclass