Table of contents
1.
Introduction
2.
Implementation
3.
FAQs
4.
Key Takeaways
Last Updated: Mar 27, 2024

Word Embedding with Gensim

Author Rajkeshav
0 upvote
Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

There are very famous Natural Language Processing libraries such as NLTK, spaCy, etc. They are very relaxed and have their merits, but Gensim has some unique features.

Gensim is super fast as it is written so that its code is highly optimised and parallelized, and most of the routine is C routine. Further, Gensim can process arbitrarily large corpora using data stream algorithms. Gensim has no platform dependents. Gensim runs on Linux, Windows, Mac OS, and any other platform, including Python and NumPy. In further discussion, we will be using the Gensim library to build a word2vec model. Let's get started with the implementation part.

Implementation

We will be using Amazon product reviews for cell phone accessories and building word2vec models in a Gensim library. Gensim library is an NLP library in Python, and it's straightforward to use. The syntax is brief compared to TensorFlow, so we will use this and train a model.

Let's first install the Gensim library. We can do PIP install gensim.  In Google collab, they provide these inbuilt features, so we don't need to download it explicitly. We need to install another module called python-Levenshtein.

# !pip install gensim
# !pip install python-Levenshtein
import gensim
import pandas as pd
You can also try this code with Online Python Compiler
Run Code

After importing Gensim and Pandas, I am downloading the Amazon product review data set, and these are the product reviews, especially for cell phones accessories categories. You can get the data from here. It has a huge file and has JSON records with all product reviews, and Pandas supports reading JSON files. Now, We are going to create a data frame.

df = pd.read_json("Cell_Phones_and_Accessories_5.json", lines=True)
You can also try this code with Online Python Compiler
Run Code

Let's print a couple of rows of the data frame.. 

print(df)
You can also try this code with Online Python Compiler
Run Code

Here we can see the columns as reviewer ID, reviewer name and text, etc. We will train a word2vec model for the cell phone accessories using only a review text. The remaining columns are not helpful to us.

Let's know the shape of the data frame.

df.shape
You can also try this code with Online Python Compiler
Run Code

 

(194439, 9)

We have 194 thousand records; that's a lot of data. That kind of data set is enough to train our model.

The first step of training our word2vec model is preprocessing because these texts have things like stop words, and we don't want these. We want to convert this word into lowercase to know everything is lowercase and comparable, then remove the trailing spaces and punctuation marks. All of these things can be done using a function in Gensim. Gensim library has 'utils.py simple preprocess' that will process these texts.

review_text = df.reviewText.apply(gensim.utils.simple_preprocess)
print(review_text)
You can also try this code with Online Python Compiler
Run Code

 

 

We can see it is tokenizing the sentences, meaning that all uppercase letters are converted into lower case letters, and punctuation marks and stop words are removed. By the way, it is not very perfect, and it uses simple heuristic rules for doing this preprocessing. But this is good enough to build our word2vec model. If we want to do the same thing for the entire column, we can do the 'apply' function in the review text. It's going to return to a new Pandas series. 

model = gensim.models.Word2Vec(
    window=10,
    min_count=2,
    workers=4,
)
You can also try this code with Online Python Compiler
Run Code

Now we will initialize the Gensim model. Gensim is an NLP library, and it comes with the word2vec class. Now I am going to create this model using a couple of parameters. The first parameter is a window that is equal to 10. Here 10 means ten words before the target word and ten words after the target word. We can also experiment with size, and there is no fixed rule; we can even make it 5. Another parameter called mean count is that if we have a sentence with only one word that doesn't use that sentence, at least two words need to be present in the sentence to be considered for the training. 'Workers' is the number of CPU threads we want to use between the models. 

Now we need to build a vocabulary. Building a vocabulary means making a unique list of words.. 

model.build_vocab(review_text, progress_per=1000)
You can also try this code with Online Python Compiler
Run Code

To perform the actual training, we will do 'model.train.' It will take a couple of parameters;  review the text, and then real examples that tell the real examples we have. The training may take time depending on the computer's CPU, GPU, etc.  

model.train(review_text, total_examples=model.corpus_count, epochs=model.epochs)
You can also try this code with Online Python Compiler
Run Code

I will save the model in a file because we train a model, save it to a file, and then use a pre-trained model in most locations.

model.save("./word2vec-amazon-cell-accessories-reviews-short.model")
You can also try this code with Online Python Compiler
Run Code

There is a product review, and someone says the word 'bad,', and we want to know what the word is similar to bad?

model.wv.most_similar("bad")
You can also try this code with Online Python Compiler
Run Code

                                                                    

We can see these words are similar to bad following the similarity scores. 

FAQs

1. What is Gensim in NLP?

Gensim is an open-source natural language processing library used for unsupervised modelling.

2. What is Gensim word2vec?

Word2vec in Gensim is a widely used NLP algorithm based on a neural network used to extract the notion of related words and the semantic relationship between words.

3. Does the Gensim library support GPU acceleration?

Yes, the Gensim library does support GPU acceleration.

4. Why is Gensim so fast?

Gensim is so fast because of its data access design and numerical processing implementation.

5. Does Gensim use Tensorflow?

No, Gensim does not use Tensorflow. 

Key Takeaways

We discussed the features of Gensim over other NLP tool kits and the implementation of word2vec using Gensim.  If you find it exciting and want to learn more about NLP and Machine Learning, visit here. 

Further readings-

Hyperparameter Tuning and Predicting scores

Restricted Boltzmann Machine on MNIST Dataset

Restaurant Reviews Analysis with NLP

Live masterclass