Table of contents

1.

Introduction

2.

Implementation

3.

FAQs

4.

Key Takeaways

Last Updated: Mar 27, 2024

Word Embedding with Gensim

Author Rajkeshav

Do you think IIT Guwahati certified course can help you in your career?

Yes

No

Introduction

There are very famous Natural Language Processing libraries such as NLTK, spaCy, etc. They are very relaxed and have their merits, but Gensim has some unique features.

Gensim is super fast as it is written so that its code is highly optimised and parallelized, and most of the routine is C routine. Further, Gensim can process arbitrarily large corpora using data stream algorithms. Gensim has no platform dependents. Gensim runs on Linux, Windows, Mac OS, and any other platform, including Python and NumPy. In further discussion, we will be using the Gensim library to build a word2vec model. Let's get started with the implementation part.

Implementation

We will be using Amazon product reviews for cell phone accessories and building word2vec models in a Gensim library. Gensim library is an NLP library in Python, and it's straightforward to use. The syntax is brief compared to TensorFlow, so we will use this and train a model.

Let's first install the Gensim library. We can do PIP install gensim. In Google collab, they provide these inbuilt features, so we don't need to download it explicitly. We need to install another module called python-Levenshtein.

# !pip install gensim
# !pip install python-Levenshtein
import gensim
import pandas as pd

You can also try this code with Online Python Compiler

After importing Gensim and Pandas, I am downloading the Amazon product review data set, and these are the product reviews, especially for cell phones accessories categories. You can get the data from here. It has a huge file and has JSON records with all product reviews, and Pandas supports reading JSON files. Now, We are going to create a data frame.

df = pd.read_json("Cell_Phones_and_Accessories_5.json", lines=True)

You can also try this code with Online Python Compiler

Let's print a couple of rows of the data frame..

print(df)

You can also try this code with Online Python Compiler

Here we can see the columns as reviewer ID, reviewer name and text, etc. We will train a word2vec model for the cell phone accessories using only a review text. The remaining columns are not helpful to us.

Let's know the shape of the data frame.

df.shape

You can also try this code with Online Python Compiler

(194439, 9)

We have 194 thousand records; that's a lot of data. That kind of data set is enough to train our model.

The first step of training our word2vec model is preprocessing because these texts have things like stop words, and we don't want these. We want to convert this word into lowercase to know everything is lowercase and comparable, then remove the trailing spaces and punctuation marks. All of these things can be done using a function in Gensim. Gensim library has 'utils.py simple preprocess' that will process these texts.

review_text = df.reviewText.apply(gensim.utils.simple_preprocess)
print(review_text)

You can also try this code with Online Python Compiler

We can see it is tokenizing the sentences, meaning that all uppercase letters are converted into lower case letters, and punctuation marks and stop words are removed. By the way, it is not very perfect, and it uses simple heuristic rules for doing this preprocessing. But this is good enough to build our word2vec model. If we want to do the same thing for the entire column, we can do the 'apply' function in the review text. It's going to return to a new Pandas series.

model = gensim.models.Word2Vec(
    window=10,
    min_count=2,
    workers=4,
)

You can also try this code with Online Python Compiler

Now we will initialize the Gensim model. Gensim is an NLP library, and it comes with the word2vec class. Now I am going to create this model using a couple of parameters. The first parameter is a window that is equal to 10. Here 10 means ten words before the target word and ten words after the target word. We can also experiment with size, and there is no fixed rule; we can even make it 5. Another parameter called mean count is that if we have a sentence with only one word that doesn't use that sentence, at least two words need to be present in the sentence to be considered for the training. 'Workers' is the number of CPU threads we want to use between the models.

Now we need to build a vocabulary. Building a vocabulary means making a unique list of words..

model.build_vocab(review_text, progress_per=1000)

You can also try this code with Online Python Compiler

To perform the actual training, we will do 'model.train.' It will take a couple of parameters; review the text, and then real examples that tell the real examples we have. The training may take time depending on the computer's CPU, GPU, etc.

model.train(review_text, total_examples=model.corpus_count, epochs=model.epochs)

You can also try this code with Online Python Compiler

I will save the model in a file because we train a model, save it to a file, and then use a pre-trained model in most locations.

model.save("./word2vec-amazon-cell-accessories-reviews-short.model")

You can also try this code with Online Python Compiler

There is a product review, and someone says the word 'bad,', and we want to know what the word is similar to bad?

model.wv.most_similar("bad")

You can also try this code with Online Python Compiler

We can see these words are similar to bad following the similarity scores.

FAQs

1. What is Gensim in NLP?

Gensim is an open-source natural language processing library used for unsupervised modelling.

2. What is Gensim word2vec?

Word2vec in Gensim is a widely used NLP algorithm based on a neural network used to extract the notion of related words and the semantic relationship between words.

3. Does the Gensim library support GPU acceleration?

Yes, the Gensim library does support GPU acceleration.

4. Why is Gensim so fast?

Gensim is so fast because of its data access design and numerical processing implementation.

5. Does Gensim use Tensorflow?

No, Gensim does not use Tensorflow.

Key Takeaways

We discussed the features of Gensim over other NLP tool kits and the implementation of word2vec using Gensim. If you find it exciting and want to learn more about NLP and Machine Learning, visit here.

Further readings-

Hyperparameter Tuning and Predicting scores

Restricted Boltzmann Machine on MNIST Dataset

Restaurant Reviews Analysis with NLP

Live masterclass

Zomato Data Analysis Case Study: Ace 25L+ Roles in FoodTech

by Abhishek Soni

16 Mar, 2026

01:30 PM

39+ registered

Data Analysis for 20L+ CTC@Flipkart: End-Season Sales dataset

by Sumit Shukla

15 Mar, 2026

06:30 AM

267+ registered

Beginner to GenAI Engineer Roadmap for 30L+ CTC at Amazon

by Shantanu Shubham

15 Mar, 2026

08:30 AM

55+ registered

Multi-Agent AI Systems: Live Workshop for 25L+ CTC at Google

by Saurav Prateek

16 Mar, 2026

03:00 PM

8+ registered

Zomato Data Analysis Case Study: Ace 25L+ Roles in FoodTech

by Abhishek Soni

16 Mar, 2026

01:30 PM

39+ registered

Data Analysis for 20L+ CTC@Flipkart: End-Season Sales dataset

by Sumit Shukla

15 Mar, 2026

06:30 AM

267+ registered

View more events