Code360 powered by Coding Ninjas X Naukri.com. Code360 powered by Coding Ninjas X Naukri.com
Table of contents
1.
Introduction
2.
Introduction to Bag of Words Model
3.
Modeling with Bag-of-Words
3.1.
Without preprocessing
3.2.
With Preprocessing 
3.3.
Implementation in Python
4.
Limitations to Bag-of-Words in NLP
5.
Frequently Asked Questions
5.1.
What do you mean by the Term Frequency?
5.2.
What do you mean by the term Inverse Document Frequency?
5.3.
Is it important to calculate the IDF Value?
6.
Conclusion
Last Updated: Mar 27, 2024
Easy

Bag of Words in NLP

Author Komal
0 upvote

Introduction

Welcome Ninjas. This blog will look into a commonly used concept of Natural Language Processing, NLP, i.e., Bag Of Words. But before directly jumping to the primary concern of the blog, let us get started with what NLP is.

Bag of Words in NLP

Just as the name suggests, it processes the natural language, the Natural language of whom? Of Human beings. It refers to the ability and capability of computers to understand human language, spoken or written. NLP combines computational linguistics with ML, deep learning, and statistical learning models.

Introduction to Bag of Words Model

Bag of Words is a technique of text modeling based on NLP (natural language processing). It is used widely for feature extraction with text data.  

bag of words

This technique summarizes the frequency of words (Term Frequency)  within a document and the word counts. We will further learn about the limitations of Bag Of Words in the blog. One of them is that any information regarding the order of words is not considered.

Modeling with Bag-of-Words

Without preprocessing

Let's first implement a bag of words without Preprocessing. Take a look at the following example:

Sentence 1: His best trait is honesty

Sentence 2: Honesty is the best policy

As we are not doing any preprocessing, we would consider the stop words and the case of the letters. So, in this case 'honesty' and 'Honesty' are two different words.

First, we list of all the words available in both sentences.

  • His
  • best
  • trait
  • is
  • honesty
  • Honesty
  • the
  • policy

Now, we count the frequencies of words in each sentence,

frequency table

Hence, for sentence 1=[1,1,1,1,1,0,0,0]

And for sentence 2=[0,1,0,1,0,1,1,1]

With Preprocessing 

We first convert all the words to lowercase and remove the stopwords, if any. After that, sentence 1 and sentence 2 become:

Sentence 1: his best trait honesty

Sentence 2: honesty best policy

As we can see, now the vocabulary has only five words. Now, we do the scoring as we did before.

freq table with preprocessing

Hence, for sentence 1=[1,1,1,1,0]

And for sentence 2=[0,1,0,0,1]

Out of the above two approaches, we use the latter one in the Bag of Words model, as in machine learning, we might have large datasets, and it is challenging to interpret without preprocessing. 

Implementation in Python

Let's analyze the reviews of the series 'Friends'.

Review 1= 'Friends is a great tv series!'

Review 2= 'Friends is the best tv series!'

Review 3=' Friends is so amazing!'

python output

For implementing in Python, first, we import the required libraries:

import pandas as pd
import numpy as np
import collections
import re
You can also try this code with Online Python Compiler
Run Code

Assign values to d1,d2,d3:

d1 = 'Friends is a great tv series!'
d2 = 'Friends is the best tv series!'
d3 = 'Friends is so amazing'
You can also try this code with Online Python Compiler
Run Code

Then, we need to remove the punctuation and transform the string into a list of words.

l_d1 = re.sub(r"[^a-zA-Z0-9]", " ", d1.lower()).split()
l_d2 = re.sub(r"[^a-zA-Z0-9]", " ", d2.lower()).split()
l_d3 = re.sub(r"[^a-zA-Z0-9]", " ", d3.lower()).split()
You can also try this code with Online Python Compiler
Run Code

Now, we need to extract features in each doc. For that:

wordset={}
def calcBOW(wordset,l_doc):
  tf_diz = dict.fromkeys(wordset,0)
  for word in l_doc:
      tf_diz[word]=l_doc.count(word)
  return tf_diz
bow1 = calcBOW(wordset,l_d1)
bow2 = calcBOW(wordset,l_d2)
bow3 = calcBOW(wordset,l_d3)
df_bow = pd.DataFrame([bow1,bow2,bow3])
df_bow
df_bow.fillna(0)
 
You can also try this code with Online Python Compiler
Run Code
python output

To compare the results obtained, we use Scikit-learn's CountVectorizer, firstly, instantiate a CountVectorizer object. Then, we return the document-term matrix representing the frequency of each word in the document.

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform([d1,d2,d3])
print(vectorizer.get_feature_names())
df_bow_sklearn = pd.DataFrame(X.toarray(),columns=vectorizer.get_feature_names())
df_bow_sklearn.head()
You can also try this code with Online Python Compiler
Run Code
python output

Limitations to Bag-of-Words in NLP

Although, the Bag of words model is very simple and easy to implement but also offers flexibility in customizing the text data to a large level. But, it still faces some drawbacks:-

  • The Bag of Words method for large documents will form vectors of large dimensions and might contain null values.
  • Most of the time, the Bag of words fails to make sense of the data; for example, the two sentences: "I like vanilla and hate chocolate" and "I like chocolate and hate vanilla" contain the exact words but have different meanings. Bag of Words will lead to similar vectorized representations.

Frequently Asked Questions

What do you mean by the Term Frequency?

Term Frequency (TF) measures the frequency of a term as it appears in a document. 

What do you mean by the term Inverse Document Frequency?

Inverse Document Frequency (IDF) measures the importance of a term.

Is it important to calculate the IDF Value?

Yes, it’s important to calculate the IDF value because TF does not provide enough information to understand the importance of words.

Conclusion

We hope this blog successfully helped you understand the concept of Bag of Words in Natural Language Processing and how it can be implemented easily in Python.

If you found this blog interesting and insightful, refer to similar blogs:

ARIMA model for time series analysis

Refer to the Basics of C++ with Data StructureDBMS, and Operating System by Coding Ninjas, and keep practicing on our platform Coding Ninjas Studio. You can check out the mock test series on code studio.

You can also refer to our Guided Path on Coding Ninjas Studio to upskill yourself in domains like Data Structures and AlgorithmsCompetitive ProgrammingAptitude, and many more! Refer to the interview bundle if you want to prepare for placement interviews. Check out interview experiences to understand various companies' interview questions.

Give your career an edge over others by considering our premium courses!

Happy Learning!

Thankyou image
 

Live masterclass