Table of contents

Introduction

Count Vectorizer

TF-IDF Vectorizer

Hashing Vectorizer

Word2Vec

5.1.

CODE

5.2.

OUTPUT

FAQs

Key Takeaways

Last Updated: Mar 27, 2024

Easy

Vectorial Representation of Words

Author Siddhant Verma

Do you think IIT Guwahati certified course can help you in your career?

Yes

Introduction

This article will look at various word representation systems and their expressive capabilities, ranging from classic NLP to current Neural Network techniques. We'll wrap up with some hands-on coding tasks and explanations.

Machines are better at deciphering numbers than decoding actual words passed on as tokens. Vectorization is the process of translating language to numbers. The vectors are then combined to produce a continuous vector space, an algebraic model in which the laws of vector addition and similarity measurements apply. Let's go over the many vectorization techniques, starting with the most basic and progressing to the most advanced.

In this article, We’ll explore the word embedding techniques:

Count Vectorizer
TF-IDF Vectorizer
Hashing Vectorizer
Word2Vec

Count Vectorizer

The Count Vectorizer is a tool for converting a string array into a frequency representation. The scikit-learn toolkit in Python has a great utility called CountVectorizer. It is used to convert a text into a vector-based on each word's frequency (count) that appears throughout the text.

In this technique, we convert text into vectors.

Steps are-

In the entire text data, find unique words.
We'll make an array of zeros with the same length for each statement.

TF-IDF Vectorizer

The text vectorizer Term frequency-inverse document frequency converts the text into a useable vector. Term Frequency (TF) and Document Frequency (DF) are combined in this idea (DF). Frequency means the total number of times the term appears in a document. Count Vectorizer converts each sentence into its vector, it does not consider a word's relevance throughout the entire list of corrections.

The TF-IDF is made up of two parts:

The term frequency (TF) refers to the number of times a word appears in a phrase.
IDF (Inverse Document Frequency) is defined as the number of total documents divided by the number of records in which the word appears divided by the number of complete documents.

It can be broken down into three steps:

Find unique terms in the entire text data.
We'll make an array of zeros with the same length for each statement.
We'll calculate the TF-IDF value for each word in each sentence and update the matching value in the sentence's vector.

Hashing Vectorizer

The hashing vectorizer is a vectorizer that finds the token string name to feature integer index mapping using the hashing method. This vectorizer converts text documents into matrices by converting the collection of documents into a sparse matrix that holds the token occurrence counts. This vectorizer is useful as it allows us to correct any word into its hash and helps in not requiring the generation of any vocabulary.

Define the size of the vector to be created for each sentence
Apply the hashing algorithm to the sentence
Repeat step 2 for all sentences

This vectorizer is really handy because it allows us to turn any word into its hash without creating any vocabulary.

Set the size of the vector that will be created for each sentence.
Apply a hashing algorithm to the sentence.
Step 2 should be repeated for each sentence.

Word2Vec

The word2vec algorithm learns word connections from a massive corpus of text using a neural network model. This is a collection of neural network models that represent words in vector space. These models are highly effective in comprehending the context and relationships between words. Related words are clustered together in the vector space, while different words are spread apart.

In this class, there are two models:

Continuous Bag of Words (CBOW): The neural network examines the surrounding words.
Skip-grams: The neural network inputs a word and then attempts to predict the words that surround it.

The neural network contains one input layer, one hidden layer, and one output layer to train on data and create vectors.

CODE

I'll utilize the gensim package to create the word2vec model, which has numerous features such as picking the odd one out, most similar words, and so on. It does not, however, lowercase or tokenize the sentences, so I do so. After that, the tokenized texts are sent into the model. I've set the vector size to 2, the window to 3, which specifies the distance to look at, and sg = 0 to use the CBOW model.

from gensim.models import word2vec

sntncs = [
    'He is playing in the field.',
    'He is running towards the football.',
    'The football game ended.',
    'It started raining while everyone was playing in the field.'
]

for i, sntnc in enumerate(sntncs):
tknzd= []
for wrd in sntnc.split(' '):
wrd = word.split('.')[0]
wrd = word.lower()
tknzd.append(wrd)
sntncs[i] = tknzd

model = word2vec.Word2Vec(sntncs, wrkrs = 1, sze = 2, mn_cnt = 1, wndw = 3, sg = 0)
smlar_word = model.wv.most_similar('football')[0]
print("Most common word to football is: {}".format(smlar_word[0]))

You can also try this code with Online Python Compiler

Run Code

OUTPUT

# Most common word to football is: game

You can also try this code with Online Python Compiler

Run Code

FAQs

1. What is a vector of words?

A word vector is a row of real-valued numbers (rather than dummy numbers). Each point represents a dimension of the word's meaning, and semantically similar words have similar vectors.

2. What is a word embedding example?

Words that have similar meanings are grouped in vector space using word embeddings.

3. What is word representation?

In NLP, word representation, which aims to represent a word with a vector, is crucial.

4. What is the word embedding model?

A word embedding is a learned text representation in which words with related meanings are represented similarly. One of the significant achievements of deep learning on challenging natural language processing problems maybe this way of expressing terms and documents.

Key Takeaways

So that's the end of the article.

In this article, we have extensively discussed the Vectorial Representation of Words.

Isn't Machine Learning exciting!! We hope that this blog has helped you enhance your knowledge regarding the Vectorial Representation of Words and if you would like to learn more, check out our articles on MACHINE LEARNING COURSE. Do upvote our blog to help other ninjas grow. Happy Coding!

Live masterclass

Zomato Data Analysis Case Study: Ace 25L+ Roles in FoodTech

by Abhishek Soni

16 Mar, 2026

01:30 PM

39+ registered

Data Analysis for 20L+ CTC@Flipkart: End-Season Sales dataset

by Sumit Shukla

15 Mar, 2026

06:30 AM

267+ registered

Beginner to GenAI Engineer Roadmap for 30L+ CTC at Amazon

by Shantanu Shubham

15 Mar, 2026

08:30 AM

55+ registered

Multi-Agent AI Systems: Live Workshop for 25L+ CTC at Google

by Saurav Prateek

16 Mar, 2026

03:00 PM

8+ registered

Zomato Data Analysis Case Study: Ace 25L+ Roles in FoodTech

by Abhishek Soni

16 Mar, 2026

01:30 PM

39+ registered

Data Analysis for 20L+ CTC@Flipkart: End-Season Sales dataset

by Sumit Shukla

15 Mar, 2026

06:30 AM

267+ registered

View more events