TF-IDF Vectorizer
The text vectorizer Term frequency-inverse document frequency converts the text into a useable vector. Term Frequency (TF) and Document Frequency (DF) are combined in this idea (DF). Frequency means the total number of times the term appears in a document. Count Vectorizer converts each sentence into its vector, it does not consider a word's relevance throughout the entire list of corrections.
The TF-IDF is made up of two parts:
- The term frequency (TF) refers to the number of times a word appears in a phrase.
- IDF (Inverse Document Frequency) is defined as the number of total documents divided by the number of records in which the word appears divided by the number of complete documents.
It can be broken down into three steps:
- Find unique terms in the entire text data.
- We'll make an array of zeros with the same length for each statement.
- We'll calculate the TF-IDF value for each word in each sentence and update the matching value in the sentence's vector.
Hashing Vectorizer
The hashing vectorizer is a vectorizer that finds the token string name to feature integer index mapping using the hashing method. This vectorizer converts text documents into matrices by converting the collection of documents into a sparse matrix that holds the token occurrence counts. This vectorizer is useful as it allows us to correct any word into its hash and helps in not requiring the generation of any vocabulary.
- Define the size of the vector to be created for each sentence
- Apply the hashing algorithm to the sentence
- Repeat step 2 for all sentences
This vectorizer is really handy because it allows us to turn any word into its hash without creating any vocabulary.
- Set the size of the vector that will be created for each sentence.
- Apply a hashing algorithm to the sentence.
- Step 2 should be repeated for each sentence.
Word2Vec
The word2vec algorithm learns word connections from a massive corpus of text using a neural network model. This is a collection of neural network models that represent words in vector space. These models are highly effective in comprehending the context and relationships between words. Related words are clustered together in the vector space, while different words are spread apart.
In this class, there are two models:
- Continuous Bag of Words (CBOW): The neural network examines the surrounding words.
- Skip-grams: The neural network inputs a word and then attempts to predict the words that surround it.
The neural network contains one input layer, one hidden layer, and one output layer to train on data and create vectors.
CODE
I'll utilize the gensim package to create the word2vec model, which has numerous features such as picking the odd one out, most similar words, and so on. It does not, however, lowercase or tokenize the sentences, so I do so. After that, the tokenized texts are sent into the model. I've set the vector size to 2, the window to 3, which specifies the distance to look at, and sg = 0 to use the CBOW model.
from gensim.models import word2vec
sntncs = [
'He is playing in the field.',
'He is running towards the football.',
'The football game ended.',
'It started raining while everyone was playing in the field.'
]
for i, sntnc in enumerate(sntncs):
tknzd= []
for wrd in sntnc.split(' '):
wrd = word.split('.')[0]
wrd = word.lower()
tknzd.append(wrd)
sntncs[i] = tknzd
model = word2vec.Word2Vec(sntncs, wrkrs = 1, sze = 2, mn_cnt = 1, wndw = 3, sg = 0)
smlar_word = model.wv.most_similar('football')[0]
print("Most common word to football is: {}".format(smlar_word[0]))

You can also try this code with Online Python Compiler
Run Code
OUTPUT
# Most common word to football is: game

You can also try this code with Online Python Compiler
Run Code
FAQs
1. What is a vector of words?
A word vector is a row of real-valued numbers (rather than dummy numbers). Each point represents a dimension of the word's meaning, and semantically similar words have similar vectors.
2. What is a word embedding example?
Words that have similar meanings are grouped in vector space using word embeddings.
3. What is word representation?
In NLP, word representation, which aims to represent a word with a vector, is crucial.
4. What is the word embedding model?
A word embedding is a learned text representation in which words with related meanings are represented similarly. One of the significant achievements of deep learning on challenging natural language processing problems maybe this way of expressing terms and documents.
Key Takeaways
So that's the end of the article.
In this article, we have extensively discussed the Vectorial Representation of Words.
Isn't Machine Learning exciting!! We hope that this blog has helped you enhance your knowledge regarding the Vectorial Representation of Words and if you would like to learn more, check out our articles on MACHINE LEARNING COURSE. Do upvote our blog to help other ninjas grow. Happy Coding!