What are the drawbacks of Word2Vec?

Word2Vec struggles with words that aren't in the dictionary. OOV words assign a random vector representation, which can be unsatisfactory. It is based on word information from the native language.

What can you do with Word2Vec?

The Word2Vec model is used to extract concepts like semantic relatedness, synonym recognition, concept classification, selectional preferences, and analogies from words or items. A Word2Vec model discovers meaningful relationships and converts them into vector similarities.

What is Word2Vec so important?

Word2vec's objective and use are to the group in vector-space the vectors of similar words. In other words, it uses math to find similarities. Word2vec generates vectors numerical representations of word properties such as context.

What exactly is the Skip-gram model?

Skip-gram is one of the unsupervised learning strategies for finding the most similar words for a given term. Skip-gram is a technique for predicting the context word for a target word. It's the opposite of the CBOW algorithm. The target word is entered, and the context words are displayed.

How do you go about putting Word2Vec into action?

There are two flavors of Word2Vec to choose from continuous Bag-Of-Words (CBOW) or continuous Skip-gram (SG). In a nutshell, CBOW tries to guess the output (target word) from its surrounding words (context words), whereas continuous Skip-Gram tries to guess the context words from the target word.

Table of contents

Introduction

Need For Word2Vec

Word2Vec Model

General Algorithm

Working Of Word2Vec model

5.1.

CBOW Model

5.1.1.

Steps

5.1.2.

Advantages

5.1.3.

Disadvantages

5.2.

Skip Gram Model

5.2.1.

Steps

5.2.2.

Advantages

Implementation

FAQs

Key Takeaways

Last Updated: Mar 27, 2024

Easy

Word2Vec

Author Mayank Goyal

Do you think IIT Guwahati certified course can help you in your career?

Yes

Introduction

One of the most valuable ways to express document vocabulary is word embedding. Word embedding can recognize a word's context in a document, semantic and grammatical similarity, and relationships with other terms, among other things.

The term "word embedding" refers to the representation of words as vectors. Word embedding's primary purpose is to convert the high-dimensional feature space into low-dimensional feature vectors while keeping the corpus's contextual similarity.

For all NLP problems, these models are frequently utilized. It learns the word embedding representations after generating a vocabulary with the help of a training corpus. Put another way, these models take a text corpus as input and output word vectors.

They may be used as feature vectors in a Machine Learning model. They can quantify text similarity using cosine similarity techniques, words clustering, and text classification approach, all of which will be covered in the next part of this series.

Need For Word2Vec

Consider the following phrases: Have a nice day and Have a fantastic day. They don't have much of a difference in meaning. If we built a comprehensive vocabulary (let's name it V), V would stand for "Have a good, terrific day."

Let us now generate a one-hot encoded vector in V for each word. The V (=5) size would be equal to the length of our one-hot encoded vector. Except for the element at the index denoting the relevant word in the vocabulary, we'd have a vector of zeros. One of them would be that particular ingredient. The encodings that follow will help you understand this better.

Have = [1,0,0,0]; a=[0,1,0,0]; good=[0,0,1,0]; great=[0,0,1,0]; day=[0,0,0,0]

If we try to envision these encodings, we can imagine a five-dimensional space in which each word fills one dimension and has no relation to the others (no projection along the other dimensions). This implies that 'good' and 'great' are equivalent to 'day' and 'having,' which is incorrect.

Our goal is for words with similar contexts to be clustered together in space. The cosine of the angle formed by such vectors should be close to 1, i.e., the angle should be close to 0.

The concept of producing distributed representations is introduced here. We present some dependency of one word on the other terms intuitively. The words in this context would receive a larger share of this reliance. As previously stated, all words in a single hot encoding representation are independent.

Word2Vec Model

Word2Vec generates word vectors, which are distributed numerical representations of word features - these word features could be words that indicate the context of particular words in our vocabulary. Through the produced vectors, word embeddings eventually assist in forming the relationship of a term with another word with similar meaning.

Similar meaning words are closer in space, as demonstrated in the graphic below when word embeddings are plotted, suggesting semantic similarity.

Img_src

Context is used in these models. This means that it looks at neighboring words to learn the embedding; if a set of words is always found close to the exact words, their embeddings will be similar.

To classify how words are similar or close to one another, we must first define the window size, which decides which neighboring terms we wish to select.

The Skip-Gram Continuous Bag of Words models is two distinct architectures that Word2Vec can build word embeddings.

General Algorithm

Step-1: Initially, we will assign a vector of random numbers to each word in the corpus.
Step-2: Then, we will iterate through each word of the document and grab the vectors of the nearest n-words on either side of our target word, concatenate all these vectors, and then forward propagate these concatenated vectors through a linear layer + softmax function, and try to predict what our target word was.
Step-3: In this step, we will compute the error between our estimate and the actual target word and then backpropagate the error, and then modify the weights of the linear layer and the vectors or embeddings of our neighbor's words.
Step-4: Finally, we will extract the weights from the hidden layer and, by using these weights, encode the meaning of words in the vocabulary.

The Word2Vec model, instead of being a single method, is made up of two preprocessing modules or techniques:

Skip-Gram with the Continuous Bag of Words (CBOW).

Both models are shallow neural networks that map words to a target variable (a word (s). The weights that operate as word vector representations are learned using these strategies. Using word2vec, both methods can be utilized to implement word embedding.

Working Of Word2Vec model

The Continuous Bag of Words (CBOW) and the Skip-Gram model architectures are two distinct model architectures that Word2Vec can employ to build word embeddings.

CBOW Model

Even though Word2Vec is an unsupervised model that can construct dense word embeddings from a corpus without any label information, Word2Vec internally uses a supervised classification model to extract these embeddings from the corpus.

The CBOW architecture includes a deep learning classification model that uses context words as input (X) to predict our target word, Y. Consider the following scenario: Have a wonderful day.

Let the word "excellent" by the input to the Neural Network. It's important to note that we're attempting to predict a target word (day) from a single context input word, amazing. More specifically, we compare the output error of the one-hot encoding of the input word to the one-hot encoding of the target word (day). We learn the vector representation of the target word as part of the prediction process.

img_src

Steps

The model's operation is described in the steps below:

As indicated in the Figure below, the context words are initially supplied as an input to an embedding layer.
The word embeddings are then transferred to a lambda layer, where the word embeddings are averaged.
The embeddings are then passed to a dense SoftMax layer, predicting our target word. We compute the loss after matching this with our target word and then run backpropagation with each epoch to update the embedding layer in the process.

Once the training is complete, we may extract the embeddings of the required words from our embedding layer.

Advantages

CBOW has the following advantages:

It is generally thought to outperform deterministic approaches because to its probabilistic character.
It does not necessitate a large amount of RAM. As a result, it has a low memory capacity.

Disadvantages

CBOW has the following drawbacks:

It averages the context of a word. Consider the word apple, which can refer to both a fruit and a company, but CBOW averages the two meanings and places it in a cluster for both fruits and companies.
If we wish to train a CBOW model from scratch, it can take longer if we don't optimize it effectively.

So far, we've seen how context words are used to construct word representations. However, there is another way we can achieve the same. We may anticipate the context using the target word (whose representation we wish to build) and generate the representations in the process. Another variety, known as the Skip Gram model, does this.

Skip Gram Model

The context words are predicted in the skip-gram model given a target (center) word. Consider the following sentence: "Word2Vec uses a deep learning model in the backend." Given the center word 'learning' and a context window size of 2, the model tries to predict ['deep,' 'model'], and so on.

We feed the skip-gram model pairs of (X, Y), where X is our input and Y is our label because the model has to predict many words from a single provided word. This is accomplished by creating positive and negative input samples.

These samples alert the model to contextually relevant terms, causing it to construct similar embeddings for words with similar meanings. This appears to be a multiple-context CBOW model that has been flipped. To a degree, this is correct.

The target term is entered into the network. The model generates C probability distributions. What exactly does this imply?

We receive C probability distributions of V probabilities for each context position, one for each word.

Img_src

Steps

The model's operation is described in the steps below:

Individual embedding layers are passed both the target and context word pairs, yielding dense word embeddings for each of these two words.
The dot product of these two embeddings is computed using a 'merge layer,' and the dot product value is obtained.
The value of the dot product is then transmitted to a dense sigmoid layer, which outputs 0 or 1.
The output is compared to the actual value or label, and the loss is calculated, then backpropagation is used to update the embedding layer at each epoch.

Advantages

The Skip-Gram Model has the following advantages:

1. It can capture two interpretations for a single word. In other words, there are two vector representations of the word Apple. One is for the business, while the other is for the fruit.
2. Skip-gram with negative subsampling outperforms all other methods in general.

Both CBOW and skip-gram have their own set of benefits and drawbacks. Skip Gram, according to Mikolov, works well with limited amounts of data and is shown to represent unusual words accurately.

CBOW, on the other hand, is speedier and provides better representations for terms that are used more frequently.