Introduction
A word installation is an educated portrayal of a message where words that have a similar significance have a comparative portrayal. This way to deal with addressing terms and archives might be viewed as one of the vital forward leaps of profound learning on testing common language handling issues. Word Embeddings are a technique for extricating highlights out of text so we can enter those elements into an AI model to work with text information. They attempt to save linguistic and semantic data.
OddOneOut Model using Word Embedding
Word embedding
In regular language handling, word installing is a term utilized to portray words for text investigation, ordinarily, as a genuine esteemed vector that encodes the importance of the word to such an extent that the words nearer in the vector space are supposed to be used to comparative in mean.
Word2Vec and Gensim are words inserting approaches that address this issue and empower comparative terms to have comparative aspects and, thus, bring set.
Word embeddings utilize a brain network with one information layer, one hidden layer, and one result layer.
Odd One Out
Odd one out, the issue is one of the most intriguing and goto issues regarding testing the sensible thinking abilities of a person. It is regularly utilized in numerous severe tests and situation adjustments to check the person's logical skills and dynamic capacity. In this article, we will compose a python code that can be utilized to track down the odd words among a given arrangement of words.
We will track down the average vector of all the given word vectors. Afterward, we analyze the similarity worth of each word vector with the normal vector esteem, the word with the slightest similarity will be our odd word.
Steps in process
- Word2Vec in Python
- Introducing modules. We start by introducing the 'gensim' and 'nltk' modules.
- Bringing in libraries. From nltk.tokenize import sent_tokenize, word_tokenize import gensim from gensim.models import Word2Vec.
- Perusing the text information.
- Setting up the corpus.
- Building the Word2Vec model utilizing Gensim.
Using Word2Vec and Gensim
Word2Vec is perhaps the most famous strategy to learn word embeddings utilizing a shallow brain organization. Tomas Mikolov created it in 2013 at Google. Word2vec is a mix of models used to address conveyed portrayals of words in a corpus C. Word2Vec (W2V) is a calculation that acknowledges message corpus as information and results in a vector portrayal for each word. We will utilize the Google pre-prepared model for the estimate Odd One Out that we will execute soon. To add the gensim library to python, we will use pip and the following code.
import numpy as np
import gensim
from gensim.models import word2vec,KeyedVectors
from sklearn.metrics.pairwise import cosine_similarity
vector_word_notations = KeyedVectors.load_word2vec_formatCode to define the function to point out the odd word:
def odd_word_out(input_words):
'''The function accepts a list of word and returns the odd word.'''
# Generate all word embeddings for the given list of words
whole_word_vectors = [vector_word_notations[i] for i in input_words]
# average vector for all word vectors
mean_vector = np.mean(whole_word_vectors,axis=0)
# Iterate over every word and find similarity
odd_word = None
minimum_similarity = 99999.0 # Can be any very high value
for i in input_words:
similarity = cosine_similarity([vector_word_notations[i]],[mean_vector])
if similarity < minimum_similarity:
minimum_similarity = similarity
odd_word = i
print("cosine similarity score between %s and mean_vector is %.3f"%(i,similarity))
print("\nThe odd word is: "+odd_word)The cosine likeness capacity will be essential in carrying out this calculation. It registers likeness as the standardized speck result of X and Y. In short words, we can utilize it to tell how much two terms are connected. Allow us to see specific models.
Now we can use different example codes to test this like:
input_1 = ['apple','mango','juice','python','orange','guava'] # python is odd word
odd_word_out(input_1)
In this execution, we have utilized KeyedVectors(from gensim module) and cosine likeness function(provided by sklearn)
The algorithm of OddOneOut
What we are doing is passing a rundown of words to our program. Along these lines, what we will do is we will take the normal of the word vectors of the multitude of words, i.e., assuming the word vectors of the words in the rundown areas v1,v2,v3… … vn (n = no. of words in the rundown), the average vector can be found out by taking the mean of all the word vectors by np.mean ([v1,v2,v3,…,vn],axis=0). Then, at that point, we will set a variable smaller than usual and give it an impressive high worth, which will help in specific correlations we will see soon. Then we will initiate a circle and emphasize every one of the words in the rundown and look at the cosine comparability between each word with the avg vector we determined.
Our oddball will be the word with the most extreme value of similitude with the average vector. Our normal vector comprises n-k terms with similar settings and k words (where k will be a modest number) of a different location from that of n-k words.
Word2Vec was presented in two papers in September and October 2013 by a group of analysts at Google. Alongside the documents, the specialists distributed their execution in C. The Python execution was done not long after the first paper by Gensim.
The basic supposition of Word2Vec is that two words having comparable settings likewise share a comparative significance and thus a relative vector portrayal from the model. For example, "canine," "pup," and "puppy" are frequently utilized in comparable circumstances, with comparative encompassing words like "great," "fleecy," or "adorable," and as per Word2Vec, they will in this manner share a relative vector portrayal.
From this presumption, Word2Vec can be utilized to figure out the relations between words in a dataset, register the comparability between them, or use the vector portrayal of those words to contribute to different applications like text characterization or bunching.
Libraries that are used in the code are as follows:
- xlrd==1.1.0:
- spaCy==2.0.12:
- gensim==3.4.0:
- scikit-learn==0.19.1:
- seaborn==0.8:




