Basic Implementation
Python’s ‘nltk’ is the huge and most popular library that supports many useful methods in the natural language processing phase. It is the library that many data scientists use to perform their text processing and various other tasks.
To implement the Lemmatization concept in Python, Python’s ‘nltk’ is the best option for us.
‘nltk’ uses the ‘WordNet’ database to reduce the words to the root forms.
To use Lemmatizer, we need to import ‘WordNetLemmatizer’, which is packaged in nltk.stem.wordnet module. To do that,
from nltk.stem.wordnet import WordNetLemmatizer

You can also try this code with Online Python Compiler
Run Code
And then, we can use
Lemmed_data = WordNetLemmatizer().lemmatize(word) #the word which to be reduced

You can also try this code with Online Python Compiler
Run Code
Here, the lemmatize() method used to do the discussed task using the WordNet database will take a word as the argument and return that word's lemma.
#Basic Example of using Lemmatization concept.
import re
import nltk
from nltk.corpus import stopwords
text = "Are the human people the ones who started the war? Is AI a bad thing ?, It will change your view of the matrix. Look at it at least twice and definitely watch part 2. The first time you see The Second Renaissance it may look boring.”
# Normalize it
text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
# Tokenize it
words = text.split()
print(words)

You can also try this code with Online Python Compiler
Run Code
The output of this will be a list of tokens of split words.
['are', 'the', 'human', 'people', 'the', 'ones', 'who', 'started', 'the', 'war', 'is', 'ai', 'a', 'bad', 'thing', 'it', 'will', 'change', 'your', 'view', 'of', ‘the’, 'matrix', 'look', 'at', 'it', 'at', 'least', 'twice', 'and', 'definitely', 'watch', 'part', '2', ‘the’, 'first', 'time', 'you', 'see', 'the', 'second', 'renaissance', 'it', 'may', 'look', 'boring']
#The below statement will remove all the stop words such as ‘the’, ‘you’, ‘it’, etc.
words = [word for word in words if word not in stopwords.words("english")]
#Implementing the Lemmatization concept.
from nltk.stem.wordnet import WordNetLemmatizer as wnl
# Reduce words to their root form
lemmed = [wnl().lemmatize(word) for word in words]
print(lemmed)

You can also try this code with Online Python Compiler
Run Code
The Output of the python snippet will be as follows:
['human', 'people', 'one', 'started', 'war', 'ai', 'bad', 'thing', 'change', 'view', 'matrix', 'look', 'least', 'twice', 'definitely', 'watch', 'part', '2', 'first', 'time', 'see', 'second', 'renaissance', 'may', 'look', 'boring']
Lemmatization with Parts Of Speech
The above example shows that the only change after Lemmatization is 'ones' to 'one, i.e., the plural noun is turned to a singular noun. Here Lemmatizer needs to know or make an assumption about the part of speech of each word in the given input and then reduce the word to its normalized or reduced form.
This makes sense that the lemmatizer will reduce the words based on the parts of speech.
So this leads to introducing an additional optional parameter to the lemmatize() method, ‘pos’
Lemmed_data = wnl().lemmatize(word, pos = ‘v’)
#v = verb here

You can also try this code with Online Python Compiler
Run Code
For the same example, we can use the pos parameter to convert the ‘boring’, ‘started’ to their root forms as ‘bore’, and ‘start’. This can be done as shown below:
lemmed = [wnl().lemmatize(word, pos='v') for word in lemmed]
print(lemmed)

You can also try this code with Online Python Compiler
Run Code
The output, as discussed earlier, will be:
['human', 'people', 'one', 'start', 'war', 'ai', 'bad', 'thing', 'change', 'view', 'matrix', 'look', 'least', 'twice', 'definitely', 'watch', 'part', '2', 'first', 'time', 'see', 'second', 'renaissance', 'may', 'look', 'bore']
We can use other parameter values when we need them.
When compared to stemming and lemmatization, stemming doesn’t use a dictionary as we used in the concept of Lemmatization; thus, stemming will be considered as a less memory-intensive method to use when compared to Lemmatization.
You can learn more about this functionality here, nltk Lemmatization.
FAQs
-
What is the Lemmatization of words?
Lemmatization of words is an important step in the Natural Language Processing pipeline, where we try to reduce the words to their root forms by using the morphological analysis of the words.
-
What are the common differences between Lemmatization and Stemming?
Lemmatization uses morphological analysis of the words to reduce to their root forms, whereas Stemming uses various algorithms to reduce words by cutting the tail of the words. Lemmatization uses a dictionary but Stemming doesn’t.
-
How does Python support Lemmatization?
Python’s nltk library provides a beautiful default method called WordNetLemmatizer(), which uses WordNet database to use the words, use lemmatize() method, which takes a word, a poS as parameters.
-
What is the output of the Lemmatization step?
The output of the Lemmatization step is the lemma of the input word, and the lemma here represents the reduces or normalized form of the word.
Key Takeaways
In this article, we have briefly discussed the concept of Lemmatization, how it is used, what are the differences between stemming and Lemmatization, and how to use python to implement Lemmatization.
Here is a small task for you,
For Given a list of words such as ‘Change’, ‘Changes’, ‘Changed’, ‘Changing’.
What is the output when we will apply Stemming and Lemmatization?
Comment the answer below.
Hey Ninjas! You can check out more unique courses on machine learning concepts through our official website, Coding Ninjas, and checkout Coding Ninjas Studio to learn through articles and other important stuff to your growth.
Happy Learning!