Code360 powered by Coding Ninjas X Naukri.com. Code360 powered by Coding Ninjas X Naukri.com
Last Updated: Mar 27, 2024
Difficulty: Easy

Stemming in NLP

Leveraging ChatGPT - GenAI as a Microsoft Data Expert
Speaker
Prerita Agarwal
Data Specialist @
23 Jul, 2024 @ 01:30 PM

Introduction 

In practically all Natural Language Processing (NLP) projects, stemming is one of the most used data preprocessing processes. Even if you've heard the term before, it's possible that you don't know what it means if you're new to this field.

Stemming is removing a component of a word or reducing a word to its stem or root. There's a chance we're not reducing a word to its dictionary root.

Assume we have a collection of three words: go, go, and gone. Each of the three words is a distinct form of the same root word, go. So, we'll have one word left when we've stemmed the words: go.

Why do we need Stemming? 

There are multiple variations of a single phrase in the English language. When creating NLP or machine learning models, the occurrence of significant variations in a text corpus leads to data redundancy. It may be the case that such models won't work.

Text, words, and documents are preprocessed for text normalization using stemming.

Normalizing text by reducing duplication and stemming words to their base form is vital for building a solid model.

Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job
Bootcamp

Over Stemming

Over stemming occurs when a large amount of a word is removed in excess of what is required, resulting in two or more words being mistakenly reduced to the same root word or stem when they should be reduced to two or more stem words. For instance, consider the university and the universe. Some stemming algorithms may reduce both words to the stem universe, implying they mean the same thing, which is incorrect. As a result, we must be cautious when selecting a stemming algorithm and attempting to optimize the model. Under Stemming, as you might expect, is the inverse of this.

Under Stemming

Under Stemming can cause two or more words to be incorrectly reduced to more than one root word when they should be reduced to the same root word. For example, the words "data" and "datum." Some algorithms may reduce these words to dat and datu, which is incorrect. Both of these must be reduced to a single stem datum. Attempting to optimize such models, however, may result in over stemming.

Types of Stemming: 

In NLTK there are various types of stemming; let's see how they are different: 

Porter Stemming, Snowball Stemming, Lancaster Stemming, and Regex Stemming are four alternative algorithms for finding the stem of various words. They differ in their approach to stemming a word to its root.

Porter Stemmer – PorterStemmer()

In 1980, Martin Porter developed the Porter Stemmer or Porter algorithm. Five-word reduction phases are used in the method, each with its own set of mapping rules. Porter Stemmer is the earliest stemmer and is noted for its speed and ease of use.

Snowball Stemmer – SnowballStemmer()

Martin Porter also created Snowball Stemmer. The algorithm used here is "English Stemmer" or "Porter2 Stemmer," and it is more accurate. It is a minor improvement over the original Porter Stemmer in logic and speed.

Lancaster Stemmer – LancasterStemmer()

Lancaster Stemmer is simple, but it produces over stemming results. Over-stemming results in stems that are not linguistic or have no meaning.

Regexp Stemmer – RegexpStemmer()

The Regex stemmer uses regular expressions to identify morphological affixes. Substrings that match the regular expressions are removed.

Let's see how we can use these in NLTK. 

Stemming in NLTK

Let's look at the code here, where we implement various kinds of stemming in python with the help of NLTK library.

from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer, RegexpStemmer
porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer(language='english')
regexp = RegexpStemmer('ing$|s$|e$|able$', min=4)
word_list = ["universe", "university", "universal", "universities"]
print("{0:20}{1:20}{2:20}{3:30}{4:40}".format("Word","Porter Stemmer","Snowball Stemmer","Lancaster Stemmer",'Regexp Stemmer'))
for word in word_list:
    print("{0:20}{1:20}{2:20}{3:30}{4:40}".format(word,porter.stem(word),snowball.stem(word),lancaster.stem(word),regexp.stem(word)))

 

Output

FAQs

1. When should we use stemming?
When a word form is recognized, it may be possible to return search results that would otherwise be missed. Because of the additional information retrieved, stemming is essential to search queries and information retrieval.

2. What are the methods of stemming?
There are three types of stemming algorithms: truncating methods, statistical methods, and mixed methods. The kind of method to be used depends on the application.

3. Is stemming beneficial to improving performance?
Yes, it improves the performance, in many applications, as with one word, you can now bring a wide variant of words into the domain, and hence, the performance improves. 

4. When it comes to stemming and lemmatization, what's the difference?
The foundation kind of inflected words is generated by both stemming and lemmatization. The main distinction is that stem may or may not be a real word, whereas lemma is a real word in the language.

Key Takeaways

In a nutshell, stemming reduces a word to its root by truncating some suffix. It is not necessary that the root produced is an actual English word. It improves the model's performance and is used in many applications like information retrieval, search engine optimizations etc. 

Hey Ninjas! Don't stop here; check out Coding Ninjas for Machine Learning, more unique courses, and guided paths. Also, try Coding Ninjas Studio for more exciting articles, interview experiences, and fantastic Data Structures and Algorithms problems. 

Happy Learning!

 

Topics covered
1.
Introduction 
2.
Why do we need Stemming? 
3.
Over Stemming
4.
Under Stemming
5.
Types of Stemming: 
5.1.
Porter Stemmer – PorterStemmer()
5.2.
Snowball Stemmer – SnowballStemmer()
5.3.
Lancaster Stemmer – LancasterStemmer()
5.4.
Regexp Stemmer – RegexpStemmer()
6.
Stemming in NLTK
7.
FAQs
8.
Key Takeaways