Why should we remove Stop Words?
Stop words are in plenty in any natural language, and therefore by removing them, we can filter out the low-level information and focus on more important words.
For example, if we search “How to study NLP” on a search engine, then if the engine finds web pages having “how” and “to” as they are more frequently used in the English language, we will get a lot of unwanted pages. However, removing “how” and “to” will be beneficial since the search engine can now focus on more critical words like “study” and “NLP” so that we get helpful resources that are of interest to us.
Therefore, removing stop words helps reduce the data, which reduces the training time of the model. At the same time, it helps in a better performance as we saw in the search engine example above that removing stop words helped give more accurate results.
When do we remove Stop Words?
Do you think that we can remove stop words in every task?
The answer is a big NO!
Let’s say we want to predict the sentiment of the sentence “The decoration was not good.”
After removing the stop words, we are left with “decoration good.”
Although the real meaning of the sentence meant a negative review that the decoration was not good, removing the stop words changed the review to positive.
So removing stop words is not suited for this case.
In general, stop words removal is suited for text classification, but it can be a curse in tasks like sentiment analysis, machine translation, etc. Therefore research about your job before removing stop words to check if it is required or not.
Removing stop words using NLTK
Natural Language Toolkit (NLTK) is a beautiful suite of libraries to work in NLP using Python.
There is no universally accepted list of stop words, but most libraries provide their list of stop words, and we can add or remove words from that list as per our task.
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print("Number of stop words is ", len(stop_words))
print("Stop words are:",stop_words)

You can also try this code with Online Python Compiler
Run Code
Output
Number of stop words is 179
Stop words are: ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
As we can see, we have 179 stop words in the English language in NLTK.
Let us see how we can remove the stop words.
text = "Coding ninjas is one of the best learning platforms."
words = [word for word in text.split() if word.lower() not in stop_words]
modified_text = " ".join(words)
print("Original text:", text)
print("Modified text:", modified_text)

You can also try this code with Online Python Compiler
Run Code
Output
Original text: Coding ninjas is one of the best learning platforms.
Modified text: Coding ninjas one best learning platforms.
As we can see that “is”, “of,” and “the” have been removed from the text.
In the second line of code, I first split the text into words as stop words are in the form of words. Then I converted them to lowercase as stop words are in lowercase. Then I checked which words were not there in the stop_words list, and then I joined those words using space in the third line of code to print the modified sentence.
We can also edit the list of stop words.
To add a single word to the stop words list.
stop_words = stopwords.words('english') # returns a list of stop words in English language
print("Original length:", len(stop_words))
stop_words.append('example') # to add a single word
print("Modified length:", len(stop_words))

You can also try this code with Online Python Compiler
Run Code
Output
Original length: 179
Modified length: 180
To add a list of words to stop_words list.
stop_words = stopwords.words('english') # returns a list of stop words in English language
print("Original length:", len(stop_words))
stop_words.extend(['stopwordone', 'stopwordtwo']) # to add a list of stop words
print("Modified length:", len(stop_words))

You can also try this code with Online Python Compiler
Run Code
Output
Original length: 179
Modified length: 181
To remove a word from the list of stop words.
stop_words = stopwords.words('english') # returns a list of stop words in English language
print("Original length:", len(stop_words))
stop_words.remove('a') # to remove a word from the list of stop words
print("Modified length:", len(stop_words))

You can also try this code with Online Python Compiler
Run Code
Output
Original length: 179
Modified length: 180
NLTK has stop words in 24 different languages, which we can check below.
print(stopwords.fileids()) # to see the available languages

You can also try this code with Online Python Compiler
Run Code
Output
['arabic', 'azerbaijani', 'bengali', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', 'spanish', 'swedish', 'tajik', 'turkish']
FAQS
1. What is removing stop words in Python?
Ans: Stop words are the most frequently occurring words that don’t add much value to the meaning of the data like a, an, the, etc, and hence they can be removed from the data.
2. What are the advantages of removing stop words?
Ans: Removing stop words can help reduce data size and hence reduce the training time of the model. It can also increase the model's performance by giving more accurate results.
3. Is it mandatory to remove stop words?
Ans: No, it depends on your task. Removing stop words in a sentiment analysis task can hamper your performance as it can flip the intent of the sentence.
4. What is NLTK?
Ans: Natural Language Toolkit (NLTK) is a super helpful library that can be used for NLP tasks like text preprocessing, removing stop words, etc.
5. How do you remove stop words in NLP?
Ans: You can use various libraries like NLTK, spaCy, etc., to remove stop words.
Key Takeaways
This article discussed stop words, when, and how we can remove them using NLTK.
We hope this blog has helped you enhance your knowledge regarding stop words and NLTK and if you would like to learn more, check out our free content on NLP and more unique courses. Do upvote our blog to help other ninjas grow.
Happy Coding!