Table of contents
1.
Introduction
2.
Important operations
2.1.
Sentence Segmentation
2.2.
Word Tokenization
2.3.
Lowercase
2.4.
Stemming
2.5.
Lemmatization
2.6.
POS Tagging
3.
FAQs
4.
Key Takeaways
Last Updated: Mar 27, 2024

Introduction to Natural Language Processing (NLP)

Author soham Medewar
0 upvote
Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

Have you ever wondered how Google assistant, Siri, Alexa, etc., personal digital assistants are able to talk and understand human languages? They are embedded with advanced machine learning models, which helps them to recognize human languages and act accordingly. These advanced models need a large amount of dataset for training. The raw dataset will be of human spoken or written sentences, and training such dataset will be of no use, as computers can understand the data in a structured format like spreadsheets, tables, etc. In comparison, the raw data will be totally unstructured. So the main problem is the format of the data. Hence, the formatting of data from human language to computer understandable language can be termed as Natural Language Processing.

Let us see some use cases of NLP in the real world.

  • You might have ordered food online from a restaurant. Contacting each customer is quite difficult for the company, so they add an additional feature to their service called AI bot. This bot can understand the human language and can respond to it.
  • In sentiment analysis, we find the emotion of the sentence. By this, we can categorize the human sentences according to their emotions. For example, “I am happy today!” this statement indicates that the person is happy, “I lost the game.” this indicates that the person is disappointed.


So let us see different techniques to process the data. So that the computer systems can understand human language.

Important operations

Let us discuss some important that are used while preprocessing the text.

Sentence Segmentation

To understand a piece of text a human reads each sentence line by line. Joining different sentences together makes the text. So to understand the essence of the text we must break the text into separate sentences to extract the data. So the first operation will be sentence segmentation. We will perform this operation using the nltk library.

If you don't have an nltk library you can install it using the below command

!pip install nltk
You can also try this code with Online Python Compiler
Run Code
# library to convert into text
import nltk
# random text
text = "Code Ninjas buildings are separated into dojos and lobbies. The lobbies are also where parents pick up the kids. Different belts have different coding languages e.g. white belts are JavaScript, blue belts are Lua, purple and onwards use C#."
# converting text to paragraph
sentence_list = nltk.tokenize.sent_tokenize(text)
# getting list
sentence_list
You can also try this code with Online Python Compiler
Run Code
['Code Ninjas buildings are separated into dojos and lobbies.',
 'The lobbies are also where parents pick up the kids.',
 'Different belts have different coding languages e.g.',
 'white belts are JavaScript, blue belts are Lua, purple and onwards use C#.']
You can also try this code with Online Python Compiler
Run Code

Word Tokenization

Breaking the sentences into individual words is known as tokenization. We break each word after we encounter a space. Even single punctuation is considered an individual token as they possess some meaning.

def tokenize(sentences):
    tokenize_words = []
    for i in sentences:
        tokenize_words.append(nltk.tokenize.word_tokenize(i))
    return tokenize_words
print(tokenize(sentence_list))
You can also try this code with Online Python Compiler
Run Code
[['Code', 'Ninjas', 'buildings', 'are', 'separated', 'into', 'dojos', 'and', 'lobbies', '.'], ['The', 'lobbies', 'are', 'also', 'where', 'parents', 'pick', 'up', 'the', 'kids', '.'], ['Different', 'belts', 'have', 'different', 'coding', 'languages', 'e.g', '.'], ['white', 'belts', 'are', 'JavaScript', ',', 'blue', 'belts', 'are', 'Lua', ',', 'purple', 'and', 'onwards', 'use', 'C', '#', '.']]
You can also try this code with Online Python Compiler
Run Code

Lowercase

We will convert all our data to lower case to reduce the size of vocabulary. As, the words “Code”, “code”, “CODE” will be considered as three different words by the system, so we will stick to the standard level and format all the data to lowercase.

We will use lower function to convert the data to lowercase.

for i in range(len(tokenized_list)):
    for j in range(len(tokenized_list[i])):
        tokenized_list[i][j] = tokenized_list[i][j].lower()
print(tokenized_list)
You can also try this code with Online Python Compiler
Run Code
[['code', 'ninjas', 'buildings', 'are', 'separated', 'into', 'dojos', 'and', 'lobbies', '.'], ['the', 'lobbies', 'are', 'also', 'where', 'parents', 'pick', 'up', 'the', 'kids', '.'], ['different', 'belts', 'have', 'different', 'coding', 'languages', 'e.g', '.'], ['white', 'belts', 'are', 'javascript', ',', 'blue', 'belts', 'are', 'lua', ',', 'purple', 'and', 'onwards', 'use', 'c', '#', '.']]
You can also try this code with Online Python Compiler
Run Code

Stemming

Stopwords are those words which add no information to your dataset, removal of those words is called as stemming. For example, if there is a sentence “I will play in the evening” here, the words will, in, and the don’t make much sense so we will remove those words. We will use stopwords function from nltk library to remove unnecessary words.

for i in range(len(tokenized_list)):
    words = tokenized_list[i]
    words = [w for w in words if w not in nltk.corpus.stopwords.words("english")]
    tokenized_list[i] = words
print(tokenized_list)
You can also try this code with Online Python Compiler
Run Code
[['code', 'ninjas', 'buildings', 'separated', 'dojos', 'lobbies', '.'], ['lobbies', 'also', 'parents', 'pick', 'kids', '.'], ['different', 'belts', 'different', 'coding', 'languages', 'e.g', '.'], ['white', 'belts', 'javascript', ',', 'blue', 'belts', 'lua', ',', 'purple', 'onwards', 'use', 'c', '#', '.']]
You can also try this code with Online Python Compiler
Run Code

Lemmatization

Lemmatization is the process where we replace the word with the root word. In other words, we replace the word with its simplest form. For example, building and buildings mean the same. But the computer can get confused so we will replace it with the root word.

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
for i in range(len(tokenized_list)):
    for j in range(len(tokenized_list[i])):
        tokenized_list[i][j] = lemmatizer.lemmatize(tokenized_list[i][j])
print(tokenized_list)
You can also try this code with Online Python Compiler
Run Code
[['code', 'ninja', 'building', 'separated', 'dojos', 'lobby', '.'], ['lobby', 'also', 'parent', 'pick', 'kid', '.'], ['different', 'belt', 'different', 'coding', 'language', 'e.g', '.'], ['white', 'belt', 'javascript', ',', 'blue', 'belt', 'lua', ',', 'purple', 'onwards', 'use', 'c', '#', '.']]
You can also try this code with Online Python Compiler
Run Code

POS Tagging

POS Tagging (Parts of Speech Tagging) is a process to mark up the words in text format for a particular part of a speech based on its definition and context. It is responsible for text reading in a language and assigning some specific token (Parts of Speech) to each word. It is also called grammatical tagging.

for i in range(len(tokenized_list)):
    print(nltk.pos_tag(tokenized_list[i]))
You can also try this code with Online Python Compiler
Run Code
[('code', 'NN'), ('ninja', 'IN'), ('building', 'NN'), ('separated', 'VBN'), ('dojos', 'JJ'), ('lobby', 'NN'), ('.', '.')]
[('lobby', 'NN'), ('also', 'RB'), ('parent', 'NN'), ('pick', 'NN'), ('kid', 'NN'), ('.', '.')]
[('different', 'JJ'), ('belt', 'VBD'), ('different', 'JJ'), ('coding', 'NN'), ('language', 'NN'), ('e.g', 'NN'), ('.', '.')]
[('white', 'JJ'), ('belt', 'NN'), ('javascript', 'NN'), (',', ','), ('blue', 'JJ'), ('belt', 'NN'), ('lua', 'NN'), (',', ','), ('purple', 'NN'), ('onwards', 'NNS'), ('use', 'VBP'), ('c', 'JJ'), ('#', '#'), ('.', '.')]
You can also try this code with Online Python Compiler
Run Code

Some of the abbreviations and their meaning.

Abbreviations Meaning
IN preposition/subordinating conjunction
NN noun, singular (cat, tree)
VBP verb, present tense not 3rd person singular(wrap)
NNS noun plural (desks)
JJ This NLTK POS Tag is an adjective (large)
RB adverb (occasionally, swiftly)

FAQs

  1. What is the difference between NLP and Machine Learning?
    NLP interprets written language, whereas Machine Learning makes predictions based on patterns learned from experience.
     
  2. What is the purpose of stemming?
    Stemming is a natural language processing technique that lowers inflection in words to their root forms, hence aiding in the preprocessing of text, words, and documents for text normalization.
     
  3. What are the 5 steps of NLP?
    The five phases of NLP involve lexical (structure) analysis, parsing, semantic analysis, discourse integration, and pragmatic analysis.
     
  4. Why is NLP important to study explain?
    NLP is important because it helps resolve ambiguity in language and adds useful numeric structure to the data for many downstream applications, such as speech recognition or text analytics.
     
  5. What are the disadvantages of NLP?
    It takes a lot of time and is not 100% reliable.

Key Takeaways

In this article, we have discussed the following topics:

  • Segmenting a text
  • Tokenizing a sentence
  • Stemming and Lemmatization
  • Tagging


Want to learn more about Machine Learning? Here is an excellent course that can guide you in learning. 

Happy Coding!

Live masterclass