Introduction
Have you ever wondered how Google assistant, Siri, Alexa, etc., personal digital assistants are able to talk and understand human languages? They are embedded with advanced machine learning models, which helps them to recognize human languages and act accordingly. These advanced models need a large amount of dataset for training. The raw dataset will be of human spoken or written sentences, and training such dataset will be of no use, as computers can understand the data in a structured format like spreadsheets, tables, etc. In comparison, the raw data will be totally unstructured. So the main problem is the format of the data. Hence, the formatting of data from human language to computer understandable language can be termed as Natural Language Processing.
Let us see some use cases of NLP in the real world.
- You might have ordered food online from a restaurant. Contacting each customer is quite difficult for the company, so they add an additional feature to their service called AI bot. This bot can understand the human language and can respond to it.
- In sentiment analysis, we find the emotion of the sentence. By this, we can categorize the human sentences according to their emotions. For example, “I am happy today!” this statement indicates that the person is happy, “I lost the game.” this indicates that the person is disappointed.
So let us see different techniques to process the data. So that the computer systems can understand human language.
Important operations
Let us discuss some important that are used while preprocessing the text.
Sentence Segmentation
To understand a piece of text a human reads each sentence line by line. Joining different sentences together makes the text. So to understand the essence of the text we must break the text into separate sentences to extract the data. So the first operation will be sentence segmentation. We will perform this operation using the nltk library.
If you don't have an nltk library you can install it using the below command
!pip install nltk
# library to convert into text
import nltk
# random text
text = "Code Ninjas buildings are separated into dojos and lobbies. The lobbies are also where parents pick up the kids. Different belts have different coding languages e.g. white belts are JavaScript, blue belts are Lua, purple and onwards use C#."
# converting text to paragraph
sentence_list = nltk.tokenize.sent_tokenize(text)
# getting list
sentence_list
['Code Ninjas buildings are separated into dojos and lobbies.',
'The lobbies are also where parents pick up the kids.',
'Different belts have different coding languages e.g.',
'white belts are JavaScript, blue belts are Lua, purple and onwards use C#.']
Word Tokenization
Breaking the sentences into individual words is known as tokenization. We break each word after we encounter a space. Even single punctuation is considered an individual token as they possess some meaning.
def tokenize(sentences):
tokenize_words = []
for i in sentences:
tokenize_words.append(nltk.tokenize.word_tokenize(i))
return tokenize_words
print(tokenize(sentence_list))
[['Code', 'Ninjas', 'buildings', 'are', 'separated', 'into', 'dojos', 'and', 'lobbies', '.'], ['The', 'lobbies', 'are', 'also', 'where', 'parents', 'pick', 'up', 'the', 'kids', '.'], ['Different', 'belts', 'have', 'different', 'coding', 'languages', 'e.g', '.'], ['white', 'belts', 'are', 'JavaScript', ',', 'blue', 'belts', 'are', 'Lua', ',', 'purple', 'and', 'onwards', 'use', 'C', '#', '.']]
Lowercase
We will convert all our data to lower case to reduce the size of vocabulary. As, the words “Code”, “code”, “CODE” will be considered as three different words by the system, so we will stick to the standard level and format all the data to lowercase.
We will use lower function to convert the data to lowercase.
for i in range(len(tokenized_list)):
for j in range(len(tokenized_list[i])):
tokenized_list[i][j] = tokenized_list[i][j].lower()
print(tokenized_list)
[['code', 'ninjas', 'buildings', 'are', 'separated', 'into', 'dojos', 'and', 'lobbies', '.'], ['the', 'lobbies', 'are', 'also', 'where', 'parents', 'pick', 'up', 'the', 'kids', '.'], ['different', 'belts', 'have', 'different', 'coding', 'languages', 'e.g', '.'], ['white', 'belts', 'are', 'javascript', ',', 'blue', 'belts', 'are', 'lua', ',', 'purple', 'and', 'onwards', 'use', 'c', '#', '.']]
Stemming
Stopwords are those words which add no information to your dataset, removal of those words is called as stemming. For example, if there is a sentence “I will play in the evening” here, the words will, in, and the don’t make much sense so we will remove those words. We will use stopwords function from nltk library to remove unnecessary words.
for i in range(len(tokenized_list)):
words = tokenized_list[i]
words = [w for w in words if w not in nltk.corpus.stopwords.words("english")]
tokenized_list[i] = words
print(tokenized_list)
[['code', 'ninjas', 'buildings', 'separated', 'dojos', 'lobbies', '.'], ['lobbies', 'also', 'parents', 'pick', 'kids', '.'], ['different', 'belts', 'different', 'coding', 'languages', 'e.g', '.'], ['white', 'belts', 'javascript', ',', 'blue', 'belts', 'lua', ',', 'purple', 'onwards', 'use', 'c', '#', '.']]
Lemmatization
Lemmatization is the process where we replace the word with the root word. In other words, we replace the word with its simplest form. For example, building and buildings mean the same. But the computer can get confused so we will replace it with the root word.
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
for i in range(len(tokenized_list)):
for j in range(len(tokenized_list[i])):
tokenized_list[i][j] = lemmatizer.lemmatize(tokenized_list[i][j])
print(tokenized_list)
[['code', 'ninja', 'building', 'separated', 'dojos', 'lobby', '.'], ['lobby', 'also', 'parent', 'pick', 'kid', '.'], ['different', 'belt', 'different', 'coding', 'language', 'e.g', '.'], ['white', 'belt', 'javascript', ',', 'blue', 'belt', 'lua', ',', 'purple', 'onwards', 'use', 'c', '#', '.']]
POS Tagging
POS Tagging (Parts of Speech Tagging) is a process to mark up the words in text format for a particular part of a speech based on its definition and context. It is responsible for text reading in a language and assigning some specific token (Parts of Speech) to each word. It is also called grammatical tagging.
for i in range(len(tokenized_list)):
print(nltk.pos_tag(tokenized_list[i]))
[('code', 'NN'), ('ninja', 'IN'), ('building', 'NN'), ('separated', 'VBN'), ('dojos', 'JJ'), ('lobby', 'NN'), ('.', '.')]
[('lobby', 'NN'), ('also', 'RB'), ('parent', 'NN'), ('pick', 'NN'), ('kid', 'NN'), ('.', '.')]
[('different', 'JJ'), ('belt', 'VBD'), ('different', 'JJ'), ('coding', 'NN'), ('language', 'NN'), ('e.g', 'NN'), ('.', '.')]
[('white', 'JJ'), ('belt', 'NN'), ('javascript', 'NN'), (',', ','), ('blue', 'JJ'), ('belt', 'NN'), ('lua', 'NN'), (',', ','), ('purple', 'NN'), ('onwards', 'NNS'), ('use', 'VBP'), ('c', 'JJ'), ('#', '#'), ('.', '.')]
Some of the abbreviations and their meaning.
Abbreviations | Meaning |
IN | preposition/subordinating conjunction |
NN | noun, singular (cat, tree) |
VBP | verb, present tense not 3rd person singular(wrap) |
NNS | noun plural (desks) |
JJ | This NLTK POS Tag is an adjective (large) |
RB | adverb (occasionally, swiftly) |