Table of contents
1.
Introduction
2.
Meaning Of NLP
3.
Applications Of NLP
4.
Text Classification
5.
Steps involved in Text Classification
6.
Importing Libraries
7.
Loading the Dataset and EDA
8.
Text Preprocessing
8.1.
Bag of Words
8.2.
Tokenization
9.
Frequently Asked Questions
10.
Key takeaways
Last Updated: Mar 27, 2024
Easy

Text Classification in NLP

Author Mayank Goyal
0 upvote
Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

Natural Language Processing (NLP) is a broad field of study that combines artificial intelligence, computer science, and linguistics. It covers a wide range of topics, including named entity recognition, machine translation, and machine question answering, all of which have intriguing real-world applications. Each of these subjects has its unique approach to textual data.

Texts and speeches are the most typical kind of unstructured data. There is a lot of it, but extracting relevant information is difficult. It would take a long time to mine the data if this were not the case. Written text and spoken language both provide a wealth of information. It's because, as intelligent beings, we communicate primarily through writing and speaking. Sentiment analysis, cognitive assistant, span filtering, spotting fake news, and real-time language translation are all tasks that NLP can perform for humans.

Meaning Of NLP

NLP is commonly utilized in automobiles, smartphones, speakers, computers, and web pages. Google Translator uses a machine translator, a natural language processing system. Google Translator used crude language to write and speak the languages users wanted to translate. NLP aids Google Translator in comprehending words in context, removing extraneous noises, and constructing CNN to understand native voice.

NLP is often used in chatbots. Chatbots are beneficial since they eliminate the need for humans to inquire about a customer's wants. A natural language processing chatbot can ask a series of inquiries, such as the user's problem and where to find a solution. Apple and AMAZON both have a comprehensive chatbot in place. When a user asks a query, the chatbot in the internal system turns it into intelligible sentences.

It's referred to as toke. The token then uses natural language processing to determine what users are asking. Information retrieval employs NLP (IR). IR is a software program that deals with massive storage and information evaluation from repositories of large text documents. It will only retrieve relevant information. It's utilized in Google Voice Detection, for example, to cut out unnecessary words.

Applications Of NLP

  • Machine Translation i.e. Google Translator
  • Information retrieval
  • Question Answering i.e. ChatBot
  • Summarization
  • Sentiment Analysis
  • Social Media Analysis
  • Mining large data

Text Classification

One of the essential tasks in supervised machine learning is text classification (ML). It is the technique of adding tags/categories to documents to automatically and cost-effectively organize and evaluate material. It is a critical problem in Natural Language Processing with many applications, including sentiment analysis, spam identification, topic labeling, and intent detection.

Steps involved in Text Classification

Let's break down the classification problem into steps:

  1. Setup: Libraries to Import
  2. Exploratory Data Analysis and Data Loading
  3. Preparing the text
  4. Getting vectors out of text (Vectorization)
  5. ML algorithms are being used.
  6. Conclusion.

Importing Libraries

import pandas as pd
import numpy as np
#for text pre-processing
import re, string
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
#for model-building
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix
from sklearn.metrics import roc_curve, auc, roc_auc_score
# bag of words
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
#for word embedding
import gensim
from gensim.models import Word2Vec
You can also try this code with Online Python Compiler
Run Code

Loading the Dataset and EDA

#loading dataset
data = pd.read_csv("Twitter_Data.csv")
data.head(5)
You can also try this code with Online Python Compiler
Run Code
#checking for missing data (null data)
data.isnull().sum()
data.shape
You can also try this code with Online Python Compiler
Run Code
#dropping missing data
data.dropna(axis=0, inplace=True)
data.shape  #data dimensions
You can also try this code with Online Python Compiler
Run Code
#mapping tweet categories
data['category'] = data['category'].map({-1.0:'Negative', 0.0:'Neutral', 1.0:'Positive'})

#distribution of sentiments
data.groupby('category').count().plot(kind='bar')
You can also try this code with Online Python Compiler
Run Code

#plotting the distribution of text length for positive sentiment tweets
fig = plt.figure(figsize=(14,7))
data['length'] = data.clean_text.str.split().apply(len)
ax1 = fig.add_subplot(122)
sns.histplot(data[data['category']=='Positive']['length'], ax=ax1,color='green')
describe = data.length[data.category=='Positive'].describe().to_frame().round(2)
You can also try this code with Online Python Compiler
Run Code
ax2 = fig.add_subplot(121)
ax2.axis('off')
font_size=14
bbox = [0,0,1,1]
table = ax2.table(cellText = describe.values, rowLabels = describe.index, bbox=bbox, colLabels=describe.columns)
table.set_fontsize(font_size)
fig.suptitle('Distribution of text length for positive sentiment tweets.', fontsize=16)
plt.show()
You can also try this code with Online Python Compiler
Run Code

#plotting the distribution of text length for negative sentiment tweets
fig = plt.figure(figsize=(14,7))
data['length'] = data.clean_text.str.split().apply(len)
ax1 = fig.add_subplot(122)
sns.histplot(data[data['category']=='Negative']['length'], ax=ax1,color='red')
describe = data.length[data.category=='Negative'].describe().to_frame().round(2)
You can also try this code with Online Python Compiler
Run Code
ax2 = fig.add_subplot(121)
ax2.axis('off')
font_size=14
bbox = [0,0,1,1]
table = ax2.table(cellText = describe.values, rowLabels = describe.index, bbox=bbox, colLabels=describe.columns)
table.set_fontsize(font_size)
fig.suptitle('Distribution of text length for negative sentiment tweets.', fontsize=16)
plt.show()
You can also try this code with Online Python Compiler
Run Code

labels=['Negative','Neutral','Positive']sizes=[]
colors = ['red','yellow','green']
p=0
n=0
N=0
for i in data['category']:
    if i=='Negative':
        n+=1
    elif i=='Positive':
        p+=1
    else:
        N+=1
sizes.append(n)
sizes.append(N)
sizes.append(p)
You can also try this code with Online Python Compiler
Run Code
#pie chart for tweets
explode = (0.05, 0.05, 0.05)
plt.pie(sizes,explode = explode,colors=colors,labels=labels,autopct='%1.1f%%',shadow=True,startangle=90)
plt.axis('equal')
plt.title("Tweets Distribution")
plt.show()
You can also try this code with Online Python Compiler
Run Code

Text Preprocessing

We must first preprocess our dataset by eliminating punctuation and special characters, cleaning texts, deleting stop words, and applying lemmatization before model construction.

Text cleaning procedures that are simple: The following are some of the most typical text cleaning procedures:

Punctuation, special characters, URLs, and hashtags are all removed.

Getting rid of the leading, trailing, and additional white spaces/tabs

Corrections are made to typos and slang, and abbreviations are written in their complete forms.

Stop-word removal: nltk can be used to remove a list of generic stop words from the English lexicon. 'I,' 'you,' 'a',' the,'' he,' 'which,' and so on are examples of such terms.

Stemming: This is the process of slicing the end or beginning of words to remove affixes (prefix/suffix).

Reducing a word to its simplest form is known as lemmatization.

def tweet_to_words(tweet):
    text = tweet.lower() #make all letters to lowercase
    text = re.sub(r"[^a-zA-Z0-9]", " ", text) #remove non-letters
    words = text.split() #tokenize
    words = [w for w in words if w not in stopwords.words("english")] #removing stopwords
    words = [PorterStemmer().stem(w) for w in words]
    return words
print("\nOriginal tweet -> ",data['clean_text'][0])
print("\nProcessed tweet -> ",data['clean_text'][0])
You can also try this code with Online Python Compiler
Run Code

 

Output

Original tweet ->  when modi promised “minimum government maximum governance” expected him begin the difficult job reforming the state why does take years get justice state should and not business and should exit psus and temples

Processed tweet ->  when modi promised “minimum government maximum governance” expected him begin the difficult job reforming the state why does take years get justice state should and not business and should exit psus and temples
X = list(map(tweet_to_words,data['clean_text']))
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
Y = le.fit_transform(data['category'])
You can also try this code with Online Python Compiler
Run Code

Bag of Words

Working with text data when creating Machine Learning models is tough since these models require well-defined numerical data. Vectorization, or word embedding in the NLP industry, is the process of converting text data into numerical data/vector. Two well-known methods for translating text data to numerical data are Bag-of-Words (BoW) and Word Embedding (Word2Vec).

#bag of words
from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer(max_features=5000,preprocessor=lambda x:x, tokenizer=lambda x:x)
X_train=count_vector.fit_transform(X_train).toarray()
X_test=count_vector.fit_transform(X_test).toarray()
You can also try this code with Online Python Compiler
Run Code

 

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
max_words = 5000
max_len=50
You can also try this code with Online Python Compiler
Run Code

Tokenization

def tokenize_pad_sequences(text):
    '''
    This function tokenize the input text into sequnences of intergers and then
    pad each sequence to the same length
    '''
    # Text tokenization
    tokenizer = Tokenizer(num_words=max_words, lower=True, split=' ')
    tokenizer.fit_on_texts(text)
    # Transforms text to a sequence of integers
    X = tokenizer.texts_to_sequences(text)
    # Pad sequences to the same length
    X = pad_sequences(X, padding='post', maxlen=max_len)
    # return sequences
    return X, tokenizer
You can also try this code with Online Python Compiler
Run Code
print('Before Tokenization & Padding \n', data['clean_text'][0])
X, tokenizer = tokenize_pad_sequences(data['clean_text'])
print('After Tokenization & Padding \n', X[0])
You can also try this code with Online Python Compiler
Run Code

Frequently Asked Questions

1. What exactly is a text classification issue?
Text categorization is a supervised learning issue that uses Machine Learning and Natural Language Processing to organize text/tokens into organized categories.

2. What is the purpose of text classification?
Classifying vast amounts of textual material aids platform standardization, making search more efficient and relevant, and enhancing user experience by making navigation easier. Surprisingly, machine intelligence and deep learning make inroads into previously inconceivable and conventional sectors.

3. What is the significance of text preparation in NLP?
Text data, in addition to numerical data, is widely available and is utilized to assess and solve business challenges. However, before you can use the data for analysis or prediction, you must first process it. Text preprocessing is used to prepare text data for model creation.

4. In NLP, what is the proper order for preprocessing?
The following are the various text preparation steps: Tokenization. Lowering the case and removing Stop words.

Key takeaways

Let us brief out the article.

Firstly, we saw the basic understanding of NLP with its applications. Later, we saw what text classification is with the steps involved in it. Lastly, we implemented all the steps for the text classification on the Twitter dataset, explaining some of the essential concepts in between. That's all from the article. I hope you all like it.
Check out this problem - First Missing Positive 

Want to learn more about Data Analysis? Here is an excellent course that can guide you in learning.

Happy Learning, Ninjas!

 

Live masterclass