Applications Of NLP
- Machine Translation i.e. Google Translator
- Information retrieval
- Question Answering i.e. ChatBot
- Summarization
- Sentiment Analysis
- Social Media Analysis
- Mining large data
Text Classification
One of the essential tasks in supervised machine learning is text classification (ML). It is the technique of adding tags/categories to documents to automatically and cost-effectively organize and evaluate material. It is a critical problem in Natural Language Processing with many applications, including sentiment analysis, spam identification, topic labeling, and intent detection.
Steps involved in Text Classification
Let's break down the classification problem into steps:
- Setup: Libraries to Import
- Exploratory Data Analysis and Data Loading
- Preparing the text
- Getting vectors out of text (Vectorization)
- ML algorithms are being used.
- Conclusion.
Importing Libraries
import pandas as pd
import numpy as np
#for text pre-processing
import re, string
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
#for model-building
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix
from sklearn.metrics import roc_curve, auc, roc_auc_score
# bag of words
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
#for word embedding
import gensim
from gensim.models import Word2Vec

You can also try this code with Online Python Compiler
Run Code
Loading the Dataset and EDA
#loading dataset
data = pd.read_csv("Twitter_Data.csv")
data.head(5)

You can also try this code with Online Python Compiler
Run Code
#checking for missing data (null data)
data.isnull().sum()
data.shape

You can also try this code with Online Python Compiler
Run Code
#dropping missing data
data.dropna(axis=0, inplace=True)
data.shape #data dimensions

You can also try this code with Online Python Compiler
Run Code
#mapping tweet categories
data['category'] = data['category'].map({-1.0:'Negative', 0.0:'Neutral', 1.0:'Positive'})
#distribution of sentiments
data.groupby('category').count().plot(kind='bar')

You can also try this code with Online Python Compiler
Run Code

#plotting the distribution of text length for positive sentiment tweets
fig = plt.figure(figsize=(14,7))
data['length'] = data.clean_text.str.split().apply(len)
ax1 = fig.add_subplot(122)
sns.histplot(data[data['category']=='Positive']['length'], ax=ax1,color='green')
describe = data.length[data.category=='Positive'].describe().to_frame().round(2)

You can also try this code with Online Python Compiler
Run Code
ax2 = fig.add_subplot(121)
ax2.axis('off')
font_size=14
bbox = [0,0,1,1]
table = ax2.table(cellText = describe.values, rowLabels = describe.index, bbox=bbox, colLabels=describe.columns)
table.set_fontsize(font_size)
fig.suptitle('Distribution of text length for positive sentiment tweets.', fontsize=16)
plt.show()

You can also try this code with Online Python Compiler
Run Code

#plotting the distribution of text length for negative sentiment tweets
fig = plt.figure(figsize=(14,7))
data['length'] = data.clean_text.str.split().apply(len)
ax1 = fig.add_subplot(122)
sns.histplot(data[data['category']=='Negative']['length'], ax=ax1,color='red')
describe = data.length[data.category=='Negative'].describe().to_frame().round(2)

You can also try this code with Online Python Compiler
Run Code
ax2 = fig.add_subplot(121)
ax2.axis('off')
font_size=14
bbox = [0,0,1,1]
table = ax2.table(cellText = describe.values, rowLabels = describe.index, bbox=bbox, colLabels=describe.columns)
table.set_fontsize(font_size)
fig.suptitle('Distribution of text length for negative sentiment tweets.', fontsize=16)
plt.show()

You can also try this code with Online Python Compiler
Run Code

labels=['Negative','Neutral','Positive']sizes=[]
colors = ['red','yellow','green']
p=0
n=0
N=0
for i in data['category']:
if i=='Negative':
n+=1
elif i=='Positive':
p+=1
else:
N+=1
sizes.append(n)
sizes.append(N)
sizes.append(p)

You can also try this code with Online Python Compiler
Run Code
#pie chart for tweets
explode = (0.05, 0.05, 0.05)
plt.pie(sizes,explode = explode,colors=colors,labels=labels,autopct='%1.1f%%',shadow=True,startangle=90)
plt.axis('equal')
plt.title("Tweets Distribution")
plt.show()

You can also try this code with Online Python Compiler
Run Code

Text Preprocessing
We must first preprocess our dataset by eliminating punctuation and special characters, cleaning texts, deleting stop words, and applying lemmatization before model construction.
Text cleaning procedures that are simple: The following are some of the most typical text cleaning procedures:
Punctuation, special characters, URLs, and hashtags are all removed.
Getting rid of the leading, trailing, and additional white spaces/tabs
Corrections are made to typos and slang, and abbreviations are written in their complete forms.
Stop-word removal: nltk can be used to remove a list of generic stop words from the English lexicon. 'I,' 'you,' 'a',' the,'' he,' 'which,' and so on are examples of such terms.
Stemming: This is the process of slicing the end or beginning of words to remove affixes (prefix/suffix).
Reducing a word to its simplest form is known as lemmatization.
def tweet_to_words(tweet):
text = tweet.lower() #make all letters to lowercase
text = re.sub(r"[^a-zA-Z0-9]", " ", text) #remove non-letters
words = text.split() #tokenize
words = [w for w in words if w not in stopwords.words("english")] #removing stopwords
words = [PorterStemmer().stem(w) for w in words]
return words
print("\nOriginal tweet -> ",data['clean_text'][0])
print("\nProcessed tweet -> ",data['clean_text'][0])

You can also try this code with Online Python Compiler
Run Code
Output
Original tweet -> when modi promised “minimum government maximum governance” expected him begin the difficult job reforming the state why does take years get justice state should and not business and should exit psus and temples
Processed tweet -> when modi promised “minimum government maximum governance” expected him begin the difficult job reforming the state why does take years get justice state should and not business and should exit psus and temples
X = list(map(tweet_to_words,data['clean_text']))
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
Y = le.fit_transform(data['category'])

You can also try this code with Online Python Compiler
Run Code
Bag of Words
Working with text data when creating Machine Learning models is tough since these models require well-defined numerical data. Vectorization, or word embedding in the NLP industry, is the process of converting text data into numerical data/vector. Two well-known methods for translating text data to numerical data are Bag-of-Words (BoW) and Word Embedding (Word2Vec).
#bag of words
from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer(max_features=5000,preprocessor=lambda x:x, tokenizer=lambda x:x)
X_train=count_vector.fit_transform(X_train).toarray()
X_test=count_vector.fit_transform(X_test).toarray()

You can also try this code with Online Python Compiler
Run Code
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
max_words = 5000
max_len=50

You can also try this code with Online Python Compiler
Run Code
Tokenization
def tokenize_pad_sequences(text):
'''
This function tokenize the input text into sequnences of intergers and then
pad each sequence to the same length
'''
# Text tokenization
tokenizer = Tokenizer(num_words=max_words, lower=True, split=' ')
tokenizer.fit_on_texts(text)
# Transforms text to a sequence of integers
X = tokenizer.texts_to_sequences(text)
# Pad sequences to the same length
X = pad_sequences(X, padding='post', maxlen=max_len)
# return sequences
return X, tokenizer

You can also try this code with Online Python Compiler
Run Code
print('Before Tokenization & Padding \n', data['clean_text'][0])
X, tokenizer = tokenize_pad_sequences(data['clean_text'])
print('After Tokenization & Padding \n', X[0])

You can also try this code with Online Python Compiler
Run Code

Frequently Asked Questions
1. What exactly is a text classification issue?
Text categorization is a supervised learning issue that uses Machine Learning and Natural Language Processing to organize text/tokens into organized categories.
2. What is the purpose of text classification?
Classifying vast amounts of textual material aids platform standardization, making search more efficient and relevant, and enhancing user experience by making navigation easier. Surprisingly, machine intelligence and deep learning make inroads into previously inconceivable and conventional sectors.
3. What is the significance of text preparation in NLP?
Text data, in addition to numerical data, is widely available and is utilized to assess and solve business challenges. However, before you can use the data for analysis or prediction, you must first process it. Text preprocessing is used to prepare text data for model creation.
4. In NLP, what is the proper order for preprocessing?
The following are the various text preparation steps: Tokenization. Lowering the case and removing Stop words.
Key takeaways
Let us brief out the article.
Firstly, we saw the basic understanding of NLP with its applications. Later, we saw what text classification is with the steps involved in it. Lastly, we implemented all the steps for the text classification on the Twitter dataset, explaining some of the essential concepts in between. That's all from the article. I hope you all like it.
Check out this problem - First Missing Positive
Want to learn more about Data Analysis? Here is an excellent course that can guide you in learning.
Happy Learning, Ninjas!