Table of contents
1.
Introduction
2.
Methods in NLTK
3.
FAQs
4.
Key Takeaways
Last Updated: Mar 27, 2024
Easy

Methods in NLTK

Author Prakriti
0 upvote
Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

Natural Language Processing (NLP) aims to understand and interpret human language. Natural Language Toolkit (NLTK) is a beautiful package in Python, having multiple datasets, pre-trained models to carry out NLP tasks easily. NLTK helps in the tokenization of text, sentences, removing stopwords, searching, counting, plotting frequency distributions, etc. If you are new to NLTK, feel free to refer to this blog, to know more about NLTK, its use cases, and installation.

Methods in NLTK

Corpus operations

import nltk
nltk.download('brown')
from nltk.corpus import brown
print("Words are")
print(brown.words())
print("Total number of words is",len(brown.words()))

print("Sentences are")
print(brown.sents())
print("Total number of sentences is",len(brown.sents()))

print("Fileids are")
print(brown.fileids())

Output

[nltk_data] Downloading package brown to /root/nltk_data...

[nltk_data]   Package brown is already up-to-date!

Words are

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

Total number of words is 1161192
Sentences are
[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]

Total number of sentences is 57340
Fileids are
['ca01', 'ca02', 'ca03', 'ca04', 'ca05', 'ca06', 'ca07', 'ca08', 'ca09', 'ca10']

We can also access the data using fileids.

brown.sents(fileids='ca01') #we can access data using fileids as well.

Output

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]

If we have our custom text file, then we can use NLTK corpus reader to read and process it.

from nltk.corpus import PlaintextCorpusReader
corp_txt = PlaintextCorpusReader(r'C:/', 'NLTK.txt') #mention the text file directory
#now we can use corp_txt just like we used brown before.

 

Searching

We can search a word in a corpus using concordance function.

from nltk.corpus import brown

text = nltk.Text(brown.words())
print("Concordance:")
text.concordance("news")
print()
print("Distributionally similar words:")
text.similar("news")
print()
print("Vocabulary plot:")
text.plot(20)

Output

Concordance:

Displaying 25 of 102 matches:

altimore's Florida Grapefruit League news ripened considerably late today when

. The appointment was announced at a news conference at which Skorich said he 

 who will be the unhappiest over the news that Musial probably will sit out mo

ts win a dramatic pennant . Romantic news concerns Mrs. Joan Monroe Armour and

entific training , he added . `` The news of their experiments reaches the far

resented in the averages . Some good news Although it looked like a routine te

ters it was accompanied by some good news . A substantial rise in new orders a

 New York Social Register which made news last week . Published annually by Wi

ut to President Kennedy at his first news conference last January was about hi

ead as an exaggeration ( see foreign news ) , and the U.S. was agreeing with i

 to his private office and broke the news : he would lead the fight to oust Co

to regard as `` an inferior man '' . News of Rayburn's commitment soon leaked 

 moment . In 1920 , as the startling news that the 1919 White Sox had conspire

embles . Soon after Loper leaked the news that Frankie had ordered `` two of e

ation . The editorial was based on a news association dispatch which said that

d . According to The Chicago Tribune News Service , State Atty. Gen. Stanley M

unese , much of each day's deluge of news will become clearer . At least , I h

stify its arms drive '' . The Soviet news agency TASS datelined from New York 

 Providence Journal is desperate for news . Usually a veteran has to hang hims

s Downers Grove , Aug. 8 -- A recent news story reported that Frank Sinatra an

ey should be examples . Church finds news features are helpful to the editor :

as made of material from The Detroit News on the King James version of the New

y helpful . We feel that The Detroit News is to be complimented upon arranging

enn T. Seaborg , `` admitted '' to a news conference in Las Vegas , Nevada , t

d . Mercenary : term of honor ? ? In news broadcasts I consistently hear the f



Distributionally similar words:

time house years state one night man people world door place and

audience word children idea president trial way heart



Vocabulary plot:

Counting

We can also perform essential tasks like counting the frequency of a particular word in a corpus, its relative frequency, and counting the total number of words using FreqDist class in NLTK.

from nltk.probability import FreqDist
fdist = FreqDist(brown.words()[0:50]) #taking small part of corpus
print("fdist")
print(fdist)
print()
print("Total number of tokens is", fdist.N())
print("The number of times said occured in the corpus is",fdist['said'])
print("The relative frequency of said is",fdist.freq('said'))

Output

Total number of tokens is 50

The number of times said occured in the corpus is 2

The relative frequency of said is 0.04

fdist

FreqDist({"''": 1,

          ',': 2,

          '.': 1,

          "Atlanta's": 1,

          'City': 1,

          'Committee': 1,

          'County': 1,

          'Executive': 1,

          'Friday': 1,

          'Fulton': 1,

          'Grand': 1,

          'Jury': 1,

          'The': 2,

          '``': 2,

          'an': 1,

          'any': 1,

          'charge': 1,

          'deserves': 1,

          'election': 2,

          'evidence': 1,

          'further': 1,

          'had': 1,

          'in': 1,

          'investigation': 1,

          'irregularities': 1,

          'jury': 1,

          'no': 1,

          'of': 2,

          'over-all': 1,

          'place': 1,

          'praise': 1,

          'presentments': 1,

          'primary': 1,

          'produced': 1,

          'recent': 1,

          'said': 2,

          'term-end': 1,

          'that': 2,

          'the': 3,

          'took': 1,

          'which': 1})

Plotting the frequency of words

fdist.plot(30)

Output

 

Lexical Dispersion Plot

This helps to visualize the location of words in multiple sentences.

from nltk.draw.dispersion import dispersion_plot
from nltk.corpus import brown
text = nltk.Text(brown.words())
dispersion_plot(text, ['news','said'], ignore_case=True, title='Lexical Dispersion Plot')

Output

 

+

FAQs

1. What is NLTK in data science?
Natural language ToolKit(NLTK) is used for doing NLP tasks such as removing stopwords, tokenizing words, etc.

2. Is NLTK a part of NLP?
Natural Language Processing(NLP) aims to understand and interpret the human language to perform various tasks such as language translation, automatic question answering, etc. Natural Language ToolKit(NLTK) package contains various libraries to perform NLP tasks in Python.

3. What is tokenize in NLTK?
NLTK contains tokenize module which has sent_tokenize() for splitting a text into sentences and word_tokenize() for splitting a sentence into words.

4. How do I use NLTK in Python?
You can use google colab to easily use NLTK in Python. Use pip install nltk and then import nltk commands.

5. What is a lexical dispersion plot?

A lexical dispersion plot helps to find the location of words present in a group of sentences.

Key Takeaways

This article discussed the various methods present in NLTK.

We hope this blog has helped you enhance your knowledge regarding the NLTK package in NLP and if you would like to learn more, check out our free content on NLP and more unique courses. Do upvote our blog to help other ninjas grow.

Happy Coding!

Live masterclass