Table of contents
1.
Introduction
2.
Why do we need Chunking? 
3.
Types of Chunking
3.1.
Chunking Up
3.2.
Chunking Down
4.
Implementation of chunking in Python
4.1.
Python
5.
Regular Expressions
6.
Chunking in Python 
7.
Frequently Asked Questions
7.1.
What are the different types of Chunking in NLP?
7.2.
Why is Chunking important?
7.3.
What is the difference between chunk and phrase?
7.4.
What is RegexpParser?
8.
Conclusion
Last Updated: Mar 27, 2024
Easy

Chunking in NLP (Natural Language Processing)

Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

Chunking is extracting phrases from an unstructured text by evaluating a sentence and determining its elements (Noun Groups, Verbs, verb groups, etc.) However, it does not describe their internal structure or their function in the introductory statement.

Chunking in NLP (Natural Language Processing)

Note that there are eight parts of speech: noun, verb, adjective, adverb, preposition, conjunction, pronoun, and interjection, as we recall from our English grammar studies at school. Short phrases are also defined as phrases generated by combining any of these parts of speech in the previous definition of Chunking.

To identify and group noun phrases or nouns alone, adjectives or adjective phrases, and so on, Chunking can be used.

Consider the following sentence:

"I had my breakfast, lunch and dinner."

In this case, if we wish to group or chunk noun phrases, we will get "breakfast", "lunch", and "dinner", which are the nouns or noun groups of the sentence. 

Why do we need Chunking? 

It's critical to understand that the statement contains a person, a date, and a location (different entities). As a result, they're useless on their own.

Chunking can break down sentences into phrases that are more useful than single words and provide meaningful outcomes.

When extracting information from text, such as places and person names, Chunking is critical. (extraction of entities)

Types of Chunking

Chunking Up

We don't go into great detail here; instead, we're content with a high-level overview. It only serves to provide us with a quick overview of the facts.

Chunking Down

Unlike the previous method of Chunking, chunking down allows us to obtain more detailed data.

Consider "chunking up" if you only need an insight; otherwise, "chunking down" is preferable.

Implementation of chunking in Python

Chunking in Python refers to splitting a sequence or list into smaller, evenly sized chunks. This is often useful when dealing with large datasets or when processing data in batches. One common approach to implement chunking is by using list comprehension along with the range() function to create slices of the original sequence. Another method involves using the itertools module's islice() function, which allows you to iterate over the sequence in chunks. Alternatively, you can use libraries like numpy or pandas to efficiently handle chunking for numerical or tabular data. Overall, chunking simplifies data processing tasks by breaking them into manageable portions, improving memory efficiency and computational performance. 

Here's how you can implement chunking in Python using list comprehension:

  • Python

Python

def chunk_list(lst, chunk_size):
return [lst[i:i + chunk_size] for i in range(0, len(lst), chunk_size)]

my_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
chunk_size = 3
chunks = chunk_list(my_list, chunk_size)
print(chunks)
You can also try this code with Online Python Compiler
Run Code

Output:

[[1, 2, 3], [4, 5, 6], [7, 8, 9], [10]]

 

This function chunk_list takes a list lst and a chunk_size as input and returns a list of lists, each containing chunk_size elements from the original list. The range() function is used to iterate over the indices of the original list, and list slicing is used to extract chunks of the specified size.

Regular Expressions

To learn how to implement Chunking, some knowledge about Regular Expressions(regex) is required. 

Regex is a kind of instruction in which we define what are the types of substring that needs to be selected from a text. 

There are some specific format rules which are defined for it. Let's learn an introduction to them. 

Here's where you can learn more about Regular Expressions: Regex

Character Description Example
[] A set of characters “[a-m]”
\ Signals a special sequence(can also be used to escape special characters) “\d”
. Any character (except newline character) “he..o”
^ Starts with “^hello”
$ Ends with “planet$”
* Zero or more occurrences “he.*o”
+ One or more occurrences “he.+o”
? Zero or one occurrence “he.?o”
{} Exactly the specified number of occurrences “he.{2}o”
| Either or “falls|stays”
() Capture and group  
Symbol Meaning Example
* The preceding character can occur zero or more times meaning that the preceding character may or may not be there. ab* matches all inputs starting with ab and then followed by zero or more numbers of b's. The patter will match ab, abb, abbb, and so on.
+ The preceding character should occur at least once a+ matches a, aa, aaa, and so on.
? The preceding character may not occur at all or occur only once meaning the preceding character is optional ab? matches ab, abb, but not abbb, and so on.

Chunking in Python 

The high-level idea is that first, we tokenize our text. Now there is a utility in NLTK which tags the words; pos_tag, which attaches a tag to the words, for example, Verb conjunction etc. 

Then with the help of these tags, we can perform Chunking. If we want to select verbs, we can write a grammar that selects the words with a grammar tag. 

Let's understand the code: 

text = word_tokenize("And now for something completely different")
 nltk.pos_tag(text)

Output: 

[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),

('completely', 'RB'), ('different', 'JJ')]

'CC' is connecting conjunction and so on. 

Here you can understand the meaning of some tags that are there in NLTK: 

POS Meaning
VB Verb in its base form
VBD Verb in its past tense
VBG Verb in its present tense
VBN Verb in its part participle form
VBP Verb in its present tense but not in third person singular
VBZ Verb in its present tense and is third person singular

 

After this tagging, we can define our rule and perform verb chunking or noun Chunking etc. 

Let's look at an example.

import nltk
sample_text= "I am a coding ninja, and I am the best in coding."

tokenized=nltk.sent_tokenize(sample_text)
for i in tokenized:
  words=nltk.word_tokenize(i)
  tagged_words=nltk.pos_tag(words)
  print(tagged_words)
  chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}""" # this is the grammar that we define,
  chunkParser=nltk.RegexpParser(chunkGram)
  chunked=chunkParser.parse(tagged_words)
  chunked.draw()

Output: 

[('I', 'PRP'), ('am', 'VBP'), ('a', 'DT'), ('coding', 'NN'), ('ninja', 'NN'), (',', ','), ('and', 'CC'), ('I', 'PRP'), ('am', 'VBP'), ('the', 'DT'), ('best', 'JJS'), ('in', 'IN'), ('coding', 'NN'), ('.', '.')]

More practice can be done regarding how to write the grammar, and more rules can be followed, in all the process is going to be the same. 

Frequently Asked Questions

What are the different types of Chunking in NLP?

Group of words make up phrases and there are five major categories.
  - Noun Phrase (NP)
  - Verb phrase (VP)
  - Adjective phrase (ADJP)
  - Adverb phrase (ADVP)
  - Prepositional phrase (PP)

Why is Chunking important?

Chunking is the process of breaking large strings of data into units or chunks. The resulting bits of information are easier to remember than a lengthier uninterrupted stream of data. Information understanding and retrieval are aided by good Chunking.

What is the difference between chunk and phrase?

As nouns, the difference between chunk and phrase is that chunk is a part of something that has been separated while the phrase is part of a sentence that has a meaning in itself. 

What is RegexpParser?

RegexpParser specifies the parser's behaviour using a set of regular expression patterns. A ChunkString is used to encode the text chunking, and each rule modifies the Chunking in the ChunkString. Regular expression matching and substitution are used to implement all of the rules.

Conclusion

In a nutshell, Chunking is finding a category of words out of a sentence by first tagging them and then running regular expressions on them; it helps to narrow the amount of information that needs to be processed and gives us a more detailed look into the kind of sentence. 

Hey Ninjas! Don't stop here; check out Coding Ninjas for Machine Learning, more unique courses, and guided paths. Also, try Coding Ninjas Studio for more exciting articles, interview experiences, and fantastic Data Structures and Algorithms problems. 

Happy Learning!

Live masterclass