Code360 powered by Coding Ninjas X Naukri.com. Code360 powered by Coding Ninjas X Naukri.com
Last Updated: Mar 27, 2024
Difficulty: Hard

Conditional Random Fields

Leveraging ChatGPT - GenAI as a Microsoft Data Expert
Speaker
Prerita Agarwal
Data Specialist @
23 Jul, 2024 @ 01:30 PM

Introduction

Let’s say we build an application where given a feature vector X; we can predict the output vector y = {y0, y1, ………, yn}.

A famous example in NLP is Part of Speech (POS) tagging, where the input variable Xis divided into features {X1, X2, ……….., Xn} and each variable yi is the POS tag of a word i.

Therefore, in such a problem, we have two primary goals

  • Predict the output vector correctly.
  • Determine the correct sequence of predictions in which conditional random fields come to the rescue.

CRF for Sequence Models

When the model predicts many interdependent variables, CRF models come to the rescue. 

The major difficulty with the NER problem is that the entities are too infrequent to occur in the training data, forcing the model to identify them only based on context. The simplistic solution to this problem is to categorize each word separately. The fundamental flaw with this strategy is that it implies named entity labels are self-contained, which is not the case.

Maharashtra, for example, is a state, whereas the Maharastra Times is a news institution.

To address this issue, we employ CRFs, in which the input data is a series and the output data is likewise a sequence, and we must consider the prior context while predicting a data point. We'll utilize a feature function with several input values to do this.

Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job
Bootcamp

Building and Training a CRF Module in Python

#invoke libraries
from bs4 import BeautifulSoup as bs
from bs4.element import Tag
import codecs
import nltk
from nltk import word_tokenize, pos_tag
from sklearn.model_selection import train_test_split
import pycrfsuite
import os, os.path, sys
import glob
from xml.etree import ElementTree
import numpy as np
from sklearn.metrics import classification_report

Defining and Building New Functions

def append_annotations(files):
   xml_files = glob.glob(files +"/*.xml")
   xml_element_tree = None
   new_data = ""
   for xml_file in xml_files:
       data = ElementTree.parse(xml_file).getroot()
       #print ElementTree.tostring(data)        
       temp = ElementTree.tostring(data)
       new_data += (temp)
   return(new_data)
#function to remove special characters and punctuations
def remov_punct(withpunc):
   punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
   without_punct = ""
   char = 'nan'
   for char in withpunct:
       if char not in punctuations:
           without_punct = without_punct + char
   return(without_punct)
# function to extracting features in documents
def extract_features(doc):
   return [word2features(doc, i) for i in range(len(doc))]
def get_labels(doc):
   return [label for (token, postag, label) in doc]

Importing the Annotated Training Area

files_path = "D:/Annotated/"
allxmlfiles = append_annotations(files_path)
soup = bs(allxmlfiles, "html5lib")
#identification of the tagged element
docs = []
sents = []
for d in soup.find_all("document"):
 for wrd in d.contents:    
   tags = []
   NoneType = type(None)   
   if isinstance(wrd.name, NoneType) == True:
       withoutpunct = remov_punct(wrd)
       temp = word_tokenize(withoutpunct)
       for token in temp:
           tags.append((token,'NA'))            
   else:
       withoutpunct = remov_punct(wrd)
       temp = word_tokenize(withoutpunct)
       for token in temp:
           tags.append((token,wrd.name))    
   sents = sents + tags 
 docs.append(sents) #appends all the individual documents into one list

Generate features. These are the default features that NER algorithm uses in nltk. One can modify it for customization.
data = []
for i, doc in enumerate(docs):
   tokens = [t for t, label in doc]    
   tagged = nltk.pos_tag(tokens)    
   data.append([(w, pos, label) for (w, label), (word, pos) in zip(doc, tagged)])
def word2features(doc, i):
   word = doc[i][0]
   postag = doc[i][1]
# Common features for all words. You may add more features here based on your custom use case
features = [
       'bias',
       'word.lower=' + word.lower(),
       'word[-3:]=' + word[-3:],
       'word[-2:]=' + word[-2:],
       'word.isupper=%s' % word.isupper(),
       'word.istitle=%s' % word.istitle(),
       'word.isdigit=%s' % word.isdigit(),
       'postag=' + postag
   ]
# Features for words that are not at the beginning of a document
if i > 0:
       word1 = doc[i-1][0]
       postag1 = doc[i-1][1]
       features.extend([
           '-1:word.lower=' + word1.lower(),
           '-1:word.istitle=%s' % word1.istitle(),
           '-1:word.isupper=%s' % word1.isupper(),
           '-1:word.isdigit=%s' % word1.isdigit(),
           '-1:postag=' + postag1
       ])
   else:
       # Indicate that it is the 'beginning of a document'
       features.append('BOS')
# Features for words that are not at the end of a document
if i < len(doc)-1:
       word1 = doc[i+1][0]
       postag1 = doc[i+1][1]
       features.extend([
           '+1:word.lower=' + word1.lower(),
           '+1:word.istitle=%s' % word1.istitle(),
           '+1:word.isupper=%s' % word1.isupper(),
           '+1:word.isdigit=%s' % word1.isdigit(),
           '+1:postag=' + postag1
       ])
   else:
       # Indicate that it is the 'end of a document'
       features.append('EOS')
return features

Building Features and Testing Data Frames

X = [extract_features(doc) for doc in data]
y = [get_labels(doc) for doc in data]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Testing the model

Let’s test our model. For testing our CRF model, we need to execute the following lines of code.

tagger = pycrfsuite.Tagger()
tagger.open('crf.model')
y_pred = [tagger.tag(xseq) for xseq in X_test]
i = 0
for x, y in zip(y_pred[i], [x[1].split("=")[1] for x in X_test[i]]):
             print("%s (%s)" % (y, x))

Performance Check of the Model

# Create a map of labels to indices
labels = {"claim_number": 1, "claimant": 1,"NA": 0}
# Conversion of tags into 1-D array
predictions = np.array([labels[tag] for row in y_pred for tag in row])
truths = np.array([labels[tag] for row in y_test for tag in row])
#prints the classification report
print(classification_report(
   truths, predictions,
   target_names=["claim_number", "claimant","NA"]))

#predict new data
with codecs.open("D:/ SampleEmail6.xml", "r", "utf-8") as infile:
   soup_test = bs(infile, "html5lib")
docs = []
sents = []
for d in soup_test.find_all("document"):
 for wrd in d.contents:    
   tags = []
   NoneType = type(None)  
   if isinstance(wrd.name, NoneType) == True:
       withoutpunct = remov_punct(wrd)
       temp = word_tokenize(withoutpunct)
       for token in temp:
           tags.append((token,'NA'))            
   else:
       withoutpunct = remov_punct(wrd)
       temp = word_tokenize(withoutpunct)
       for token in temp:
           tags.append((token,wrd.name))
   #docs.append(tags)
sents = sents + tags # puts all the sentences of a document in one element of the list
docs.append(sents) #appends all the individual documents into one list      
data_test = []
for i, doc in enumerate(docs):
   tokens = [t for t, label in doc]    
   tagged = nltk.pos_tag(tokens)    
   data_test.append([(w, pos, label) for (w, label), (word, pos) in zip(doc, tagged)])
data_test_feats = [extract_features(doc) for doc in data_test]
tagger.open('crf.model')
newdata_pred = [tagger.tag(xseq) for xseq in data_test_feats]

Checking the Predicted Data

i = 0
for x, y in zip(newdata_pred[i], [x[1].split("=")[1] for x in data_test_feats[i]]):
   print("%s (%s)" % (y, x))

Here, we used Python to train a CRF model, and finally discovered how to identify entities from new text.

Frequently Asked Questions

What is entity recognition?

With the interest in Natural Language Processing, entity recognition has experienced a recent spike in use. A section of text that is of interest to the data scientist or the company is referred to as an entity. Names of persons, addresses, account numbers, and localities are examples of frequently extracted entities. These are only samples; you may come up with your own entity to solve the problem. To use a basic example of entity recognition, if any text in the dataset contains the word "London," the algorithm will automatically categorize or classify it as a place.

What are discriminative models?

Discriminative models describe how to take feature vector X and assign them an output vector y. The discriminative model models the decision boundary between different classes. An example is a discriminative model a logistic regression that maximizes the likelihood estimates.

What are generative models?

Generative models describe how a label vector y can probabilistically generate the feature vector X. As a simple example, Naive Bayes (which is a very popular probabilistic classifier) is a generative algorithm.

What is the feature function of CRF?

The feature function can be given as: -

f(X,i,yi-1,yi

Where

X = set of input vectors

i=poistion of data points we want to collect.

yi-1=label of data point i-1

yi=label of data point i in X

What are the applications of CRF? 

CRFs are capable of modelling sequential data for application in Natural Language Processing and Computer Vision. Named Entity Recognition is one of the uses of CRFs in NLP, where we forecast the order in which they are reliant on one other. Other types of CRFs include Hidden CRF, which is used for gesture recognition, Dynamic CRF, which is used to label sequence data, Skip Gram CRF, which is used for activity recognition, and so on. Gene prediction is another use.

Conclusion

In this article, we have extensively discussed Conditional Random Fields, their implementation, and their importance.

After reading about the Architecture of IoT, are you not feeling excited to read/explore more articles on the topic of Programming in IoT and learning Ruby? Don't worry; Coding Ninjas has you covered. To learn, see more blogs on RubyRuby Documentation, and Programming Language for IoT.

Refer to our Guided Path on Coding Ninjas Studio to upskill yourself in Data Structures and AlgorithmsCompetitive ProgrammingJavaScriptSystem Design, and many more! If you want to test your competency in coding, you may check out the mock test series and participate in the contests hosted on Coding Ninjas Studio! But if you have just started your learning process and are looking for questions asked by tech giants like Amazon, Microsoft, Uber, etc; you must look at the problems, interview experiences, and interview bundle for placement preparations.

Nevertheless, you may consider our paid courses to give your career an edge over others!

Do upvote our blogs if you find them helpful and engaging!

Happy Learning!

Topics covered
1.
Introduction
2.
CRF for Sequence Models
3.
Building and Training a CRF Module in Python
3.1.
Defining and Building New Functions
3.2.
Importing the Annotated Training Area
3.3.
Building Features and Testing Data Frames
3.4.
Testing the model
3.5.
Performance Check of the Model
3.6.
Checking the Predicted Data
4.
Frequently Asked Questions
4.1.
What is entity recognition?
4.2.
What are discriminative models?
4.3.
What are generative models?
4.4.
What is the feature function of CRF?
4.5.
What are the applications of CRF? 
5.
Conclusion