Do you think IIT Guwahati certified course can help you in your career?
No
Introduction
Let’s say we build an application where given a feature vector X; we can predict the output vector y = {y0, y1, ………, yn}.
A famous example in NLP is Part of Speech (POS) tagging, where the input variable Xi is divided into features {X1, X2, ……….., Xn} and each variable yi is the POS tag of a word i.
Therefore, in such a problem, we have two primary goals
Predict the output vector correctly.
Determine the correct sequence of predictions in which conditional random fields come to the rescue.
CRF for Sequence Models
When the model predicts many interdependent variables, CRF models come to the rescue.
The major difficulty with the NER problem is that the entities are too infrequent to occur in the training data, forcing the model to identify them only based on context. The simplistic solution to this problem is to categorize each word separately. The fundamental flaw with this strategy is that it implies named entity labels are self-contained, which is not the case.
Maharashtra, for example, is a state, whereas the Maharastra Times is a news institution.
To address this issue, we employ CRFs, in which the input data is a series and the output data is likewise a sequence, and we must consider the prior context while predicting a data point. We'll utilize a feature function with several input values to do this.
Building and Training a CRF Module in Python
#invoke libraries
from bs4 import BeautifulSoup as bs
from bs4.element import Tag
import codecs
import nltk
from nltk import word_tokenize, pos_tag
from sklearn.model_selection import train_test_split
import pycrfsuite
import os, os.path, sys
import glob
from xml.etree import ElementTree
import numpy as np
from sklearn.metrics import classification_report
You can also try this code with Online Python Compiler
def append_annotations(files):
xml_files = glob.glob(files +"/*.xml")
xml_element_tree = None
new_data = ""
for xml_file in xml_files:
data = ElementTree.parse(xml_file).getroot()
#print ElementTree.tostring(data)
temp = ElementTree.tostring(data)
new_data += (temp)
return(new_data)
#function to remove special characters and punctuations
def remov_punct(withpunc):
punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
without_punct = ""
char = 'nan'
for char in withpunct:
if char not in punctuations:
without_punct = without_punct + char
return(without_punct)
# function to extracting features in documents
def extract_features(doc):
return [word2features(doc, i) for i in range(len(doc))]
def get_labels(doc):
return [label for (token, postag, label) in doc]
You can also try this code with Online Python Compiler
files_path = "D:/Annotated/"
allxmlfiles = append_annotations(files_path)
soup = bs(allxmlfiles, "html5lib")
#identification of the tagged element
docs = []
sents = []
for d in soup.find_all("document"):
for wrd in d.contents:
tags = []
NoneType = type(None)
if isinstance(wrd.name, NoneType) == True:
withoutpunct = remov_punct(wrd)
temp = word_tokenize(withoutpunct)
for token in temp:
tags.append((token,'NA'))
else:
withoutpunct = remov_punct(wrd)
temp = word_tokenize(withoutpunct)
for token in temp:
tags.append((token,wrd.name))
sents = sents + tags
docs.append(sents) #appends all the individual documents into one list
Generate features. These are the default features that NER algorithm uses in nltk. One can modify it for customization.
data = []
for i, doc in enumerate(docs):
tokens = [t for t, label in doc]
tagged = nltk.pos_tag(tokens)
data.append([(w, pos, label) for (w, label), (word, pos) in zip(doc, tagged)])
def word2features(doc, i):
word = doc[i][0]
postag = doc[i][1]
# Common features for all words. You may add more features here based on your custom use case
features = [
'bias',
'word.lower=' + word.lower(),
'word[-3:]=' + word[-3:],
'word[-2:]=' + word[-2:],
'word.isupper=%s' % word.isupper(),
'word.istitle=%s' % word.istitle(),
'word.isdigit=%s' % word.isdigit(),
'postag=' + postag
]
# Features for words that are not at the beginning of a document
if i > 0:
word1 = doc[i-1][0]
postag1 = doc[i-1][1]
features.extend([
'-1:word.lower=' + word1.lower(),
'-1:word.istitle=%s' % word1.istitle(),
'-1:word.isupper=%s' % word1.isupper(),
'-1:word.isdigit=%s' % word1.isdigit(),
'-1:postag=' + postag1
])
else:
# Indicate that it is the 'beginning of a document'
features.append('BOS')
# Features for words that are not at the end of a document
if i < len(doc)-1:
word1 = doc[i+1][0]
postag1 = doc[i+1][1]
features.extend([
'+1:word.lower=' + word1.lower(),
'+1:word.istitle=%s' % word1.istitle(),
'+1:word.isupper=%s' % word1.isupper(),
'+1:word.isdigit=%s' % word1.isdigit(),
'+1:postag=' + postag1
])
else:
# Indicate that it is the 'end of a document'
features.append('EOS')
return features
You can also try this code with Online Python Compiler
X = [extract_features(doc) for doc in data]
y = [get_labels(doc) for doc in data]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Testing the model
Let’s test our model. For testing our CRF model, we need to execute the following lines of code.
tagger = pycrfsuite.Tagger()
tagger.open('crf.model')
y_pred = [tagger.tag(xseq) for xseq in X_test]
i = 0
for x, y in zip(y_pred[i], [x[1].split("=")[1] for x in X_test[i]]):
print("%s (%s)" % (y, x))
Performance Check of the Model
# Create a map of labels to indices
labels = {"claim_number": 1, "claimant": 1,"NA": 0}
# Conversion of tags into 1-D array
predictions = np.array([labels[tag] for row in y_pred for tag in row])
truths = np.array([labels[tag] for row in y_test for tag in row])
#prints the classification report
print(classification_report(
truths, predictions,
target_names=["claim_number", "claimant","NA"]))
#predict new data
with codecs.open("D:/ SampleEmail6.xml", "r", "utf-8") as infile:
soup_test = bs(infile, "html5lib")
docs = []
sents = []
for d in soup_test.find_all("document"):
for wrd in d.contents:
tags = []
NoneType = type(None)
if isinstance(wrd.name, NoneType) == True:
withoutpunct = remov_punct(wrd)
temp = word_tokenize(withoutpunct)
for token in temp:
tags.append((token,'NA'))
else:
withoutpunct = remov_punct(wrd)
temp = word_tokenize(withoutpunct)
for token in temp:
tags.append((token,wrd.name))
#docs.append(tags)
sents = sents + tags # puts all the sentences of a document in one element of the list
docs.append(sents) #appends all the individual documents into one list
data_test = []
for i, doc in enumerate(docs):
tokens = [t for t, label in doc]
tagged = nltk.pos_tag(tokens)
data_test.append([(w, pos, label) for (w, label), (word, pos) in zip(doc, tagged)])
data_test_feats = [extract_features(doc) for doc in data_test]
tagger.open('crf.model')
newdata_pred = [tagger.tag(xseq) for xseq in data_test_feats]
Checking the Predicted Data
i = 0
for x, y in zip(newdata_pred[i], [x[1].split("=")[1] for x in data_test_feats[i]]):
print("%s (%s)" % (y, x))
You can also try this code with Online Python Compiler
Here, we used Python to train a CRF model, and finally discovered how to identify entities from new text.
Frequently Asked Questions
What is entity recognition?
With the interest in Natural Language Processing, entity recognition has experienced a recent spike in use. A section of text that is of interest to the data scientist or the company is referred to as an entity. Names of persons, addresses, account numbers, and localities are examples of frequently extracted entities. These are only samples; you may come up with your own entity to solve the problem. To use a basic example of entity recognition, if any text in the dataset contains the word "London," the algorithm will automatically categorize or classify it as a place.
What are discriminative models?
Discriminative models describe how to take feature vector X and assign them an output vector y. The discriminative model models the decision boundary between different classes. An example is a discriminative model a logistic regression that maximizes the likelihood estimates.
What are generative models?
Generative models describe how a label vector y can probabilistically generate the feature vector X. As a simple example, Naive Bayes (which is a very popular probabilistic classifier) is a generative algorithm.
What is the feature function of CRF?
The feature function can be given as: -
f(X,i,yi-1,yi)
Where
X = set of input vectors
i=poistion of data points we want to collect.
yi-1=label of data point i-1
yi=label of data point i in X
What are the applications of CRF?
CRFs are capable of modelling sequential data for application in Natural Language Processing and Computer Vision. Named Entity Recognition is one of the uses of CRFs in NLP, where we forecast the order in which they are reliant on one other. Other types of CRFs include Hidden CRF, which is used for gesture recognition, Dynamic CRF, which is used to label sequence data, Skip Gram CRF, which is used for activity recognition, and so on. Gene prediction is another use.
Conclusion
In this article, we have extensively discussed Conditional Random Fields, their implementation, and their importance.
After reading about the Architecture of IoT, are you not feeling excited to read/explore more articles on the topic of Programming in IoT and learning Ruby? Don't worry; Coding Ninjas has you covered. To learn, see more blogs on Ruby, Ruby Documentation, and Programming Language for IoT.