Table of contents
1.
Introduction
2.
Markov Random Fields
3.
Conditional Random Fields
4.
CRF Theory and Likelihood Optimization
5.
Building and Training a CRF Module in Python
5.1.
Defining and Building New Functions
5.2.
Importing the Annotated Training Area
5.3.
Building Features and Testing Data Frames
5.4.
Testing the model
5.5.
Performance Check of the Model
5.6.
Checking the Predicted Data
6.
FAQs
7.
Key Takeaways
Last Updated: Jun 25, 2025
Easy

Conditional Random Fields

Author soham Medewar
0 upvote
Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

Conditional Random Fields (CRFs) are a type of discriminative model that is well suited to prediction tasks in which contextual information or the status of the neighbors influences the present prediction. CRFs are used to solve problems such as named entity recognition, part of speech tagging, gene prediction, noise reduction, and object detection.

In this article, we'll go through the fundamental math and jargon associated with Markov Random Fields, which is the concept on which CRF is based. Furthermore, we will discuss a simple Conditional Random Fields model, demonstrating why they are ideally suited to sequential prediction issues. Then, in the context of the CRF model, we will go through the likelihood maximization problem and related derivations.

Markov Random Fields

A Markov Random Field, also known as a Markov Network, is a type of graphical model in which random variables are connected in an undirected graph. The random variables' dependency or independence is determined by the graph's structure.

source

  • The graph G = (V, E) represents a Markov Network, with the vertices or nodes representing random variables and the edges reflecting the interdependence between those variables.
  • The graph can be factored into J separate cliques or factors, each regulated by a factor function ϕⱼ, each with a subset of random variables as its scope. For all conceivable values of Dⱼ. ϕⱼ(dⱼ) should be strictly positive.
  • The product of all the factor functions is the unnormalized joint probability of the variables, therefore for the MRF illustrated above with V= (A, B, C, D), the joint probability may be stated as:

source

The denominator is the sum of the factors' products across all possible values for the random variables. It's a constant that's also known as the partition function and is abbreviated as Z.

Read about, Machine Learning

Conditional Random Fields

Let's assume we have a Markov Random Field that is divided into two sets of random variables, Y and X.

"When we condition the graph on X globally, i.e., when the values of random variables in X are fixed or given, all the random variables in set Y follow the Markov property p(Yᵤ/X, Yᵥ, u≠v) = p(Yᵤ/X, Yₓ, Yᵤ~Yₓ), where Yᵤ~Yₓ implies that Yᵤ and Yₓ are neighbors in the graph." The Markov Blanket of a variable is made up of its adjacent nodes or variables.

The chain-structured graph shown below is one such graph that satisfies the aforementioned property:

 

source

As the CRF is a discriminative model, it models the conditional probability P(Y/X), which means that X is always given or observed. As a result, the graph eventually descends into a simple chain.

We call X and Y the evidence and label variables, respectively, because we condition on X and aim to find the appropriate Yᵢ for every Xᵢ.

We can see that the "factor reduced" CRF model in the above figure follows Markov's property as shown for variable Y₂ in the below equation. As we can see in the equation below, the conditional probability of Y₂ depends only on its neighboring nodes.

CRF Theory and Likelihood Optimization

Let's start by defining the parameters, then use the Gibbs notation to construct the equations for joint (and conditional) probabilities.

1. Label domain: Assume that the domain of random variables in set Y is {m ϵ ℕ | 1≤m ≤M}, i.e., the first M natural numbers.

2. Evidence structure and a domain: Assume that the random variables in set X are F-dimensional real-valued vectors, i.e., ∀ Xᵢ ϵ X, Xᵢ ϵ Rˢ.

3. The length of the CRF chain should be L, which includes L labels and L evidence variables.

4. Let βᵢ(Yᵢ, Yⱼ) = Wcc’ if Yᵢ = c, Yⱼ = c’ and j = i+1, 0 otherwise.

5. Let β’ᵢ(Yᵢ, Xᵢ) = W’c . Xᵢ, if Yᵢ = c and 0 otherwise.

6. The total number of parameters is M x M + M x S, indicating that there is a single parameter for each label transition ( possible label transitions = M x M ) and S parameters for each label (M possible labels) that will be multiplied to the observation variable (a vector of size S) for that label.

7. Let D = {(xn, yn)} for n=1 to N, be the training data comprising of N examples.

So, the energy and the likelihood can be expressed in the following way:

As a result, the training problem boils down to maximizing the log-likelihood for all Wcc' and W'cs model parameters.

The gradient of the log-likelihood with respect to W’cs is derived in the below equation:

Note that the second term in the above equation denotes the sum of marginal probability of y’ᵢ being equal to c, weighted by xnis. The y’-i here denotes the set of label/y variables at each position except position i.

For dL/dWcc', a similar derivation may be figured out as shown below.

 

Building and Training a CRF Module in Python

#invoke libraries
from bs4 import BeautifulSoup as bs
from bs4.element import Tag
import codecs
import nltk
from nltk import word_tokenize, pos_tag
from sklearn.model_selection import train_test_split
import pycrfsuite
import os, os.path, sys
import glob
from xml.etree import ElementTree
import numpy as np
from sklearn.metrics import classification_report
You can also try this code with Online Python Compiler
Run Code

Defining and Building New Functions

def append_annotations(files):
   xml_files = glob.glob(files +"/*.xml")
   xml_element_tree = None
   new_data = ""
   for xml_file in xml_files:
       data = ElementTree.parse(xml_file).getroot()
       #print ElementTree.tostring(data)        
       temp = ElementTree.tostring(data)
       new_data += (temp)
   return(new_data)
#function to remove special characters and punctuations
def remov_punct(withpunc):
   punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
   without_punct = ""
   char = 'nan'
   for char in withpunct:
       if char not in punctuations:
           without_punct = without_punct + char
   return(without_punct)
# function to extracting features in documents
def extract_features(doc):
   return [word2features(doc, i) for i in range(len(doc))]
def get_labels(doc):
   return [label for (token, postag, label) in doc]
You can also try this code with Online Python Compiler
Run Code

Importing the Annotated Training Area

files_path = "D:/Annotated/"
allxmlfiles = append_annotations(files_path)
soup = bs(allxmlfiles, "html5lib")
#identification of the tagged element
docs = []
sents = []
for d in soup.find_all("document"):
 for wrd in d.contents:    
   tags = []
   NoneType = type(None)   
   if isinstance(wrd.name, NoneType) == True:
       withoutpunct = remov_punct(wrd)
       temp = word_tokenize(withoutpunct)
       for token in temp:
           tags.append((token,'NA'))            
   else:
       withoutpunct = remov_punct(wrd)
       temp = word_tokenize(withoutpunct)
       for token in temp:
           tags.append((token,wrd.name))    
   sents = sents + tags 
 docs.append(sents) #appends all the individual documents into one list

Generate features. These are the default features that NER algorithm uses in nltk. One can modify it for customization.
data = []
for i, doc in enumerate(docs):
   tokens = [t for t, label in doc]    
   tagged = nltk.pos_tag(tokens)    
   data.append([(w, pos, label) for (w, label), (word, pos) in zip(doc, tagged)])
def word2features(doc, i):
   word = doc[i][0]
   postag = doc[i][1]
# Common features for all words. You may add more features here based on your custom use case
features = [
       'bias',
       'word.lower=' + word.lower(),
       'word[-3:]=' + word[-3:],
       'word[-2:]=' + word[-2:],
       'word.isupper=%s' % word.isupper(),
       'word.istitle=%s' % word.istitle(),
       'word.isdigit=%s' % word.isdigit(),
       'postag=' + postag
   ]
# Features for words that are not at the beginning of a document
if i > 0:
       word1 = doc[i-1][0]
       postag1 = doc[i-1][1]
       features.extend([
           '-1:word.lower=' + word1.lower(),
           '-1:word.istitle=%s' % word1.istitle(),
           '-1:word.isupper=%s' % word1.isupper(),
           '-1:word.isdigit=%s' % word1.isdigit(),
           '-1:postag=' + postag1
       ])
   else:
       # Indicate that it is the 'beginning of a document'
       features.append('BOS')
# Features for words that are not at the end of a document
if i < len(doc)-1:
       word1 = doc[i+1][0]
       postag1 = doc[i+1][1]
       features.extend([
           '+1:word.lower=' + word1.lower(),
           '+1:word.istitle=%s' % word1.istitle(),
           '+1:word.isupper=%s' % word1.isupper(),
           '+1:word.isdigit=%s' % word1.isdigit(),
           '+1:postag=' + postag1
       ])
   else:
       # Indicate that it is the 'end of a document'
       features.append('EOS')
return features
You can also try this code with Online Python Compiler
Run Code

Building Features and Testing Data Frames

X = [extract_features(doc) for doc in data]
y = [get_labels(doc) for doc in data]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Testing the model

Let’s test our model. For testing our CRF model, we need to execute the following lines of code.

tagger = pycrfsuite.Tagger()
tagger.open('crf.model')
y_pred = [tagger.tag(xseq) for xseq in X_test]
i = 0
for x, y in zip(y_pred[i], [x[1].split("=")[1] for x in X_test[i]]):
             print("%s (%s)" % (y, x))

Performance Check of the Model

# Create a map of labels to indices
labels = {"claim_number": 1, "claimant": 1,"NA": 0}
# Conversion of tags into 1-D array
predictions = np.array([labels[tag] for row in y_pred for tag in row])
truths = np.array([labels[tag] for row in y_test for tag in row])
#prints the classification report
print(classification_report(
   truths, predictions,
   target_names=["claim_number", "claimant","NA"]))

#predict new data
with codecs.open("D:/ SampleEmail6.xml", "r", "utf-8") as infile:
   soup_test = bs(infile, "html5lib")
docs = []
sents = []
for d in soup_test.find_all("document"):
 for wrd in d.contents:    
   tags = []
   NoneType = type(None)  
   if isinstance(wrd.name, NoneType) == True:
       withoutpunct = remov_punct(wrd)
       temp = word_tokenize(withoutpunct)
       for token in temp:
           tags.append((token,'NA'))            
   else:
       withoutpunct = remov_punct(wrd)
       temp = word_tokenize(withoutpunct)
       for token in temp:
           tags.append((token,wrd.name))
   #docs.append(tags)
sents = sents + tags # puts all the sentences of a document in one element of the list
docs.append(sents) #appends all the individual documents into one list      
data_test = []
for i, doc in enumerate(docs):
   tokens = [t for t, label in doc]    
   tagged = nltk.pos_tag(tokens)    
   data_test.append([(w, pos, label) for (w, label), (word, pos) in zip(doc, tagged)])
data_test_feats = [extract_features(doc) for doc in data_test]
tagger.open('crf.model')
newdata_pred = [tagger.tag(xseq) for xseq in data_test_feats]

Checking the Predicted Data

i = 0
for x, y in zip(newdata_pred[i], [x[1].split("=")[1] for x in data_test_feats[i]]):
   print("%s (%s)" % (y, x))
You can also try this code with Online Python Compiler
Run Code

Here, we used Python to train a CRF model, and finally discovered how to identify entities from new text.

FAQs

1. What do you mean by CRF?

CRF is also known as a conditional random field. It is a type of discriminative model that's best for prediction tasks when the current forecast is influenced by contextual information or the status of the neighbors.
 

2. What is CRF in image segmentation?

When the class labels for different inputs are not independent, a conditional random field is utilized as a discriminative statistical modelling tool. For example, the class label for a pixel is also determined by the labels of its neighbors.
 

3. What is the difference between CRF and HMM (Hidden Markov Model)?

HMM is a directed graph, whereas the CRF is an undirected graph. HMM predicts the probability of co-occurrence by explicitly modelling the transition probability and the phenotypic probability.
 

4. What is the difference between CRF and MRF (Markov Random Fields)?

A Conditional Random Field (CRF) is a type of MRF that determines a posterior for variables x given data z. The factorization into the data distribution P (x|z) and the prior P (x) is not made explicit, unlike the hidden MRF.
 

Key Takeaways

In this article, we have discussed the following topics:

  • Introduction to CRF
  • MRF
  • CRF Theory and Likelihood Optimization

Want to learn more about Machine Learning? Here is an excellent course that can guide you in learning. 

Happy Coding!

Live masterclass