Conditional Random Fields
Let's assume we have a Markov Random Field that is divided into two sets of random variables, Y and X.
"When we condition the graph on X globally, i.e., when the values of random variables in X are fixed or given, all the random variables in set Y follow the Markov property p(Yᵤ/X, Yᵥ, u≠v) = p(Yᵤ/X, Yₓ, Yᵤ~Yₓ), where Yᵤ~Yₓ implies that Yᵤ and Yₓ are neighbors in the graph." The Markov Blanket of a variable is made up of its adjacent nodes or variables.
The chain-structured graph shown below is one such graph that satisfies the aforementioned property:

source
As the CRF is a discriminative model, it models the conditional probability P(Y/X), which means that X is always given or observed. As a result, the graph eventually descends into a simple chain.

We call X and Y the evidence and label variables, respectively, because we condition on X and aim to find the appropriate Yᵢ for every Xᵢ.
We can see that the "factor reduced" CRF model in the above figure follows Markov's property as shown for variable Y₂ in the below equation. As we can see in the equation below, the conditional probability of Y₂ depends only on its neighboring nodes.

CRF Theory and Likelihood Optimization
Let's start by defining the parameters, then use the Gibbs notation to construct the equations for joint (and conditional) probabilities.
1. Label domain: Assume that the domain of random variables in set Y is {m ϵ ℕ | 1≤m ≤M}, i.e., the first M natural numbers.
2. Evidence structure and a domain: Assume that the random variables in set X are F-dimensional real-valued vectors, i.e., ∀ Xᵢ ϵ X, Xᵢ ϵ Rˢ.
3. The length of the CRF chain should be L, which includes L labels and L evidence variables.
4. Let βᵢ(Yᵢ, Yⱼ) = Wcc’ if Yᵢ = c, Yⱼ = c’ and j = i+1, 0 otherwise.
5. Let β’ᵢ(Yᵢ, Xᵢ) = W’c . Xᵢ, if Yᵢ = c and 0 otherwise.
6. The total number of parameters is M x M + M x S, indicating that there is a single parameter for each label transition ( possible label transitions = M x M ) and S parameters for each label (M possible labels) that will be multiplied to the observation variable (a vector of size S) for that label.
7. Let D = {(xn, yn)} for n=1 to N, be the training data comprising of N examples.
So, the energy and the likelihood can be expressed in the following way:

As a result, the training problem boils down to maximizing the log-likelihood for all Wcc' and W'cs model parameters.
The gradient of the log-likelihood with respect to W’cs is derived in the below equation:

Note that the second term in the above equation denotes the sum of marginal probability of y’ᵢ being equal to c, weighted by xnis. The y’-i here denotes the set of label/y variables at each position except position i.
For dL/dWcc', a similar derivation may be figured out as shown below.

FAQs
1. What do you mean by CRF?
CRF is also known as a conditional random field. It is a type of discriminative model that's best for prediction tasks when the current forecast is influenced by contextual information or the status of the neighbors.
2. What is CRF in image segmentation?
When the class labels for different inputs are not independent, a conditional random field is utilized as a discriminative statistical modelling tool. For example, the class label for a pixel is also determined by the labels of its neighbors.
3. What is the difference between CRF and HMM (Hidden Markov Model)?
HMM is a directed graph, whereas the CRF is an undirected graph. HMM predicts the probability of co-occurrence by explicitly modelling the transition probability and the phenotypic probability.
4. What is the difference between CRF and MRF (Markov Random Fields)?
A Conditional Random Field (CRF) is a type of MRF that determines a posterior for variables x given data z. The factorization into the data distribution P (x|z) and the prior P (x) is not made explicit, unlike the hidden MRF.
Key Takeaways
In this article, we have discussed the following topics:
- Introduction to CRF
- MRF
- CRF Theory and Likelihood Optimization
Want to learn more about Machine Learning? Here is an excellent course that can guide you in learning.
Happy Coding!