Table of contents
1.
Introduction
2.
Probability Distribution
3.
Maximum Likelihood Estimation
3.1.
Likelihood function
3.2.
Log of likelihood
4.
Bayesian Estimation
4.1.
Bayes Theorem 
4.2.
Bayesian Estimation
5.
Key Differences between MLE and Bayesian Estimation
6.
Frequently Asked Questions
7.
Conclusion
Last Updated: Mar 27, 2024
Easy

Maximum Likelihood Estimation vs Bayesian Estimation

Author Arun Nawani
0 upvote
Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

Maximum likelihood estimation and Bayesian Estimation, we have already discussed these in detail separately in our previous blogs. If you have missed either of the two, you still may continue with this blog since we are going to discuss the two techniques anyhow in this blog too. We’ll try to point out the key differences between these two parameter estimation techniques. Estimation is a process of obtaining model parameters from randomly distributed observations. Before we dive deep into contrasting the two techniques, let us first understand what we mean by probability distribution functions. 

Probability Distribution

In statistics, Probability distribution functions depict the probability of different outcomes of a random variable. It can be divided into 2 types-

  • Discrete Probability Distribution- In this probability distribution, the random variable may take discrete and distinct number of values with their respective probabilities. 
    For Example: a die rolled once can take only 6 values, from 1 to 6. And each of these outcomes has a probability of ⅙.   
     
  • Continuous Probability Distribution: In this probability distribution, the random variable can take an infinite number of values. And the probability of any discrete value is almost zero. The probability is given for a range of values.
    For example: say we choose a person randomly and need to know the probability of the person weighing exactly 70 kg. This is very unlikely to happen. But we may define a probability for a given range, say 65-70 kg. The probability density function can be given as integral of a range in a cumulative probability function.

 

Source - link

Maximum Likelihood Estimation

In MLE, the objective is to maximize the likelihood of observing data given specific probability distribution and its parameters. We estimate parameters that maximize the likelihood of observing the data.

Likelihood function

The objective is to maximise the probability of observing the data points from joint probability distribution considering specific probability distribution. This is formally stated as- 

P(X | theta)

Here, theta is an unknown parameter. This may also be written as 

P(X ; theta)

P(x1,x2,x3,...,xn ; theta)

This is the likelihood function and is commonly denoted with L-

L(X ; theta)

Since the aim is to find the parameters that maximise the likelihood function-

Maximum{L(X;theta)}

The joint probability is restated as a product of conditional probability for every observation given the distribution parameters.

L(X | theta) = π(i to n) P (xi | theta)

 

Log of likelihood

It’s going to be a lot of work taking the product of all these conditional probabilities. So, to make it slightly easier, we can take log(natural log) on both sides

ln L(X | theta) = ln(π(i to n) P (xi | theta))

Which becomes

ln L(X | theta) = ∑(i to n) log P(xi | theta) 

MLE is an optimisation technique which can be used on various machine learning models like Logistic regression, linear regression, etc.

Bayesian Estimation

Bayes Theorem 

Most of you might already be aware of bayes theorem. It was proposed by Thomas Bayes. The theorem puts forth a formula for conditional probability. Given as 

P(A|B)=P(B|A).P(A)

                 P(B)

Here, We find the probability of event A given B is true. And P(A) and P(B) are independent probabilities of events A and B.

Or, you may come across websites referring to these in pure statistical terminology.

P(A) = Prior Probability. This is the probability of any event before we take into consideration any new piece of information. 

P(B) is referred to as evidence. How likely an observation of B is given our prior beliefs about A.

P(B|A) is referred to as likelihood function. It tells how likely each observation of B is for a fixed A. 

P(A|B) = Posterior Probability. This is the probability of an event after some event has already occurred. 

Bayesian Estimation

Source - link

In Bayesian Estimation, the equation just takes probability distributions instead of numeric values. 

Notice we replaced evidence with the integral of the numerator. This is because P(D) is tough to calculate and it doesn’t depend on P(θ). It also ensures the integral of the posterior distribution to be 1.

Here ∫P(D|θ)P(θ)dθ  is known as evidence.

In Bayesian Estimation, we compute a distribution over a parameter space known as posterior pdf( P(θ|D) ). 

Source  - link


We see that Bayesian estimation tries to encompass both, the prior probability and the likelihood function to give out the result of posterior distribution. 

Key Differences between MLE and Bayesian Estimation

While both, Maximum Likelihood Estimation and Bayesian Estimation , are parameter estimation techniques based on probability distribution, There are some key differences between the two. 

Check this out, Difference Between Compiler and Interpreter and Assembler

Frequently Asked Questions

  1. Given suitable conditions for both the techniques, which one should be preferred?
    A general consensus is that Bayesian Estimation provides more accurate results than MLE. But it is also more complex to compute than MLE. 
     
  2. How are Maximum likelihood estimation and bayesian estimation different from other parameter optimisation techniques?
    Maximum likelihood estimation and Bayesian estimation are dependent on likelihood function. To decide on parameters that would give the best fitting model. Something other techniques like OLS don’t. 
     
  3. When do Maximum Likelihood Estimation and Bayesian Estimation predict similar values?
    There are a few conditions where Bayesian estimation is extremely close to MLE. When the bayesian prior is uniform over all the values, then bayesian predictions are very close to MLE. Also, if Bayesian prior is well defined and non-zero at all observations and then the Bayesian estimation and MLE will converge at the same value given we have plenty of observations. 

Conclusion

This blog briefly explains and contrasts the 2 most widely used parameter estimation techniques, Maximum Likelihood Estimation and Bayesian Estimation. Optimal conditions for both the techniques and their key differences. We advise readers to go through the blog thoroughly. You may check out our industry-oriented machine learning courses curated by industry experts. 

Happy Learning!

Live masterclass