Do you think IIT Guwahati certified course can help you in your career?
No
Introduction
For discrete variables, the probability is simple and can be calculated easily. But for continuous variables that take infinite values, the probability can also take a range of infinite values. The function that describes the probability for such continuous variables is called a probability density function in statistics.
Probability Density Function
The probability density is the relationship between variables and their probability, such that we can find the probability of the variable using that function. We refer to the overall shape of probability density as a probability distribution. A probability density function performs the probability calculation of probabilities for a specific random variable, such as uniform, normal, exponential, etc.
Knowing the probability distribution can help calculate the distribution's moments, like the mean and variance. It can also help in determining whether an observation is an outlier or not.
Now, the issue is we may not know the probability distribution. We rarely see the distribution because we don't have access to all possible outcomes for a random variable. The only thing we have is a sample of observations.
We refer to this problem as probability density estimation. We use the observations in a random sample to estimate the general density of probabilities with the available sample of data.
Estimating Probability Density Function
This post will focus on univariate data, e.g., one random variable. We can apply the same methods to muti variable data.
There are three main steps:
Summarizing the density with a histogram
We first convert the data into discrete form by plotting it on the histogram. A histogram first groups the observations into bins and counts the number of events in each bin. Each bin's counts, or frequencies of observations, are then plotted as a bar graph. The selection of the number of bins is crucial as it determines how many bars the histogram will have and their width,i.e., how well the density of the observations are plotted. It is a good practice to experiment with different values of bins to get multiple perspectives or views on the same data.
Let us look at the implementation part:
Implementation
Firstly, we import all the necessary modules.
from matplotlib import pyplot import numpy as p from numpy.random import normal rom numpy import std from scipy.stats import norm from numpy import mean
Data sample generation
sample = normal(size=1000)
We use the normal function to create 1000 data samples with mean equals zero and standard deviation equals one.
We create a histogram using the hist() function. We provide data as the first argument and the number of bins as the second argument.
pyplot.hist(sample, bins=10) pyplot.show()
Output
In the above code, we choose the value of bins to be ten; now, let us select different values of bins to check if we get a bell curve or not.
pyplot.hist(sample, bins=4) pyplot.show()
Output
As you can see, this histogram doesn't resemble a bell shape as much as the one with ten bins. So, it can make it hard to recognize the type of distribution.
Reviewing a histogram with different numbers of bins will help to identify whether the density looks like a standard probability distribution or not.
In most cases, we will see a unimodal distribution, as the bell shape of the normal distribution. We can also have complex bimodal or multimodal distributions where multiple peaks do not disappear with different counts of bins.
Parametric Density Estimation
A PDF can take a shape similar to many standard functions. The formed histogram helps to determine the type of function. We can calculate the parameters associated with the function to get our density.
For example, the mean and standard deviation are the two parameters of normal distribution. We can now know the PDF with the help of parameters. We can estimate the parameters by calculating the sample mean and standard deviation. This process is known as parametric density estimation because we use predefined functions to establish the relationship between observations and their probability with the help of parameters.
After estimating density, we can check if it is a good fit or not. There are many ways to check, such as:
Sampling the density function and comparing the generated sample to the real sample.
Plotting the density function and comparing the shape to the histogram.
Using a statistical test to confirm the data fits the distribution.
Implementation
Firstly, we import all the necessary modules.
from matplotlib import pyplot from numpy.random import normal from scipy.stats import norm from numpy import std from numpy import mean
We generate a random sample of 1,0000 observations normally distributed with a standard deviation of five and a mean of fifty.
data = normal(loc=50, scale=5, size=10000)
We pretend that we don't know the probability distribution, look at a histogram and guess it is normal. Assuming it is normal, we can calculate the mean and standard deviation distribution parameters.
It is not expected that the mean and standard deviation to be fifty and five exactly because of the small sample size and noise.
As we can see, the assumption we made on the distribution is a perfect fit for the data samples. If it were not the same, we would have to assume the sample to be of some other distribution and repeat the process.
Non-parametric Density Estimation
There are cases when the shape of the histogram doesn't match a common PDF or cannot be made to fit one. This case generally happens when the data has bimodal distribution(two peaks) or multimodal distribution(multiple peaks). In such cases, parametric density estimation is impossible, so alternative methods must be used. Therefore, we use an algorithm to approximate the probability distribution of the data without a predefined distribution, referred to as a nonparametric method.
The commonly used nonparametric approach for estimating the PDF of a continuous random variable is kernel density estimation or kernel smoothing.
A kernel returns the probability for a given random variable. The kernel effectively smooths the probabilities across the range of outcomes for a random variable. The sum of probabilities should equal one.
The smoothing parameter is the number of samples used to estimate the probability for a new point. We can shape the contributions of samples within the window using different functions, known as basis functions, e.g., uniform normal, etc., with varying effects on the smoothness of the resulting density function. The basis function is chosen to control the contribution of samples in the dataset toward estimating the probability of a new point.
Implementation
Importing libraries
from matplotlib import pyplot from numpy import hstack from numpy.random import normal from sklearn.neighbors import KernelDensity from numpy import asarray from numpy import exp
For performing non-parametric estimation, we need multimodal or bimodal distribution. In the below example we use bimodal class. We create two data samples normally distributed, consisting of a different numbers of data points.
Now, we use Kernel density estimation to get a model, which you can then fit your sample to create a probability distribution curve. The input function should have a 2D shape. Therefore we have reshaped our data sample to have 1,000 rows and 1 column.
model = KernelDensity(bandwidth=2, kernel='gaussian') data = data.reshape((len(data), 1)) model.fit(data)
Now we evaluate how the density estimates fits our data. We calculate the probability for a range of observations, in our case, it is from ten to sixty. Lastly, we compare the shape of the histogram.
values = asarray([value for value in range(10,60)]) values = values.reshape((len(values), 1)) prob = model.score_samples(values) prob = exp(prob)
We can see the PDF fits the histogram almost well. We can PDF more smoothly by changing the value of bandwidth. We can experiment with it with different bandwidth values and the kernel function.
FAQs
What are the conditions of probability density function? A probability density function must satisfy the following requirements. Firstly, f(x) must be nonnegative for each value of the random variable, and secondly, the overall integral values of the random variable must equal one.
What is the significance of PDF? PDF helps us to calculate the likelihood of a discrete value. A discrete variable can be measured precisely, while a continuous variable can have infinite values.
Why is PDF always positive? PDF is the derivative of the distribution function, and we know the distribution function is monotonically increasing value on real-valued R, so PDF is always positive.
Key Takeaways
Let us brief the article. Firstly we saw PDF and its significance. Later, we saw different types of PDFs. Further, we saw how we could estimate the PDF using other methods, like using histograms, parametric and nonparametric density estimation. That is all from the PDF in statistics. Check out this problem - Largest Rectangle in Histogram
I hope you all like this article.
Happy Learning Ninjas!!
Live masterclass
Interview-Ready Excel & AI Skills for Microsoft Analyst Roles
by Prerita Agarwal
19 Jun, 2025
01:30 PM
AI PDF Analyzer using FastAPI – Explained by Google SWE3
by Akash Aggarwal
16 Jun, 2025
01:30 PM
Amazon Data Analyst: Advanced Excel & AI Interview Tips
by Megna Roy
17 Jun, 2025
01:30 PM
From Full Stack to AI Stack: What Modern Web Dev Looks Like
by Shantanu Shubham
18 Jun, 2025
01:30 PM
Interview-Ready Excel & AI Skills for Microsoft Analyst Roles
by Prerita Agarwal
19 Jun, 2025
01:30 PM
AI PDF Analyzer using FastAPI – Explained by Google SWE3