Table of contents
1.
Introduction
2.
Probability Density Function
3.
Estimating Probability Density Function
3.1.
Summarizing the density with a histogram
3.1.1.
Implementation
3.2.
Parametric Density Estimation
3.3.
Non-parametric Density Estimation
3.3.1.
Implementation
4.
FAQs
5.
Key Takeaways
Last Updated: Mar 27, 2024

Probability Density Function

Author Mayank Goyal
0 upvote
Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

For discrete variables, the probability is simple and can be calculated easily. But for continuous variables that take infinite values, the probability can also take a range of infinite values. The function that describes the probability for such continuous variables is called a probability density function in statistics.

Probability Density Function

The probability density is the relationship between variables and their probability, such that we can find the probability of the variable using that function. We refer to the overall shape of probability density as a probability distribution. A probability density function performs the probability calculation of probabilities for a specific random variable, such as uniform, normal, exponential, etc.

Knowing the probability distribution can help calculate the distribution's moments, like the mean and variance. It can also help in determining whether an observation is an outlier or not.

Now, the issue is we may not know the probability distribution. We rarely see the distribution because we don't have access to all possible outcomes for a random variable. The only thing we have is a sample of observations.

We refer to this problem as probability density estimation. We use the observations in a random sample to estimate the general density of probabilities with the available sample of data.

Estimating Probability Density Function

This post will focus on univariate data, e.g., one random variable. We can apply the same methods to muti variable data.

There are three main steps:

Summarizing the density with a histogram

We first convert the data into discrete form by plotting it on the histogram. A histogram first groups the observations into bins and counts the number of events in each bin. Each bin's counts, or frequencies of observations, are then plotted as a bar graph.
The selection of the number of bins is crucial as it determines how many bars the histogram will have and their width,i.e., how well the density of the observations are plotted. It is a good practice to experiment with different values of bins to get multiple perspectives or views on the same data. 

Let us look at the implementation part:

Implementation

Firstly, we import all the necessary modules.

from matplotlib import pyplot
import numpy as p
from numpy.random import normal
rom numpy import std
from scipy.stats import norm
from numpy import mean

Data sample generation

sample = normal(size=1000)

We use the normal function to create 1000 data samples with mean equals zero and standard deviation equals one.

We create a histogram using the hist() function. We provide data as the first argument and the number of bins as the second argument.

pyplot.hist(sample, bins=10)
pyplot.show()

Output

In the above code, we choose the value of bins to be ten; now, let us select different values of bins to check if we get a bell curve or not.

pyplot.hist(sample, bins=4)
pyplot.show()

Output

As you can see, this histogram doesn't resemble a bell shape as much as the one with ten bins. So, it can make it hard to recognize the type of distribution.

Reviewing a histogram with different numbers of bins will help to identify whether the density looks like a standard probability distribution or not.

In most cases, we will see a unimodal distribution, as the bell shape of the normal distribution. We can also have complex bimodal or multimodal distributions where multiple peaks do not disappear with different counts of bins.

Parametric Density Estimation

A PDF can take a shape similar to many standard functions. The formed histogram helps to determine the type of function. We can calculate the parameters associated with the function to get our density.

For example, the mean and standard deviation are the two parameters of normal distribution. We can now know the PDF with the help of parameters. We can estimate the parameters by calculating the sample mean and standard deviation. This process is known as parametric density estimation because we use predefined functions to establish the relationship between observations and their probability with the help of parameters.

After estimating density, we can check if it is a good fit or not. There are many ways to check, such as:

  • Sampling the density function and comparing the generated sample to the real sample.
  • Plotting the density function and comparing the shape to the histogram.
  • Using a statistical test to confirm the data fits the distribution.

Implementation

Firstly, we import all the necessary modules.

from matplotlib import pyplot
from numpy.random import normal
from scipy.stats import norm
from numpy import std
from numpy import mean

We generate a random sample of 1,0000 observations normally distributed with a standard deviation of five and a mean of fifty.

data = normal(loc=50, scale=5, size=10000)

We pretend that we don't know the probability distribution, look at a histogram and guess it is normal. Assuming it is normal, we can calculate the mean and standard deviation distribution parameters.

It is not expected that the mean and standard deviation to be fifty and five exactly because of the small sample size and noise.

Calculating parameters

data_mean = mean(data)
data_std = std(data)
print((data_mean, data_std))

Output

(49.94316297520495, 4.9691467863059735)

We fit the distribution with these parameters.

dist = norm(data_mean, data_std)

Now we calculate the probabilities for this distribution, in this case, between 40 and 80.

val = [value for value in range(4080)]
prob = [dist.pdf(value) for value in val]

Finally, we plot the histogram of the data sample and overlay a line plot of the probabilities calculated for the range of values from the PDF.

pyplot.hist(data, bins=10, density=True)
pyplot.plot(values, prob)
pyplot.show()

Output

As we can see, the assumption we made on the distribution is a perfect fit for the data samples. If it were not the same, we would have to assume the sample to be of some other distribution and repeat the process.

Non-parametric Density Estimation

There are cases when the shape of the histogram doesn't match a common PDF or cannot be made to fit one. This case generally happens when the data has bimodal distribution(two peaks) or multimodal distribution(multiple peaks). In such cases, parametric density estimation is impossible, so alternative methods must be used. Therefore, we use an algorithm to approximate the probability distribution of the data without a predefined distribution, referred to as a nonparametric method.

The commonly used nonparametric approach for estimating the PDF of a continuous random variable is kernel density estimation or kernel smoothing.

A kernel returns the probability for a given random variable. The kernel effectively smooths the probabilities across the range of outcomes for a random variable. The sum of probabilities should equal one.

The smoothing parameter is the number of samples used to estimate the probability for a new point. We can shape the contributions of samples within the window using different functions, known as basis functions, e.g., uniform normal, etc., with varying effects on the smoothness of the resulting density function. The basis function is chosen to control the contribution of samples in the dataset toward estimating the probability of a new point.

Implementation

Importing libraries

from matplotlib import pyplot
from numpy import hstack
from numpy.random import normal
from sklearn.neighbors import KernelDensity
from numpy import asarray
from numpy import exp

For performing non-parametric estimation, we need multimodal or bimodal distribution. In the below example we use bimodal class. We create two data samples normally distributed, consisting of a different numbers of data points.

data1 = normal(loc=20, scale=5, size=300)
data2 = normal(loc=40, scale=5, size=700)
data = hstack((data1, data2))

 

data

Output

Plotting the distribution

pyplot.hist(data, bins=40,density=True)
pyplot.show()

Output

Now, we use Kernel density estimation to get a model, which you can then fit your sample to create a probability distribution curve. The input function should have a 2D shape. Therefore we have reshaped our data sample to have 1,000 rows and 1 column.

model = KernelDensity(bandwidth=2, kernel='gaussian')
data = data.reshape((len(data), 1))
model.fit(data)

Now we evaluate how the density estimates fits our data. We calculate the probability for a range of observations, in our case, it is from ten to sixty. Lastly, we compare the shape of the histogram.

values = asarray([value for value in range(10,60)])
values = values.reshape((len(values), 1))
prob = model.score_samples(values)
prob = exp(prob)

Plotting the function.

pyplot.hist(sample, bins=40, density=True)
pyplot.plot(values[:], probabilities)
pyplot.show()

Output

We can see the PDF fits the histogram almost well. We can PDF more smoothly by changing the value of bandwidth. We can experiment with it with different bandwidth values and the kernel function.

FAQs

  1. What are the conditions of probability density function?
    A probability density function must satisfy the following requirements. Firstly, f(x) must be nonnegative for each value of the random variable, and secondly, the overall integral values of the random variable must equal one.
     
  2. What is the significance of PDF?
    PDF helps us to calculate the likelihood of a discrete value. A discrete variable can be measured precisely, while a continuous variable can have infinite values.
     
  3. Why is PDF always positive?
    PDF is the derivative of the distribution function, and we know the distribution function is monotonically increasing value on real-valued R, so PDF is always positive.

Key Takeaways

Let us brief the article.
Firstly we saw PDF and its significance. Later, we saw different types of PDFs. Further, we saw how we could estimate the PDF using other methods, like using histograms, parametric and nonparametric density estimation. That is all from the PDF in statistics.
Check out this problem - Largest Rectangle in Histogram

I hope you all like this article.

Happy Learning Ninjas!!

Live masterclass