Table of contents
1.
Introduction
2.
What is the Central Limit Theorem(CLT)?
3.
Mathematical Intuition of the Central Limit Theorem
4.
Code Implementation of Central Limit Theorem 
5.
FAQs
6.
Key Takeaways
Last Updated: Mar 27, 2024

Limit Theorem

Author Tushar Tangri
0 upvote
Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

We use statistics almost everywhere while studying machine learning. But what's more important is to know the theoretical concepts behind the statistics corresponding to their practical application. Central Limit theorem is one such statistical application in machine learning that has its importance and application in several parts of machine learning and data science, deep learning, and so on. 

In this blog, we will be talking about the central statistical concept, better known as the central limit theorem, and its mathematical and python application in machine learning. 

What is the Central Limit Theorem(CLT)?

The central limit theorem asserts that as the number of parameters rises, independent random variables averaged together will tend to a normal distribution under several circumstances. This is particularly useful when using statistical reasoning to estimate the population parameters using sample standard deviation. The averages of samples, for example, will construct a normal distribution.  

Source: Link

We may utilize this knowledge to constrain our population estimations further. This information may also be used to predict the likelihood of samples exhibiting extreme values that differ from the mean.

Mathematical Intuition of the Central Limit Theorem

According to the Central Limit Theorem (CLT), a distribution would resemble a normal distribution as the data under observation advances. The standard deviation of the sample distribution of the mean is calculated as follows:

Source: Link

where σx is the population standard deviation, σ is the standard deviation of the sample distribution of the mean (the population parameter), and n is the sample size.

This is significant because we frequently deal with sampling from populations, and everything we understand about the population comes from the sample. We wish to draw conclusions about the population based on this sample. 

Source: Link

We can accomplish this by studying the values' histogram and computing the mean and standard deviation (as approximations of the population parameters). From the above mathematical intuition we can conclude that the theory states that as the sample size grows, the mean distribution among several samples will resemble a Gaussian distribution.

We can consider doing a trial and obtaining a result of making an observation. We can repeat the procedure to obtain a second independent observation. A sample of observations is made up of many observations that have been accumulated throughout time.

Code Implementation of Central Limit Theorem 

To better understand how the theorem works, we will work on the above-learned concept using python and its libraries such as seaborn, matplotlib, and NumPy. 

Assume that we reside in a town of 50,000 people and that we know the heights of everyone in the town. We'll have 50000 numbers to work with, and they'll tell us all we need to know about our population. We'll put ourselves in a town named 'town 47,' where the community means the height is 172 cm and the standard deviation is 5 cm, and we'll mimic these data.

Step 1: Import all the required libraries 

from scipy.stats import norm
from scipy.stats import t
import numpy as np
import pandas as pd
from numpy.random import seed
import matplotlib.pyplot as plt
You can also try this code with Online Python Compiler
Run Code

 

Step 2: Create the data according to the stats discussed above where the mean height is 172 cms and the standard deviation is 5 for an overall population of 50000 people in town 47. 

seed(47)
pop_heights = norm.rvs(172, 5, size=50000)
You can also try this code with Online Python Compiler
Run Code

 

Step 3: Plot a histogram to visualize the above-generated data and see if it works in accordance with the central limit theorem. 

_ = plt.hist(pop_heights, bins=30)
_ = plt.xlabel('height (cm)')
_ = plt.ylabel('number of people')
_ = plt.title('Distribution of heights in entire town population')
_ = plt.axvline(172, color='r')
_ = plt.axvline(172+5, color='r', linestyle='--')
_ = plt.axvline(172-5, color='r', linestyle='--')
_ = plt.axvline(172+10, color='r', linestyle='-.')
_ = plt.axvline(172-10, color='r', linestyle='-.')
You can also try this code with Online Python Compiler
Run Code

 

Output: 

We can clearly see that a large number of data, when plotted can be seen to form a normal distribution curve. 

Step 4: Now that we have understood and visualized the central limit theorem let's verify it by visualizing a smaller number of people from the town, let's say 10 people, and in the end let's compare the visualization with the same sample size for larger data.  

def townsfolk_sampler(n):
    return np.random.choice(pop_heights, n)
You can also try this code with Online Python Compiler
Run Code

 

seed(47)
daily_sample1 = townsfolk_sampler(10)
You can also try this code with Online Python Compiler
Run Code

 

Step 5: Let’s Visualize the sample for 10 people.

_ = plt.hist(daily_sample1, bins=10)
_ = plt.xlabel('height (cm)')
_ = plt.ylabel('number of people')
_ = plt.title('Distribution of heights in sample size 10')
You can also try this code with Online Python Compiler
Run Code

 

Output: 

 

Step 6: Let’s calculate the mean as per our CLT

np.mean(daily_sample1)
You can also try this code with Online Python Compiler
Run Code

 

Output:

173.47911444163503

 

Step 7: Let’s visualize the same sample size for 365 days straight with 10 samples being collected daily. 

seed(47)
# take your samples here
year_samples = []
for i in range(365):
    year_samples.append(np.mean(townsfolk_sampler(10)))
You can also try this code with Online Python Compiler
Run Code

 

Step 8: Now that we have created the above data lets visualize it for the given period and find its mean. 

_ = plt.hist(year_samples, bins=10)
_ = plt.xlabel('daily mean height (cm)')
_ = plt.ylabel('number of days')
_ = plt.title('Distribution of means of sample size 10 collected daily for 1 year.')

mean = np.mean(year_samples)
std = np.std(year_samples)

_ = plt.axvline(mean, color='r')
_ = plt.axvline(mean+std, color='r', linestyle='--')
_ = plt.axvline(mean-std, color='r', linestyle='--')
_ = plt.axvline(mean+(2*std), color='r', linestyle='-.')
_ = plt.axvline(mean-(2*std), color='r', linestyle='-.')
plt.show()
print('sample mean: ~' + str(mean.round(2)), '\nsample std: ~'+ str(std.round(2)))
You can also try this code with Online Python Compiler
Run Code

 

Output:

Based on the above visualization we have verified the central limit theorem as the larger is our sample size the closer we get to a normal distribution curve. 

Compared to sample 1 of sample size 10, taking 10 samples daily for 1 year creates more data and follows the central limit theorem.  

FAQs

  1. Why is CLT important in machine learning?
    In the statistical inference of machine learning, the CLT plays an important role. It shows how much a larger sample size reduces sampling error, which tells us about the accuracy or margin of error for statistical estimations from samples, such as percentages.  
     
  2. What is the mathematical use of CLT? 
    The Central Limit Theorem (CLT) asserts that if a large number of samples are obtained, the central limit is reached for any data. The following characteristics are true: Mean(μₓ¯) of the Sampling Distribution Equals Population Mean(μ) The standard deviation of a sampling distribution (Standard error) =σ/√n ≈S/√n. 
     
  3. What are the assumptions that we make while using CLT? 
    It has to be sampled at random, and the samples should be unrelated to one another. One sampling should not have an impact on the others. When sampling without replacement, the sample size should not exceed 10% of the population.  

Key Takeaways

In this article, with the help of randomly generated data, we have successfully proved and verified the implementation of the central limit theorem in python. We started by understanding the basics of CLT about how the larger number of data samples, when visualized, tends to show normal curve distribution towards the mathematical theory and the code implementation of the same. To learn more about similar concepts, follow our blogs to understand the subject better. 
Check out this problem - Largest Rectangle in Histogram

Live masterclass