Do you think IIT Guwahati certified course can help you in your career?
No
Introduction
We use statistics almost everywhere while studying machine learning. But what's more important is to know the theoretical concepts behind the statistics corresponding to their practical application. Central Limit theorem is one such statistical application in machine learning that has its importance and application in several parts of machine learning and data science, deep learning, and so on.
In this blog, we will be talking about the central statistical concept, better known as the central limit theorem, and its mathematical and python application in machine learning.
What is the Central Limit Theorem(CLT)?
The central limit theorem asserts that as the number of parameters rises, independent random variables averaged together will tend to a normal distribution under several circumstances. This is particularly useful when using statistical reasoning to estimate the population parameters using sample standard deviation. The averages of samples, for example, will construct a normal distribution.
We may utilize this knowledge to constrain our population estimations further. This information may also be used to predict the likelihood of samples exhibiting extreme values that differ from the mean.
Mathematical Intuition of the Central Limit Theorem
According to the Central Limit Theorem (CLT), a distribution would resemble a normal distribution as the data under observation advances. The standard deviation of the sample distribution of the mean is calculated as follows:
where σxis the population standard deviation, σ is the standard deviation of the sample distribution of the mean (the population parameter), and n is the sample size.
This is significant because we frequently deal with sampling from populations, and everything we understand about the population comes from the sample. We wish to draw conclusions about the population based on this sample.
We can accomplish this by studying the values' histogram and computing the mean and standard deviation (as approximations of the population parameters). From the above mathematical intuition we can conclude that the theory states that as the sample size grows, the mean distribution among several samples will resemble a Gaussian distribution.
We can consider doing a trial and obtaining a result of making an observation. We can repeat the procedure to obtain a second independent observation. A sample of observations is made up of many observations that have been accumulated throughout time.
Code Implementation of Central Limit Theorem
To better understand how the theorem works, we will work on the above-learned concept using python and its libraries such as seaborn, matplotlib, and NumPy.
Assume that we reside in a town of 50,000 people and that we know the heights of everyone in the town. We'll have 50000 numbers to work with, and they'll tell us all we need to know about our population. We'll put ourselves in a town named 'town 47,' where the community means the height is 172 cm and the standard deviation is 5 cm, and we'll mimic these data.
Step 1: Import all the required libraries
from scipy.stats import norm
from scipy.stats import t
import numpy as np
import pandas as pd
from numpy.random import seed
import matplotlib.pyplot as plt
You can also try this code with Online Python Compiler
Step 2: Create the data according to the stats discussed above where the mean height is 172 cms and the standard deviation is 5 for an overall population of 50000 people in town 47.
We can clearly see that a large number of data, when plotted can be seen to form a normal distribution curve.
Step 4: Now that we have understood and visualized the central limit theorem let's verify it by visualizing a smaller number of people from the town, let's say 10 people, and in the end let's compare the visualization with the same sample size for larger data.
Based on the above visualization we have verified the central limit theorem as the larger is our sample size the closer we get to a normal distribution curve.
Compared to sample 1 of sample size 10, taking 10 samples daily for 1 year creates more data and follows the central limit theorem.
FAQs
Why is CLT important in machine learning? In the statistical inference of machine learning, the CLT plays an important role. It shows how much a larger sample size reduces sampling error, which tells us about the accuracy or margin of error for statistical estimations from samples, such as percentages.
What is the mathematical use of CLT? The Central Limit Theorem (CLT) asserts that if a large number of samples are obtained, the central limit is reached for any data. The following characteristics are true: Mean(μₓ¯) of the Sampling Distribution Equals Population Mean(μ) The standard deviation of a sampling distribution (Standard error) =σ/√n ≈S/√n.
What are the assumptions that we make while using CLT? It has to be sampled at random, and the samples should be unrelated to one another. One sampling should not have an impact on the others. When sampling without replacement, the sample size should not exceed 10% of the population.
Key Takeaways
In this article, with the help of randomly generated data, we have successfully proved and verified the implementation of the central limit theorem in python. We started by understanding the basics of CLT about how the larger number of data samples, when visualized, tends to show normal curve distribution towards the mathematical theory and the code implementation of the same. To learn more about similar concepts, follow our blogs to understand the subject better. Check out this problem - Largest Rectangle in Histogram