Central Limit theorem

Original article was published on Deep Learning on Medium


Statistics for Data Science and Machine Learning

Central Limit theorem

Inferential Statistics

Central limit theorem states that the sampling distribution of the sample means(take the means of batches) approximates the normal distribution, having any distribution on the population.

This theorem is one of the most important, we can use it to make inference based on the Normal distribution, the easiest to analyze, regardless of the population distribution.

Theorem

Let X1, . . , Xn be a random sample from some population with mean μ and

variance σ2. Then for large n:

Distribution for Central Limit Theorem, self-generated

The mean of X is approximately normally distributed with mean μ and variance σ2/n.

That means that having any distribution if we take random samples with replacement and calculate the mean of each sample, the distribution of the result means will approximately be a normal distribution.

A random sample with replacement means that we can take all values in all sames, one value can be in all the samples.

There’s no rule for the sufficient size of original data and the random samples that we need to use, but statisticians often repeat the rule of sample size larger than 30. But in cases of hard skewness, these values can reach 40 or more.

The number of elements for each sample will depend on each case and there’s no rule to define it, so we need to try distinct values.

Central Limit Theorem can be used on continuous and dichotomous data, let’s make some examples.

Examples

Left Skewed Data

The first example is applied over a left-skewed data, the original data is from a population of 1000 individuals and has the following skewness:

Left Skewed data, self Generated using Scipy

Let’s try to apply Central Limit Theorem to get a normal distribution, we will try out distinct sizes, we will start with n = 10:

Left Skewed data with CLT applied, self Generated using Scipy

As we can see the result is similar to a normal distribution, let’s check it with the Shapiro Test (Test used to check if a distribution is normal, it will be explained in next posts):

Statistics=0.9887740015983582, p=6.264754119911231e-07
Sample does not look Normal (reject H0)

The results are quite promising, but it does not pass the Shapiro Test, let’s increase the size of n to be 15.

Left Skewed data with CLT applied, self Generated using Scipy

After increasing the size of samples, we’re passing the Shapiro Test with a 99% confidence:

Statistics=0.9943469762802124, p=0.000821537512820214
Sample does not look Normal (reject H0)

After some try and error, you will be able to reach a normal distribution and use it as data for your models.

Summary

This post introduces THE THEOREM that allows data scientists to work with a distribution with the best properties.

Normal Distributions gives data scientists a lot of tools easy to apply and needed for nearly all classic machine learning models.