Original article was published by Andrew Rothman on Artificial Intelligence on Medium
Central Limit Theorem: Proofs & Actually Working Through the Math
… Not another ‘hand-wavy’ CLT explanation… Let’s actually work through the math
Background and Motivation
For anyone pursuing study in Data Science, Statistics, or Machine Learning, stating that “The Central Limit Theorem (CLT) is important to know” is an understatement. Particularly from a Mathematical Statistics perspective, in most cases the CLT is what makes recovery of valid inferential coverage around parameter estimates a tractable and solvable problem.
There are several articles on the Medium platform regarding the CLT. I noticed however not a single article (as to my knowledge) that delved into the mathematics of the theorem, nor even properly specified the assumptions under which the CLT holds. This is a tremendous disservice in my view. These are mathematical foundations every practitioner in the above-mentioned fields should know.
It’s not only important to understand the mathematical foundations on which the CLT sits, but to understand the conditions under which the CLT doesn’t hold. For example, if we have a series of n i.i.d. Cauchy distributed RVs, their mean-centered and standard deviation scaled sample mean does not converge in distribution to the standard normal and the CLT does not apply; if all we have is a “wishy-washy hand-wavy” understanding of the CLT, it would be hard to understand the above Cauchy example. It is my hope this information in this article can bridge that gap in knowledge for interested parties.
This article is split into three parts:
- CLT – Mathematical Definition (specifically the Lindeberg–Lévy CLT)
- Mathematical Preparations for Proving the CLT
- Proof of the Lindeberg–Lévy CLT
Note that the Central Limit Theorem is actually not one theorem; rather it’s a grouping of related theorems. These theorems rely on differing sets of assumptions and constraints holding. In this article, we will specifically work through the Lindeberg–Lévy CLT. This is the most common version of the CLT and is the specific theorem most folks are actually referencing when colloquially referring to the CLT.
So let’s jump in!
1. CLT – Mathematical Definition:
Let’s describe in words (in a well-specified fashion) the Lindeberg–Lévy CLT:
Okay, great. But how can we write this mathematically?:
So, the (Lindeberg–Lévy) CLT tells us, under the assumptions specified above, that Y* converges in distribution to the standard normal distribution N(0,1).
2. Mathematical Preparations for Proving the CLT:
In preparation for proving the CLT, there are some mathematical facts and theorems we can leverage to our benefit:
2A: Change of Variables:
Above we can see that Y* and S* are mathematically equivalent sampling estimators. For proving the CLT, we will be using S*. This choice is simply a matter of mathematical convenience. Proving the CLT with S* is easier than working directly with Y*
2B: The Moment Generating Function (MGF) of a Standard Normal RV:
Below we derive the Moment Generating Function (MGF) of a standard Normal Random Variable Z~N(0,1). We will see why this is important in section 3.
2C. Properties of Moment Generating Functions (MGF):
In section 3 we will be algebraically manipulating MGFs. The properties below will be useful:
3. Proof of the Lindeberg–Lévy CLT:
We’re now ready to prove the CLT.
But what will be our strategy for this proof? Look closely at section 2C above (Properties of MGFs). What the last stated property tells us (essentially) is that if the MGF of RV A converges pointwise to the MGF of RV B, then it must be the case that RV A converges in distribution to RV B.
Our approach for proving the CLT will be to show that the MGF of our sampling estimator S* converges pointwise to the MGF of a standard normal RV Z. In doing so, we have proved that S* converges in distribution to Z, which is the CLT and concludes our proof.
So let’s jump in:
And that concludes our proof! Congrats, you made it through the math 🙂
Wrap-up and Final Thoughts
Understanding the above derivations I think is a worthwhile exercise for any Data Science, Statistics, or Machine Learning practitioner. Understanding the above not only gives one appreciation for the CLT, but importantly provides understanding for the scenarios where the CLT doesn’t hold.
In the beginning of this article, I mentioned how the CLT does not apply to n i.i.d. Cauchy distributed RVs. The Cauchy distribution is a continuous probability distribution without a defined mean or defined variance. Hence, given the conditions required for the Lindeberg–Lévy CLT (first and second moments are not defined), the CLT does not hold here. If you (the reader) can think of other common examples where the CLT would not hold, please post your thoughts in the comments section.
I hope the above is insightful and helpful. As I’ve mentioned in some of my previous pieces, it’s my opinion not enough folks take the time to go through these types of exercises. For me, this type of theory-based insight leaves me more comfortable using methods in practice. A personal goal of mine is to encourage others in the field to take a similar approach. I’m planning on writing similar theory based pieces in the future, so feel free to follow me for updates!