Probability and Statistics for Machine Learning

Source: Deep Learning on Medium

Probability and Statistics for Machine Learning

Probability and Statistics is one of the important topic of mathematics that should be learnt before starting machine learning. But do you really need to know every thing before starting Machine learning. Let’s discuss one by one.

Mean: mean is average of dataset.

Median: middle set of numbers.

Mode: most common number in dataset.

Variance: Variance measures how far a data set is spread out. It is mathematically defined as the average of the squared differences from the mean.

Standard Deviation: The square root of the variance is the standard deviation. While var. gives you a rough idea of spread, the standard deviation is more concrete, giving you exact distances from the mean.Its symbol is σ (the greek letter sigma)

Population: A population is a whole set of number of a group.

The population mean symbol is μ.

The formula population mean is:
μ = (Σ * X)/ N
Σ = “the sum of.”
X = all the individual items in the group.
N = the number of items in the group.

Sample: Sample is a subset/part of population.

The sample mean formula is:
x̄ = ( Σ xi ) / n
x̄ = sample mean
Σ = means “add up”
xi = “all of the x-values
n = means “the number of items in the sample”
  • n means “the number of items in the sample”

i.e. Let’s imagine you have to estimate average weight of every people in world. How will do you do it?

Now, You have two option:

  • go to every people and ask their weight and write down and take Average. OR
  • go to some people(set of random people) write down their weight and take average.


A distribution is simply a collection of data, or scores, on a variable. Usually, these scores are arranged in order from smallest to largest and then they can be presented graphically.

Density Functions: Distribution most often describe in the term of density or density function. There are two types of density function: PDF(Probability Density Function) and CDF(Cumulative Density Function)

PDF: calculates the probability of observing a given value. It is derivative of CDF.

CDF: calculates the probability of an observation equal or less than a value.

Gaussian Distribution: It can be described using two parameter mean,variance.

where : N-normal distribution, μ-mean,σ -variance

Let’s dig into the formula.

suppose mu=0 and sigma =1 then

Y = 1/ √2π exp{-1/2}x²

Here π and e are fixed value .

So it can be written as :

Y = exp(-x²)
x=0 >> Y=exp(-1^2) = 0
x=1 >> Y = exp(-1^2) =0.36
x=2 >>Y = exp(-2^2) = 0.018

from this we can understand that even x is increased or decreased the curve is going to decrease spontaneously.

  • X moves away from mu , Y start decreasing.
  • It’s Symmetric.

Now, Let’s talk about CDF of Gaussian Distribution.

  • When mu is zero, CDF is half.
  • When mu is small, CDF become closer
  • If you look closer to below image then you can find that at mu = 0 CDF is high and symmetric.

Basically it says that

68% of values are with in 1 standard deviation, 95% values are with in 2 standard deviation and 99.7 values are with in 3 standard deviation.

i.e.- suppose, we have to measure the age of every person in a class room. So , when we will plot the Gaussian most of the students will be the same age because they are in same class. So they will easily lie in the first standard deviation. Some are maybe too old or young compare to rest of the class, they will lie in the second standard deviation. and there will be a teacher will be more older than the rest of the class so that will lie on the 3 standard deviation.

Kurtosis: It is used to measure peak of Gaussian . In other words, how heavily the tails of a distribution differ from the tails of a normal distribution.

Kernel Density Estimation : It is non-parametric way to estimate the probability density function. It is fundamental smoothing problem where by smoothing histogram we can calculate PDF .
  • Variance in KDE is a bandwidth
  • First we do smooth plotting of every kernel with their height that is showing in red.
  • And By adding every kernel’s height we get PDF that is showing in blue.

Sampling Distribution : It is not always necessary that every distribution is Gaussian. Sometimes we have to make distribution Gaussian. Taking random data from given dataset for n time and calculating mean/median and again processing it to calculate sample mean/median called Sampling distribution.

Central limit Theorem: Its states that the sampling distribution of the sample means approaches a normal distribution as the sample size gets larger — no matter what the shape of the population distribution.

  • Take random sample not less than 30 and take mean after doing m times of sampling we will get m number of mean.
  • And by taking mean of m sample mean tends to convert Gaussian.
distribution of X = sampling distribution of sample mean

Bernoulli : It is distribution which have only 2 outcomes.

The probability of a failure is labeled on the x-axis as 0 and success is labeled as 1. In the following Bernoulli distribution, the probability of success (1) is 0.4, and the probability of failure (0) is 0.6:

Binomial Distribution: The Bernoulli distribution is closely related to the Binomial distribution. The Bernoulli distribution is sometimes used to model a single individual experiencing an event like death, a disease, or disease exposure. The model is an excellent indicator of the probability a person has the event in question.

  • 1 = “event” (P = p)
  • 0 = “non event” (P = 1 — p)

Bernoulli distributions are used in logistic regression to model disease occurrence.

Log Normal Distribution: log-normal distribution is a continuous probability distribution of a random variable whose logarithm is normally distributed. A random variable which is log-normally distributed takes only positive real values.



  • length of comments posted in internet discussion forum follows log-normal distribution
  • the users dwell time on the online articles

if the random variable X is log-normally distributed, then Y = ln(X) has a normal distribution. Likewise, if Y has a normal distribution, then the exponential function of Y, X = exp(Y), has a log-normal distribution.

Power Law Distribution: The power law states that change in one quantity results in a proportional relative change in another. A simplest example is : if you double the length of a side (say, from 2 to 4 inches) then the area will quadruple (from 4 to 16 inches squared).