How to Verify the Distribution of Data using Q-Q Plots?

Original article was published by Satyam Kumar on Artificial Intelligence on Medium


Given a random distribution, that needs to be verified if it is a normal/gaussian distribution or not. For understanding, we will name this unknown distribution X, and known normal distribution as Y.

Generate unknown distribution X:

X = np.random.normal(loc=50, scale=25, size=1000)

we are generating a normal distribution having 1000 values with mean=50 and standard deviation=25.

(Image by Author), first 20 random values of X

Find 100 percentile values:

X_100 = []
for i in range(1,101):
X_100.append(np.percentile(X, i))

Compute each integral percentile (1%, 2%, 3%, . . . , 99%, 100%) value of X random distribution and store it in X_100.

(Image by Author), Left: Distribution of X, Right: Distribution of X_100

Generate known random distribution Y and its percentile values:

Y = np.random.normal(loc=0, scale=1, size=1000)

Generating a normal distribution having 1000 values with mean=0 and standard deviation=1 which need to be compared with the unknown distribution X to verify if X distribution is distributed normally or not.

Y_100 = []
for i in range(101):
Y_100.append(np.percentile(Y, i))

Compute each integral percentile (1%, 2%, 3%, . . . , 99%, 100%) value of Y random distributions and store it in Y_100.

Plotting:

Plot a scatter plot for the above obtained 100 percentile values of unknown distribution to the normal distribution.

(Image by Author), Q-Q Plot

Here X — is the unknown distribution that is compared to Y — normal distribution.

For a Q-Q Plot, if the scatter points in the plot lie in a straight line, then both the random variable have same distribution, else they have different distribution.

From the above Q-Q plot, it is observed that X is normally distributed.

What if both the distributions are not the same?

If X is not normally distributed and it has some other distribution, then if the Q-Q plot is plotted between X and a normal distribution the scatter points will not lie in a straight line.

(Image by Author), Q-Q Plot

Here, X distributed is a log-normal distribution, which is compared to a normal distribution, hence the scatter points in the Q-Q plot are not in a straight line.

Let us have some more observation:

Here are 4 Q-Q plots for 4 different conditions of X and Y distribution.

(Image by Author), Top Left: QQ plot of lognormal vs normal distribution, Top Right: QQ plot of normal vs exponential distribution, Bottom Left: QQ plot of exponential vs exponential distribution, Bottom Right: QQ plot of logistic vs logistic distribution