Original article was published by Rashida Nasrin Sucky on Artificial Intelligence on Medium

# A Complete Guide to Hypothesis Testing for Data Scientists Using Python

## Explained Clearly with Sample Research Questions, Solution Steps, and Complete Codes

Hypothesis testing is an important part of statistics and data analysis. Most of the time it is practically not possible to take data from a total population. In that case, we take a sample and make estimations or claims about the total population. These assumptions or claims are hypotheses. Hypothesis testing is the process to test if there is evidence to reject that hypothesis.

Hypothesis testing normally is done on proportion and mean.

In this article, we are going to cover the hypothesis testing of the population proportion, the difference in population proportion, population or sample mean and the difference in the sample mean.

I will explain the process of hypothesis testing step by step for all the four categories individually with examples.

I used a Jupyter Notebook environment for this exercise. If you do not have that feel free to use any notebook or IDE of your choice.

A Google collab notebook will be perfect too. Google collab is a smart notebook. These common libraries are preinstalled in it.

# Hypothesis Testing for One Proportion

This is the most basic hypothesis testing. Most of the time we do not have a specific fixed value for comparison. But if we have, this is the most simple hypothesis testing. I am going to start with a one proportion hypothesis testing.

I used the Heart dataset from Kaggle for this demonstration. Please feel free to download the dataset for your practice. Here I import the packages and the dataset:

`import pandas as pd`

import numpy as np

import statsmodels.api as sm

import scipy.stats.distributions as distdf = pd.read_csv('Heart.csv')

df.head()

The last column of the dataset is ‘AHD’. That is if the person has heart disease. The research question for this section is,

“**The population proportion of Ireland having heart disease is 42%. Are more people suffering from heart disease in the US**”?

Now, find the answer to this research question step by step.

**Step 1: **define the null hypothesis and alternative hypothesis.

In this problem, the null hypothesis is the population proportion having heart disease in the US is less than or equal to 42%. But if we test for equal to less than will be covered automatically. So, I am making it only equal to.

And the alternative hypothesis is the population proportion of the US having heart disease is more than 42%.

`Ho: p0 = 0.42 #null hypothesis`

Ha: p > 0.42 #alternative hypothesis

Let’s see if we can find the evidence to reject the null hypothesis.

**Step 2:** Assume that the dataset above is a representative sample from the population of the US. So, calculate the population proportion of the US having heart disease.

`p_us = len(df[df['AHD']=='Yes'])/len(df)`

The population proportion of the sample having heart disease is **0.46 or 46%**. This percentage is more than the null hypothesis. That is 42%.

But the question is if it is significantly more than 42%. If we take a different simple random sample, the currently observed population proportion (46%) can be different.

To find out if the observed population proportion is significantly more than the null hypothesis, perform a hypothesis test.

**Step 3:** Calculate the Test Statistic:

Here is the formula for test-statistics:

We use this formula for standard error:

In this formula, p0 is 0.42 (according to the null hypothesis) and n is the size of the sample population. Now calculate the Standard error and the test statistics:

`se = np.sqrt(0.42 * (1-0.42) / len(df))`

Find the test statistics using the formula for test statistic above:

`#Best estimate`

be = p_us #hypothesized estimate

he = 0.42test_stat = (be - he)/se

The test statistics came out to be 1.3665.

Step 4: Calculate the p-value

This test statistic is also called z-score. You can find the p-value from a z_table or you can find the p-value from this formula in python.

`pvalue = 2*dist.norm.cdf(-np.abs(test_stat))`

**The p-value is 0.1718.** It means the sample population proportion (46% or 0.46) is 0.1718 null standard errors above the null hypothesis.

**Step 5:** Infer the conclusion from the p-value

Consider the significance level alpha to be 5% or 0.05. A significance level of 5% or less means that there is a probability of 95% or greater that the results are not random.

Here p-value is bigger than our considered significance level of 0.05. So, we cannot reject the null hypothesis. That means there is no significant difference in population proportion having heart disease in Ireland and the US.

# Hypothesis Tests for the Difference in Two Proportions

Comparative tests are conducted much more frequently than one population proportion hypothesis test. A two-sample test of proportions is performed to assess if the population proportion of some traits differs between two subgroups.

**Here, we are going to test if the population proportion of females with heart disease is different from the population proportion of males with heart disease.**

**Step 1:** Set up the null hypothesis, alternative hypothesis, and significance level.

Here, we want to check if there is any difference between the population proportion of males and females having heart disease. We will start with the assumption that there is no difference.

`Ho: p1 -p2 = 0`

This is our null hypothesis. Here, p1 is the population proportion of females with heart disease and p2 is the population proportion of males having heart disease.

What could be the alternative hypothesis?

The alternative hypothesis can be, there is a difference.

`Ha: p1 - p2 != 0`

Let’s use the significance level of 0.1 or 10%.

**Step 2:** Prepare a chart that shows the population proportion of males and females with heart disease and the total male and female population.

`df['Gender'] = df.Sex.replace({1: "Male", 0: "Female"})`

p = df.groupby("Gender")['AHD'].agg([lambda z: np.mean(z=='Yes'), "size"])

p.columns = ["HeartDisease", 'Total']

p

**Step 3:** Calculate the test statistic

We will use the same formula for the test statistic as before. The best estimate is p1 — p2. Here, p1 is the population proportion of females with heart disease and p2 is the population proportion of males with heart disease.

`#Best estimate is p1 - p2. Get p1 and p2 from the chart p above`

p_fe = p.HeartDisease.Female

p_male = p.HeartDisease.Male

The standard error for two population proportion is calculated with the formula below:

Here, p is the total population proportion in the sample with heart disease. n1 and n2 are the total numbers of the female and male populations in the sample.

`p = p_us #calculated in the beginning of the previous example`

n1 = p.Total.Female

n2 = p.Total.Male

se = np.sqrt(p_us*(1-p_us)*(1/n1 + 1/n2))

Now, use this standard error and calculate the test statistic.

`#calculate the best estimate`

be = p_fe - p_male #Calculate the hypothesized estimate

#Our null hypothesis is p1 - p2 = 0he = 0 #Calculate the test statistic

test_statistic = (be - he)/se

The calculated test_statistic is -0.296. That means that the observed difference in sample proportions is 0.296 estimated standard error below the hypothesized value.

Step 4: Calculate the p-value

`pvalue = 2*dist.norm.cdf(-np.abs(test_statistic)`

The p-value is 0.7675. That means more than 76% of the time we would see that the results we observed are true considering the null hypothesis is true.

In another way, the p-value is bigger than the significance level (0.1). So, we do not have enough evidence to reject the null hypothesis.

The population proportion of males with heart disease is not significantly different than the population proportion of females with heart disease.

# Hypothesis Testing for One Mean

This is a simple hypothesis testing process. We can perform this test if we have a specific fixed mean value to compare. Let’s work on an example to understand the process.

This is the research question:

**“Check if the mean RestBP is great than 135”. **Here, RestBP is resting blood pressure. We have a RestBP column in the DataFrame. Let’s solve this problem step by step.

**Step 1: **State the hypothesis

We need to find out if the mean RestBP is greater than 135. Let’s assume that the mean RestBP is less than or equal to 135.

So, the null hypothesis can be that the mean RestBP is 135. Because if we can prove that the mean RestBP is greater than 135, it is automatically greater than 134 or 130.

If we find enough evidence to reject the null hypothesis, we can accept that the mean RestBP is greater than 135. This is the alternative hypothesis for this example.

`Ho: mu = 135`

Ha: mu > 135

We will check if we can reject the null hypothesis using a **significance level of 0.05**.

**Step 2: **Check the assumptions

There are two assumptions:

- The sample should be a simple random sample.
- The data need to be normally distributed.

I collected this dataset from Kaggle. I was not involved in collecting the data. For the demonstration purpose, just assume that this is a simple random sample. To check the second assumption, plot the data, and have a look at the distribution.

`sns.distplot(df.RestBP)`

The distribution is not exactly normal. But it is close to normal.

The good news is, we do not need to worry about the normality of the data. Because we have a large enough sample size(more than 25 data).

**Step 3: **Calculate the test statistic

Here is the formula to calculate the test statistic:

First, calculate the standard error using the formula below:

Here, S is the sample standard deviation and n is the number of samples.

`std= df.RestBP.std()`

n = len(df)

se = std/np.sqrt(n)

Now, use this standard error to find the test statistic:

`#Best estimate`

be = df.RestBP.mean() #Hypothesized estimatehe = 135

test_statistic = (be - he)/se

Test statistic came out to be -3.27. Look at the formula for test statistics. On top, it measures the distance between the original mean and hypothesized mean. And the bottom is the standard error.

So, this test_statistic means, the sample mean is 3.27 standard error below the hypothesized mean.

**Step 4: **Infer the conclusion from the test statistic

Convert this test_statistic to a probability value to see if this difference is unusual or not. We can get the value using this python formula:

`pvalue = 2*dist.norm.cdf(-np.abs(test_statistic))`

The p-value is 0.001 which is less than the significance level (0.05).

So, we can reject the null hypothesis.

There is only a 0.1% probability that we will see the observed result is true when the null hypothesis is true. 0.1% probability is too low.

So, we reject the null hypothesis and accept the alternative hypothesis based on this sample data.

# Hypothesis Testing for the Difference in Mean

For this example, we will use the same data, the RestBP column. But this time to test if there is any difference between the mean RestBP of females to the mean RestBP of males.

**Step 1: **State the hypothesis

As a null hypothesis, start with the claim that the mean RestBP of females and the mean RestBP of males are the same. So the difference between these two means will be zero.

The alternative hypothesis is, these two means are not the same. Let’s perform the test with a 10% significance level.

`Ho: mu_female - mu_male = 0`

Ha: mu_female - mu_male != 0

Both the male and female populations have large enough data in this data. So, checking for the normality of the data is not required.

**Step 2: **Calculate the test statistic

The formula for the test statistic is the same as before. But the formula for the standard error is different.

Here s1 and s2 are the sample standard deviation of the female and male population respectively. n1 and n2 are the sample size of the female and male population. Now, calculate the standard error:

`pop_fe = df[df.Gender=='Female'].dropna()`

pop_male = df[df.Gender=='Male'].dropna()std_fe = pop_fe.RestBP.std()

std_male = pop_male.RestBP.std()se = np.sqrt(std_fe**2/len(pop_fe) + std_male**2/len(pop_male))

Use the standard error to get the test statistic.

`#calculate the best estimate`

mu_fe = pop_fe.RestBP.mean() #Mean RestBP for females

mu_male = pop_male.RestBP.mean() #Mean RestBP for malesmu_diff = mu_fe - mu_male #hypothesized estimate

mu_diff_hyp = 0 #null hypothesis: difference of two mean = zerotest_statistic = (be-he)/se

The test_statistic is 1.086. For the information, the observed difference in mean ‘mu_diff’ is 2.52.

As we are testing if the mean is different from each other, this is a two-tailed test.

The p-value is the probability that the test statistic is either less than 1.086 or greater than 1.086.

**Step 3: **Infer the conclusions from the test statistic

Calculate the p-value from this test statistic in python:

`pvalue = 2*dist.norm.cdf(-np.abs(test_statistic))`

The p-values came out to be 0.277. As this is a two-tailed test,

p(z < -1.086) = 0.277

p(z > 1.086) = 0.277

p-value = 0.277+0.277 = 0.554

That means, there is approximately 55.4% probability that the observed result or more extreme is true when the null hypothesis is true.

In another way, **the p-value is much bigger than the significance level. So, we fail to reject the null hypothesis.**

The final inference is, based on the observed difference between the mean RestBP of females and the mean RestBP of males, we cannot support the idea that there is a significant difference between the two means.

# Conclusion

I explained the four most common types of research questions in this article with working examples. Hope you will be able to use hypothesis testing in decision making from now on.