How To Detect Outliers In Dataset

Original article was published by Puneet Singh on Deep Learning on Medium

How To Detect Outliers In Dataset

Photo by Will Myers on Unsplash

Handle the outliers is biggest and challengeable task in Machine learning.

It directly effect on model accuracy.

first let understand , what is the outliers in dataset?

An outlier is a data set that is distant from all other observations. A data points that lies outside the overall distribution of the dataset.

Now, let understand with the help of example….

In an organization, The salary range of all employees in between 10k$ to 50k$.

So, in salary column all employee’s salaries fall under this range.

Suppose, we have 10 employees in an organization and their salaries distributions.

These all the list of employee’s salaries. so it’s clearly visible 1,50,000$ is not in range and it doesn’t fall in between 10k$ to 50k$. So, It indicates outlier of this salary column.

Outliers occurs by human errors like wrong entry ,Variability in the data and an experimental measurement error etc. but it might be possible in our case the salary of CEO is 1,50,000$. How can you say this done by human mistake.

In our case, there were only 10 entries and we could easily find outlier manually or by hand or by watch .but if we have millions of entries so that time how will you find out the outlier from million entries.

For that,

there are some majorly used techniques we will discussed later..

1-Using scatter plots.

2-Box plot.

3-using z score.

4-using the IQR interquartile range.

we will discuss all of these in detail….

What are the impacts of having outliers in a dataset?

  1. It causes various problems during our statistical analysis.
  2. It may cause a significant impact on the mean and the standard deviation.
  3. It directly effect on the model’s accuracy.

1. Detecting outlier using Scatter Plot

import matplotlib.pyplot as plt
x = [5,7,8,7,2,17,2,9,4,9,8,9,6]
y = [86,86,87,88,111,86,103,87,88,81,80,85,86]
plt.scatter(x, y)

x and y are our data points and now try to find out the outliers from the data using scatter plot.

Right here, clearly visible three are outliers in dataset. Those three dots is not in range of data points variability.

Outlier with scatter plot

2. Detecting outlier using Z score

Using Z score

Formula for Z score = (Observation — Mean)/Standard Deviation

z = (X — μ) / σ

Normal distribution

Data point that falls outside of 3 standard deviations. we can use a z score and if the z score falls outside of 2 standard deviation.

outliers=[]dataset=[11,10,12,14,12,15,14,13,15,102,12,14,17,19,107,10,13,12,14,12,108,12,11,14,13,15,10,15,12,10,14,13,15,10]def detect_outliers(data):

mean = np.mean(data)
std =np.std(data)

for i in data:
z_score= (i - mean)/std
if np.abs(z_score) > threshold:
return outliers

3. Detecting outliers using InterQuantile Range

75%- 25% values in a dataset

Steps —

1. Arrange the data in increasing order

2. Calculate first(q1) and third quartile(q3)

3. Find interquartile range (q3-q1)

4.Find lower bound q1*1.5

5.Find upper bound q3*1.5

Anything that lies outside of lower and upper bound is an outlier.

dataset= [11,10,12,14,12,15,14,13,15,102,12,14,17,19,107, 10,13,12,14,12,108,12,11,14,13,15,10,15,12,10,14,13,15,10]
import numpy as np
dataset= [11,10,12,14,12,15,14,13,15,102,12,14,17,19,107, 10,13,12,14,12,108,12,11,14,13,15,10,15,12,10,14,13,15,10]quantile1, quantile3= np.percentile(dataset,[25,75])print("range between quantile1 to quantile3")print(quantile1,quantile3)print("IQR")iqr_value=quantile3-quantile1print(iqr_value)print("Find the lower bound value and the higher bound value")lower_bound_val = quantile1 -(1.5 * iqr_value)upper_bound_val = quantile3 +(1.5 * iqr_value)print(lower_bound_val,upper_bound_val)

SO, Data below 7.5 and above 19.5 consider as outliers.

4. Detecting the outliers using Box Plots

Draw a box plot on given dataset and detect the outliers using box plots.

import matplotlib.pyplot as plt
value1 = [82,76,24,40,67,62,75,78,71,32,98,89,78,67,72,82,87,66,56,52]
Box plot
Box plot

These are four most commonly use way to detect the outliers from the datasets.

In next tutorial , we will discuss about how we can handle outliers.

if any doubt regarding this tutorial ask feel free on Linkedin —

Github workspace link-