Source: Deep Learning on Medium
How to work with Imbalanced Dataset
In Machine Learning, Most of the time we deal with an imbalanced dataset. Before knowing how to deal with it let’s see what is it.
We have two types of dataset Balanced Dataset and Imbalanced Dataset. Suppose we have 2 class classifier n1 and n2 and n1 is +ve points and n2 -ve points.
n = n1 +n2
At no 1, we have 80 points in n1 and 40 points in n2 that is almost similar so it’s a balanced dataset.
At no 2 and 3, we have a large difference between n1 and n2 that’s why it’s an Imbalanced dataset.
How to work around an imbalanced dataset
Working on an Imbalanced dataset is a tedious task. But there are some methods which can be used to deal with imbalanced datasets.
- Under Sampling- Under Sampling is a method where we leave extra points.
i.e At point to we have 100 and 1000 points n1 and n2 respectively.so we leave 900 points from n2 to make the dataset imbalanced.
But here one thing to be noticed, we have left 900 points that is not a good idea to leave this much point and will also decrease our accuracy
- Over Sampling – Oversampling is a method to create artificial points. Let’s take points on 3, We have 150 and 975 for respectively n1 and n2. So we will create artificial points for n1.
- Class Weight- It is a technique to give minority class more weight and for the majority less weight.
I.e. we have 100 +ve and 900-ve points. So we will split 900 points into two sets 700 train and test 300 respectively. Now again we will split 700 points into 630 -ve and 30+ve points and same will be applied on test data 270-ve and 30+ve
NOTE: We can get high accuracy with Imbalanced data called a dumb model.
Thanks for reading!!!
Suggestions are welcome!!!