Imbalanced Data: How to handle In Classification Problems.

Original article was published on Deep Learning on Medium

Challenges with machine learning Algorithms:

The conventional machine-learning algorithm can’t classify well for the imbalanced dataset.

Standard classifier algorithms like Decision Tree, Logistic Regression they are biased towards the classes. They tend to predict only the majority class. And the feature of minority class treated as noise and it’ll get ignored. Therefore there is a high probability to predict only a majority class over the minority class.

Baseline Model:

We are going to use DummyClassifier, LogisticClassifier, and RandomForestClassifer

  1. DummyClassifier:

2. LogisticClassifier :

3. RandomForest Classifier :

Also, let’s print the classification matrix for each algorithm.

Random Forest Confusion matrix
Logistic Regression Confusion Matrix

As we can see we are getting good results by using RandomForest over logistic Classification.

Also, Let’s try with Deep learning :

So, to test how Neural network works on an imbalanced dataset created a small model.

The confusion matrix for this approach is as below:

Neural Network Approach

We can see our model, complete bias towards the majority class. and to increase the accuracy and reduce the error it’s not considering the minority class.

Now let’s see how we can use the Sampling techniques to reduce the class imbalance problem.