Improving Performance of Machine Learning Models using Bagging Ensemble

Original article was published on Artificial Intelligence on Medium


Improving Performance of Machine Learning Models using Bagging Ensemble

Understand the working of Bootstrap Aggregation ensemble learning and how Random Forest is used to improving the performance of a model.

Photo by Andri Klopfenstein on Unsplash

The performance of a Machine Learning model tells us how the model performs for unseen data-points. There are various strategies and hacks to improve the performance of an ML model, some of them are:

  • Fine Tuning hyperparameter of the ML model
  • Using Ensemble learning.

What is Ensemble Learning?

Ensemble Learning is a technique to combine multiple ML models to form a single model. The multiple ML models also referred to as base models or weak learners can be of different algorithms or same algorithms with a change in hyperparameters.

Like for classification tasks, multiple ML models can be Logistic Regression, Naive Bayes, Decision Tree, SVM, etc. For regression tasks, multiple ML models can be Linear Regression, Lasso Regression, Decision Tree Regression, etc.

Ensemble Learning combines the advantages of base models to form a single robust model with improved performance. Various types of Ensemble learning techniques are:

  1. Bagging (Bootstrap Aggregation)
  2. Boosting
  3. Voting
  4. Cascading
  5. Stacking

and many more. This article will cover the working and implementation of the Bagging Ensemble technique.

Overview of Bagging (Bootstrap Aggregation):

Bagging ensemble technique also known as Bootstrap Aggregation uses randomization to improve performance. In bagging, we use base models that are trained on part of the dataset. In bagging, we use weak learners (or base models) models as building blocks for designing complex models by combining several of them.

Most of the time, these base models do not perform well because they either overfit or underfit. Overfitting or Underfitting of a model is decided by bias-variance tradeoff.

What is Bias-Variance Tradeoff ?

The overall error of a model is dependent on the bias and variance of the model following the below equation:

For a good robust model, the error of the model is as minimum as possible. To minimize the error the bias and variance need to be minimum and irreducible error remains constant. The below plot of error vs model flexibility (degree of freedom) describes the change of bias and variance along with test and training error:

Source, Bias-Variance Tradeoff error plot

Analysis from the above image plot:

  • When the model is in the initial phase of training, training, and test error are very high.
  • When the model is enough trained, the training error is very low and the test error is high.
  • The phase where training and test error is high is Underfitting.
  • The phase where training error is low and the test error is high is Overfitting.
  • The phase where there is a balance between training and testing error is Best fitting.
  • An Underfit model has low variance and high bias.
  • An Overfit model has high variance and low bias.

Bagging Ensemble technique can be used for base models that have low bias and high variance. Bagging ensemble uses randomization of the dataset (will be discussed later in this article) to reduce the variance of base models keeping the bias low.

Working of Bagging:

It is now clear that bagging reduces the variance of base models keeping the bias low. The variance of base models is reduced by combining strategies of bootstrap sampling and aggregation. The entire working of Bagging is divided into 3 phases:

  1. Bootstrap Sampling
  2. Base Modeling
  3. Aggregation

Below diagram describes all three steps for a sample dataset D having n rows:

(Image by Author), 3 Steps of Bagging — Bootstrap Sampling, Modeling, Aggregation

Bootstrap Sampling:

A bootstrap sample is a smaller sample that is “bootstrapped” from a larger sample. Bootstrapping is a type of resampling where large numbers of smaller samples of the same size are repeatedly drawn, with replacement, from a single original sample.

We have a dataset having n rows and f features, we do a bootstrap sampling that refers to sampling with replacement into k different smaller datasets each of size m with the same f features. Each smaller dataset D_i formed sees a subset of the dataset. In the figure below initial dataset D of shape (n, f) is sampled to k dataset each of shape (m, f), where m<n.

(Image by Author), Bootstrap Sampling of Dataset

The below image describes how to bootstrap the sample. The dataset D having 10 rows is sampled with replacement into k smaller dataset each having 5 rows. Here n=10 and m=5 from the above diagram.

It is observed that each of the datasets formed by bootstrapping sees only a part of the original dataset and all the datasets are independent of each other.

(Image by Author), Bootstrap Sampling for a sample dataset of 10 rows.

This is the 1st step of the bagging ensemble technique in which k smaller datasets are created by bootstrapping independent of each other.

Modeling:

Modeling is the 2nd step of bagging. After k smaller datasets are created by bootstrapping each of the k datasets is trained using ML algorithms. The algorithms used for training k dataset can be the same with or without a change in hyperparameters or different algorithms can be used.

For example,

  • Decision Tree algorithms can be used as base models with a change in hyperparameters such as ‘depth’.
  • A combination of different algorithms such as SVM, Naive Bayes, Logistic Regression can be used.

The models which are trained on each bootstrap dataset are called base models or weak learners. The below diagram describes the training of each dataset of separate models:

(Image by Author), Modeling of Bootstrap dataset

Aggregation:

A final powerful robust model is creating by combining the k different base models. Since the base models are trained on a bootstrap sample so each model may have different predictions. The aggregation technique is different depending on the problem statement.

  • For a regression problem: The aggregation can be taking the mean of prediction of each base model.
Notation,
prediction: Final Output of bagging ensemble
k: number of base models
pred_i: prediction of ith base model
  • For a classification problem: The aggregation can be using majority voting, the class having the maximum vote can be declared as the final prediction.
Notation,
prediction: Final Output of bagging ensemble
pred_i: prediction target class of ith base model
1,2,3...,c: c different target class
C: Target Class having maximum vote
(Image by Author), Aggregation of k base models