A Perfect Guide to Ensemble Learning

Original article was published by Shivam Singh on Artificial Intelligence on Medium


A Perfect Guide to Ensemble Learning

Ensemble Learning is the strategy in which we try to aggregate the predictions of a different group of predictors in order to increase the performance of our model. In contrast to ordinary learning approaches that try to construct one learner from training data, ensemble methods try to construct a set of learners and combine them. Ensemble learning is also called committee-based learning or learning multiple classifier systems. If you can see the architecture of the Ensemble, so there you can observe that different kinds of learners are present which is called base learners. And these base learners can be KNN, SVM, Logistic Regression, Decision Tree, or any other kind of a learning algorithm. Many Ensemble methods contain the same type of learning algorithms which are called homogeneous ensembles but there are also some methods that contain different types of learning algorithms and they are called heterogeneous ensembles.

A Common Ensemble Architecture

The main aim of the Ensemble methods is to reduce the generalization error, which is represented as:-

And generally, there are two kinds of base learners one of which is called week learners, and the other one is called the strong learners. If the base model contains weak leaners then it means that they are highly biased models and if the base model contains strong learners then it means that the variance is very high. With the help of Ensemble methods, we try to reduce the both i.e. bias in case of weak learners and variance in the case of strong learners, and try to make a generalized model that should maintain a balance between these two types of errors. In an Ensemble method we can generate our base learners in two ways, that is, sequential ensemble methods where the base learners are generated sequentially, and parallel ensemble methods where the base learners are combined parallelly.

There are mainly 4 kinds of Ensemble Techniques which are mostly used:-

  1. Bagging ( Bootstrap Aggregation)
  2. Boosting
  3. Stacking
  4. Cascading

Bagging

The name Bagging came from the abbreviation of Bootstrap Aggregating [Breiman, 1996d]. There are two main key ingredients of Bagging one is bootstrap and aggregation. We know that the combination of independent base learners will lead to a dramatic decrease in errors and therefore, we want to get base learners as independent as possible. In Bagging, we take different subsets of dataset randomly and combine them with the help of Bootstrap Sampling. In detail, given a training data set containing ’n’ number of training examples, a sample of ‘m’ training examples will be generated by sampling with replacement. In Bagging we used the most popular strategies for aggregating the outputs of the base learners, that is, find out the majority vote in a classification task and finding the mean in the regression task.

In Bagging, we actually combine several strong learners in which all the base models are overfitted models they are having a very high variance and at the time of aggregation we simply try to reduce that variance without affecting the bias with which the accuracy may improve. We can summarize this bagging technique in a few points:-

  1. Make subsets with replacement: that means every item may appear in different subsets.
  2. Apply the model for every subset of the sample.
  3. All the base models are run paralalley and more often they are independent of each other.
  4. Now we predict x-text on each model and then aggregate their predictions (either by voting or by averaging) to form a final prediction.

Random Forest

Random forest is a widely used Ensembling (Bagging) technique where the major difference with Bagging is the incorporation of randomized feature selection. The base learners in Random Forest are Deep Decision Trees. There is one hyperparameter K which controls the incorporation of randomness.

  • The constructed decision tree is identical to the traditional deterministic decision tree when the value of K equals the total number of features.
  • When K = 1, a feature will be selected randomly.

A good value of K may be the logarithm of the number of features. It is quite interesting to notice that randomness is introduced only at the time of the feature selection process, not into the choice of split points on the selected feature. To sum up, several decision tress model is fitted on each subset which is created from original datasets by bootstrapping and the final prediction is calculated by averaging the predictions from all decision trees.

Boosting

Boosting is a kind of algorithms that is able to convert weak learners to strong learners. In a Boosting technique, each algorithm i.e. base learners are trained sequentially and at every time the next learner is trying to reduce the error by updating the parameters and perform better in comparison to the previous learner. To sum up, boosting works by training a different set of learners sequentially and combining them for prediction, where the later learners focus more on the mistakes of the earlier learners. There are many boosting methods available, but out of that, the most popular methods are AdaBoost (short for Adaptive Boosting) and Gradient Boosting.

The architecture of Boosting Technique

Adaptive boosting or AdaBoost

AdaBoost is a very common and widely used Boosting technique in which we actually try to pay a bit more attention to the training instances that the predecessor underfitted. To build an AdaBoost classifier, suppose our first base classifier is trained and we make some predictions on the training set in that way the relative parameters of misclassified training instances are then increased. Now a second classifier is trained using updated parameters and again makes some prediction on training, now the same process is repeated again and again until get the best result. There are basically two hyperparameters of this AdaBoost technique one is learning rate and the other one is n_estimators and we can get the best value of these hyperparameters by doing cross-validation or hyperparameter tuning.

Gradient Boosting

Another Boosting algorithm that is widely used is Gradient Boosting. Just like AdaBoost, Gradient Boosting works by sequentially adding different models to an ensemble, where each one tries to correct its predecessor. However, in this technique, we didn’t update the instance weights at every iteration as AdaBoost does, in this technique we just try to fit the new model to the residual errors made by the previous model. In a Gradient Boosting technique if all the base learners are Decision trees then it is known as Gradient Boosting Decision Tree (GBDT). The learning rate scales the contribution of each tree. If the value of our learning rate is too low, such as 0.1, then you will need more trees in the ensemble to fit the training set, but the predictions will usually generalize better. This regularization technique is known as shrinkage.

Besides AdaBoost and Gradient Boosting techniques there are some advanced boosting techniques also which become more popular nowadays as they are very fast and give much better performance in comparison with these 2 methods. You can read about those algorithms by using the below-given links:-

  1. XGBoost
  2. CatBoost
  3. LightGBM

Stacking

Stacking is a general procedure where a learner is trained to combine individual learners. In stacking instead of combining weak leaners or strong learners we basically combine different base models sequentially which predicts some outcome or class label with some probability and then we combine all these predictions and make a final prediction. In Stacking, we train the first-level learners using the original training data set and then generate a new data set for training the second-level learner, where the outputs of the first-level learners are considered as input of second-level learners while the original labels are still considered as labels of the new training data. The first-level learners are often generated by combining the different learning algorithms, that is why stacked ensembles are often called as heterogeneous, though it is also possible to construct homogeneous stacked ensembles.

The architecture Stacking Ensemble Learning

Cascading

According to Google, Cascading in simple English literature means that “a process whereby something, typically information or knowledge, is successively passed on”. It is a very powerful ensemble learning method and mostly used by the Machine Learning Engineers when they want to absolutely dead sure about the accuracy. You can understand this Cascading technique with the very famous problem statement that whether the transaction made by the user is fraudulent or not.

In this problem statement we have mainly two class label ‘0’ means not a fraud transaction and ‘1’ means fraud transaction. The dataset is highly imbalanced in which 99.4% instances belong to ‘0’ class and just 0.6% of instance belongs to ‘1’ class. So, it is very hard to get an accurate result with only one model in that case we have to build a sequence of models (or cascade models) to get the most accurate result. This model is typically used when the cost of making a mistake is very high. You can get a clear idea of how this technique works with the help of this diagram.

Steps to get the result using Cascade models:-

  • Suppose you have a transaction query point we will feed that query point to model 1 and model 1 give you the class probability.
  • Suppose predict probabilities is given by P (Yq=0) and P (Yq=1), where Yq is our actual class label.
  • In this case, we set a threshold of 99% which means that if P (Yq=0)>0.99, we will declare the final prediction to be not fraudulent.
  • However, if P (Yq=0) < 0.99 we are not very sure if or not it’s a fraudulent transaction although though there is a high chance that the transaction is not fraudulent.
  • So, when we slightly unsure about the prediction we train Model 2 and the same process will repeat in model 2 also.
  • If again we get P (Yq=0) <0.99, we aren’t sure! Hence, we will pass the query point to another Model 3 in the cascade which does the same thing.
  • Now in such a case, there is a human being who sits at the end of a cascade, so this person personally asks the customer whether he does that transaction or not.
  • Now we are absolutely sure about the transaction if a person says that he made that transaction then that is not a fraudulent one otherwise we can say that this is a fraudulent transaction.

In a typical cascading system the complexity of models increases as we add more and more models to the cascade. Please note that all the models in a cascade are super powerful and have very high accuracy on unseen data.

This is all about the Ensemble Learning method. Hope you enjoyed reading this article.

Happy Learning.