Mathematics behind Random forest and XGBoost

Source: Deep Learning on Medium

Mathematics behind Random forest and XGBoost

What are ensembles?

Ensemble means Collection or group of things

Ensemble methods is a machine learning technique that combines several base models in order to produce one optimal predictive model(powerful model).

Ensemble Methods allow us to take a sample of Decision Trees into account, calculate which features to use or questions to ask at each split and make a final predictor based on the aggregated results of the sampled Decision Trees.

Types of Ensemble Methods:

  1. BAGGing, or Bootstrap AGGregating(Random forest)
  2. Boosting
  3. Stacking
  4. Cascading

BAGGing, or Bootstrap AGGregation:

All sampling is done at Sampling with replacement. Each model is built at a different subset of data. A model is set to have high variance if the model changes a lot with changes in training data. So bagging is a concept to reduce variance in the model without impacting bias.

Bagging=DT + Row sampling

Bagging is the application of the Bootstrap procedure to a high-variance machine learning algorithm, typically decision trees.

By taking a small sample dataset for each model, the aggregate model does not change much because it only impacts a small subset of a dataset.

Typical aggregation operation is mean/median. Mean/Median in case of regression and majority vote in case of the classification problem. Final aggregation step is the final model(h).

Take bunch of low bias and high variance models(h1,h2,h3,h4…) and combine them with bagging. you will get low bias and reduced variance model(h). Ex. Decision tree with very high depth.

When bagging with decision trees, we are less concerned about individual trees overfitting the training data. For this reason and for efficiency, the individual decision trees are grown deep (e.g. few training samples at each leaf-node of the tree) and the trees are not pruned. These trees will have both high variance and low bias. These are important characteristics of sub-models when combining predictions using bagging.

Random forest the most popular Baggind model used now a day for low bias and high variance datasets.

Random forest:

random-forest do both row sampling and column sampling with Decision tree as a base. Model h1, h2, h3, h4 is more different than by doing only bagging because of column sampling.

As you increase the number of base learners (k), the variance will decrease. when you decrease k, variance increases. But bias remains constant for the whole process. k can be found using cross-validation.

Random forest= DT(base learner)+ bagging(Row sampling with replacement)+ feature bagging(column sampling) + aggregation(mean/median, majority vote)

here we want our base learner as low bias and high variance. so train DT to full depth length. we are not worried about depth, we let them grow because at end variance decrease in aggregation.

For model h1, (D-D’) dataset not used in modeling are out of bag dataset. they used for cross-validation for the h1 model.

Let’s look at the steps taken to implement a Random forest:

1. Suppose there are N observations and M features in the training data set. First, a sample from the training data set is taken randomly with replacement.

2. A subset of M features is selected randomly and whichever feature gives the best split is used to split the node iteratively.

3. The tree is grown to the largest.

4. The above steps are repeated and prediction is given based on the aggregation of predictions from n number of trees.

Train and run-time complexity

Training time = O(log(nd)*k)

Run time = O(depth*k)

Space = O(store each DT*K)

As the number of base model increases, training run time increases so always use Cross-validation to find optimal hyperparameter.

Code:

Extremely randomized tree:

Extreme randomized tree = DT(base learner)+ bagging(Row sampling with replacement)+ feature bagging(column sampling) + aggregation(mean/median, majority vote) + randomization when selecting threshold

One more level to decrease variance but bias may increase slightly. but this method not used much in real life.

Cases:

  1. All cases of decision trees also the same for the random forest except in RF, variance affected ny number of base learner(K).
  2. feature importance in case of decision tree only on one model but in RF we also consider all k base learner model(k).

Note: all cases of Decision trees also applicable to the random forest.

Boosting:

Boosting is another ensemble technique to create a collection of predictors. In this technique, learners are learned sequentially with early learners fitting simple models to the data and then analyzing data for errors. In other words, we fit consecutive trees (random sample) and at every step, the goal is to solve for net error from the prior tree.

this work in case of high bias and low variance base model and additively combine. we will try to reduce bias.

When an input is misclassified by a hypothesis, its weight is increased so that the next hypothesis is more likely to classify it correctly. By combining the whole set at the end converts weak learners into a better performing model.

Note: in RF we can not minimize loss because we are not using. but in Gradient boosting we can minimise any loss

Here L(y,F(x)) is loss function(log-loss, linear loss, hing loss etc) and M is the number of the base model. all have done sequencly hnce GBDT not parallisied so take more time train even compare to RF.

Gradient boosting is a base learner have high bias and low variance models. here we will use the Gradient boosting decision tree with a very low depth value.

As v reduces, giving less weighted to model. if v increase, your chances of overfit increases. always do cross-validation for the number of the base models(M) and shrinkage parameters (v).

As M increases bias decrease certainly but variance also increases. So use regularization. Here In shrinkage, we simply multiply “v” with the second term. Hence reduce “game” by constant “v”. As “v” reduces, chances of overfitting decrease so variance decreases.Both M and “v” use for the bias-variance tradeoff.

Train and Run time complexity

Training time = O(log(nd)*k)

Run time = O(depth*k)

Space = O(store each DT* k )

XGBoost

XGBoost: Boosting + Randomization

https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html

https://github.com/dmlc/xgboost/blob/master/demo/guide-python/sklearn_examples.py

Adaboost:

Mostly use in image processing and face detection problem. At every stage, it gives more weight to those points which are misclassified.

You can tune the parameters to optimize the performance of algorithms, I’ve mentioned below the key parameters for tuning:

  • n_estimators: It controls the number of weak learners.
  • learning_rate: Controls the contribution of weak learners in the final combination. There is a trade-off between learning_rate and n_estimators.
  • base_estimators: It helps to specify different ML algorithms.

Stacking models:

The first train each model independently(C1, C2, C3,…)

the meta classifier can be logistic, RF or any.

Cascading classifiers:

this model mainly used when the cost of an error occurring is very high. Widely used in credit card companies to detect fraud transactions. or medical domain to detect rare problem to save a life.

dataset train in the second step is (D-D1) dataset which are not used in the first step modeling.

============Happy ending==================

Reference:

Google image

Applied AI