Ensemble Methods

Original article was published on Artificial Intelligence on Medium


We have discussed that DT suffers from high variance due to which the chances of overfitting are high. That is DT is venerable to overfitting. To overcome this we use ensemble methods that include Bagging, Boosting and Random Forest.

Before discussing ensemble methods let us first understand types of sampling. There are 2 kinds of sampling, assume that we have a set of size n and we need a sample of size m; it can be done in 2 ways

  1. Sampling without replacement
  2. Sampling with replacement

numpy.random.choice(a, size=None, replace=True, p=None)

If replace = True: it is with replacement else without replacement. We will be using with the replacement for reducing the variance.

Samples with replacement are called Bootstrap samples. The main concept behind bootstrap is that rather than generating independent training set of size n, we generate independent sample (with replacement mandatory)of size n from the given training set of size n

Bootstrap Aggregation/ Bagging: Reduces the variance

Aim: How to reduce variance in DT?

For instance,

  1. we have a set of n independent observations X1, X2……Xn
  2. each with variance σ²
  3. the variance of the mean of observation is given by σ²/n, where the mean is given by

4. On averaging, the set of observation variance will reduce.

Bagging

When we have to reduce the variance of a prediction algorithm, we take many training sets, then we construct a separate prediction rule using each training set, lastly, we average the resulting predictions.

Mathematically,

  1. Calculate,

where B = separate training set

2. Average them,

However, we do not have access to multiple training sets. Alternatively, we will use bootstrap samples, train prediction on bootstrap samples and average all prediction,

Steps:

  1. Build B different bootstrapped training sets
  2. Train the prediction algo on bth bootstrapped training set to get our

3. Averaging all the predictions, we will get

The above procedure is called Bagging in general

Bagging in Regression Trees(RT)

To apply bagging in RT we build B RT using B bootstrapped training sets, as it is RT problem we will average the resulting predictions, also, trees are grown deep and are not cut, therefore every single tree has very high variance and low bias. Averaging these B trees reduces the variance.

Bagging in Classification Trees(CT)

It is same as normal bagging, the difference is that for every test input, record the class predicted by each of the B trees as it is CT problem we take a majority vote: the prediction falls in the commonly occurring class among the B predictions.

Key Points from Bagging

  1. Bagging test error rate is lower than the test error rate obtained from a single tree
  2. The number of B tree not important, an increase in B does not lead to overfitting.
  3. The estimated test error rate can be obtained without cross-validation or validation set
  4. each bagged tree uses around 2/3 of observations remaining 1/3 are called out of bag observations (OOB)

Random Forest

It is similar to the bagged tree, but less dependent, when building a DT, we take a split for each tree, pick a random sample of m attributes as a split parameter from a set of p attributes. At each split fresh sample of m attribute is taken.

  • * If m = p random forest is same as bagging
  • * therefore in RF m is approximately equal to √p

Boosting for Regression Tree

The theory behind boosting is that if we combine many weak learners as a whole they will be stronger.

Algorithm

source: An-Introduction-to-Statistical-Learning-with-Applications-in-R.pdf

Process

  • Boosting learns slowly
  • We fit DT to its residuals for a give prediction rule
  • Then add a bit of this new DT in prediction rule so as to improve residual.
  • Each of the new trees can be small, as determined by the parameter d. ( what is d ?? will be discussed shortly)
  • This is how we slowly improve fhat.

Theory behind It

  • * B = number of trees, in boosting overfitting may occur if B is too large, we can use cross-validation or validation set to select B
  • η = learning rate(small +ve number), very small η can expect a very large B, usually, the value of η is 0.001, 0.0001, nevertheless, right value depends on the problem
  • d = number of the split in each tree, it controls the complexity, usually, d=1 is considered and this type of tree is known as decision stump.

Wrapping Up

RF v/s Boosting

  • * In RF, DT are produced independently.
  • *In boosting, DT are produced in a sequential manner; each new tree is based on the previously constructed trees. Hence smaller DT is often sufficient. source: An-Introduction-to-Statistical-Learning-with-Applications-in-R.pdf.