The recent QUEEN of AI Algos: XGBoost, and it’s future

Source: Deep Learning on Medium

Understanding XGBoost

In order to understand XGBoost, we must first understand Gradient Descent and Gradient Boosting.

a) Gradient Descent:

A cost function measures how close the predicted values are, to the corresponding actual values. Ideally, we want as little difference as possible between the predicted values and the actual values. Thus, we want the cost function to be minimized.

The weights associated with a trained model, cause it to predict values that are close to the actual values. Thus, the better the weights associated with the model, the more accurate are the predicted values and the lower is the cost function. With more records in the training set, the weights are learned and then updated.

Gradient Descent is an iterative optimization algorithm. It is a method to minimize a function having several variables. Thus, Gradient Descent can be used to minimize the cost function. It first runs the model with initial weights, then seeks to minimize the cost function by updating the weights over several iterations.

b) Gradient Boosting:

Boosting: An ensemble of weak learners is built, where the misclassified records are given greater weight (‘boosted’) to correctly predict them in later models. These weak learners are later combined to produce a single strong learner. There are many Boosting algorithms such as AdaBoost, Gradient Boosting and XGBoost. The figure depicts the Tree Ensemble Model.

Tree Ensemble Model to predict whether a given user likes computer games or not. +2, +0.1, -1, +0.9, -0.9 are the prediction scores in each leaf. The final prediction for a given user is the sum of predictions from each tree.

Gradient Boosting carries the principle of Gradient Descent and Boosting to supervised learning. Gradient Boosted Models (GBM’s) are trees built sequentially, in series. In GBM’s, we take the weighted sum of multiple models.

  • Each new model uses Gradient Descent optimization to update/ make corrections to the weights to be learned by the model to reach a local minimum of the cost function.
  • The vector of weights assigned to each model is not derived from the misclassifications of the previous model and the resulting increased weights assigned to misclassifications but is derived from the weights optimized by Gradient Descent to minimize the cost function. The result of Gradient Descent is the same function of the model as the beginning, just with better parameters.
  • Gradient Boosting adds a new function to the existing function in each step to predict the output. The result of Gradient Boosting is an altogether different function from the beginning because the result is the addition of multiple functions.

c) XGBoost:

XGBoost was built to push the limit of computational resources for boosted trees. XGBoost is an implementation of GBM, with major improvements. GBM’s build trees sequentially, but XGBoost is parallelized. This makes XGBoost faster.