Inside the for loop, it minimizes the loss of old classifier f_{m-1} plus the current classifier G, pick the best one (w.r.t. the best parameter) and update the classifier. For Adaboost, the loss L is chosen to be exponential loss.

If the closed-form solution to the minimization problem is hard to find in the forward stagewise additive modeling, we can think of approximating our prediction f(x_i) to y_i by gradient descent as shown in Equation (105).

This is the core idea of gradient boosting . In terms of the implementation of gradient boosting, we can do as follows:

fit a simple model to get prediction for all the data points f(x)
compute gradient g based on Equation (105) for all data points
train a new model using g
update f(x) using Equation (105)
repeat step 1 to 4 until g converges
As a concrete example, if Loss is mean square error, training data is [2,3,4]. Assume step 1 we predict [1,2,3], step 2 we get the residual as [2–1, 3–2, 4–3] = [1,1,1]. Step 3 we train a model using the residual [1,1,1], assume we get [0.5, 0.5, 0.5], step 4 we update our prediction [1,2,3] to [1.5, 2.5, 3.5]. Repeat step 1 to 4 until the residual does not change.

That’s all about boosting. As a summary the pros of boosting is:

decrease in bias
better accuracy
additive modeling
cons:

increase in variance
prone to overfitting
Neural networks
Neural network consists of layers of neurons and each neuron consists of linear function and activation. Layers of neurons form a model and model is architecture plus parameters. There are already lots of tutorial for neural networks. In this post, I think it would be better just going through an example to talk about basic techniques in neural networks including forward propagation, backward propagation and optimization.

The example we are going to talk about is image classification. Say we have a bunch of images of cat and dog as a training set, we want our model to be able to determine if a given image is cat or dog. Instead of applying convolutional neural network, we just go with the most basic neuron network. We flatten every pixel in the image and generate a feature vector of size n for each image and feed into our model. Our model is a 3 layer network that looks like:

In terms of forward propagation, if we assume using sigmoid function, we can write down equations like:

In practice we would like to use vectorization, basically apply a bunch of examples at a time instead of solving one example at a time. Therefore, we can stack m examples together and form the input vector X:

We stack every m inputs vertically to form a matrix with shape n*m. While passing through the first neuron, we can do the matrix multiplication as shown in the picture above. Notice b[1] will be broadcast from 3*1 to 3*m so that every examples share the same b value. In the end Z[1] is of shape 3*m. Following the same procedure, we can write down the shape of every variable in the forward propagation as follows:

In order to train this neural network, we can define cost function as the mean of binary cross entropy:

Next we use back-propagation to compute the gradient of cost function w.r.t. each parameter w and b and then update w and b using gradient descent. The way to compute gradient is first writing the gradient using chain rule (following the reverse order of forward propagation) and you can also find some of the terms are redundant and you don’t need to recompute multiple times:

We can unfold every term of the gradient w.r.t. w[3] and get:

Similarly, we can unfold the gradient w.r.t. w[2]. w[1] follows with the same idea.

Notice we need to rearrange each term to make sure the shape of right hand side matches the left hand side. One trick is that every time we take derivative of sigmoid, it becomes element-wise product. The way to identify the shape of left hand side is that if the numerator is a constant (which is the case for cross-entropy loss) and denominator is a matrix, then the final shape is the same as the denominator. If the numerator is a vector of size m and denominator is a vector of shape n, then the left hand size is of shape m*n.

That’s all about back-propagation. In practice deep learning framework such as tensorflow and pytorch has already done it for us. We only need to construct the model by specifying forward propagation.

Next we are going to talk about optimization.

First we can normalize input to make sure the feature in every dimension has mean 0 and variance 1. This can avoid the case that the gradient in some dimension is too large while in others is small. In other words, it avoids wiggle in gradient update:

Parameter initialization in neural network is also very important to avoid vanishing or exploding gradients during back-propagation.

We can initialize parameters using random initialization. A better way is to apply Xavier/He initialization for all the weights. The idea of Xavier/He initialization is to let the input variance to be similar to the output variance for each layer.

During training, we can use mini-batch like we mentioned before in the linear regression section. Also we can use momentum instead of gradient descent. Momentum is an adaptive learning rate method. The idea is to apply a moving average to the gradient so that we can have memory about past update and find a direction to update parameter:

Another default adaptive learning rate algorithm to use is Adam. The idea is that instead of applying moving average to first derivative of J w.r.t. w, it applies to second derivative as well:

To avoid code start (m and v are very small at the beginning so that the update is almost 0), we can add bias correction mechanism. The idea is to include iteration t and makes sure initially m and v are large:

That’s all about neural networks. We’ve covered forward propagation, backward propagation, normalization, parameter initialization, adaptive learning algorithms such as momentum and Adam.

In the last section of supervised learning, we are going to talk about evaluation metrics.

Evaluation metrics
In binary classification, we can use confusion matrix. From the confusion matrix, we can get accuracy, precision, recall, specificity, F1 score as follows:

We can also rely on ROC (receiver operating characteristic) and AUROC (area under ROC) to further measure the performance of our model.

ROC measures the relationship between false positive rate (1 minus specificity) and true positive rate (recall). It can be used to select threshold between “predict positive” and “predict negative”. In the figure above, we use 0.5 as the threshold. AUROC tells how much our model is capable of distinguishing between classes.

That’s all about part I of this comprehensive summary. We’ve went through various supervised learning algorithms in CS229 and knowledge about learning theory, regularization, model section and evaluation metrics which are general to machine learning.

What’s next?

I will publish the link to part 2 about unsupervised learning algorithms once it is done
I will add code to some of the algorithms
Useful Resources and reference:

[CS229 official website]http://cs229.stanford.edu/syllabus-autumn2018.html