Summary of algorithms in Stanford Machine Learning (CS229) Part I

Source: Deep Learning on Medium

Supervised learning in general and notations

The key point is that supervised learning has both feature x and label y (while unsupervised learning only has feature x but not label). Supervised learning has a set of training examples (x, y) and the goal is to learn a hypothesis h that can map x to y. In order to make predictions, we give a new x’ that is not in the training set and use h learned above to predict y. So y_prediction can be written as h(x’).

x is usually a vector of n features, denoted as x subscript i as: {x_{1}, x_{2}, …, x_{n}}

y is a constant

We use superscript i to denote ith training example. For instance x^{1} represents the feature vector of first training example.

There are two types of supervised learning. If our prediction is continuous such as housing price, sales estimation, this is called regression problem. If our prediction is discrete, such as spam or non-spam email, like or don’t like, this is called classification problem.

Linear regression

This is one of the most widely used regression algorithms. The assumption is that our hypothesis h has a linear relationship with x defined as follows:

During the training process, our goal is to learn those thetas such that for whatever x^{i} given in the training set, our prediction h(x^{i}) can be as close to actual label y^{i} as possible.

How do we measure the closeness of our predictions and actual labels? We define a cost function (or loss function) as follows:

Here we sum the least square error of each example in the training set together and get the training loss. If the training loss is low, it means our model’s predictions are close to the actual labels and vice versa.

How do we train our model (adjust parameter thetas)? We use gradient descent.

The idea is that every time we take a step in the direction of steepest decrease of J. The alpha in Equation (3) is a hyper-parameter which means it is not learnable and has to be defined by at the beginning. We call it learning rate. Here we use the fixed learning rate along the whole process. Later on we will see some adaptive learning rate methods such as RMSProp and Adam.

To unfold Equation (3), we consider the partial derivative of J(theta) with respect to theta_{j} for one training example as follows:

Then we sum over all the training examples and combine Equation (3) and (4):

Since the pseudocode above looks at every example in the entire training set on every step, it is called batch gradient descent. There is another variation called stochastic gradient descent (SGD). The idea is that instead of going through all the training examples and update theta, we update theta while each time we encounter a training example:

Pros of SGD:

  • start making progress/update right away
  • usually SGD converges faster than batch GD

Cons of SGD:

  • the update will oscillating and it does not guarantee the decrease of J(theta) in each update step

There is another idea called mini-batch gradient descent. Instead of taking the full example or just one example during the update, it takes a chunk of examples. It is widely used in the training of neural network. The pseudocode is like follows:

This is pretty much all the key concepts in linear regression. We have covered hypothesis of linear regression, cost function, batch GD, SGD, mini-batch GD.

Next there is some advanced topic about probabilistic interpretation of why Equation (2) above is a good cost function for linear regression. It involves some basic probability theory such as Gaussian distribution (also known as normal distribution), probability density function (pdf), maximum log likelihood. The knowledge of maximum log likelihood will also be applied in logistic regression later on.

The probabilistic interpretation starts with an assumption as shown in Equation (5). The idea is that for each training example i, we add an error term which captures the difference between our prediction and true label:

We further assume the error term of each training example (each i) is independently and identically distributed (I.I.D.) and satisfy Gaussian distribution with 0 mean and some variance denoted as sigma².

I would say people like to use Gaussian distribution to model the error term due to two reasons. First is the central limit theorem which states that the sum of n i.i.d. random variables is normally distributed. An error can be thought as a summation of lots of independent effects so by central limit theorem, it satisfies normal distribution or close to normal distribution. The other reason is that Gaussian distribution is the least assuming distribution, all you need is the mean and variance.

The I.I.D. assumption means the error term of one training example has the same distribution of the error term of other examples. By same distribution, it means they have the same mean and variance. It does not necessarily mean each error term has the same value, they just has the same chance to have the same value.

With all the clarifications about I.I.D. and Gaussian distribution assumption above, let’s move forward. Because each error term satisfy Gaussian distribution with 0 mean and sigma² variance, its probability density function (pdf) can be written as follows:

By looking at Equation (5) and (6), the error term is a random variable that satisfies Gaussian distribution, our prediction can be deemed as a constant. So y^{i} is also a random variable that satisfies Gaussian distribution with mean being our prediction (theta.T*x) and variance being sigma². You can think of this as shift the mean of the error distribution by a constant value. So the distribution of y^{i} is as follows:

Next we apply a math trick which is to look at Equation (7) above as a function of theta rather than y and introduce a term called likelihood:

We consider all the input and labels in Equation (8) rather than a single example i because we use the same theta for all the training examples. Our goal is to maximize the likelihood, in other words we choose theta so as to make the data as high probability as possible. This might be confusing at first glance, you might wonder why data has a probability. But this is how we think of problems in a probabilistic view. Based on Equation (5) above, by varying theta, we can have different data (x^{i}, y^{i} pair just to be clear). So we want to choose theta that can make sure the given training data has the highest probability to exist (even though the data already exists).

Professor Andrew Ng has a good summary as follows: when we say likelihood, we fixed the data, parameter theta can vary; when we use probability, theta is fixed, data can vary. So we can say the likelihood of theta and the probability of data but not the other way around. In machine learning, we want to maximize the likelihood or minimized loss. You will find soon from math equation, loss is just the negative of likelihood. So minimize a negative function is same as maximize the function.

Going back to Equation (8), since the assumption is every example is i.i.d. and combine with Equation (7), we can expand Equation (8) as:

In order to avoid sequence of multiplication and simplify Equation (9), it is a common math trick to consider the log likelihood since log is a strictly increasing function, maximize log(L(theta)) is equivalent to maximize L(theta):

The first term of Equation (10) is a constant, so maximizing l(theta) is equivalent to maximize the second term which is equivalent to minimize Equation(2) above.

This is the end of probabilistic interpretation, we’ve covered i.i.d, Gaussian distribution, maximize log likelihood (which is equivalent to minimize cost function).

Normal equation

Normal equation is parallel to gradient descent and it is another way to find parameter theta of linear regression. And it works only for linear regression.

The idea is that the cost function in Equation (2) is a convex function w.r.t. theta, so by taking the derivative w.r.t. theta and set it to 0, we can solve the optimal theta directly. The full derivation is written clearly in the lecture notes which involves matrix derivatives, properties of trace. In the end we can get the normal equation as:

Therefore, the optimal theta can be computed directly as:

Pros:

  • Equation (12) can find optimal parameter theta directly so when the dimension of feature is small, it is more efficient

Cons:

  • When the dimension of feature is large, the computation cost is huge especially matrix inversion takes O(n³) time complexity
  • If there are some redundant features, X.T X is not invertible. In this case you can either remove all the redundant features or use pseudo inverse.

Locally weighted linear regression

As the name indicates, this algorithm gives weight to each training example and focus more on local training examples. By local, it refers to the training examples close to the query point x. The cost function of this algorithm is:

As can be seen from Equation (14), the point that is far away from x has smaller weight and the point that close to x has larger weight.

For every query point x, this algorithm will fit a model and get optimal theta by minimizing Equation (13). Then the output will be same as Equation (1) which is theta.T * x.

Locally weighted linear regression is a case of non-parametric learning algorithm which means the amount of data/parameter you need to keep grows (linearly) with the size of data. For example from Equation (14), it can be seen that we need to keep all the training examples in order to get w^{i}. While linear regression belongs to parametric learning algorithm which has fixed set of parameters.

This algorithm is used when the number of features is not much but data is large and you don’t want to think about which feature to use.

Logistic regression

Logistic regression is one of the widely used binary classification algorithms. The input x is a bunch of features and y takes either 1 (positive class) or 0 (negative class). Spam filter is one of the applications of logistic regression where we determine whether a given email is spam or non-spam.

The hypothesis of logistic regression is as follows:

where g is called sigmoid function which is also a popular activation function in neural networks. It is defined as follows:

As can be seen, g(z) is bounded between 0 and 1. As z approaches to positive infinity, g(z) approaches to 1; as z approaches to negative infinity, g(z) approaches to 0. When z = 0, the segment around z = 0 is approximately like a straight line (in 2D). You might wonder why we use sigmoid function rather than something else. We will talk more about this in the later part of generalized linear models (GLM).

Notice now our hypothesis in Equation (15) outputs a value between 0 and 1. By convention, we define it to be the probability that we output y = 1. You can also define it to be the probability that we output y=0, but the definition above is just the convention. So mathematically we have:

A little digression, you might also hear about Bernoulli distribution which models the distribution of y as follows:

Therefore, by comparing Equation (17) and (19), another way to think about the hypothesis of logistic regression is that it is the same as parameter phi in the Bernoulli distribution.

Coming back to Equation (17) and (18), people tend to combine those two equations because it would be easier to write down the likelihood and then further maximize the log likelihood to learn parameter theta as we did before in the probabilistic interpretation of linear regression, so we have:

You can verify that Equation (21) is the same as Equation (17) and (18) by plugging in y = 1 and y = 0. Next consider all the training examples and assume each examples were generated independently, so we have the likelihood of theta as follows:

Take the log of Equation (22) and get the log likelihood as shown below:

You might also hear binary cross entropy loss, which is just the negative of l(theta). Since we want to maximize log likelihood in order to find optimal parameter theta , we use gradient ascent in this case. In Equation (3) above, we want to minimize loss, so we use gradient descent. Again, loss is just the negative of likelihood, so gradient ascent and descent turn out to be the same thing. So for each parameter theta, we do:

Consider one specific parameter theta j and take the derivative of l(theta) in Equation (23) w.r.t. theta j, we have Equation (25). Then combine (24) and (25) we can derive the update rule for theta j which is shown in Equation (26):

Once we find the best parameter theta using the training set based on Equation (26), we can make prediction based on Equation (17) and (18). We can set a threshold, say if P(y=1|x;theta)>0.5, we predict 1 otherwise 0.

Another interesting thing is that the parameter update rule for both linear regression (Equation (3), (4)) and logistic regression (Equation (26)) are identical except the definition of hypothesis. This is because both of them are the special case of generalized linear models which we will talk about later.

Newton’s method

Newton’s method is another way to find optimal theta by maximizing log likelihood of l(theta) in Equation (23).

Consider a function f in 2D which takes a real number, say theta, as input and output a real number. The graph of f and a simple derivation is shown below:

Figure 1: A simple derivation of Newton’s method

The idea is to find theta* (which satisfy f(theta*) = 0) by choosing a random starting position theta(0) and keep updating theta until converges. More generally, the update rule for theta is:

Now notice that the optimal theta in Equation (23) satisfy that the gradient of l(theta) w.r.t. theta = 0. So if we let gradient of l(theta) be the f(theta) in Figure 1 above, we can directly apply Newton’s method to find the optimal theta:

This is the idea of how we apply Newton’s method to solve for optimal theta. Equation (29) only consider the case that theta is a real number, if theta is a vector (multidimensional), the generalization of Equation (29) is:

Where gradient of l(theta) is a generalization of first derivative of l(theta), Hessian matrix H is a generalization of second derivative of l(theta).

Pros:

  • quadratic convergence which indicates fast convergence and few iterations

Cons:

  • if the number of features is large, computing Hessian matrix H and do the inverse is computational expensive.

So as a summary, in linear regression, we minimize the least square loss using gradient descent to learn parameter theta. We can also use normal equation to find optimal theta directly. In logistic regression, we maximize log likelihood using gradient ascent to learn parameter theta. We can also apply Netwon’s method to get faster convergence but potential computational complexity.

Generalized linear models (GLM)

Now it’s time to talk about GLM which is a generalization of linear regression, logistic regression and many more.

GLM is based on exponential family and the exponential family is defined as:

With proper choice of T, a, b (parameterized by eta), we can get various distributions such as Bernoulli, Gaussian, multinomial, Poisson, gamma, exponential, beta and Dirichlet etc. And Bernoulli corresponds to logistic regression, Gaussian corresponds to linear regression (recall the probabilistic interpretation), multinomial corresponds to softmax regression.

Now let’s consider Bernoulli distribution, with the math trick applied in Equation (21), we can write the distribution of Bernoulli as:

We can get T, a, b (parameterized by eta) by comparing Equation (32) with Equation (31):

From the middle equation of Equation (33) which relates eta and phi, you can see the structure of sigmoid function. We will use it while constructing logistic regression later. One more thing to be clear is that T, a, b are parametrized by eta, and eta is parameterized by parameters of a specific distribution. For instance, in Bernoulli distribution, eta is parameterized by phi. And in Gaussian distribution, eta will be parameterized by u which represents mean of Gaussian distribution.

Now consider Gaussian distribution, comparing its general form (with mean = u and variance = 1 for simplicity) with Equation (32), we can get:

Some properties of the exponential family is:

  • maximum likelihood estimation w.r.t. eta is concave and negative log likelihood or cost function is convex. That’s way the normal equation and Newton’s method work as stated above.
  • The expectation of the distribution is equal to the derivative of a(eta) w.r.t. eta:
  • The variance of distribution is equal to the second derivative of a(eta) w.r.t. eta:
  • The parameter update rule for all the exponential family is same as Equation (26) above.

The last two properties can be used to derive the expectation and variance of all the distributions belonging to the exponential family.

With the knowledge of the exponential family, we can work on constructing GLMs. If we want to predict some random variable y given input x, we make the following assumptions:

  1. y|x; theta ~ ExponentialFamily(eta). This states that the distribution of y given x belongs to exponential family parameterized by eta.
  2. the output h(x) is equal to the expectation of y given x (E[y|x])
  3. eta = theta.T * x

With the assumptions above, we can start to derive hypothesis of linear regression, logistic regression and softmax regression.

Linear regression:

First equality is based on assumption 2, second equality is because the distribution of y|x satisfy Gaussian distribution, third equality is based on Equation (34) and the last one is based on assumption 3.

Logistic regression:

Softmax regression:

Softmax itself is a generalization of logistic regression in the sense that y can takes {1,2,…k} rather than only {0,1}. We model it as distributed according to a multinomial distribution. With k possible outcomes, we need to use k-1 parameters specifying the probability of each of the outcome and the probability of the last outcome is just 1 subtract all the previous ones so we don’t need k parameters. More formally, we have:

In multinomial distribution, T(y) is a vector of dimension k-1, so T(y) is not equal to y anymore. Instead, T(y)_{i} is a one-hot vector with the corresponding index i being 1 and the rest element being 0. Formally, this can be expressed by an indicator function in Equation (39). With the settings above, we can write the multinomial distribution as exponential family by getting the T, a, b, eta (w.r.t. phi). Apply the same trick as Equation (21) but generalizes to k dimension, we have:

Substitute Equation (39) to Equation (40) and then compare with Equation (31) we can get:

To express phi w.r.t. eta (which is called response function formally), we use the fact that the summation of phi from 1 to k is 1 and rearrange Equation (41):

The last equation that map phi to eta is called softmax function. To further convert it to the form we normally seen, we can apply assumption 3 of GLM stated above:

And finally our hypothesis h(x) will be the probability of y belong to class 1 to k-1 as follows:

To train the softmax regression, we maximize the log likelihood shown in Equation (45) below either using gradient ascent or Newton’s method to find the optimal theta. To make a prediction, we use the learned theta and applied Equation (44) above.

That’s all for GLMs. In my opinion, GLM gives a deep reasoning how people come up with hypothesis for linear regression, logistic regression and softmax regression and many more distributions belonging to the exponential family. I felt it was amazing when I first saw it.

Generative learning algorithms in general

There are two types of learning algorithms in machine learning, one is called discriminative learning algorithms, the other is generative learning algorithms.

Discriminative learning algorithms are algorithms we have seen before. It models p(y|x), the conditional distribution of y given x. In other word, it learns features to labels mapping. Generative learning algorithms model p(x|y) and p(y). p(x|y) models the distribution of features given y. For instance, if y can take 0 and 1, p(x|y=0) models the distribution of features when y is 0, p(x|y=1) models the distribution of features when y is 1. During prediction, when new feature comes in, it matches against the features of y=0 and y=1 using Bayes rule and decide which one is more similar. p(y) is called prior probability which indicates your prior belief about the probability before seeing the data.

Bayes rule computes the posterior distribution of y given x as follows. We call it posterior because it represents our belief about y after seeing the data x.

The denominator p(x) can be computed using law of total probability. But quite often it can be ignored. Because all we care about is the relative value of p(y=1|x) and p(y=0|x) (in the case of y = 0 or 1), we can take the ratio and cancel out the denominator. More formally our prediction is y maximizes the right hand side of Equation (46):

That’s it about the general introduction of generative learning algorithms. Next we will describe two classic generative learning algorithms, Gaussian discriminant analysis (GDA) and Naive Bayes.

Gaussian discriminant analysis (GDA)

GDA is used to solve classification problems. Consider the case of binary classification, the idea is as follows: it models x|y=0 as multivariate normal distribution and models x|y = 1 as another multivariate normal distribution. From the data, it learns the parameters of two Gaussian distributions (mean and variance) and uses Equation (47) above to make predictions.

The multivariate normal distribution (multivariate Gaussian distribution) generates the 1-dimension normal distribution we have seen in Equation (7) to n-dimensions. It has two parameters as well, mean vector u and covariance matrix sigma respectively. Its probability density function is given by:

As a side note, covariance matrix is positive semidefinite and symmetric. It is also widely used in statistics which is used to quantify how two RVs vary together. If X and Y are independent, their covariance is 0 but not the other way around. If n = 1, Equation (48) will be reduces to the 1-dimension normal distribution we have seen before.

With the knowledge of multivariate normal distribution, we formally define GDA (in the case of binary classification) as follows:

The reason to use the same covariance matrix is because we want the decision boundary to be linear (decision boundary is defined as p(x|y=0) = p(x|y=1) = 0.5). The parameters of GDA are phi, u0, u1 and sigma. In order to learn those parameters we write down the log likelihood of the data:

In order to find parameters that maximize the log likelihood, we take the gradient of the log likelihood w.r.t. phi, u0, u1, and sigma, set to 0 and get:

Even though the equation might look complex, intuitively it make sense. phi is computed by counting the number of training examples that have label 1 divide the total number of training examples m; u0 is computed by taking the average of all x that have label y = 0; u1 is computed by taking the average of all x that have label 1; sigma is the outer product of each example subtract its class mean.

As seen in the class notes, pictorially what GDA doing can be seen as follows:

Two Gaussian distributions are fit to the data. The straight line is the decision boundary where y=0 and y=1 has the same chance to be predicted. On one side of the boundary, we will predict y=1 to be he most likely outcome and y=0 on the other side.

Pros:

  • when p(x|y) is indeed distributed as multivariate Gaussian or close to multivariate Gaussian, GDA requires less training data to learn well
  • GDA is easy to learn because its parameter learning is purely based on counting(as seen in Equation (52)). So if you don’t know much about the data, just give GDA a try and see how it works.

Cons:

  • GDA makes stronger assumption of the data compared with other classification algorithms, say logistic regression. In other words, it is not robust to modeling assumptions.

That’s all the knowledge about GDA. What it is doing is fit multivariate Gaussian distribution to data, learn parameters by counting and predict based on the relative value of probability of each class. Next we will talk about Naive Bayes which has discrete-valued features rather than continuous-valued features in GDA.

Naive Bayes

Naive Bayes is widely used in text classification. For instance, it classifies whether a given email is spam or not spam. It is also one of the generative learning algorithms.

The commonly used Naive Bayes algorithm is the type of multivariate Bernoulli event model. Take spam filter as an example. The input is a set of emails and their corresponding labels (spam or non spam). The feature vector we constructed has length equal to the number of words in a dictionary prebuilt by us. If an email contains jth word of the dictionary, it will set x_{j} to be 1; otherwise x_{j} is 0. But notice that if the jth word appears in an email multiple times, this algorithm still set x_{j} to be 1 rather than the count of appearance. This is why this model has “Bernoulli” in its name.

Next step is to model p(x|y). The Naive Bayes assumption says each word x_{i} is conditionally independent given y. This is saying if an email is spam, the knowledge of each word in the email does not influence our belief about the chance of appearance of the other words. For instance, the appearance of word “buy” given a spam email does not influence the chance of appearance about “price” given this is a spam email. This is not a reasonable assumption in reality but works well in many problems.

So based on the Naive Bayes assumption above, take ith example, we can expand p(x^{i}|y^{i}) as follows:

Since this is a multivariate Bernoulli event model, we parameterize it by phi_{j|y=1), phi_{j|y=0} and phi_{y}. Formally:

In order to learn those parameters, we write down the likelihood in Equation (56) and maximizes the likelihood (in this case we take the gradient of likelihood w.r.t. each parameter, set to 0 and solve for the optimal parameter) and get:

So similar to what we have seen in Equation (52), the parameter learning is also purely based on counting. phi_{j|y=1) is just the fraction of the spam(y=1) emails in which word j does appear; phi_{j|y=0) is just the fraction of the non-spam(y=0) emails in which word j does appear; phi_{y} is just the fraction of spam emails which is intuitively pretty straightforward.

To make a prediction, given a new feature x, we apply Bayes theorem and law of total probability and select the class that gives higher probability:

But like stated above in Equation (47), we can take the ratio of p(y=1|x) and p(y=0|x) so that we don’t need to compute the denominator using law of total probability. We can even take the log so the series of multiplication becomes a series of addition to avoid round-off or precision out-of-range errors:

Every term in Equation (59) above is purely based by counting as shown in Equation (57). And if the log ratio is bigger than 0 (which means ratio is bigger than 1), we predict y=1; otherwise we predict y=0, as simple as this.

What if p(x_j|y=1) or p(x_j|y=0) is 0 in Equation (58) or Equation (59)? It means a word (in the vocabulary) that does not appear in our training set but does appear during prediction. If we don’t handle it correctly, we might end up with case like 0/0, 0/infinity or infinity/infinity. We apply Laplace smoothing.

The idea of Laplace smoothing is that when we compute parameters in Equation (57), we add 1 to the numerator and add k (where k is the number of classes) to the denominator. There is lots of math behind the scenes that why this is the optimal estimator of each parameter. But knowing how it works and just apply it is already good enough.

Therefore the modified version of Equation (57) with Laplace smoothing is:

That’s all for the Naive Bayes of multivariate Bernoulli event model. We talked about how to construct features, Naive Bayes assumption, how to learn parameters by counting, how to make predictions and Laplace smoothing. Next we will talk about Naive Bayes of multinomial event model.

In this model the way to construct feature is different. Recall in the multivariate model, the feature vector for each sentence has length of vocabulary. If a word appears in the sentence, we set the corresponding index in the feature to be 1, otherwise 0. Here the feature vector has the same length as the sentence, every element of the feature vector represents the word index in the vocabulary. For instance, if an email starts with “TOM WON TOM…”, x_{1}= 100 (if assuming “TOM” is the 100th word in the vocabulary), x_{2} = 300 (if assuming “WON” is the 300th word in the vocabulary), x_{3} = 100.

The way to model p(x|y) looks the same as Equation (53) above except that index j means the actual word index of sentence i rather than the word index in the vocabulary:

The parameters are:

Notice that Equation (62) above is does not depend on position j because the Naive Bayes assumption we have in the multinomial event model is that a word is generated does not depend on its position j within the email. Takes the same example “TOM WON TOM…”, the assumption means the chance of appearance of “TOM” in position 1 is the same as the chance of appearance of “TOM” in position 3 (mathematically, p(x_1=100|y=1) = p(x_3=100|y=1)) and all we need is phi_{100|y=1} which is equal to p(100|y=1).

To learn those parameters, again we write down the likelihood, set gradient w.r.t. each parameter to 0 and solve for each parameter. The likelihood is same as Equation (56) and maximizing it yields optimal parameters as follows:

Equation (63) might look complex at first glance, but it is actually straightforward. To compute phi_{k|y=1}, the numerator means the number of times k appears and the email is spam (y=1) among all the examples and all the word. The numerator means the total number of words of all the spam emails. The interpretation is similar to phi_{k|y=0} and it is the same for phi_y.

Apply Laplace smoothing we have:

Recall we mentioned that we add the number of classes to the denominator during Laplace smoothing. So we add vocabulary size in Equation (64) since k has that many choices.

To make predictions, the formula is same as Equation (58) and (59) except that the parameters are different. More specifically, to compute each conditional probability we need to use the ones in Equation (64) rather than those in Equation (60).

That’s all for Naive Bayes multinomial model. This model is different from multivariate model in terms of feature construction and parameterization. It turns out that this model is better than the multivariate model in text classification since it handles the case that the same word appears multiple times in a sentence, but if you search Naive Bayes, most of the tutorial will only tells you the multivariate model. Hope my explanation above does not confuse you.

Support vector machines (SVM)

SVM is the core fundamental of machine learning. It does not have many parameters and work very well as a supervised learning algorithm. It consists of two parts, one is optimal margin classifier, the other is kernel. To solve for the optimal margin classifier, knowledge of convex optimization especially Lagrange duality is involved. However, thanks to the third-party libraries, nowadays we only need to formulate the convex optimization equation and those libraries provide highly optimized solver to solve it. In this post, we will talk more about formulating the problem and kernel and brief mention about Lagrange duality.

SVM is used in classification problems. If it is binary classification, its hypothesis is as follows:

Where b plays the role of theta_{0} in the hypothesis of linear regression in Equation (1), w plays the role of theta_{1} to theta_{n} in Equation (1) and we drop the convention that x_{0} = 1 since b is already separated out to handle this case. Same as the case in perceptron algorithm, g(z) is defined as:

So our goal is to find the optimal w and b and then use Equation (65) to make a prediction given x.

In order to find optimal w and b, we need to introduce the notion of functional and geometric margins.

A functional margin of (w, b) is defined as:

As can be seen, if y^{i} = 1, in order to let functional margin to be large, we need w^{T}x + b to be a large positive number. A large positive number indicates that we are confident with our prediction in this case; similarly, if y^{i} = 1, we want w^{T}x + b to be a large negative number. A large negative number indicates that we are confident with our prediction in this case. Also if our prediction is correct, the functional margin is bigger than 0. Therefore, the functional margin measures how confident our model is during classification and how correctly our prediction is.

But because of the choice of g in Equation (66), if we enlarge w and b to say (2w, 2b), our prediction stays the same but functional margin will be doubled. In this way we make the functional margin arbitrarily large but not change anything meaningful. So we need to think a way to normalize w and b. This is the idea of geometric margin.

Pictorially, a geometric margin of a point is defined as the distance from the point to the decision boundary. For instance, the geometric margin of point i (denoted as gamma_{i}) is the line segment AB:

Mathematically, we can find the geometric margin of the example i by taking point B into the decision boundary:

Solve for Equation (69) we get:

More generally, in order to make Equation (70) work for the point in the other side of the decision boundary, we have:

Similar to Equation (68), we define geometric margin to be the smallest among all examples as shown above.

By comparing Equation (67) and Equation (71), the relationship between functional margin and geometric margin is:

That’s all the knowledge about functional and geometric margins, next we are going to talk about optimal margin classifier. To make things easier, we first assume the training data is linear separable and then consider non linear separable case.

we can formulate the optimal margin classifier as optimization problem:

If this type of formulation is not familiar to you, don’t worry about it. All this is saying is to maximize the geometric margin under the constraint that every training example has margin bigger or equal to gamma.

Notice while talking about functional margin, we mentioned that w can be scaled by any factor without influencing the final result. So to simplify formulation (73), we can let ||w|| = 1/gamma (by Equation (72), this also means we set the functional margin to be 1) and the new formulation becomes:

The solution to to formulation (75) gives us the optimal margin classifier. We can solve it using third-party libraries to find w,b through quadratic programming or Lagrange duality.

Next we will briefly talk about Lagrange and Lagrange duality, you can skip this part if it is too abstract.

Lagrange is a tool to solve optimization problem, if we have an optimization problem as:

Then the generalized Lagrangian can be written as:

The objective is to maximize the Lagrangian and the optimal value of the objective (denoted as p*) is to find the minimum of objective. So p* is defined as:

The optimal value of the dual form is denoted as d* and d* is:

Under certain conditions called Karush-Kuhn-Tucker (KKT) conditions, d* is equal to p*. So if d* is easier to solve which is the case in SVM, we can find p* by solving d*. So if you apply Lagrange duality to formulation (75), you can get:

Next we consider the non-separable case. The idea is to add penalty to the optimization problem if a data point is on the wrong side of margin. For each data point, if margin is bigger or equal to 1, we don’t care. If margin is smaller than 1, we add a linear penalty (l1 regularization). Mathematically it looks like:

It can be solved using Lagrangian as well. Another easier to understand but slower way is to define loss function and use gradient descent as we did before. The loss function in SVM is called hinge loss which is:

Notice the second term inside max function is the functional margin. It should be bigger or equal to 1 (with the choice of our w as shown in Equation (74)). But if it is not, we add it to the loss. So this is the same idea as we described in formulation (81).

C is a hyper-parameter. If C is large, we force the model to make correct classification (otherwise the penalty is too large), so our model will be susceptible to outliers. If C is small, then we can set slack variable to be anything bigger or equal to 0 and can make whatever mistakes in classification. In this case, we basically ignores the data. So C is a very important hyper parameter to tune.

With the hinge loss in Equation (82), we can take gradient of J w.r.t. w and b and apply gradient descent to iteratively update w and b until convergence. Again once optimal w and b is found, we can use Equation (65) to make a prediction given any x.

That’s all for the maximal margin classifier. We’ve talked about functional and geometric margins, how to formulate optimization problems to maximize geometric margin, how to solve it using Lagrange duality and gradient descent. Next we are going to talk about the other important component of SVM which is kernels.

The idea of kernel is to replace the original attributes with features through attribute to feature mapping called phi. Here the terminology is that the original feature is called attribute and the feature get from phi is called feature. Usually it is the case that low dimensional attribute can be mapped to high dimensional features and when we run SVM using the features we can have non-linear decision boundary. I feel like it is better to visualize this video (https://www.youtube.com/watch?v=3liCbRZPrZA) to see the effect of kernels in SVM.

Applying kernel to SVM is not hard, there are four steps:

To achieve step 1, we can combine Equation (80) and Equation (81) as follows:

Further rearrangement we have:

To make prediction, we combine Equation (65) and Equation (80) and get:

So that’s how we write algorithm in terms of inner product.

In terms of step 2, there is a Mercer theorem which states that K is a valid kernel function (K(x,z) = phi(x)^T phi(z)) if and only if for any d points {x1, …, xd}, the corresponding kernel matrix is symmetric positive semi-definite.

Also there are some commonly used kernels like:

Let’s talk about step 3 and the way to simplify the calculation of K(x,z) is called kernel trick. Consider a concrete example as follows:

What the kernel trick says is that K(x,z) = (x^T z )². Because if we expand it we can get:

If you compare Equation (84) with Equation (83), they are the same. But Equation (83) takes O(n²) to compute since phi(x) is of shape O(n²), phi(z) is of shape O(n²), but Equation (84) only takes O(n) time to compute since both x and z are of shape O(n). If phi maps x to even higher dimension say n⁵, the kernel trick still takes O(n) time to compute which is very powerful.

Step 4 is straightforward, we use need to replace all the inner product in step 1 to the kernel we selected or constructed.

That’s all about kernels. One last thing is that kernel is actually more general, it can be applied in all discriminative learning algorithms such as linear regression and logistic regression to make the decision boundary non-linear.

Learning theory

In my opinion, the role of learning theory in machine learning is like the role of operating system in computer science. So it is not contained only in supervised learning algorithms (that’s why I put a bracket in the roadmap at the beginning), but I put it here following the syllabus.

Let’s dive into the learning theory. First we start with some fundamental assumptions:

  1. Both train and test data are drawn from a same distribution D. Without this assumption, what we are doing in the training set cannot be generalized to the test set.
  2. All samples are independent

So the learning process can be thought as some random variables S that are generated through distribution D, pass through learning algorithm (which is deterministic) and then we get hypothesis h or equivalently parameter theta which are random variable. All random variables have a distribution associated with it. For S, its distribution is D. For theta, its distribution is called sampling distribution. We assume there exists h* or theta* that is the true parameter (they are constant not random variable) in our set of hypothesis (for instance hypothesis of logistic regression etc). We can only approach the true parameter but never reach it.

With all the basic setting in mind, we will talk about bias and variance. From the data view, you probability has seen it lots of times. Basically high bias means our model does not learn well from the data, high variance means our model overfits the data and does not generalize well.

But it is better to view it through the parameter perspective. To better visualize, we assume there are only two parameters in our model, theta1 and theta2. The red point is theta*, each blue point is the best parameter we learned through each training of our model:

So from the picture above, we can see high bias means the expectation of our learned parameters does not center around theta*; high variance means our learned parameters is widely spread (which is why if we have a data point that we haven’t seen during prediction, the result might be way off).

More generally, we have:

Where H is the set of hypothesis we have. This figure can be used to illustrate why regularization can reduce variance. Because regularization penalizes large parameter so it shrinks the size of H and the estimation error will decrease. At the same time, the approximation error might increase, so it might increase bias as well.

Ways to reduce variance:

  • increase data size
  • regularization

Ways to reduce bias:

  • use more complex models
  • complex architectures

Next we introduce some notations related to error:

As can be seen the generalization error is got from the whole data distribution while the empirical risk is got from limited amount of data points.

One interesting question to ask is that what is the relationship between empirical risk and generalization error. Basically in our learning algorithms, we always deal with a limited amount of data and our goal is to minimize the error among the limited amount of data and the question is why this can help generalizing our models?

In order to answer this question, two tools are required:

  1. Union Bound

2. Hoeffding inequality (in binary case)

I think the formulation of Hoeffding inequality is really beautiful, especially in explaining some of our common sense. For instance, if we toss a coin m times and count how many times the coin lands on head, then if m is large, the ratio of head count and m is approximately the probability of the coin being on head. Going back to Equation (91), the left hand side means the difference between our prediction and actual value is out of range and the right hand side approaches to 0 as m is large. So combing together it means the chance that our prediction is out of range is 0 which just means our prediction is approximately equal to the actual value.

With the knowledge of Hoeffding inequality, we can first choose a specific hypothesis h_i and have:

So you can already have a feeling that if m is large, the empirical risk for the specific hypothesis h_i will be close to the generalization error. But to make it more robust, we consider all the hypothesis and apply union bound and we can get:

As can be seen in Equation (93), as m is large, the empirical risk will still be close to the generalization error. Actually we can explore more and get the relationship between delta (probability of error), gamma (margin of error) and m (sample size) and quantify the exact probability. But I feel like this is too complex to be useful in practice. Having an intuition is already good enough.

That’s all about learning theory, we talked about bias/variance and the relationship between empirical risk and generalization error (using union bound and Hoeffding inequality).

Regularization and model selection

Regularization penalizes high value of parameter and it reduces the set of hypothesis H thus reduces variance and possibly increase bias. More detailed explanation can be seen in the last section (learning theory).

For logistic regression, the loss function after adding regularization is:

Normally we use L2 regularization which is adding theta² in the loss function. L2 regularization makes theta spread across every dimension.

Next we are going to talk about cross validation and feature selection.

Cross validation set is just the validation set we used while training machine learning models. The standard procedure is to split the data set into train, dev and test set. Train models on the training set, evaluate on the dev set; select the best model and retrain using train and dev and finally test on the test set.

Another common technique used when data is small is called k-fold cross validation:

Feature selection is applied when the number of features is very large (maybe even larger than the training examples). One procedure is called forward search:

Another method is called mutual information. The idea is to compute a score between each feature x_i and label y that measures how informative each feature is and then pick k features with the largest score. The way to compute score for binary feature is called mutual information defined as follows:

The computation is similar to KL divergence which is a way to measure how similar two distributions are.

Decision trees and ensemble methods

Decision tree is an inherently non-linear machine learning techniques which means it can provide non-linear decision boundary. There are two types of decision trees, regression tree and classification tree respectively. In this section, we will talk about selecting regions, loss function and prediction of these two types of decision trees.

In terms of selecting regions, both types of decision trees splits the parent region R_p using feature X_j and a threshold t to two child regions R1 and R2 such that:

As an example, we can choose feature X_j to be time, threshold t to be 3, parent region to be the original whole dataset, so pictorially we have:

And we can recursively split R1 and R2 to sub regions until satisfying the stop criterion which we will talk about later.

The purpose of each split is to maximize the decrease in loss, in other words we maximize the information gain:

For classification tree, we use cross-entropy as our loss. In information theory, cross-entropy means the lowest number of bits needed to transmit data. So if we minimize the loss, it means we can transmit less data for communication, our listener gains more information just due to the split. The cross entropy is defined as:

For regression tree, the loss is mean squared error of our prediction and all the elements in region R:

In terms of prediction, for classification tree, we predict the class that has the most element in region R:

For regression tree, we predict the mean of all values in region R:

That’s how we select regions (by minimizing loss) and how we make predictions. Next we are going to talk about the stop criterion.

Normally there are 4 kinds of stop criterion:

  1. minimum leaf size: do not split R if the number of elements in child region is smaller than this fixed threshold.
  2. maximum depth: do no split R if the depth of the split is already reach this fixed threshold.
  3. maximum number of nodes: do not split R if the number of leaf nodes reach this fixed threshold.
  4. pruning: fully grow the tree and prune away nodes that minimally decrease misclassification or squared error.

In the end we summarize the pros and cons of decision trees

pros:

  • it is interpretable
  • if the variable is categoric variable such as {left, right, up, down}, you don’t need to convert it to one-hot vector, just use it directly to split

cons:

  • tend to overfit if the stop criterion is not well designed
  • poor additive modeling: this is saying each split is based on only one feature. So if the data is linear separable, multiple features are changed at the same time, decision tree performs bad on this kind of data distribution:
As can be seen it is required multiple splits to separate a linear separable data in decision tree
  • usually low predictive accuracy

Next we are going to talk about ensemble methods which can greatly improve the performance of decision tree and works well in other machine learning models.

ensemble methods

There are in general two types of ensemble methods which are bagging and boosting. Bagging stands for “Bootstrap Aggregation”, as its names says, the idea of bagging is to sample the training data with replacement and train multiple models and average the prediction result during evaluation. The idea of boosting is that it starts with a weak learner and iteratively mitigates the incorrect examples made by the weak learner using a new weak learner. In the end, it outputs a combination of those weak learners as the final evaluation. More detailed explanation will be in the later part of this section.

Before diving into bagging and boosting, let’s talk about why we can derive benefit from ensemble.

If we have n i.i.d. random variables denoted as X_i for 0≤i<n and each random variable has variance sigma², then the summation of n random variable has variance n*sigma². The variance of the mean is:

If we drop the independence assumption, then:

From Equation (103), in general we can decrease variance by decreasing the pearson correlation coefficient rho (decrease the correlation between models) or increase n where n is the number of random variables (models in our case).

Bagging can be used to reduce variance. The whole process of bagging can be summarized in the following picture:

Basically we sample with replacement n times to create n sample sets and train each model using each sample set and aggregate all the models to give final result. If we denote each model to be Gm, then the final aggregate predictor can be computed as:

As M increases, the variance will decrease. Also because we sample with replacement and we can also set a threshold, for instance each sample set only has size 2/3 of the original training set, the correlation of each model will be lower compared to just using the whole training set for each model.

Just to be clear with sample replacement, for instance if our training set is [1,2,3], we set our sample set to be size 2, then the sample set can be [1, 1], [1,2], [2,2] etc. It does not matter if we pick one data point multiple times.

A special technique when applying bagging to decision tree is called random forests. The idea can be summarized in the following figure:

Basically, for each sample set, we use a sample of features to train it instead of all the features. In the aggregation step, we can take average in regression tree and max vote in the classification tree.

That’s all about bagging. We cover the bagging technique in general and random forest for decision trees.

As a summary the pros of bagging is:

  • decrease variance
  • better accuracy
  • can be trained in parallel

Cons:

  • increase bias
  • hard to interpret
  • still it has poor additive modeling
  • expensive to train

Next we are going to talk about boosting. Boosting is used for bias reduction. We start with a weak learner (for instance in decision tree we can restrict the number of leaf nodes) make a prediction, then the next weak learner can be trained on the points that are predicted wrong, so on and so forth. In the end we combine all the weak learners as an ensemble learner.

One popular boosting algorithm is Adaboost. In the case of binary classification using Adaboost, the algorithm is outlined as follows:

As can be seen, we use M weak classifiers. For each classifier, we compute the weight of this classifier alpha_m and update the weights of each examples for the next classifier. If err_m is small, alpha_m approaches infinity which means this classifier is really important; otherwise this classifier can almost be ignored. Misclassified examples will have high weights while correctly classified examples will have low weights. So the next learner learns from the fault of previous learner. Obviously in this setting, there is strong correlation between each learner, so the variance increases, but bias will decrease.

A more general paradigm of Adaboost is called Forward Stagewise Additive Modeling which works as follows:

Inside the for loop, it minimizes the loss of old classifier f_{m-1} plus the current classifier G, pick the best one (w.r.t. the best parameter) and update the classifier. For Adaboost, the loss L is chosen to be exponential loss.

If the closed-form solution to the minimization problem is hard to find in the forward stagewise additive modeling, we can think of approximating our prediction f(x_i) to y_i by gradient descent as shown in Equation (105).

This is the core idea of gradient boosting. In terms of the implementation of gradient boosting, we can do as follows:

  1. fit a simple model to get prediction for all the data points f(x)
  2. compute gradient g based on Equation (105) for all data points
  3. train a new model using g
  4. update f(x) using Equation (105)
  5. repeat step 1 to 4 until g converges

As a concrete example, if Loss is mean square error, training data is [2,3,4]. Assume step 1 we predict [1,2,3], step 2 we get the residual as [2–1, 3–2, 4–3] = [1,1,1]. Step 3 we train a model using the residual [1,1,1], assume we get [0.5, 0.5, 0.5], step 4 we update our prediction [1,2,3] to [1.5, 2.5, 3.5]. Repeat step 1 to 4 until the residual does not change.

That’s all about boosting. As a summary the pros of boosting is:

  • decrease in bias
  • better accuracy
  • additive modeling

cons:

  • increase in variance
  • prone to overfitting

Neural networks

Neural network consists of layers of neurons and each neuron consists of linear function and activation. Layers of neurons form a model and model is architecture plus parameters. There are already lots of tutorial for neural networks. In this post, I think it would be better just going through an example to talk about basic techniques in neural networks including forward propagation, backward propagation and optimization.

The example we are going to talk about is image classification. Say we have a bunch of images of cat and dog as a training set, we want our model to be able to determine if a given image is cat or dog. Instead of applying convolutional neural network, we just go with the most basic neuron network. We flatten every pixel in the image and generate a feature vector of size n for each image and feed into our model. Our model is a 3 layer network that looks like:

In terms of forward propagation, if we assume using sigmoid function, we can write down equations like:

In practice we would like to use vectorization, basically apply a bunch of examples at a time instead of solving one example at a time. Therefore, we can stack m examples together and form the input vector X:

We stack every m inputs vertically to form a matrix with shape n*m. While passing through the first neuron, we can do the matrix multiplication as shown in the picture above. Notice b[1] will be broadcast from 3*1 to 3*m so that every examples share the same b value. In the end Z[1] is of shape 3*m. Following the same procedure, we can write down the shape of every variable in the forward propagation as follows:

In order to train this neural network, we can define cost function as the mean of binary cross entropy:

Next we use back-propagation to compute the gradient of cost function w.r.t. each parameter w and b and then update w and b using gradient descent. The way to compute gradient is first writing the gradient using chain rule (following the reverse order of forward propagation) and you can also find some of the terms are redundant and you don’t need to recompute multiple times:

We can unfold every term of the gradient w.r.t. w[3] and get:

Similarly, we can unfold the gradient w.r.t. w[2]. w[1] follows with the same idea.

Notice we need to rearrange each term to make sure the shape of right hand side matches the left hand side. One trick is that every time we take derivative of sigmoid, it becomes element-wise product. The way to identify the shape of left hand side is that if the numerator is a constant (which is the case for cross-entropy loss) and denominator is a matrix, then the final shape is the same as the denominator. If the numerator is a vector of size m and denominator is a vector of shape n, then the left hand size is of shape m*n.

That’s all about back-propagation. In practice deep learning framework such as tensorflow and pytorch has already done it for us. We only need to construct the model by specifying forward propagation.

Next we are going to talk about optimization.

First we can normalize input to make sure the feature in every dimension has mean 0 and variance 1. This can avoid the case that the gradient in some dimension is too large while in others is small. In other words, it avoids wiggle in gradient update:

Parameter initialization in neural network is also very important to avoid vanishing or exploding gradients during back-propagation.

We can initialize parameters using random initialization. A better way is to apply Xavier/He initialization for all the weights. The idea of Xavier/He initialization is to let the input variance to be similar to the output variance for each layer.

During training, we can use mini-batch like we mentioned before in the linear regression section. Also we can use momentum instead of gradient descent. Momentum is an adaptive learning rate method. The idea is to apply a moving average to the gradient so that we can have memory about past update and find a direction to update parameter:

Another default adaptive learning rate algorithm to use is Adam. The idea is that instead of applying moving average to first derivative of J w.r.t. w, it applies to second derivative as well:

To avoid code start (m and v are very small at the beginning so that the update is almost 0), we can add bias correction mechanism. The idea is to include iteration t and makes sure initially m and v are large:

That’s all about neural networks. We’ve covered forward propagation, backward propagation, normalization, parameter initialization, adaptive learning algorithms such as momentum and Adam.

In the last section of supervised learning, we are going to talk about evaluation metrics.

Evaluation metrics

In binary classification, we can use confusion matrix. From the confusion matrix, we can get accuracy, precision, recall, specificity, F1 score as follows:

We can also rely on ROC (receiver operating characteristic) and AUROC (area under ROC) to further measure the performance of our model.

ROC measures the relationship between false positive rate (1 minus specificity) and true positive rate (recall). It can be used to select threshold between “predict positive” and “predict negative”. In the figure above, we use 0.5 as the threshold. AUROC tells how much our model is capable of distinguishing between classes.

That’s all about part I of this comprehensive summary. We’ve went through various supervised learning algorithms in CS229 and knowledge about learning theory, regularization, model section and evaluation metrics which are general to machine learning.

What’s next?

  1. I will publish the link to part 2 about unsupervised learning algorithms once it is done
  2. I will add code to some of the algorithms

Useful Resources and reference:

[CS229 official website]http://cs229.stanford.edu/syllabus-autumn2018.html