Diving Deeper into Linear Regression

Original article can be found here (source): Artificial Intelligence on Medium

When I say “linear regression”, most of the people start thinking about the good old Ordinary Least Square(OLS) regression. If you are not familiar with the term, these equations might help…

β_1, β_2: weights; β_0: bias; J(β): cost function

Did you also think about OLS? If yes then you are on the right track. But there’s more to linear regression than just OLS! First, let us look at OLS a bit more closely.

OLS

The name of this technique came from the cost function. Here, we take the sum of squared errors (the difference between ground truths and predictions) and try to minimize this. By minimizing the cost function we achieve the optimal value of the vector β (contains bias and weights). In the below plot, the contour (concentric ellipses) of the cost function is shown. After the minimization, we get β as the point at the center.

OLS

At first, it seems like OLS is enough for any regression problem. But as we increase the number of features and the complexity of data OLS tends to overfit the training data. The concept of overfitting is vast and deserves a separate article (you can find plenty of them) so I’m going to give you a brief. Overfitting means the model has learned the training data so well that it fails to generalize. In other words, the model has learned even the small scale (insignificant) variations in the train data so it fails to produce good predictions on unseen (validation and test) data. To tackle the problem of overfitting we can use many techniques. Adding a regularization (penalty) term to our cost function is one such technique. But what term should we use? We generally use one of the following two methods.

Ridge

In this case, we add the sum of squares of weights to our least square cost function. So now it looks something like this…

m: 1+dimension of β; λ: regularization parameter

But how does this term prevent overfitting? Adding this term is equivalent to adding an extra constraint on the possible values of β. Because to achieve the minimum cost, the sum of β²_j’s must not exceed a certain value (say r). This technique prevents the model from assigning very large weights to some features over the others thus tackling overfitting. Mathematically,

In other words, β should lie inside(or on) the circle with radius √r centered at the origin. Here’s the visualization…

Ridge

Notice that because of the constraint (red circle), the final value of β is closer to the origin than it was in the OLS.

Lasso

The only difference between Ridge and Lasso is the regularization term. Here, we add the sum of absolute values of the weights to our least square cost function. So the cost function becomes…

In this case, the constraint can be written as…

Now we can visualize the constraint as a square instead of a circle.

Lasso

It is worth noting that, if the contours hit a corner of the square then one feature is completely neglected (weight becomes 0). For higher-dimensional feature space, we can use this trick to reduce the number of features.

Note: In the regularization term we are not using bias (β_0) because only the very large weights (β_i’s for i>0) corresponding to the features contribute to the overfitting. Bias term is just an intercept hence does not have much to do with the overfitting.

Phew…that was a lot about regularization. The common thing among the above methods was: they all have residuals/errors (ground truth-prediction) in their cost function. These errors are parallel to the y-axis. We could also consider errors along the x-axis and proceed similarly. See the plot below.

y-errors and x-errors

What if we use a different kind of error?

Major axis (Orthogonal) regression

In this case, we consider errors in both directions (x-axis and y-axis). The sum of the square of perpendicular distances between the observed data points and the predicted line is to be minimized. Let’s visualize this by taking only one feature.

(X_i, Y_i): the foot of the perpendicular drawn from (x_i, y_i) on the best fit line

Let our model be

Then the regression coefficients can be obtained by minimizing

under the constraints

Reduced Major axis regression

This is very similar to the above method with a slight change. Here, we minimize the sum of areas of the rectangle formed by (X_i, Y_i) and (x_i, y_i).

Reduced Major Axis

The total area extended by n data points is,

The constraints here are the same as orthogonal regression.

When should you use orthogonal regression?

One should go for orthogonal and reduced major axis regressions when the uncerttainties are present in study (y) and explanatory (x) variables both.

One interesting thing in orthogonal regression is, it produces symmetrical fit w.r.t y-errors and x-errors. But in OLS we don’t get the symmetry for we minimize either y-errors or x-errors, not both.

Still curious? Watch a video that I made recently…

I hope you enjoyed the reading. Until next time…Happy learning!