# Diving Deeper into Linear Regression

Original article can be found here (source): Artificial Intelligence on Medium

When I say “linear regression”, most of the people start thinking about the good old Ordinary Least Square(OLS) regression. If you are not familiar with the term, these equations might help…

Did you also think about OLS? If yes then you are on the right track. But there’s more to linear regression than just OLS! First, let us look at OLS a bit more closely.

# OLS

The name of this technique came from the cost function. Here, we take the sum of squared errors (the difference between ground truths and predictions) and try to minimize this. By minimizing the cost function we achieve the optimal value of the vector β (contains bias and weights). In the below plot, the contour (concentric ellipses) of the cost function is shown. After the minimization, we get β as the point at the center.

At first, it seems like OLS is enough for any regression problem. But as we increase the number of features and the complexity of data OLS tends to overfit the training data. The concept of overfitting is vast and deserves a separate article (you can find plenty of them) so I’m going to give you a brief. Overfitting means the model has learned the training data so well that it fails to generalize. In other words, the model has learned even the small scale (insignificant) variations in the train data so it fails to produce good predictions on unseen (validation and test) data. To tackle the problem of overfitting we can use many techniques. Adding a regularization (penalty) term to our cost function is one such technique. But what term should we use? We generally use one of the following two methods.

# Ridge

In this case, we add the sum of squares of weights to our least square cost function. So now it looks something like this…

But how does this term prevent overfitting? Adding this term is equivalent to adding an extra constraint on the possible values of β. Because to achieve the minimum cost, the sum of β²_j’s must not exceed a certain value (say r). This technique prevents the model from assigning very large weights to some features over the others thus tackling overfitting. Mathematically,