Linear Regression: A complete story

Source: Deep Learning on Medium

2. Mathematics behind the model:

Even though we have python libraries which can do the regression analysis in a single line of code, it is really important to know the mathematics behind the model. Because only when you know how a model works from the scratch, you will be able to tweak different model parameters with respect to your problem statement and the dataset at your hand to get the desired result.

There are two kinds of variables in a linear regression model:

  • The input or independent or predictor variable(s) is the input for the model and it helps in predicting the output variable. It is represented as X.
  • The output or dependent variable(s) is the output of the model i.e., the variable that we want to predict. It is represented as Y.

2.1 Simple linear regression:

When there is one input variable/independent variable (X) then it is called simple linear regression.

The simple linear regression equation looks like this:

The main idea behind this model is to fit a straight line in the data. In order to get the best fit line, we have to find the optimum values for the coefficients/parameters β0 and β1 in such a way that it minimizes the error between the predicted and the actual value.

So, how do we find the optimum values for β0 and β1? The simple linear regression can be solved using Ordinary Least Squares (OLS), which is a statistical method, to find the model parameters.

Ordinary Least Squares:

Ordinary Least Squares (OLS) regression is a statistical method of analysis that estimates the relationship between one or more independent variables and a dependent variable. The objective of the least squares method is to find values of β0 and β1 that minimize the sum of the squared difference between Y and Yₑ.

2.2 Multiple Linear Regression:

When there are multiple input variables/independent variables (X1, X2, X3….) then it is called multiple linear regression.

The multiple linear regression equation looks like this:

In simple linear regression, we used OLS method for finding the best-fit line. Whereas in this case, we have more than one predictor variable which makes it hard for us to use that simple OLS method.

But we can implement a linear regression model for performing Ordinary Least Squares regression using one of the following approaches:

  • Solving the model parameters analytically (Normal Equations method)
  • Using an optimization algorithm (Gradient Descent, Stochastic Gradient Descent, etc.)

Normal Equation (closed-form solution)

This approach treats the data as a matrix and uses linear algebra operations to estimate the optimal values for the model parameters. It means that all of the data must be available and you must have enough memory to fit the data and perform matrix operations. So this method should be preferred for smaller datasets.

The multiple linear regression looks like this:

Now, taking the model parameters θ and X in a matrix form:

Thus the vectorized form of multiple linear regression equation looks like this:

The closed-form solution for finding the optimal values for the model parameters is:

For very large datasets, computing the matrix inverse is costly or in some cases the inverse does not exist (the matrix is non-invertible or singular, e.g., in case of perfect multicollinearity). In such cases, the below explained Gradient Descent approach is preferred.

Gradient Descent:

Gradient Descent is a very generic optimization algorithm capable of finding optimal solutions to a wide range of problems. The general idea of Gradient Descent is to tweak parameters iteratively in order to minimize a cost function.

Error/Cost here represents the sum of squared error between the predicted and the actual value. This error is defined in terms of a function and is called Mean Squared Error (MSE) cost function. So, the objective of this Gradient Descent optimization algorithm is to minimize the MSE cost function.

Suppose let’s assume that we are standing on top of a hill in a dense fog; you can only feel the slope of the ground below your feet. A good strategy to get to the bottom of the valley quickly is to go downhill in the direction of the steepest slope. This is exactly what Gradient Descent does: it measures the local gradient of the error function with regards to the parameter vector θ and it goes in the direction of the descending gradient. Once the gradient is zero, you have reached a minimum.

An important parameter in the Gradient Descent is the size of the steps, determined by the learning rate hyperparameter. If the learning rate is too small, the algorithm will have to go through many iterations to converge, which will take a long time.

On the other hand, if the learning rate is too high, the algorithm will jump over the minimum, making it diverge. Hence the algorithm won’t reach the minimum.

To implement the Gradient Descent, you need to compute the gradient of the cost function with respect to the parameter vector θ. Here gradient means how much the cost function will change if there is a small change in the parameter vector θ. To get the gradient, we need to take the partial derivative of the cost function.

The MSE cost function is defined as,

Where, m = number of samples in dataset.

The partial derivative of the cost function is,

The above equation computes the partial derivatives individually for every data point. Instead using the vectorized form, we can compute all of the gradients in one go.

The vectorized form looks like,

The gradient vector, above, contains all the partial derivatives of the cost function (one for each model parameter). The update rule to get the updated weights/parameters is given below,

Where alpha is the learning rate hyperparameter.

This sums up the methods which are there to find the parameters of the Linear Regression model. In the practical application point of view, we don’t need to write the algorithm from scratch every time we need to apply the linear regression model. Python provides a machine learning library called scikit-learn which contains the linear regression algorithm which we can use it by a single line of code as you will see in the next section.