Source: Deep Learning on Medium
Linear regression is a supervised learning algorithm and it is all about finding an equation of a line that almost fits the given data so that it can predict future values.
The linear relationship between the input variable x (independent variable) and the expected variable y (dependent variable) can be represented by a straight line called the regression line.
Let look at the equation of a line y=mx+b. This is a hypothesis. Let’s rewrite this in a somewhat similar way.
This is the most common way of writing a hypothesis in Machine Learning.
Now to understand this hypothesis, we will take the example of the housing prices
To say it more technically, we have to tune the values of Θ₀(base price) & Θ₁ in such a way that our line fits the dataset in the best way possible. Now we need some metric to determine the ‘best’ line, and we have it. It’s called a cost function. Let’s look into it.
The cost function of the linear regression is
we can rewrite it as
Here m means the total number of examples in your dataset. In our example, m will be the total number of houses in our dataset.
what we are doing is just averaging the square of the distances between predicted values and actual values over all the (m) examples.
Look at the graph above, here m = 4. The points on the blue line are predicted values while the red points are actual values. The green line is the distance between the actual value and the predicted value.
The cost for this line will be
So what the cost function calculates is just the mean of the square of the length of green lines. We also divide it by 2 to ease out some future calculations which we will see.
Linear Regression tries to minimize this cost by finding the proper values of Θ₀ and Θ₁. How? By using Gradient Descent.
Gradient Descent is a very important algorithm when it comes to Machine Learning. Right from Linear Regression to Neural Networks, it is used everywhere.
This is how we update our weights. This update rule is executed in a loop & it helps us to reach the minimum of the cost function. The α is a constant learning rate
In this visualization, you can see how the line is fitting to the dataset. Note that initially, the line is covering the distance very quickly. But as the cost is decreasing, the line becomes slower.
Assumptions of Linear Regression
First, linear regression needs the relationship between the independent and dependent variables to be linear. It is also important to check for outliers since linear regression is sensitive to outlier effects. The linearity assumption can best be tested with scatter plots, the following two examples depict two cases, where no and little linearity is present.
Secondly, the linear regression analysis requires all variables to be multivariate normal. This assumption can best be checked with a histogram or a Q-Q-Plot. Normality can be checked with a goodness of fit test, e.g., the Kolmogorov-Smirnov test. When the data is not normally distributed a non-linear transformation (e.g., log-transformation) might fix this issue.
Thirdly, linear regression assumes that there is little or no multicollinearity in the data. Multicollinearity occurs when the independent variables are too highly correlated with each other.
Fourth, linear regression analysis requires that there is little or no autocorrelation in the data. Autocorrelation occurs when the residuals are not independent of each other.
Advantages of Linear Regression
1. Linear Regression performs well when the dataset is linearly separable. We can use it to find the nature of the relationship between the variables.
2. Linear Regression is easier to implement, interpret and very efficient to train.
3. Linear Regression is prone to over-fitting but it can be easily avoided using some dimensionality reduction techniques, regularization (L1 and L2) techniques and cross-validation.
Disadvantages of Linear Regression
1. The main limitation of Linear Regression is the assumption of linearity between the dependent variable and the independent variables. In the real world, the data is rarely linearly separable. It assumes that there is a straight-line relationship between the dependent and independent variables which is incorrect many times.
2. Prone to noise and overfitting: If the number of observations is lesser than the number of features, Linear Regression should not be used, otherwise it may lead to overfitting because is starts considering noise in this scenario while building the model.
3. Prone to outliers: Linear regression is very sensitive to outliers (anomalies). So, outliers should be analyzed and removed before applying Linear Regression to the dataset.
4. Prone to multicollinearity: Before applying Linear regression, multicollinearity should be removed (using dimensionality reduction techniques) because it assumes that there is no relationship among independent variables.
In summary, Linear Regression is a great tool to analyze the relationships among the variables but it isn’t recommended for most practical applications because it over-simplifies real-world problems by assuming a linear relationship among the variables
An Example: Predicting house prices with linear regression using scikit-learn
Setting the environment:
import pandas as pd
import numpy as npimport seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error
Read the house prices data:
houses = pd.read_csv("kc_house_data.csv")
You can check code regarding house price analysis in my git hub profile