Original article was published by Sandeep Ram on Artificial Intelligence on Medium
As a fresher in the field of machine learning, the first thing that you learn would be simple univariate linear regression. However, for the past decade or so, tree-based algorithms and neural networks have overshadowed the significance of linear regression on a commercial scale. The purpose of this blog post is to highlight why linear regression and other linear algorithms are still very relevant and how you can improve the performance of such rudimentary models to compete with large and sophisticated algorithms like XGBoost and Random Forests.
How to improve the performance of linear models:
Many self-taught data scientists start code first by learning how to implement various machine learning algorithms without actually understanding the mathematics behind these algorithms. By understanding the math behind these algorithms, we can get an idea about how to improve their performance.
The mathematics behind Linear Regression makes a few fundamental assumptions about the data that the model will be receiving:
- No Hetroskedacity
- No Multi-colinearity
- Normal Distribution
Let’s dive deeper into a few of these assumptions and find ways to improve our models.
A linear model tries to fit a straight line through the data points given to it. It looks similar to the graph given below.
However, this kind of model fails to fit data points that are not plotted linearly. Consider a relation y = x² + c +- (noise). The graph for this function is parabolic. Fitting a line through this graph would not result in a good fit.
We need to transform the independent variables by applying exponential, logarithmic, or other transformations to get a function that is as close as possible to a straight line.
What this means is that by changing my independent variable from x to x² by squaring each term, I would be able to fit a straight line through the data points while maintaining a good RMSE.
Thus we need to figure out whether our independent variable is directly related to each dependent variable or a transformation of these variables before building our final model.
No Hetroskadacity (constant varience)
Consider a situation where you are employed at a job that pays $3000 a month, while your living expenses total up to $2950. You are left with $50 every month to spend at your leisure. You could choose to spend this money or save it. In any case, it is not going to make a significant difference immediately. Your bank balance changes by a minimal amount (anywhere in the range of 0 to 50 dollars). 10 years down the line, you are now the CEO of a multinational company earning upwards of $100,000 every month. Your living expenses total up to $25,000. Now you can choose to either spend the rest $75,000 or just a fraction of it. The change in your bank balance at the end of the month can be anywhere between 0 — $75000. The range here is much larger as compared to 10 years ago. The variance between the least and most amount that you can save has increased with time.
Linear regression assumes that the variance between data points does not increase or decrease as a function of the dependent variable.
The graph should look more like this to fit a good linear model.
In this case, the standard error of the linear model will not be reliable.
How to detect this:
How to fix this:
- Use variable transformations, similar to fixing linearity.
Consider a problem statement where you are asked the predict the cost of real-estate property, based on the length of the plot, the land area, and proximity to schools and public infrastructure. Here it is evident that 2 of the independent variables (the length and the area of the plot) are directly related. As the length increases, the area also increases. Such a correlation affects the performance of linear regression.
How to identify this:
- Using Pearson-correlation.
How to fix this:
- Drop one of the two variables
- Create a function to create a new independent variable using the correlated features and drop the correlated features.
Normality of Distribution
This is one of the most important factors that people usually forget before building a linear model. It is important that the continuous variables in the dataset need to be Gaussian distributed.
Gaussian distribution It is a probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean. In graph form, normal distribution will appear as a bell curve.
The measure of deviation from a Gaussian distribution is known as skew. The skewness of a distribution can be calculated in python by using the scipy module.
In the above graph, the black line refers to the gaussian distribution that we hope to reach, and the blue line represents the kernel density estimation (KDE) of the given data before the transformation.
Fixing the skewness of data can provide a huge boost to the accuracy of linear models.
Transformations that can be applied to fix skewness:
- Logarithmic Transformation: This works best if the data is right-skewed, i.e the distribution has a long tail on the right end.
- Exponential Transformation: Raising the distribution by a power ‘c’ where ‘c’ is an arbitrary constant (usually between 0 to 5).
- Box-Cox transformation: This transformation fixes skewness in at least 80% of the cases. This is easily the most powerful tool to fix skewness.
- Reciprocal Transformation: Replace the values with it’s reciprocal/ inverse.
The textbook definition of autocorrelation is:
Autocorrelation refers to the degree of correlation between the values of the same variables across different observations in the data.
Let’s try to understand this with the help of an example. The concept of autocorrelation is most often discussed in the context of time series data in which observations occur at different points in time, hence we will be taking the example of the stock prices of an imaginary company (XYZ inc.). The stock price of this company has been stable at around $50 for the past 3 days. From this data, we can infer that the stock price on the 4th day will most likely be around the same $50. However, this cannot be said about 2 months from now. Such type of data where data points that are closer to each other are correlated stronger than the considerably distant data points is called as autocorrelated data.
Impacts on Linear Regression:
- This affects the goodness of fit of the line since autocorrelation affects the standard errors. However, the coefficients are left unbiased.
- The statistical measures such as p-value, t-value and standard error will not be reliable in the presence of autocorrelation.
How to detect the presence of autocorrelation:
- Durbin-Watson test: This is done by comparing the successive error terms to check if they are directly/inversely correlated with each other. Values between 1.5–2.5 would tell us that autocorrelation is not a problem in that predictive model.
0–1.5 in the Durbin-Watson test refers to a significant positive correlation while 2.5+ refers to a significantly negative correlation.
However, this test fails to detect autocorrelation when exists between data points that are consequent, but equally spaced. (eg: Stock prices on every Friday is $52.50 +- $0.5).
- Breusch-Godfrey Test: This is slightly more complicated than the previous test. Simply put, this test requires you to build a model, calculate the error terms for each of the data points, and try to predict the error term at time t as a function of all the preceding error terms.
This test can be performed using the statsmodels module as well. Check out their official documentation of this test at this link.
I hope you found this story informative. Leave a clap if you think this has helped you, and comment if you have any queries. Wish you luck on your data-science journey