Original article was published on Artificial Intelligence on Medium
In machine learning, linear regression is applied to predict an outcome (called the dependent variable) as a function of one or more predictors (called independent variables), which are correlated with the outcome. For example, the weight of students in a class can be predicted with two variables — age, height — which are correlated with weight. In a linear function this relation can be represented as:
weight = c + b1*age + b2*height
c, b1, b2 are parameters to be estimated from training data.
In this kind of regression two important assumptions are made: a) that the outcome is a continuous variable and b)that it is normally distributed.
However, in reality, this is not the case all the time. Outcomes are not always normally distributed, nor are they always have to be continuous variables.
Here are just a few of many examples where these assumptions are violated:
- The outcome variable is binary, with two classes; e.g. survived/not-survived in the Titanic disaster
- The outcome is a categorical variable, of multiple classes; e.g. land cover classes to be predicted from satellite imagery
- The outcome is count data; e.g. number of road accidents per year
- The outcome is a continuous variable but skewed, not normally distributed; e.g. US income distribution is skewed to the right
Since ordinary linear regression is not suitable in those instances, what are the alternatives? Here are some options:
In cases such as #1 and #2 above, if the outcome/dependent variable is binary or categorical, machine learning classification models should work fine. Logistic Regression for binary and Random Forest for multi-class classification are two frequently applied algorithms in the machine learning world.
That leaves us with two following situations where neither ordinary linear regression nor classification algorithms will work:
1) Count outcome
2) Continuous but skewed outcome
This is where the Generalized Linear Models (GLM) come handy (aside: it’s generalized linear models, NOT general linear model which refers to conventional OLS regression). GLMs are a class of models that are applied in cases where linear regression isn’t applicable or fail to make appropriate predictions.
A GLM consists of three components:
- Random component: an exponential family of probability distributions;
- Systematic component: a linear predictor; and
- Link function: that generalizes linear regression.
There are several great packages in R and Python to implement GLM but below is an implementation using
statmodels library in Python.
Note that if you are into R programming language, be sure to check out this example from a Princeton researcher.
In summary, in this article, we’ve discussed that ordinary linear regression is applied if the outcome is a continuous variable and is normally distributed. However, there are cases where these two assumptions do not hold true. In those situations a suite of Generalized Linear Models is applied. A GLM has three elements: random, systematic and link function which need to be specified in each model implementation.
Hope this was useful, you can follow me on Twitter for updates and new article alerts.