Evaluation metrics & Model Selection in Linear Regression

Original article was published by NVS Yashwanth on Artificial Intelligence on Medium

Model selection & Subset Regression

Let me make it clear that, when you develop any model considering all of the predictors or regressor variables, it is termed as a full model.
If you drop one or more regressor variables or predictors, then this model is a subset model.

The general idea behind subset regression is to find which does better. The subset model or the full model.

We select the subset of predictors that do the best of all the available candidate predictors, such that we have the largest value, largest adjusted R², or the smallest MSE.

However, R² is never used for comparing the models as the value of R² increases with the increase in the number of predictors (even if these predictors do not add any value to the model).

Reason for model selection
We set out to select the best subset of predictors that explain the data well.
A simpler model that adequately explains the relationship is always a better option due to the reduced complexity. The addition of unnecessary regressor variables will add noise.

We will now look at the most common criteria and strategies for comparing and selecting the best models.

Adjusted R-squared — selection criterion

The main difference between adjusted R-squared and R-square is that R-squared describes the amount of variance of the dependent variable represented by every single independent variable, while adjusted R-squared measures variation explained by only the independent variables that actually affect the dependent variable.

Adjusted R-squared. Image by the author.

In the equation above, n is the number of data points while k is the number of variables in your model, excluding the constant.

R² tends to increase with an increase in the number of independent variables. This could be misleading. Thus, the adjusted R-squared penalizes the model for adding furthermore independent variables (k in the equation) that do not fit the model.

Mallow’s Cp — selection criterion

Mallow’s Cp measures the usefulness of the model. It tries to compute the mean squared prediction error.

Mallow’s Cp statistic. Image by te author.

Here p is the number of regressors, RSSₚ is the RSS of the model for the given p number of regressors, MSEₖ is the total MSE for k total number of predictors, and n is the sample size. This is useful when n>>k>p.

Mallow’s Cp compares the full model with a subset model. If Cp is almost equal to p (smaller the better), then the subset model is an appropriate choice.

One can plot Cp vs p for every subset model to find out the candidate model.

Exhaustive and Best subset searching

The exhaustive search looks at all the models. If there are k number of regressors, there 2ᵏ possible models. This is a very slow process.

The best subset strategy simplifies the search by finding the model that minimizes RSS for every P-value.

Stepwise Regression

Stepwise Regression is faster than Exhaustive and Best subset searching. It is an iterative procedure to choose the best model.
Stepwise regression is classified into backward and forward selection.
Backward selection starts with a full model, then step by step we reduce the regressor variables and find the model with the least RSS, largest R², or the least MSE. The variables to drop would be the ones with high p-values. It is however important to note that you cannot drop one of the levels of a categorical variable. Doing so would result in a biased model. You either drop all levels of the categorical variable or none.
Forward selection starts with a null model, then step by step we increase the regressor variables until we can no longer improve the error performance of the model. We usually pick the model with the highest adjusted R².