Machine Learning With R: Linear Regression

Original article was published by Dario Radečić on Artificial Intelligence on Medium


Model training and evaluation

This will be the longest section thus far, so get yourself a cup of coffee. We’ll start with the train/test split. We want to split our dataset into two parts, one (bigger) on which the model is trained, and the other (smaller) that is used for model evaluation.

Before we do anything, let’s set a random seed. Train/test split is a random process, and seed ensures the randomization works the same on yours and my computer:

set.seed(42)

Great! Let’s perform the split now. 70% of the data is used for training, and the remaining 30% is used for testing. Here’s the code:

sampleSplit <- sample.split(Y=df$Weight, SplitRatio=0.7)
trainSet <- subset(x=df, sampleSplit==TRUE)
testSet <- subset(x=df, sampleSplit==FALSE)

After executing the code, you should see two additional variables created in the top right panel:

So, we have 159 rows in total, of which 111 were allocated for model training, and the remaining 48 are used to test the model.

We can now proceed with the model training.

R uses the following syntax for linear regression models:

model <- lm(target ~ var_1 + var_2 + ... + var_n, data=train_set)

That’s okay, but imagine we had 100 predictors, then it would be a nightmare to write every single one to the equation. Instead, we can use the following syntax:

model <- lm(target ~. , data=train_set)

Keep in mind — this only works if you decide to use all predictors for model training. Accordingly, we can train the model like this:

model <- lm(formula=Weight ~ ., data=trainSet)

In a nutshell — we’re trying to predict the Weight attribute as a linear combination of every other attribute. R also handles the categorical attributes automatically. Take that, Python!

Next, we can take a look at the summary of our model:

summary(model)

The most interesting thing here is the P-values, displayed in the Pr(>|t|) column. Those values display the probability of a variable not being important for prediction. It’s common to use a 5% significance threshold, so if a P-value is 0.05 or below we can say that there’s a low chance it is not significant for the analysis. Sorry for the negation, that’s just how the hypotheses are formed.

Awesome! Next, we can make a residuals plot, or residuals histogram to be more precise. Here we expect to see something approximately normally distributed. Let’s see how the histogram looks like:

modelResiduals <- as.data.frame(residuals(model)) ggplot(modelResiduals, aes(residuals(model))) +
geom_histogram(fill='deepskyblue', color='black')

Well, there’s a bit of skew due to the value on the far right, but after eyeballing it we can conclude that the residuals are approximately normally distributed.

And now we can finally make predictions! It’s quite easy to do so:

preds <- predict(model, testSet)

And now we can evaluate. We’ll create a dataframe of actual and predicted values, for starters:

modelEval <- cbind(testSet$Weight, preds)
colnames(modelEval) <- c('Actual', 'Predicted')
modelEval <- as.data.frame(modelEval)

Here’s how the first couple of rows look like:

It’s not the best model — at least not without any tuning, but we’re still getting decent results. How decent? Well, that’s what metrics like MSE and RMSE will tell us. Here are the calculations:

mse <- mean((modelEval$Actual - modelEval$Predicted)²)
rmse <- sqrt(mse)

We got the RMSE value of 95.9, and MSE is well, a square of that. This means that we’re on average wrong by 95.9 units of Weight. I’ll leave it up to you to decide how good or bad that is.

In my opinion, the model has overfitted on the training data, due to large correlation coefficients between the input variables. Also, the coefficient of determination value (R2) for the train set is insanely high (0.93+).

And that’s just enough for today. Let’s wrap things up in the next section.