Machine Learning with Python: Regression (complete tutorial)

Original article was published on Artificial Intelligence on Medium

Setup

First of all, I need to import the following libraries.

## for data
import pandas as pd
import numpy as np
## for plotting
import matplotlib.pyplot as plt
import seaborn as sns
## for statistical tests
import scipy
import statsmodels.formula.api as smf
import statsmodels.api as sm
## for machine learning
from sklearn import model_selection, preprocessing, feature_selection, ensemble, linear_model, metrics, decomposition
## for explainer
from lime import lime_tabular

Then I will read the data into a pandas Dataframe. The original dataset contains 81 columns, but for the purposes of this tutorial, I will work with a subset of 12 columns.

dtf = pd.read_csv("data_houses.csv")cols ["OverallQual","GrLivArea","GarageCars", 
"GarageArea","TotalBsmtSF","FullBath",
"YearBuilt","YearRemodAdd",
"LotFrontage","MSSubClass"]
dtf = dtf[["Id"]+cols+["SalePrice"]]
dtf.head()

Details about the columns can be found in the provided link to the dataset.

Please note that each row of the table represents a specific house (or observation). If you are working with a different dataset that doesn’t have a structure like that, in which each row represents an observation, then you need to summarize data and transform it.

Now that it’s all set, I will start by analyzing data, then select the features, build a machine learning model and predict.

Let’s get started, shall we?

Data Analysis

In statistics, exploratory data analysis is the process of summarizing the main characteristics of a dataset to understand what the data can tell us beyond the formal modeling or hypothesis testing task.

I always start by getting an overview of the whole dataset, in particular, I want to know how many categorical and numerical variables there are and the proportion of missing data. Recognizing a variable’s type sometimes can be tricky because categories can be expressed as numbers. To this end, I am going to write a simple function that will do that for us:

'''
Recognize whether a column is numerical or categorical.
'''

def utils_recognize_type(dtf, col, max_cat=20):
if (dtf[col].dtype == "O") | (dtf[col].nunique() < max_cat):
return "cat"
else:
return "num"

This function is very useful and can be used on several occasions. To give an illustration I’ll plot a heatmap of the dataframe and visualize columns type and missing data.

dic_cols = {col:utils_recognize_type(dtf, col, max_cat=max_cat) for col in dtf.columns}heatmap = dtf.isnull()
for k,v in dic_cols.items():
if v == "num":
heatmap[k] = heatmap[k].apply(lambda x: 0.5 if x is False else 1)
else:
heatmap[k] = heatmap[k].apply(lambda x: 0 if x is False else 1)
sns.heatmap(heatmap, cbar=False).set_title('Dataset Overview')
plt.show()
print("\033[1;37;40m Categerocial ", "\033[1;30;41m Numeric ", "\033[1;30;47m NaN ")

There are 1460 rows and 12 columns:

  • each row of the table represents a specific house (or observation) identified by Id, so I’ll set it as the index (or primary key of the table for SQL lovers).
  • SalePrice is the dependent variable that we want to understand and predict, so I’ll rename the column “Y”.
  • OverallQuall, GarageCars, FullBath and MSSubClass are categorical variables while the others are numerical.
  • Only LotFrontage contains missing data.
dtf = dtf.set_index("Id")dtf = dtf.rename(columns={"SalePrice":"Y"})

I believe visualization is the best tool for data analysis, but you need to know what kind of plots are more suitable for the different types of variables. Therefore, I’ll provide the code to plot the appropriate visualization for different examples.

First, let’s have a look at the univariate distributions (probability distribution of just one variable). A histogram is perfect to give a rough sense of the density of the underlying distribution of a single numerical data. I recommend using a box plot to graphically depict data groups through their quartiles. For example, let’s plot the target variable:

x = "Y"fig, ax = plt.subplots(nrows=1, ncols=2, sharex=False, sharey=False)
fig.suptitle(x, fontsize=20)
### distribution
ax[0].title.set_text('distribution')
variable = dtf[x].fillna(dtf[x].mean())
breaks = np.quantile(variable, q=np.linspace(0, 1, 11))
variable = variable[ (variable > breaks[quantile_breaks[0]]) & (variable < breaks[quantile_breaks[1]]) ]
sns.distplot(variable, hist=True, kde=True, kde_kws={"shade": True}, ax=ax[0])
des = dtf[x].describe()
ax[0].axvline(des["25%"], ls='--')
ax[0].axvline(des["mean"], ls='--')
ax[0].axvline(des["75%"], ls='--')
ax[0].grid(True)
des = round(des, 2).apply(lambda x: str(x))
box = '\n'.join(("min: "+des["min"], "25%: "+des["25%"], "mean: "+des["mean"], "75%: "+des["75%"], "max: "+des["max"]))
ax[0].text(0.95, 0.95, box, transform=ax[0].transAxes, fontsize=10, va='top', ha="right", bbox=dict(boxstyle='round', facecolor='white', alpha=1))
### boxplot
ax[1].title.set_text('outliers (log scale)')
tmp_dtf = pd.DataFrame(dtf[x])
tmp_dtf[x] = np.log(tmp_dtf[x])
tmp_dtf.boxplot(column=x, ax=ax[1])
plt.show()

The average price of a house in this population is $181k, the distribution is highly skewed and there are outliers on both sides.

Moreover, A bar plot is appropriate to understand labels frequency for a single categorical variable. Let’s take the FullBath (number of bathrooms) variable for instance: it has ordinality (2 bathrooms > 1 bathroom) but it’s not continuous (a home can’t have 1.5 bathrooms), so it can be analyzed as a categorical.

x = "Y"ax = dtf[x].value_counts().sort_values().plot(kind="barh")
totals= []
for i in ax.patches:
totals.append(i.get_width())
total = sum(totals)
for i in ax.patches:
ax.text(i.get_width()+.3, i.get_y()+.20,
str(round((i.get_width()/total)*100, 2))+'%',
fontsize=10, color='black')
ax.grid(axis="x")
plt.suptitle(x, fontsize=20)
plt.show()

Most of the houses have 1 or 2 bathrooms, there are some outliers with 0 and 3 bathrooms.

I’ll take the analysis to the next level and look into the bivariate distribution to understand if FullBath has predictive power to predict Y. This would be the case of categorical (FullBath) vs numerical (Y), therefore I shall proceed like this:

  • split the population (the whole set of observations) into 4 samples: the portion of houses with 0 bathroom (FullBath = 0), 1 bathroom (FullBath = 1), and so on …
  • Plot and compare densities of the 4 samples, if the distributions are different then the variable is predictive because the 4 groups have different patterns.
  • Group the numerical variable (Y) in bins (subsamples) and plot the composition of each bin, if the proportion of the categories is similar in all of them then the variable is not predictive.
  • Plot and compare the box plots of the 4 samples to spot different behaviors of the outliers.
cat, num = "FullBath", "Y"fig, ax = plt.subplots(nrows=1, ncols=3, sharex=False, sharey=False)
fig.suptitle(x+" vs "+y, fontsize=20)

### distribution
ax[0].title.set_text('density')
for i in dtf[cat].unique():
sns.distplot(dtf[dtf[cat]==i][num], hist=False, label=i, ax=ax[0])
ax[0].grid(True)
### stacked
ax[1].title.set_text('bins')
breaks = np.quantile(dtf[num], q=np.linspace(0,1,11))
tmp = dtf.groupby([cat, pd.cut(dtf[num], breaks, duplicates='drop')]).size().unstack().T
tmp = tmp[dtf[cat].unique()]
tmp["tot"] = tmp.sum(axis=1)
for col in tmp.drop("tot", axis=1).columns:
tmp[col] = tmp[col] / tmp["tot"]
tmp.drop("tot", axis=1).plot(kind='bar', stacked=True, ax=ax[1], legend=False, grid=True)
### boxplot
ax[2].title.set_text('outliers')
sns.catplot(x=cat, y=num, data=dtf, kind="box", ax=ax[2])
ax[2].grid(True)
plt.show()

FullBath seems predictive because the distributions of the 4 samples are very different in price levels and number of observations. It appears that the more bathrooms there are in the house the higher is the price, but I wonder whether the observations in the 0 bathroom sample and in the 3 bathrooms sample are statistically significant because they contain very few observations.

When not convinced by the “eye intuition”, you can always resort to good old statistics and run a test. In this case of categorical (FullBath) vs numerical (Y), I would use a one-way ANOVA test. Basically, it tests whether the means of two or more independent samples are significantly different, so if the p-value is small enough (<0.05) the null hypothesis of samples means equality can be rejected.

cat, num = "FullBath", "Y"model = smf.ols(num+' ~ '+cat, data=dtf).fit()
table = sm.stats.anova_lm(model)
p = table["PR(>F)"][0]
coeff, p = None, round(p, 3)
conclusion = "Correlated" if p < 0.05 else "Non-Correlated"
print("Anova F: the variables are", conclusion, "(p-value: "+str(p)+")")

We can conclude that the number of bathrooms determines a higher price of the house. That makes sense as more bathrooms mean a bigger house and the size of the house is an important price factor.

In order to check the validity of this first conclusion, I will have to analyze the behavior of the target variable with respect to GrLivArea (above ground living area in square feet). This is a case of numerical (GrLivArea) vs numerical (Y), so I’ll produce 2 plots:

  • first, I’ll group GrLivArea values into bins and compare the mean value (and median) of Y in each bin, if the curve isn’t flat then the variable is predictive because the bins have different patterns.
  • Second, I’ll use a scatter plot with the distributions of the two variables on the sides.
x, y = "GrLivArea", "Y"### bin plot
dtf_noNan = dtf[dtf[x].notnull()]
breaks = np.quantile(dtf_noNan[x], q=np.linspace(0, 1, 11))
groups = dtf_noNan.groupby([pd.cut(dtf_noNan[x], bins=breaks,
duplicates='drop')])[y].agg(['mean','median','size'])
fig, ax = plt.subplots(figsize=figsize)
fig.suptitle(x+" vs "+y, fontsize=20)
groups[["mean", "median"]].plot(kind="line", ax=ax)
groups["size"].plot(kind="bar", ax=ax, rot=45, secondary_y=True,
color="grey", alpha=0.3, grid=True)
ax.set(ylabel=y)
ax.right_ax.set_ylabel("Observazions in each bin")
plt.show()
### scatter plot
sns.jointplot(x=x, y=y, data=dtf, dropna=True, kind='reg',
height=int((figsize[0]+figsize[1])/2) )
plt.show()

GrLivArea is predictive, there is a clear pattern: on average, the larger the house the higher the price, even though there are some outliers with an above-average size and a relatively low price.

Just like before, we can test the correlation between these 2 variables. Since they are both numerical, I’d test the Pearson’s Correlation Coefficient: assuming that two variables are independent (null hypothesis), it tests whether two samples have a linear relationship. If the p-value is small enough (<0.05), the null hypothesis can be rejected and we can say that the two variables are probably dependent.

x, y = "GrLivArea", "Y"dtf_noNan = dtf[dtf[x].notnull()]
coeff, p = scipy.stats.pearsonr(dtf_noNan[x], dtf_noNan[y])
coeff, p = round(coeff, 3), round(p, 3)
conclusion = "Significant" if p < 0.05 else "Non-Significant"
print("Pearson Correlation:", coeff, conclusion, "(p-value: "+str(p)+")")

FullBath and GrLivArea are examples of predictive features, therefore I’ll keep them for modeling.

This kind of analysis should be carried on for each variable in the dataset to decide what should be kept as a potential feature and what can be dropped because not predictive (check out the link to the full code).

Feature Engineering

It’s time to create new features from raw data using domain knowledge. I will provide one example: the MSSubClass column (the building class) contains 15 categories, which is a lot and can cause a dimensionality problem during modeling. Let’s see:

sns.catplot(x="MSSubClass", y="Y", data=dtf, kind="box")

There are many categories and it’s hard to understand what’s the distribution inside each one. So I will group these categories into clusters: the classes with higher Y value (like MSSubClass 60 and 120) will go into the “max” cluster, the classes with lower prices (like MSSubClass 30, 45, 180) will go into the “min” cluster, the rest will be grouped into the “mean” cluster.

## define clusters
MSSubClass_clusters = {"min":[30,45,180], "max":[60,120], "mean":[]}
## create new columns
dic_flat = {v:k for k,lst in MSSubClass_clusters.items() for v in lst}
for k,v in MSSubClass_clusters.items():
if len(v)==0:
residual_class = k
dtf[x+"_cluster"] = dtf[x].apply(lambda x: dic_flat[x] if x in
dic_flat.keys() else residual_class)
## print
dtf[["MSSubClass","MSSubClass_cluster","Y"]].head()

In this way, I reduced the number of categories from 15 to 3, which is way better for analysis:

The new categorical feature is easier to read and keeps the pattern shown into the original data, therefore I am going to keep MSSubClass_cluster instead of the column MSSubClass.

Preprocessing

Data preprocessing is the phase of preparing raw data to make it suitable for a machine learning model. In particular:

  1. each observation must be represented by a single row, in other words, you can’t have two rows describing the same passenger because they will be processed separately by the model (the dataset is already in such form, so ✅). Moreover, each column should be a feature, so you shouldn’t use Id as a predictor, that’s why this kind of table is called “feature matrix”.
  2. The dataset must be partitioned into at least two sets: the model shall be trained on a significant portion of your dataset (so-called “train set”) and tested on a smaller set (“test set”).
  3. Missing values should be replaced with something, otherwise, your model may freak out.
  4. Categorical data must be encoded, which means converting labels into integers because machine learning expects numbers, not strings.
  5. It’s good practice to scale the data, it helps to normalize the data within a particular range and speed up the calculations in an algorithm.

Alright, let’s begin by partitioning the dataset. When splitting data into train and test sets you must follow 1 basic rule: rows in the train set shouldn’t appear in the test set as well. That’s because the model sees the target values during training and uses it to understand the phenomenon. In other words, the model already knows the right answer for the training observations and testing it on those would be like cheating.

## split data
dtf_train, dtf_test = model_selection.train_test_split(dtf,
test_size=0.3)
## print info
print("X_train shape:", dtf_train.drop("Y",axis=1).shape, "| X_test shape:", dtf_test.drop("Y",axis=1).shape)
print("y_train mean:", round(np.mean(dtf_train["Y"]),2), "| y_test mean:", round(np.mean(dtf_test["Y"]),2))
print(dtf_train.shape[1], "features:", dtf_train.drop("Y",axis=1).columns.to_list())

Next step: the LotFrontage column contains some missing data (17%) that need to be handled. From a Machine Learning perspective, it’s correct to first split into train and test and then replace NAs with the average of the training set.

dtf_train["LotFrontage"] = dtf_train["LotFrontage"].fillna(dtf_train["LotFrontage"].mean())

The new column I created MSSubClass_cluster contain categorical data that should be encoded. I shall use the One-Hot-Encoding method, transforming 1 categorical column with n unique values into n-1 dummies.

## create dummy
dummy = pd.get_dummies(dtf_train["MSSubClass_cluster"],
prefix="MSSubClass_cluster",drop_first=True)
dtf_train= pd.concat([dtf_train, dummy], axis=1)
print( dtf_train.filter(like="MSSubClass_cluster",axis=1).head() )
## drop the original categorical column
dtf = dtf_train.drop("MSSubClass_cluster", axis=1)

Last but not least, I’m going to scale the features. For regression problems, it is often desirable to transform both the input and the target variables. I shall use the RobustScaler which transforms the feature by subtracting the median and then dividing by the interquartile range (75% value — 25% value). The advantage of this scaler is that it’s less affected by outliers.

## scale X
scalerX = preprocessing.RobustScaler(quantile_range=(25.0, 75.0))
X = scaler.fit_transform(dtf_train.drop("Y", axis=1))
dtf_scaled= pd.DataFrame(X, columns=dtf_train.drop("Y",
axis=1).columns, index=dtf_train.index)
## scale Y
scalerY = preprocessing.RobustScaler(quantile_range=(25.0, 75.0))
dtf_scaled[y] = scalerY.fit_transform(
dtf_train[y].values.reshape(-1,1))
dtf_scaled.head()

Feature Selection

Feature selection is the process of selecting a subset of relevant variables to build the machine learning model. It makes the model easier to interpret and reduces overfitting (when the model adapts too much to the training data and performs badly outside the train set).

I already did a first “manual” feature selection during data analysis by excluding irrelevant columns. Now it’s going to be a bit different because we have to deal with the multicollinearity problem, which refers to a situation in which two or more explanatory variables in a multiple regression model are highly linearly related.

I’ll explain with an example: GarageCars is highly correlated with GarageArea because they both give the same information (how big the garage is, one in terms of how many cars can fit, the other in square feet). Let’s compute the correlation matrix to see it:

corr_matrix = dtf_train.corr(method="pearson")
sns.heatmap(corr_matrix, vmin=-1., vmax=1., annot=True, fmt='.2f', cmap="YlGnBu", cbar=True, linewidths=0.5)
plt.title("pearson correlation")

One among GarageCars and GarageArea could be unnecessary and we may decide to drop it and keep the most useful one (i.e. the one with the lowest p-value or the one that most reduces entropy).

Linear regression is a linear approach to modeling the relationship between a scalar response and one or more explanatory variables. Univariate linear regression tests are widely used for testing the individual effect of each of many regressors: first, the correlation between each regressor and the target is computed, then an ANOVA F-test is performed.

RIDGE regularization is particularly useful to mitigate the problem of multicollinearity in linear regression, which commonly occurs in models with large numbers of parameters.

X = dtf_train.drop("Y", axis=1).values
y = dtf_train["Y"].values
feature_names = dtf_train.drop("Y", axis=1).columns
## p-value
selector = feature_selection.SelectKBest(score_func=
feature_selection.f_regression, k=10).fit(X,y)
pvalue_selected_features = feature_names[selector.get_support()]

## regularization
selector = feature_selection.SelectFromModel(estimator=
linear_model.Ridge(alpha=1.0, fit_intercept=True),
max_features=10).fit(X,y)
regularization_selected_features = feature_names[selector.get_support()]

## plot
dtf_features = pd.DataFrame({"features":feature_names})
dtf_features["p_value"] = dtf_features["features"].apply(lambda x: "p_value" if x in pvalue_selected_features else "")
dtf_features["num1"] = dtf_features["features"].apply(lambda x: 1 if x in pvalue_selected_features else 0)
dtf_features["regularization"] = dtf_features["features"].apply(lambda x: "regularization" if x in regularization_selected_features else "")
dtf_features["num2"] = dtf_features["features"].apply(lambda x: 1 if x in regularization_selected_features else 0)
dtf_features["method"] = dtf_features[["p_value","regularization"]].apply(lambda x: (x[0]+" "+x[1]).strip(), axis=1)
dtf_features["selection"] = dtf_features["num1"] + dtf_features["num2"]
dtf_features["method"] = dtf_features["method"].apply(lambda x: "both" if len(x.split()) == 2 else x)
sns.barplot(y="features", x="selection", hue="method", data=dtf_features.sort_values("selection", ascending=False), dodge=False)

The blue features are the ones selected by both ANOVA and RIDGE, the others are selected by just the first statistical method.

Alternatively, you can use ensemble methods to get feature importance. Ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. I will give an example using a gradient boosting algorithm: it builds an additive model in a forward stage-wise fashion and in each stage fits a regression tree on the negative gradient of the given loss function.

X = dtf_train.drop("Y", axis=1).values
y = dtf_train["Y"].values
feature_names = dtf_train.drop("Y", axis=1).columns.tolist()
## call model
model = ensemble.GradientBoostingRegressor()
## Importance
model.fit(X,y)
importances = model.feature_importances_
## Put in a pandas dtf
dtf_importances = pd.DataFrame({"IMPORTANCE":importances,
"VARIABLE":feature_names}).sort_values("IMPORTANCE",
ascending=False)
dtf_importances['cumsum'] =
dtf_importances['IMPORTANCE'].cumsum(axis=0)
dtf_importances = dtf_importances.set_index("VARIABLE")

## Plot
fig, ax = plt.subplots(nrows=1, ncols=2, sharex=False, sharey=False)
fig.suptitle("Features Importance", fontsize=20)
ax[0].title.set_text('variables')
dtf_importances[["IMPORTANCE"]].sort_values(by="IMPORTANCE").plot(
kind="barh", legend=False, ax=ax[0]).grid(axis="x")
ax[0].set(ylabel="")
ax[1].title.set_text('cumulative')
dtf_importances[["cumsum"]].plot(kind="line", linewidth=4,
legend=False, ax=ax[1])
ax[1].set(xlabel="", xticks=np.arange(len(dtf_importances)),
xticklabels=dtf_importances.index)
plt.xticks(rotation=70)
plt.grid(axis='both')
plt.show()

It’s really interesting that OverallQual, GrLivArea and TotalBsmtSf dominate in all the methods presented.

Personally, I always try to use the fewest features possible, so here I select the following ones and proceed with the design, train, test, and evaluation of the machine learning model:

X_names = ['OverallQual', 'GrLivArea', 'TotalBsmtSF', "GarageCars"]X_train = dtf_train[X_names].values
y_train = dtf_train["Y"].values
X_test = dtf_test[X_names].values
y_test = dtf_test["Y"].values

Please note that before using test data for prediction you have to preprocess it just like we did for the train data.

Model Design

Finally, it’s time to build the machine learning model. I will first run a simple linear regression and use it as a baseline for a more complex model, like the gradient boosting algorithm.

The first metric I normally use is the R squared, which indicates the proportion of the variance in the dependent variable that is predictable from the independent variable.

I will compare the linear regression R squared with the gradient boosting’s one using k-fold cross-validation, a procedure that consists in splitting the data k times into train and validation sets and for each split, the model is trained and tested. It’s used to check how well the model is able to get trained by some data and predict unseen data.

I will visualize the results of the validation by plotting predicted values against the actual Y. Ideally, points should be all close to a diagonal line where predicted = actual.

## call model
model = linear_model.LinearRegression()
## K fold validation
scores = []
cv = model_selection.KFold(n_splits=5, shuffle=True)
fig = plt.figure()
i = 1
for train, test in cv.split(X_train, y_train):
prediction = model.fit(X_train[train],
y_train[train]).predict(X_train[test])
true = y_train[test]
score = metrics.r2_score(true, prediction)
scores.append(score)
plt.scatter(prediction, true, lw=2, alpha=0.3,
label='Fold %d (R2 = %0.2f)' % (i,score))
i = i+1
plt.plot([min(y_train),max(y_train)], [min(y_train),max(y_train)],
linestyle='--', lw=2, color='black')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('K-Fold Validation')
plt.legend()
plt.show()

The linear regression scores an average R squared of 0.77. Let’s see how the gradient boosting validation goes:

The gradient boosting model presents better performances (average R squared of 0.83), so I will use it to predict test data:

## train
model.fit(X_train, y_train)
## test
predicted = model.predict(X_test)

Remember that data were scaled, therefore in order to compare the predictions with the actual house prices in the test set they must be unscaled (with the inverse transform function):

predicted = scalerY.inverse_transform( 
predicted.reshape(-1,1) ).reshape(-1)

Evaluation

Moment of truth, we’re about to see if all this hard work is worth it. The whole point is to study how much variance of Y the model can explain and how the errors are distributed.

I’ll evaluate the model using the following common metrics: R squared, mean absolute error (MAE), and root mean squared error (RMSD). The last two are measures of error between paired observations expressing the same phenomenon. Since errors can be both positive (actual > prediction) and negative (actual < prediction), you can measure the absolute value and the squared value of each error.

## Kpi
print("R2 (explained variance):", round(metrics.r2_score(y_test, predicted), 2))
print("Mean Absolute Perc Error (Σ(|y-pred|/y)/n):", round(np.mean(np.abs((y_test-predicted)/predicted)), 2))
print("Mean Absolute Error (Σ|y-pred|/n):", "{:,.0f}".format(metrics.mean_absolute_error(y_test, predicted)))
print("Root Mean Squared Error (sqrt(Σ(y-pred)^2/n)):", "{:,.0f}".format(np.sqrt(metrics.mean_squared_error(y_test, predicted))))
## residuals
residuals = y_test - predicted
max_error = max(residuals) if abs(max(residuals)) > abs(min(residuals)) else min(residuals)
max_idx = list(residuals).index(max(residuals)) if abs(max(residuals)) > abs(min(residuals)) else list(residuals).index(min(residuals))
max_true, max_pred = y_test[max_idx], predicted[max_idx]
print("Max Error:", "{:,.0f}".format(max_error))

The model explains 86% of the variance of the target variable. On average, predictions have an error of $20k, or they’re wrong by 11%. The biggest error on the test set was over $170k. We can visualize the errors by plotting predicted against actuals and the residual (the error) of each prediction.

## Plot predicted vs true
fig, ax = plt.subplots(nrows=1, ncols=2)
from statsmodels.graphics.api import abline_plot
ax[0].scatter(predicted, y_test, color="black")
abline_plot(intercept=0, slope=1, color="red", ax=ax[0])
ax[0].vlines(x=max_pred, ymin=max_true, ymax=max_true-max_error, color='red', linestyle='--', alpha=0.7, label="max error")
ax[0].grid(True)
ax[0].set(xlabel="Predicted", ylabel="True", title="Predicted vs True")
ax[0].legend()

## Plot predicted vs residuals
ax[1].scatter(predicted, residuals, color="red")
ax[1].vlines(x=max_pred, ymin=0, ymax=max_error, color='black', linestyle='--', alpha=0.7, label="max error")
ax[1].grid(True)
ax[1].set(xlabel="Predicted", ylabel="Residuals", title="Predicted vs Residuals")
ax[1].hlines(y=0, xmin=np.min(predicted), xmax=np.max(predicted))
ax[1].legend()
plt.show()

There it is, the biggest error of -170k: the model predicted about 320k while the true value of that observation is about 150k. It seems that most of the errors lie between 50k and -50k, let’s have a better look at the distribution of the residuals and see if it looks approximately normal:

fig, ax = plt.subplots()
sns.distplot(residuals, color="red", hist=True, kde=True, kde_kws={"shade":True}, ax=ax)
ax.grid(True)
ax.set(yticks=[], yticklabels=[], title="Residuals distribution")
plt.show()

Explainability

You analyzed and understood the data, you trained a model and tested it, you’re even satisfied with the performance. You can go the extra mile and show that your machine learning model is not a black box.

The Lime package can help us to build an explainer. To give an illustration I will take a random observation from the test set and see what the model predicts:

print("True:", "{:,.0f}".format(y_test[1]), "--> Pred:", "{:,.0f}".format(predicted[1]))

The model predicted a price for this house of $194,870. Why? Let’s use the explainer:

explainer = lime_tabular.LimeTabularExplainer(training_data=X_train, feature_names=X_names, class_names="Y", mode="regression")
explained = explainer.explain_instance(X_test[1], model.predict, num_features=10)
explained.as_pyplot_figure()

The main factors for this particular prediction are that the house has a large basement (TotalBsmft > 1.3k), it was built with high-quality materials (OverallQual > 6), and it was built recently (YearBuilt > 2001).

The predicted against actuals plot is a great tool to show how the testing went, but I also plot the regression plane to give a visual aid of the outliers observations that the model didn’t predict correctly. Since it works better for linear models, I will use linear regression to fit bidimensional data. In order to plot the data in 2 dimensions, some dimensionality reduction is required (the process of reducing the number of features by obtaining a set of principal variables). I will give an example using the PCA algorithm to summarize the data into 2 variables obtained with linear combinations of the features.

## PCA
pca = decomposition.PCA(n_components=2)
X_train_2d = pca.fit_transform(X_train)
X_test_2d = pca.transform(X_test)
## train 2d model
model_2d = linear_model.LinearRegression()
model_2d.fit(X_train, y_train)
## plot regression plane
from mpl_toolkits.mplot3d import Axes3D
ax = Axes3D(plt.figure())
ax.scatter(X_test[:,0], X_test[:,1], y_test, color="black")
X1 = np.array([[X_test.min(), X_test.min()], [X_test.max(),
X_test.max()]])
X2 = np.array([[X_test.min(), X_test.max()], [X_test.min(),
X_test.max()]])
Y = model_2d.predict(np.array([[X_test.min(), X_test.min(),
X_test.max(), X_test.max()],
[X_test.min(), X_test.max(), X_test.min(),
X_test.max()]]).T).reshape((2,2))
Y = scalerY.inverse_transform(Y)
ax.plot_surface(X1, X2, Y, alpha=0.5)
ax.set(zlabel="Y", title="Regression plane", xticklabels=[],
yticklabels=[])
plt.show()

Conclusion

This article has been a tutorial to demonstrate how to approach a regression use case with data science. I used the house prices dataset as an example, going through each step from data analysis to the machine learning model.

In the exploratory section, I analyzed the case of a single categorical variable, a single numerical variable, and how they interact together. I gave an example of feature engineering extracting a feature from raw data. Regarding preprocessing, I explained how to handle missing values and categorical data. I showed different ways to select the right features, how to use them to build a regression model, and how to assess the performance. In the final section, I gave some suggestions on how to improve the explainability of your machine learning model.

An important note is that I haven’t covered what happens after your model is approved for deployment. Just keep in mind that you need to build a pipeline to automatically process new data that you will get periodically.

Now that you know how to approach a data science use case, you can apply this code and method to any kind of regression problem, carry out your own analysis, build your own model and even explain it.