Building Linear Regression Models: modeling and predicting

Original article can be found here (source): Artificial Intelligence on Medium

Building Linear Regression Models: modeling and predicting

Photo by João Silas on Unsplash

Hi everyone! In this post I will be giving you some insight on regression analysis and an experiment on a dataset that will make you aware about the concept of simple and multiple linear regression.

In this tutorial, we will be working with pandas, numpy, sklearn libraries and some visualization libraries like matplotlib and seaborn. Since you are into regression algorithms now you must have used these libraries for data analysis tasks before. If not, WHAT ARE PYTHON PACKAGES FOR DATA SCIENCE? and IMPORTING AND EXPORTING DATA IN PYTHON WITH PANDA are a few articles on a short introduction to these libraries.

Before diving into simple and multiple linear regression let me give you some theoretical concept on simply “Regression”. You must have heard about the concept of correlation. On a simple data set, the relationship between two observations (variables) is called correlation. We examine correlation to identify the type of relationship our variables have in between and their strength which is represented by a numerical value between -1 to 1. To determine the same relationship there is another method often used called regression which beliefs in building a straight line which best represents the relation between two variables. This straight line is represented by a simple formula which is also called regression equation:

Y=a+bX+u

Where:

Y = dependent variable(the variable that you are trying to predict )

X = independent variable(the variable that you are using to predict Y )

a = the intercept.

b = the slope.

u = the regression residual.

Here X and Y are the two variables that we are observing. This equation simply illustrates that the value of Y is dependent on the value of X and some other facts like intercept, slope, and regression residual. Generally, regression analysis is done for prediction purposes, such that knowing the X parameters you can assume Y parameter which is significantly close to real value.

Basically there are two main types of regression :

  1. Simple Linear regression
  2. Multiple Linear Regression

Simple Linear Regression defines the relationship between two different variables through a straight line equation which tries to represent the relationship between one dependent and one independent variable. The equation for the simple linear equation is given by:

Y=a+bX+u

Multiple Linear Regression defines the relation between three or more different variables through an equation that tries to represent the relationship between one dependent and multiple independent variables. Multiple linear regression of two independent and one dependent variable forms a hyperplane and more than two independent variables forms the multidimensional structure. The equation for multiple linear equations is given by:

Y=a+b1X1+b2X2+b3X3+….+bnXn+u

The process of building such an equation for certain datasets so that we can predict future outcomes by knowing a few independent variables is called model building.

Let’s load a relevant data set and necessary libraries for regression analysis over that dataset. Eventually, we will build a linear model for that data set so lets import ‘Boston housing prices’ data set which is available as a sample data set in scikit learn library. This data set is frequently used while introducing model development with linear regression.

# Importing the necessary librariesimport pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snsfrom sklearn.datasets import load_bostonsns.set(style=”ticks”, color_codes=True)plt.rcParams[‘figure.figsize’] = (8,5)plt.rcParams[‘figure.dpi’] = 150# loading the databoston = load_boston()

The data set we have, in variable boston is in dictionary form with key-value pair. You can check those keys with the following code.

print(boston.keys())

The output will be as follow:

dict_keys([‘data’, ‘target’, ‘feature_names’, ‘DESCR’, ‘filename’])

Before making any analysis over the dataset you must understand it first. ‘DESCR’ key in boston dictionary explains the details about the variables in the dataset. Let’s check this out quickly.

print(boston.DESCR)

You will find these details in output:

:Attribute Information (in order): 
— CRIM per capita crime rate by town
— ZN proportion of residential land zoned for lots over 25,000 sq.ft.
— INDUS proportion of non-retail business acres per town
— CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
— NOX nitric oxides concentration (parts per 10 million)
— RM average number of rooms per dwelling
— AGE proportion of owner-occupied units built prior to 1940
— DIS weighted distances to five Boston employment centres
— RAD index of accessibility to radial highways
— TAX full-value property-tax rate per $10,000
— PTRATIO pupil-teacher ratio by town
— B 1000(Bk — 0.63)² where Bk is the proportion of blacks by town
— LSTAT % lower status of the population
— MEDV Median value of owner-occupied homes in $1000’s :Missing Attribute Values: None

It’s obvious that the dataset we will be working with lies in “boston.data” key. So we need this data as a data frame for ease of use. This can be achieved with the following code.

df = pd.DataFrame(boston.data,columns=boston.feature_names)df.head()# print the columns present in the datasetprint(df.columns)# print the top 5 rows in the datasetprint(df.head())
First five records from data set

Notice that the “MEDEV” column which is the target variable in this dataset is missing from bostan.data. This is because the target(dependent) variable is separated from other independent variables while building a model. But for now, let’s include the “MEDEV” column into the data frame and perform correlation analysis on overall data frame.

df[‘MEDV’]=boston.target

Before making any analysis lets check if we have any missing values.

df.isna().sum()
Numbers of null values in each columns

So there are not any null values. Checking null value is important because regression won’t work if data has missing values.

Here comes the correlation analysis to identify the degree of relation between multiple variables in our data set so that we take highly correlated variables for regression analysis next. Let’s plot a heat map that represents the correlation between columns in the overall data set.

#plotting heatmap for overall data setsns.heatmap(df.corr(), square=True, cmap=’RdYlGn’)
Heat map of overall data set

From the above heat map let’s look for a few independent variables which have a significant correlation with MEDV for building a linear regression. Looks like RM (average number of rooms per dwelling) has a high positive correlation with MEDV. So let’s plot a regression plot to see the correlation between RM and MEDV.

sns.lmplot(x = ‘RM’, y = ‘MEDV’, data = df)
Regression plot with RM and MEDV

Now let’s make a simple linear regression model to predict the price of the house based on the RM feature of the house. The first thing to do while building a model is identifying the X and Y variable from the data set. We have already analyzed our data set and now the X variable will be RM and Y variable which is the variable to be predicted latter on is MEDV.

# Initializing the variables X and YX = df[[‘RM’]]y = df[[‘MEDV’]]

The next task to do with the prepared data is to split them as training data set and testing data set. The reason behind this is simple, we will train the regression model with the training dataset and test our model latter on with the testing data set. There is a train_test_split() function in sklearn which will do this task for us. This function will split the given dataset into four parts X_train, X_test, y_train, and y_test. This function takes test size parameter which defines the ratio on which training and testing dataset will be divided on and a random state parameter which defines the seed for the random number generator for dividing the records from dataset randomly.

Our goal is to make the model predict outcomes as close as the actual values. So we will need a different set of X variables to test the model and predict the Y variable and compare the already known Y variable with the predicted ones. This is called model evaluation.

# Splitting the dataset into train and test setsfrom sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 10)

Now lets train the regression model. Sklearn.linear_model provides the function LinearRegression() which will do all the mathematics while fitting the tranning dataset to the model for us seemlessly.

# Fitting the training data to our modelfrom sklearn.linear_model import LinearRegressionregressor = LinearRegression()regressor.fit(X_train, y_train)

Our model is now successfully fitted with a training dataset. Now its time to evaluate the model which I mention right above. Here we will use the testing dataset. LinearRegression() class provides a function score() which will take the test sets as a parameter and gives a value that represents the accuracy level of the model with the testing dataset. This value is the result of the mathematical calculation for R². the output of the function is a value between 0 and 1.

#check prediction scoreregressor.score(X_test, y_test)

The output is:

0.5383003344910231

Here the score is 0.53~53%, which is fine. Now let’s see the predictions on which this accuracy level is achieved. We can find these predictions right after fitting the model. LinearRegression() class provides a function predict() which will take the X_test variables and gives an array of y variables predicted according to the regression model we built. I will show you the earlier and predicted values in a data frame.

# predict the y valuesy_pred=regressor.predict(X_test)# a data frame with actual and predicted values of yevaluate = pd.DataFrame({‘Actual’: y_test.values.flatten(), ‘Predicted’: y_pred.flatten()})evaluate.head(10)

The output is:

MDEV value in actual and when predicted

There are some differences between already known and predicted variables since the model gives only 53 % accuracy rate now. Let’s plot this data frame into the bar plot and visualize the differences.

evaluate.head(10).plot(kind = ‘bar’)
Bar char illustrating differences between actual and predicted outcomes

If you are still with me let me tell you we just built Simple Linear Regression model as the X variable has only one feature from the dataset. If you found the above details understandable, multiple linear regression will easily be understood. The only difference in building a multiple linear regression model is that the X variable will have multiple columns of variables and these variables will make a difference in model’s accuracy.

So let’s start with initializing X and Y variables for multiple linear regression model similarly as above.

# Preparing the dataX = df[[‘LSTAT’,’INDUS’,’CRIM’,’NOX’,’TAX’,’PTRATIO’,’CHAS’,’ZN’,’DIS’]]y = df[[‘MEDV’]]

The other steps for splitting the dataset, fitting, evaluating and prediction is done similarly as above.

# Splitting the dataset into train and test setsfrom sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 10)# Fitting the training data to our modelregressor.fit(X_train, y_train)#score of this modelregressor.score(X_test, y_test)

We got the following score:

0.6252457568746427

You can see the accuracy level of the model is increased by 10%. Let’s have a data frame of actual and predicted values of MEDV as we made for the earlier one.

# predict the y valuesy_pred=regressor.predict(X_test)# a data frame with actual and predicted values of yevaluate = pd.DataFrame({‘Actual’: y_test.values.flatten(), ‘Predicted’: y_pred.flatten()})evaluate.head(10)

The output dataframe is:

MDEV value in actual and when predicted

We build this model quite fast right? Let’s make another multiple linear regression model with a different set of features in the X variable.

# Preparing the dataX = df[[‘LSTAT’,’INDUS’,’CRIM’,’NOX’,’TAX’,’PTRATIO’]]y = df[[‘MEDV’]]# Splitting the dataset into train and test setsfrom sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 10)# Fitting the training data to our modelregressor.fit(X_train, y_train)#score of this modelregressor.score(X_test, y_test)

The score is:

0.5793234700425467

As you can see the accuracy of the model dropped 6% while taking a different set of features to form out the overall data set. So I want you to know that there are lots of things which affect the model accuracy, this is only one. As you will learn to build other ML models you will come to know that the hyperparameters (random_state and test_size in this model) also affect the accuracy of the model.

That’s it for the introduction to build a linear regression model. Lots of things to learn ahead. Few things above might not make sense right now but will make as you keep on learning further.