Original article was published by Vagif Aliyev on Artificial Intelligence on Medium
Coding Linear Regression from Scratch
Ok, now the moment you have been waiting for; the implementation! Without further ado, let’s begin!
Note: all the code can be downloaded from this Github repo. However, I recommend you to follow along the tutorial before you do so, because then you will gain a better understanding of what you are actually coding!
First, let’s do some basic imports:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
Yep, that’s all the imports! All we are using is numpy for the mathematical implementation, matplotlib to plot graphs, and the boston dataset from scikit-learn.
Next, let’s load our data and define our features and our labels:
# Load and split data
data = load_boston()
X,y = data['data'],data['target']
Next, let’s create a custom train_test_split function to split our data into a training and test set:
# Custom train test split
train_size = 1 - test_size
arr_rand = np.random.rand(X.shape)
split = arr_rand < np.percentile(arr_rand,(100*train_size))
X_train = X[split]
y_train = y[split]
X_test = X[~split]
y_test = y[~split]
return X_train, X_test, y_train, y_testX_train,X_test,y_train,y_test = train_test_divide(X,y,test_size=0.3,random_state=42)
Basically, we are just
- getting in a test size.
- setting a random seed to make sure our results and repeatable.
- Obtaining the train set size based on the test set size
- Picking random samples from our features
- Splitting the randomly selected instances into train and test set
Our cost function
We will implement MSE, or Mean Squared Error, a common cost function used for regression tasks:
m = len(y)
return 1/(m) * np.sum(np.square((y - preds)))
- M referring to the number of training examples
- yi referring to a single instance in our label vector
- and preds refers to our predictions made
In order to make clean, repeatable and efficient code, as well as adhere to software development practices, we will make create a Linear Regression class:
self.X = X
self.y = y
self.m = len(y)
self.bgd = False
- The bgd boolean is a parameter that defines whether or not we should use Batch Gradient Descent or not.
We will now create a method to add an intercept term:
X = np.insert(X,1,np.ones(X.shape[0:1]),axis=1).copy()
This basically inserts a column on at the beginning of our features. It is just there for matrix multiplication purposes.
If we did not add this, then we would be forcing the hyperplane to pass through the origin, causing it to tilt considerably and hence not fitting the data properly
We will not scale our features:
X = (X - X.mean()) / (X.std())
Next, we will randomly initialise the weights:
self.thetas = np.random.rand(self.X.shape)
We will now code the normal equation from scratch using the following formula:
A = np.linalg.inv(np.dot(self.X.T,self.X))
B = np.dot(self.X.T,self.y)
thetas = np.dot(A,B)
Essentially, we split the algorithm into 3 parts:
- We get the inverse of the dot product of X transposed and X
- We get the dot product of our weights and our labels
- We get the dot product of our two calculated values
So, that’s the Normal Equation! Not too bad! Now, we will implement Batch Gradient Descent using the following formula:
self.cost_history =  * (n_iterations)
self.n_iterations = n_iterations
for i in range(n_iterations):
h = np.dot(self.X,self.thetas.T)
gradient = alpha * (1/self.m) * ((h - self.y)).dot(self.X)
self.thetas = self.thetas - gradient
self.cost_history[i] = mse(np.dot(self.X,self.thetas.T),self.y)
Here, we do the following:
- We take in alpha, or the learning rate, and the number of iterations
- We create a list to store our cost functions history to later plot in a line plot
- We loop through the dataset n_iterations times,
- We obtain the predictions, and calculate the gradient(the slope of the tangential line of the function).
- We update the weights to move down the gradient
- We record the values using our custom MSE function
- Repeat, and when finished, return our results
Let’s define a fit function to fit our data:
self.X = self.add_intercept_term(self.X)
self.X = self.feature_scale(self.X)
if bgd == False:
self.thetas = self.normal_equation()
self.bgd = True
self.thetas = self.batch_gradient_descent(alpha,n_iterations)
Here, we just check if the user wants Gradient Descent or not, and do our steps accordingly.
Let’s Build a function to plot our cost function:
if self.bgd == True:
plt.xlabel('No. of iterations')
plt.title('Gradient Descent Cost Function Line Plot')
print('Batch Gradient Descent was not used!')
And finally a method that predicts unlabelled instances:
self.X_test = X_test.copy()
self.X_test = self.add_intercept_term(self.X_test)
self.X_test = self.feature_scale(self.X_test)
predictions = np.dot(self.X_test,self.thetas.T)
Now, let’s see which optimisation produced better results. First, let’s try Gradient Descent:
lin_reg_bgd = LinReg(X_train,y_train)
Let’s plot our function and see how the cost function decreased and if Gradient Descent converged:
So we can see that at around 1000 iterations was when it began to converge.
And now the Normal Equation:
lin_reg_normal = LinReg(X_train,y_train)
So we can see that the Normal Equation slightly outperformed Gradient Descent. This could be because the dataset is small and we have not picked the best parameters for the learning rate.
- Try to increase the learning rate greatly. What happens?
- Don’t apply feature scaling. Is there a difference?
- Try research and see if you can implement an even better optimisation algorithm. Evaluate your model on the test set