Machine Learning Algorithms from Start to Finish in Python: Linear Regression

Original article was published by Vagif Aliyev on Artificial Intelligence on Medium

Coding Linear Regression from Scratch

Photo by Chris Ried on Unsplash

Ok, now the moment you have been waiting for; the implementation! Without further ado, let’s begin!

Note: all the code can be downloaded from this Github repo. However, I recommend you to follow along the tutorial before you do so, because then you will gain a better understanding of what you are actually coding!

First, let’s do some basic imports:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston

Yep, that’s all the imports! All we are using is numpy for the mathematical implementation, matplotlib to plot graphs, and the boston dataset from scikit-learn.

Next, let’s load our data and define our features and our labels:

# Load and split data
data = load_boston()
X,y = data['data'],data['target']

Next, let’s create a custom train_test_split function to split our data into a training and test set:

# Custom train test split
def train_test_divide(X,y,test_size=0.3,random_state=42):
train_size = 1 - test_size
arr_rand = np.random.rand(X.shape[0])
split = arr_rand < np.percentile(arr_rand,(100*train_size))

X_train = X[split]
y_train = y[split]
X_test = X[~split]
y_test = y[~split]

return X_train, X_test, y_train, y_test
X_train,X_test,y_train,y_test = train_test_divide(X,y,test_size=0.3,random_state=42)

Basically, we are just

  1. getting in a test size.
  2. setting a random seed to make sure our results and repeatable.
  3. Obtaining the train set size based on the test set size
  4. Picking random samples from our features
  5. Splitting the randomly selected instances into train and test set

Our cost function

We will implement MSE, or Mean Squared Error, a common cost function used for regression tasks:

Photo By Author
def mse(preds,y):
m = len(y)
return 1/(m) * np.sum(np.square((y - preds)))
  • M referring to the number of training examples
  • yi referring to a single instance in our label vector
  • and preds refers to our predictions made

In order to make clean, repeatable and efficient code, as well as adhere to software development practices, we will make create a Linear Regression class:

class LinReg:
def __init__(self,X,y):
self.X = X
self.y = y
self.m = len(y)
self.bgd = False
  • The bgd boolean is a parameter that defines whether or not we should use Batch Gradient Descent or not.

We will now create a method to add an intercept term:

def add_intercept_term(self,X):
X = np.insert(X,1,np.ones(X.shape[0:1]),axis=1).copy()
return X

This basically inserts a column on at the beginning of our features. It is just there for matrix multiplication purposes.

If we did not add this, then we would be forcing the hyperplane to pass through the origin, causing it to tilt considerably and hence not fitting the data properly

We will not scale our features:

Photo By Author
def feature_scale(self,X):
X = (X - X.mean()) / (X.std())
return X

Next, we will randomly initialise the weights:

def initialise_thetas(self):
self.thetas = np.random.rand(self.X.shape[1])

We will now code the normal equation from scratch using the following formula:

Photo By Author
def normal_equation(self):
A = np.linalg.inv(,self.X))
B =,self.y)
thetas =,B)
return thetas

Essentially, we split the algorithm into 3 parts:

  1. We get the inverse of the dot product of X transposed and X
  2. We get the dot product of our weights and our labels
  3. We get the dot product of our two calculated values

So, that’s the Normal Equation! Not too bad! Now, we will implement Batch Gradient Descent using the following formula:

def batch_gradient_descent(self,alpha,n_iterations):
self.cost_history = [0] * (n_iterations)
self.n_iterations = n_iterations

for i in range(n_iterations):
h =,self.thetas.T)
gradient = alpha * (1/self.m) * ((h - self.y)).dot(self.X)

self.thetas = self.thetas - gradient
self.cost_history[i] = mse(,self.thetas.T),self.y)

return self.thetas

Here, we do the following:

  1. We take in alpha, or the learning rate, and the number of iterations
  2. We create a list to store our cost functions history to later plot in a line plot
  3. We loop through the dataset n_iterations times,
  4. We obtain the predictions, and calculate the gradient(the slope of the tangential line of the function).
  5. We update the weights to move down the gradient
  6. We record the values using our custom MSE function
  7. Repeat, and when finished, return our results

Let’s define a fit function to fit our data:

def fit(self,bgd=False,alpha=0.158,n_iterations=4000):
self.X = self.add_intercept_term(self.X)
self.X = self.feature_scale(self.X)
if bgd == False:

self.thetas = self.normal_equation()
self.bgd = True
self.thetas = self.batch_gradient_descent(alpha,n_iterations)

Here, we just check if the user wants Gradient Descent or not, and do our steps accordingly.

Let’s Build a function to plot our cost function:

def plot_cost_function(self):

if self.bgd == True:
plt.xlabel('No. of iterations')
plt.ylabel('Cost Function')
plt.title('Gradient Descent Cost Function Line Plot')
print('Batch Gradient Descent was not used!')

And finally a method that predicts unlabelled instances:

def predict(self,X_test):
self.X_test = X_test.copy()
self.X_test = self.add_intercept_term(self.X_test)
self.X_test = self.feature_scale(self.X_test)
predictions =,self.thetas.T)
return predictions

Now, let’s see which optimisation produced better results. First, let’s try Gradient Descent:

lin_reg_bgd = LinReg(X_train,y_train)

Let’s plot our function and see how the cost function decreased and if Gradient Descent converged:

Photo By Author

So we can see that at around 1000 iterations was when it began to converge.

And now the Normal Equation:

lin_reg_normal = LinReg(X_train,y_train)

So we can see that the Normal Equation slightly outperformed Gradient Descent. This could be because the dataset is small and we have not picked the best parameters for the learning rate.


  1. Try to increase the learning rate greatly. What happens?
  2. Don’t apply feature scaling. Is there a difference?
  3. Try research and see if you can implement an even better optimisation algorithm. Evaluate your model on the test set