Linear Regression (Part 2)

Original article was published on Artificial Intelligence on Medium

Linear Regression (Part 2)

We will discover how to implement the simple linear regression algorithm from scratch in Python without using any machine learning libraries.

Data used is in sample.csv which can be downloaded from here:

https://github.com/appyavi/Simple-Linear-Regression.git

There are two columns SAT and G.P.A. and that’s what our example will be all about.

Problem Statement: Create a linear regression which predicts the GPA of a student based on their SAT score.

Before jumping on coding let’s stop and think for a while, Why would I predict GPA with SAT?

Each time you create a regression it should be meaningful. Well the S.A.T. is considered one of the best estimators of intellectual capacity and capability on average if you did well on your essay. You will do well in college and in the workplace. It is safe to say our regression makes sense, OK?

So, Remember the equation is Ŷ = b0 +b1X1.

Our dependent variable is G.P.A., the independent variable is the SAT. We will first estimate the value of coefficient b1 and constant b0.

The formula used for b0 and b1 are as follows:

b1 = covariance(x, y) / variance(x)

i.e b1 = sum((x-x̅) -(y-ȳ))/sum((x-x̅)**2)

b0 = mean(y) — b1 * mean(x)

Once we have all values we can begin prediction by plugging in values in our equation.

Let’s apply it to sample.csv dataset.

  1. we will load the dataset
import pandas as pd
import matplotlib.pyplot as plt
a = pd.read_csv('sample.csv')
a.describe()

Here, Describe is a pandas method to give us the most useful descriptive statistics for each column in the data frame as in the number of observations, mean, standard deviation and so on.

2. defining dependent and independent variables.

x = a['SAT']  #predictors
y = a['GPA'] #predicted

3. plot your data in order to understand it better and see if there is a relationship to be found.

plt.scatter(x,y)
plt.xlabel('SAT')
plt.ylabel ('GPA')
plt.show()

4. Estimating b0 and b1

x1 = x - x.mean()
y1 = y - y.mean()
b1 = sum((x1 * y1)) / (sum(x1**2))
b0 = y.mean() - b1*x.mean()

5. Define the regression equation,

yhat = b0 + b1 * x

6. Plotting the regression line against the independent variable (SAT)

plt.plot(x,yhat,c=’red’)
plt.scatter(x,y)
plt.xlabel(‘SAT’)
plt.ylabel (‘GPA’)
plt.show()

7. Make Predictions

Once the coefficients are estimated, we can use them to make predictions. Just by changing the value of predictor (x) ‘SAT’ in our example, we will get the associated yhat.