Linear Regression from Scratch with Tensorflow 2 Part 1

Original article was published on Deep Learning on Medium

Linear Regression from Scratch with Tensorflow 2 Part 1

The intuition behind linear regression with Tensorflow.

Photo by Andrik Langfield on Unsplash

Linear regression is one of the most basic and perhaps one of most commonly used machine learning algorithm that beginners and experts alike should know by heart. In this article I will walk you through how to implement linear regression using only Tensorflow. The discussion will be divided into two articles, the first part (this article) explains the concept of linear regression, the second part is a walk through of how to implement linear regression in Tensorflow.

If you’re looking for the actual implementation of the this on Tensorflow, please skip ahead to the second part here.

The Concept

Linear regression attempts to model the relation of dependent and independent variables by fitting a linear equation. Suppose we have the data of quiz scores and the length of study hours of 100 students.

Quiz Scores vs Study Hours of 100 students

By, visually inspecting the scatter plot, we can easily draw a line with the equation y=mx+b , where m and b are the slope and y-intercept, respectively. These can be seen on the figure as roughly 40 and 0 for m and b respectively.

Let’s draw a line where m=40 and b=0.

Fitting a line with equation y=40x

The y=40x line looks good! We can then estimate that a student’s score is just 40 multiplied by the number of hours a student studied.

Linear regression works exactly like this except that it cannot visually check the slope and y-intercept from the scatter plot. What it does instead is to guess the slope and y-intercept first then measure how good its guess is. If it’s not good enough then it adjusts the slope and y-intercept until the line fits the data well.

Linear regression is a three-step algorithm:

  1. Initialize the parameters of a linear equation (first guess of slope and y-intercept).
  2. Measure the goodness of fit based on some function.
  3. Adjust the parameters until the measure in step 2 looks good.

1. Linear Equation and Initialization

Now that we built our intuition around linear regression, let’s talk about each of the step mathematically.

The first step in linear regression model is to initialize a linear equation, yes, we’ll use y=mx+b but we have to generalize our approach. The reason for such is that we might be facing data with multiple independent variables. Think of it as adding another variable in our Quiz Score data, like the amount of coffee consumed while studying. Having this coffee dimension will make the linear equation look like this: y=m1x1+m2x2+b , where m1 and m2 are slopes for the study hours and coffee dimensions, respectively, and x1 and x2 are the study hours and coffee variables.

We’ll use dot product to represent the product of matrices, m and x, instead of writing a longer equation for every new variable. Note that it is also valid to use the term, tensor, since it is the generalization of a matrix. Bold type letters are used to denote matrices, so the linear equation should be written as y=mx+b.

There are many ways to initialize the parameters of the equation, the most common ones are using random values, zeros, or ones. You are free to use any type of initialization, this choice will determine how fast your learning algorithm terminates. In the next iteration of the algorithm, these parameters will be updated based on some function discussed in step 2.

2. Loss function

Now let’s say the initial values you set for m and b are ones, so your equation is y=1x+1 . The initial predictions will look like the orange dots in the figure below. It’s obviously a very bad prediction, we need a number to quantify how bad or good these predictions are.

Initial Prediction

There are many ways to measure the goodness of our prediction, we’ll use one of them called mean squared error (MSE). In this context, error means difference, so MSE literally means taking the square of the difference between actual and predicted values, then take the average. It is written mathematically as

Image from researchgate.net

Functions like MSE are called loss function or objective function. These are functions that the algorithm wants to minimize. If our linear regression model perfectly predicts the quiz scores, its MSE will be equal to 0. So in every iteration of the algorithm, it should update the parameters such that the MSE comes closer to 0 without overfitting. Overfitting is an entire topic itself, but what it means essentially is that we don’t want our learning algorithm to be so good with the training data and fail miserably on the test set.

3. Gradient Descent

Sure we can keep on guessing the parameters until we get close enough to zero MSE but it will take time and effort — gradient descent will do this for us. If you’re not familiar with the term, there’s tons of articles and videos explaining its concept.

Gradient descent is one of the building blocks of artificial intelligence. It is the learning in machine learning. Algorithms like gradient descent allows learning algorithms to learn without being told so explicitly. (You need to brush up on your calculus to understand how gradient descent works.)

Gradient descent is an optimization algorithm that we will use to minimize our loss function (MSE in this case). It does so by updating the parameters with small changes in each iteration, this change can be big too depending on your preference (learning rate).

In each new iteration, the updated parameters will be p_new=p_old-(l*dL/dp), where p is the parameter, which could be the slope, m, or the y-intercept, b. The new variables, l and dL/dp, are the learning rate and partial derivative of the loss function with respect to the parameter.

With enough iterations, the slope and y-intercept will get closer to 40 and 0, the values we consider to be “close enough” to fit to our data. As you may have observed, if you happen to initialize the parameters close to 40 and 0, say 35 and 0.5, then the algorithm will take less iterations.

This article is very helpful if you want to dig deeper into the math of gradient descent.

Terminating the algorithm

Here are some of the possible ways to terminate the algorithm:

  1. Terminate the algorithm once a specified number of iterations is satisfied.
  2. Terminate the algorithm once a specified MSE is satisfied.
  3. Terminate the algorithm if the MSE does not improve in the next iteration. You can specify a precision such as 0.001, such that if the difference between two successive MSEs is less than this precision, then stop the algorithm.

Moving Forward

Now that we built our intuition on linear regression, let’s go ahead and implement this on TensorFlow in the second part of this article, read it here.