Original article was published on Deep Learning on Medium
You may be wondering why are we getting into statistics when we were supposed to discuss Linear Regression Model in Machine Learning. Let me set relationship here. The Machine Learning models are created/used to predict output/target variables given the large set of similar data set. The model than tries to predict result as accurately as possible and if that is not feasible, at least minimize the error or variance in the response. If you could connect now, this is what exactly Liner regression does as we discussed earlier. In a nutshell, the Machine Learning stream has borrowed Linear Regression model from statistics.
Python is one of the most widely used programming language in machine learning and PyTorch is one such library of our interest, to use the prediction and computation of probability.
The entire process of machine learning model is primarily a five steps process.
- Data Preparation: In this process, training data set is prepared, this training data set is then fed to the algorithm.
- Creating Learnable parameter: In this step, the relevant parameters and their weights & biases are identified.
- Model itself:The model produces the output for the input data, applying a linear rule and weighted values and also adding a bias term y=Wx+b ( we will explore this in a moment).
- Loss: This step provides us the information about how good our model is.
- Optimizer:In this step, we adjust the weights created initially to help the model calculate target value more accurately, basically a fine tuning.
Let us try to understand this from a real world problem statement using fictions data, we will try to create a model that shall predict the yield of Blueberries and Blackberries and the weather information such as temperature, rainfall and humidity for that region. Here is our training data set.
In Linear Regression, each target variables is estimated to be sum of weighted average of one or more inputs parameter and bias value(a imaginary constant value). Remember the formula y = W * x (weighted average of parameter)+ b (bias)
Since we have two target variables (Blackberries and Blueberries), we will have to calculate this for both the variable, one for Blackberries and the other for Blueberries. It should look like this
Blueberries = (w11 * temp + w12 * rainfall + w13 * humidity ) + b1
Blackberries = (w21 * temp + w22 * rainfall + w23 * humidity ) + b2
The learning part of linear regression is to figure out a set of weights w11, w12, to w23, b1 & b2 by looking at the training data, to make accurate predictions for new data (i.e. to predict the yields for Blackberries and Blueberries in a new region using the existing average temperature, rainfall and humidity in other region). This is done by adjusting the weights slightly many times to make better predictions, using an optimization technique (step number 5 as indicated above) called gradient descent.
Let us get into code now by importing Torch and Numpy. We will be using Juypter notebook for this demonstration.
Now that we have to stat capturing the training data set we have. In our case, its the weights of temperature, rainfall and humidity. We will have to store these weights into a Numpy array. here we go
Now let us create a numpy array for the target data set for the given set of inputs. You could get the data set from the Target variables as indicated from the above mentioned training set.
We will have to convert them in to tensors by using from_numpy function of PyTorch. You may be wondering why are we doing this conversion. We could have directly created the tensors instead of creating them from numpy array. let me address this. we can very well do so. But in most of the real time problems, you would be reading the data from Excel, Database, CSV or any other source and therefore reading them (training data) into numpy make real sense. PyTorch’s tensor are super fast when it comes to performance because it works with GPUs (Graphical Processing Units) while numpy array works on CPU(Central Processing Unit). The performance gain on GPUs over CPUs could be 50–100 times faster.
The weight and biases can also be created as matrix of random sample (because we don’t know the initial set of values to be considered)of weights and biases. please pay attention to the shape (size) of the tensors [2,3]. Since we have two target variable (Blackberries and Blueberries) in our training data set and therefore we have to create two rows one for each fruit and for each target variable three parameters (temperature, rainfall and humidity)are to be captured so three columns and hence shape of [2,3].
The next question is why have we added requires_grad = True? This is to ensure that PyTorch keep track of deviation or variance of these random sample for any changes that might occur. torch.randn() is the PyTorch function to create a tensor with given shape ([2,3] in our case ) and create element of random number from a normal distribution with a mean of 0 and standard deviation as 1. So its time to create a model, our model is a function that calculates the matrix multiplication of inputs tensor (input)and their respective weights (w) and at last add a biased number (b) to the product of weights and inputs(W * x). Remember our formula y=W *x +b.
Before we begin, let us first display the tensors of w and b and see what we have in our plate.
Let us now define the Linear Regression function basis our universal formula
Please pay attention to ‘@’ symbol. It represents the matrix multiplication in PyTorch and torch.transpose() will return the transpose of the matrix (w in our case). The given dimensions dim0 and dim1 are swapped. Now lets generate the prediction of the given set of inputs & weights for the target variables (Blueberries and Blackberries in our case).
As you can imagine, our learnable variable are nowhere close to the actual. Our model predicted that chances of target variable are -77 and 65 but if you could recollect, out original prediction are 59 and 74. Our model is doing pretty bad. let us see the output we have in our target tensors.
As we have started with random values, our model and our learnable parameters, w and b has resulted prediction no where close to the actual. So we need to define a function which tells the model how close its predictions are to the actual values. Since this a regression problem, we will use a loss function called sum of squared error or SSE. We take the difference between predicted and actual target variables and then square it. SSE helps the model to understand how close the predicted values are to the actual values. PyTorch.nn library has a different loss function such as MSELoss and cross-entropy loss. However, for the sake of understanding, let us implement the loss function ourselves.
PyTorch’s torch.sum function returns the sum of all the elements in a tensor, and the .numel() method returns the number of elements in a tensor. Let’s compute the mean squared error for the current predictions of our model.
So what does this number mean to us. This means that each element in the predication differs from actual value by ~20054. But if you could recollect, we have converted the loss into sum of squared error (SSE) that means this deviated number (highlighted) is the square of the actual error. So if we take the square root of loss number ~20054, the square root would come approximately 141. That means the difference between predicted and actual values comes out to be 141, which is pretty bad because lower the loss, the better the model is.
PyTorch allows us to automatically compute the gradient or derivative or deviation of the loss with respect to the weights and biases, because they have requires_grad set to True. The .backward() function does this for us and the gradients are stored in .grad property. Lets now calculate and display the grad of weights, this should have similar shape of [2,3].
Two very important points to remember here is that.
If a gradient is positive:
1.increasing the element’s value slightly will increase the loss.
2. decreasing the element’s value slightly will decrease the loss
If a gradient is negative:
1.increasing the element’s value slightly will decrease the loss.
2.decreasing the element’s value slightly will increase the loss.
Now we know that the model we are trying to implement does not giving us good prediction and therefore,we will have to change the element values according to the above mention rules. But before we do, as i said, the grad property keep track of all the gradients and if the value is changed again, the PyTorch will keep track of these changes as well. But since we know that these values does not make sense as the loss is very high and we need to adjust the element value so we will have to instruct PyTorch to reset the gradients to zero so that a fresh value can be calculated.
Let us check if the PyTorch did the way we expected i.e resetting the values to zero.
and here we go, it did the way as we expected. so we are ready for next cycle by adjusting weights and biases using gradient descent to minimize the loss. basically we are trying to optimize the model as we discussed earlier (step number 5) and then again start from step number 1. We will continue to do so for a repeated number of times until we come to a stage where adjustment to the gradients descends does not impact the loss (means any more training does not make any change in the value of loss or its getting negatively impacted). Lets take a look for one such cycle for ease of understanding.
before that let us check what is our loss function has to say about this model.
This is the same steps as we did on cell number 12. In real world program, it is not required to calculate it again, but for demonstration purpose, we are recalculating the loss and compute the gradients
Now as we thought earlier, let us adjust the element value slightly.
torch.no_grad() is used to instruct PyTorch that we are experimenting with elements values and PyTorch should not track the gradients. Also to note that we multiply the gradients with a small number (10^-5 written as 1e-5), to step by step modification of weights by a really small value to ensure that results are not adversely affected and are improved slowly but steadily. This number is called the learning rate of the algorithm. Let us check the adjusted weighted and biases (after modification).
If you notice the weights are changed and we hope with the new weights and bias, the loss should reduce a bit. let check that as well.
The number of loss has reduced drastically, in the initial run it was somewhere close to ~20054, which has come down to ~10370, which indicates that the login of adjusting the weights and biases are working as expected. Let us now run this entire steps for lets say 1000 times to see if the loss is further reduced, if not minimized,
Now that we have trained our model, lets calculate the loss after we adjusted weights and bias for 1000 times… In theory, the loss should have been reduced.
As you can see, the loss (~8) is drastically reduced now as compared to our first loss calculation (~20054). Let’s now check what the predictions are, and how good or bad it is with respect to target values (Blueberries and Blackberries). Since the loss is reduced, in theory, the prediction should be closed to the actual values as listed above as part of target elements.
If you notice, the predicted elements are quite close to target elements and that indicates that the model works perfectly. This is nothing but the usages of Linear Regression using PyTorch.