Original article was published on Deep Learning on Medium

# Chapter 4.2 — Linear Regression using PyTorch Built-ins

In last blog Chapter 4.1 we discussed in detail about some commonly used built-in PyTorch packages and some basic concepts we will be using to build out linear regression model. In this blog we will be building our model using the PyTorch built-ins.

In this blog, we’re going to use information like a person’s age, sex, BMI, no. of children and smoking habit to accurately predict insurance costs. This kind of model is useful for insurance companies to determine the yearly insurance premium for a person. The dataset for this problem is taken from: https://www.kaggle.com/mirichoi0218/insurance .

We will create a model with the following steps:

- Download and explore the dataset,
- Prepare the dataset for training ,
- Create a linear regression model ,
- Train the model to fit the data ,
- Make predictions using the trained model.

We start by importing the required packages. We have discussed about most of the packages used in the previous blog.

**Step 1 :- Download and explore the data**

For this blog, we will be using Kaggle platform to build our model. We could load our dataset directly from Kaggle.

To load the dataset into memory, we’ll use the read_csv function from the pandas library. The data will be loaded as a Pandas dataframe.

We could print the first five lines of the dataset using the head function in Pandas.

We are going to do a slight customization to dataset so that every reader could get a slightly different dataset. This step is not mandatory.

The customize_dataset function will customize the dataset slightly using your name as a source of random numbers.

Now let’s call the customize function and pass dataset and your_name as arguments and check out first few lines of our dataset using the head function.

Now let’s find out the number of rows and columns in our dataset.

Now we should assign the input, output and categorical columns(input columns that are non-numerical).

We can find the minimum value, maximum value and mean value of output column “charges”. We can also plot the distribution of charges in a graph. For reference do look into https://jovian.ml/aakashns/dataviz-cheatsheet.

**Step 2 :- Prepare dataset for training**

We need to convert the data from the Pandas dataframe into a PyTorch tensors for training. To do this, the first step is to convert it numpy arrays, second step is to convert the categorical columns to numbers.

Read through the Pandas documentation to understand how we’re converting categorical variables into numbers.

Next step is to convert these numpy arrays to PyTorch Tensors.

We use the inputs and targets tensors to create PyTorch datasets & data loaders for training & validation. We’ll start by creating a TensorDataset, followed by the creation of training and validation dataset. We will be using the random_split function to split dataset to get training and validation datasets . Finally pick a batch size to create the dataloaders for training & validation.

Before we move on to next step let’s take a look at a batch of data.

**Step 3 :- Create a Linear Regression model**

Let’s create the class definition of our Insurance Model. Before creating the class we must find the size of input and output columns.

Let’s discuss what each function does. The training_step function gets a batch of data. From batch we get the inputs, self(inputs), self means the model itself, it calls the forward function and pass inputs as arguments. Inside the nn.module there is a callable method implemented which simply takes the inputs and passes to forward function.

The validation_step function also follow the same methodology as the training_step function. In addition we also calculate and return the validation loss.

After we run through all the batches we get back the list of outputs. We extract out the losses of each batch and find the average of value of the batch loss to get the validation loss of each epoch. This is done by the validation_epoch_end function.

After each epoch we log the epoch number and validation loss for that epoch. We display the epoch number and the validation loss for every 20th epoch and the final epoch. This is done by the epoch_end function.

Let’s create a model using the InsuranceModel() class and check out the weights and biases of the model using model.parameters.

**Step 4 :- Train the Model**

To train our model we will use the fit method.

The evaluate function takes two arguments the model and the val_loader. It iterates over batches in validation dataset and for each batch it calls the validation_step function. For each batch it gets back the validation loss as an object. All these are combined to list and we get back list of object as output.

After this we call the validation_epoch_end(outputs). This returns object which contain the average loss of validation dataset.

The fit function creates an optimizer which by default uses torch.optim.SGD which perform the gradient descent on the model parameters (weights and bias). It also takes the number of epochs, learning rate, model, train_loader and val_loader as inputs. We declare history[] to store the validation loss for each epoch.

Call the evaluate function to calculate the loss on the validation set before training.

We are now ready to train the model. You may need to run the training loop many times, for different number of epochs and with different learning rates, to get a good result. Also, if your loss becomes too large (or nan), you may have to re-initialize the model by running the model = InsuranceModel(). Experiment with this for a while, and try to get to as low a loss as possible.

We got a final validation loss of 6369.4995. Re-initialize the model, and try different set of values for batch size, number of epochs, learning rate etc to get even lower validation losses.

**Step 5 :- Make predictions using the trained model**

Let’s write a function to make some predictions.

We use the unsqueeze(0) method to give an extra dimension so that it represent a batch of image. As we can see the predictions are pretty close to the target value.