Original article was published on Deep Learning on Medium
Predict your Wine Quality using Deep Learning with PyTorch
Sounds interesting? I bet the process we are implementing for finding the best quality red wine will be more engrossing.
So if you’ve done some Machine Learning and you want to dive into Deep Learning, the most common question is: “How do I get started?”. Well you need to choose a good framework for it. There are many deep learning frameworks and many of them are viable tools. However, Google shows there are two great ones in the market.
TensorFlow is developed by Google Brain and actively used at Google both for research and production needs. It’s closed-source predecessor is called DistBelief.
PyTorch is a cousin of lua-based Torch framework which was developed and used at Facebook. However, PyTorch is not a simple set of wrappers to support popular language, it was rewritten and tailored to be fast and feel native.
Both TensorFlow and PyTorch provide useful abstractions to reduce amounts of boilerplate code and speed up model development. The main difference between them is that PyTorch may feel more “pythonic” and has an object-oriented approach while TensorFlow has several options from which you may choose. Personally, I consider PyTorch to be more clear and developer-friendly. It’s torch.nn.Module gives you the ability to define reusable modules in an OOP manner and I find this approach very flexible and powerful.
And here we will see how to implement Linear Regression using PyTorch.
So let’s get started. I have covered some basics of PyTorch tensor functions in my previous blog you can check: https://medium.com/@srijaneogi31/pytorch-getting-started-c9d03f02e4ff /to get some idea about how you can play with PyTorch datatypes. Also you can check out the official documentation of PyTorch to learn the various functions provided by the library. https://pytorch.org/docs/stable/tensors.html/ .
First of all What is Liner Regression?
Simple linear regression is an approach for predicting a response using a single feature.
It is assumed that the two variables are linearly related. Hence, we try to find a linear function that predicts the response value(y) as accurately as possible as a function of the feature or independent variable(x).
The linear regression model can be represented by the following equation
Y = θ₀ + θ₁x₁ + … + θₙxₙ
Y is the predicted value
θ₀ is the bias term.
θ₁,…,θₙ are the model parameters
x₁,…,xₙ are the feature values.
The above hypothesis can also be represented by
Y = θT*x
θ is the model’s parameter vector including the bias term θ₀
x is the feature vector with x₀ = 1
Now we will implement this in our model. We have used the Red Wine Quality dataset. The dataset for this problem is taken from: https://archive.ics.uci.edu/ml/datasets/wine+quality From the UCI Machine Learning Repository, this dataset can be used for regression modelling and classification tasks. The dataset includes info about the chemical properties of different types of wine and how they relate to overall quality.
First we will import all the libraries.
We begin by importing torch and torchvision. torchvision contains some utilities for working with image data.
Import numpy and pandas. Pandas is one of the most popular Python libraries for Data Science and Analytics. It is built on the Numpy package and it’s key data structure is called the DataFrame. DataFrames allow you to store and manipulate tabular data in rows of observations and columns of variables.
Matplotlib is a plotting library for the Python programming for creating static, animated, and interactive visualizations.
torch.nn is the class for that provides a modular way to build neural networks using Torch we will be using it’s various subclass
torch.nn.functional applies a linear transformation to the incoming data.
Step 1: Download and explore the data
To load the dataset into memory, we’ll use the read_csv function from the pandas library. The data will be loaded as a Pandas dataframe. See this short tutorial to learn more: https://data36.com/pandas-tutorial-1-basics-reading-data-files-dataframes-data-selection/
Let’s see a skeleton of our dataframe.
It means this dataframe contains 1599 rows and 12 columns. Out of which we have separated 11 columns viz. fixed acidity, volatile, acidity, citric acid, residual sugar, chlorides, free sulphur dioxide, total sulphur dioxide, density, pH, sulphates, alcohol as input and one column i.e. quality as output.
Step 2: Prepare the dataset for training
We need to convert the data from the Pandas dataframe into a PyTorch tensors for training. To do this, the first step is to convert it to numpy arrays. For this we create a dataframe_to_arrays class.
Now we’ll convert numpy array to torch tensor.
Note that by default the datatype of the tensors will be float64 we need to convert them to float32 to avoid parameter mismatch while implementing the liner function. torch.float is converting the datatype to float32.
Let me show the dtype and shape of the tensors so that it is easier to understand.
Next we’ll convert torch tensor to tensor Dataset.
Now that we have a single dataset. We need to split the dataset into 2 parts:
Training set — used to train the model i.e. compute the loss and adjust the weights of the model using gradient descent.
Validation set — used to evaluate the model while training, adjust hyperparameters (learning rate etc.) and pick the best version of the model.
While building real world machine learning models, it is quite common to split the dataset into 3 parts. Besides these two, a testing set is required to compare different models, or different types of modeling approaches, and report the final accuracy of the model.
Here I have set 1300 dataset out of 1599 for training and remaining 299 for validation. Next I initialised batch_size to 50. The batch size is a hyperparameter that defines the number of samples to work through before updating the internal model parameters. Now we encountered two interesting features of torch.utils.data class:
torch.utils.data.random_split is a built in function used to Randomly split a dataset into non-overlapping new datasets of given lengths.
torch.utils.data.DataLoader class. It represents a Python iterable over a dataset. Data loader combines a dataset and a sampler, and yields data as a batch for every epoch.
We have set Shuffle =True to shuffle the data while training, so that inputs and outputs are collected in a rearranged or intermixed manner from the dataset for each batch. This randomization helps generalise & speed up the training process. On the other hand, since the validation dataloader is used only for evaluating the model, there is no need to shuffle the images. I’ll recommend to learn more from https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader/.
Till now we’ve created training and validation data loaders to help us load the data in batches.
Step 3: Create a Linear Regression Model
Our model itself is a fairly straightforward linear regression
Let’s check the input and output size and datatype once again, so that weights can be initialised with specified size.
Now we will create our WineQuality model class.
Our model inherits nn.Linear class that applies a linear transformation to the incoming data: y = xW^T + b. This is where the linear function happens as discussed earlier. The weights(W) and bias(b) are initialised by the function and can be checked in model.Parameters(). The forward() is doing a batch wise forwarding i.e. feeding the input of first layer to the second layer.
training_step() is to do batch wise training and calc. the batch loss.
validation_step() is to do batch wise validation and calc. batch loss
Now Loss function is to calculate how different our model is predicting from the actual targets. Basically output — targets. PyTorch provides various different loss functions like cross_entrophy, mean squared error ,mean absolute error, you can try any of them. Here we used l1_loss(mean absolute error). check https://pytorch.org/docs/stable/nn.html#loss-functions for more details.
Let us create a object of WineQuality model.
Now we’ll define an
evaluate function, which will perform the validation phase, and a
fit function which will perform the entire training process.
for batch in train_loader:
# Generate predictions
# Calculate loss
# Compute gradients
# Update weights
# Reset gradients
# Validation phase
for batch in val_loader:
# Generate predictions
# Calculate loss
# Calculate metrics (accuracy etc.)
Step 4: Train the model
Here we will use some hyperparameters: Learning rate and epochs to fit the data. You can try with different lr and epochs to bring a better result. Our aim is to lessen the val_loss.
I have repeated the previous step by varying learning rates by orders of 10 (e.g. 1e-2, 1e-3, 1e-4, 1e-5, 1e-6) to figure out what works.
Let’s check our rate of val_loss with respect to different epochs.
Now here is the final step
Step 5: Make predictions using the trained model
Let’s define a helper function
predict_single, which returns the predicted quality for a single tensor
Here you can see quality of wine is 5. and our model is predicting 5.4455 which is pretty close. You can try other samples and get the accuracy of prediction.
So here we end predicting wine quality with Linear Regression using PyTorch. So next time you buy wine, you can presume it’s quality. I am sure that’s going to be a lovely experience. You can try building a new models with some other dataset. Get some here https://jovian.ml/outlink?url=https%3A%2F%2Flionbridge.ai%2Fdatasets%2F10-open-datasets-for-linear-regression%2F. You can also implement logistic regression or classification using these datasets.
My jovian notebook https://jovian.ml/srijaneogi31/02-wine-quality-prediction
Contact me https://www.linkedin.com/in/srija-neogi-a699a51aa