Facial Expression Recognition Using Pytorch

Original article was published on Artificial Intelligence on Medium

Facial Expression Recognition Using Pytorch

Computer Vision is a very well known keyword today in the year of 2020. Yes, there actually exist a hype where research teams and many are up to make computers see and understand as the human does. One of the features of human vision is understanding the facial expression of friends, families, and strangers, like that today we will be doing a project to make the computer understand too!

The Main Idea
Computer Vision is a field of artificial intelligence that trains computers to understand the visual world by replicating the complexity of the human visual system. Using digital images acquired from cameras and videos, machines can be “trained” to accurately identify and classify objects and even react to what they perceive.
The Facial Expression Recognition can be featured as one of the classification jobs people would really like to include in the set of computer vision. The job of our project will be to look through a camera that will be used as eyes for the machine and classify the face of the person (if any) based on his current expression/mood.

The Planning

So, no we will plan out a checklist we would follow from start to completion of the project for the ease in understanding the strategy and more clarity of what we may need to do and don’t to make the project work successfully.

First, we will find ourselves a dataset based on which we will train the model then we will explore the dataset a little to gain insights about how we can edit the mode later.
Secondly, we will get the data ready so that we can work on the model, then we will create a model pipeline. After the model is ready we will train on the model and tweak the hyperparameters based on need. After the model is fully trained we will test the model on test data and save the model.

Then finally we will write a python script using Haar-Cascade Classifier with the help of the OpenCV module to detect our face through camera then use the saved model to classify the expression.

Our planning of the roadmap is ready. Now we can get on with the project!

Working With The Data

After going around the internet for some time I found a dataset that is of my interest and pretty perfect for the problem we are going to solve in the project!

The dataset is called FER2013 but I only found the dataset from here. Unfortunately, there was no information regarding which classes were which. So now we have to work ourselves to extract that information so that we can make the project fruitful.

After taking a look at the data frame we can see clearly that there are 3 columns in the frame first has the index number of the classes that is 0 for happy face and 1 for sad but we do not know which is what yet so we will categorize it ourselves a little later.

The second column contains the pixel values for the photo of the faces with the respective reactions and usage column states which data rows are for testing which are for training.

After little exploration on the dataset, this is what I found,

So now we know the total number of data in the data frame is 35887. That is actually a lot and is enough helpful for the training. There are actually three kinds of classes in the Usage column, Training, PrivateTest, and PublicTest. Probably they ran a contest and the Training class has the dataset used by the contestants to train and they had to use the PublicTest for the prediction and the online leaderboard used the PrivateTest to test the accuracy for the leaderboard.

We can see that the test datasets each take up to 10% of the total data frame while the train takes up 80% of it.

We can also see that the total number of expression classes are 7. So let us find out ourselves by looking up at the photos and at their respective expression class number and describe which class represents which mood.

While performing the code I came to know that the total number of pixel values is 2304 that is each picture is of the dimension 48×48. After looking at the photos I also concluded that 0 represents anger, 1 represents disgust, 2 represents fear, 3 represents happiness, 4 represents sadness, 5 represent surprised and 6 represents neutral.

Since we also have a little fewer data compared to a big deep learning model, we will use some transformation on the images. That is do get a more number of the dataset and avoiding overfitting, I am going to give random cropping of 48×48 dimension on the images after applying a 4pixel wide padding on each side of each image. Also, I will randomize the horizontal flip on the images by 50%. And after everything, I am adding normalization with 0.5 mean and variance.

Now we will create the class for the dataset which will also transform our pixels to shaped tensors and will be helpful to use in the model later. I am going to use the test dataset for the validation dataset since the labels for the test images are available.

Now when the dataset is fully ready we will have our data loader-ready with a batch size of 128. We will have this extra validation data loader because it is a good practice and let us know when we may overfit the model and understand the accuracy better.

Since everything is ready now with the data, let us take a look at the photos of the first batch of the training data loader.

Everything is perfect now with the dataset!! So now we can begin with our model!

Getting The Model Ready

Since the data is fully ready to utilize, we will be creating the model now. The first thing I will be doing is creating a base classifier class that will hold on to the validation data loss and accuracy and the overall epoch loss and accuracy.

Then I am going to create a Convolutional Neural Network Model with Residual Networks which will gradually increase the number of channels of the facial data and decrease the dimension and will be followed by a fully connected neural network which will lastly output an array of 7 values between -1 to 1 describing the probability of which facial expression class could it be.

The network architecture I used here is also famously called the Resnet9.

The Model is all working perfectly. Now setting up the torch so that it can use GPU for the training and then after loading the datasets on the device available to us, that is GPU in our case. We can now happily head on to training our model!

Training The Model

In the model here, I am going to use the 1Cycle learning rate scheduler where the learning rate isn’t manually implemented and, starts from a very low learning rate, increases then again get reduced.

I am going to use a gradient clipping of from -0.1 to 0.1 so that the gradient descent jump falls of far too away. After a lot of training, I finally chose the epoch of 30 and the maximum learning rate for the 1Cycle as 0.001. This gives me 70% accuracy at this moment.