Facial Emotion Recognition using ResNet50

Original article was published on Deep Learning on Medium

Facial Emotion Recognition using ResNet50

Facial emotion recognition (FER) is an important topic in the fields of computer vision and artificial intelligence. Facial emotion recognition has been an active research area over the past few decades, and it is still challenging due to the high variation.

What is facial emotion recognition?

Facial emotion recognition is the process of detecting human emotions from facial expressions. Facial emotions are important factors in human communication that help us understand the intentions of others. Humans can recognize emotions naturally, as we have been trained to do so, and now AI software has been developed that can recognize emotions as well. This system is becoming more accurate day by day as the technology is growing.


Facial emotions are important factors in human communication. In general, people infer the emotional states of other people, such as joy, sadness, and anger, using facial expressions and vocal tone.Among several nonverbal components, by carrying emotional meaning, facial expressions are one of the main information channels in interpersonal communication.

In this work we made a model which makes use of static images to train the network and classify static images. We have trained on a ResNet50 Convolutional Neural Network from scratch where all the parameters were trained directly on the FER2013 dataset so that the model will be able learn patterns directly from the dataset itself.

A facial emotion recognition system comprises two steps i.e. face detection in image followed by emotion detection on the detected face.

Convolutional Neural Network

A convolutional neural network (CNN, or ConvNet) is a class of deep neural networks, most commonly applied to analyzing visual imagery.Convolutional Neural Network (CNN) has been pre-dominant among various deep learning models. This is a class of artificial neural networks that have produced astonishing results in most of the computer vision tasks. It is used for processing the type of data that has a grid pattern in it simply ,data such as images. These networks are built to learn the features and patterns from the images automatically and adaptively. A Convolutional Neural Network typically comprises three major parts, they are — convolutional block, pooling layers and fully connected networks.

CNN accepts the input in the form of tensors with the shape of an image.Then the image is passed through the convolutional layer for the abstraction of features from it . These convolutional layers convolve the inputs and move the output to the next layer.

Pooling layers are used to minimise the dimensions of the input by aggregating the outputs of neuron clusters of a layer into a single neuron in the next layer. The fully connected layers are used to classify the images .The neurons present in one layer are connected to each and every neuron in another layer. The hierarchy of extracted features become more complex as the layers feed their output to another layer as input.


ResNet is known as the Residual Network. This is a convolutional neural network model which works on the concept of deep residual block. Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously.According to the universal approximation theorem, a single layer of feed forward network is ample to represent any function. However, the layer might be massive and the network is prone to overfitting the data.

Increase in the deeper network doesn’t work because of vanishing gradient problem, as the gradients are back propagated to the previous layers rigorous multiplication operations are being conducted ,gradually performance gets saturated and decreases rapidly. ResNet does this by utilizing skip connections, or shortcuts to jump over some layers. Typical ResNet models are implemented with double- or triple- layer skips that contain nonlinearities (ReLU) and batch normalization in between.An additional weight matrix may be used to learn the skip weights.

The model has 50 convolutional blocks where it extracts the features from the image and updates its gradients from the loss it obtains. ResNet has two fully connected layers where it classifies the image into different classes.

Dataset Used

We used FER2013 which was used in Facial Recognition Challenge in 2013.The data consists of 48×48 pixel grayscale images of faces and seven classes of emotions they are Angry, Disgust, Fear, Happy, Sad, Surprise, Neutral. The task is to categorize each face based on the emotion shown in the facial expression in to one of seven categories (0=Angry, 1=Disgust, 2=Fear, 3=Happy, 4=Sad, 5=Surprise, 6=Neutral).The dataset contain 35,887 images where the dataset is divided into 3 major groups such as train set contain 28709 images, test set contain 3089 images and validation set contains 3089 images.


We have made use of pytorch frameworks for implementing the model.The dataset is in a csv file(comma separated value) that has columns emotion,pixels,and usage. The dataset was divided into different dataframes based on usage columns. After that, the dataset was loaded with a batch size of 64 ready to pass it to the network. Next, as we have chosen the ResNet50 model we have changed the output classes to 7. We have chosen stochastic gradient descent(SGD) as an optimization algorithm and cross entropy as loss function with learning rate of 0.01.

Result And Conclusion

So, we’ve constructed a CNN model to recognize facial expressions of human beings.From this model of facial emotion recognition using a ResNet50 convolutional network and using the FER2013 dataset to train our model we obtained a training accuracy rate of 75% and test accuracy rate of 45%.