Building A Real-Time Emotion Detector: Towards Machine with E.Q.

Source: Deep Learning on Medium

4. The Performance

After training our model, we will test the model with a test set (with faces that the model has not seen before). Arguably, the performance of the model is best evaluated using F1-score of categorizing the emotion of a face, instead of metrics such as accuracy, precision or recall. Intuitively, accuracy is the proportion of the correct predictions made relative to the whole test set.

Accuracy of the test set = No of faces categorized correctly in the test set / Total number of faces in the test set.

However, a poor model may report high accuracy when the data set is skewed. For these cases, alternative metrics such as precision or recall are preferred. These two metrics can be combined into a single metric, the F1-score, which is particularly relevant for a model trained and tested on a skewed data set like FER2013.

F1-score for the test set for the 7 types of emotions

From this bar chart of F1-score, we can see that our model is able to predict happy, surprise, neutral, sad and angry faces decently. However, it performs poorer on faces that show disgust and fear. In order to investigate the cases of misclassification, we plot the confusion matrix for our data set.

The normalized confusion matrix reports the proportion of faces classified correctly or otherwise. On the vertical and the horizontal axes of the matrix are the true and predicted label of the faces respectively. The diagonal of the confusion matrix shows the proportions of faces classified correctly, and everywhere else faces classified incorrectly.

Confusion matrix of the test set

With the matrix, we can analyze why the model performs poorly on ‘disgust’ and ‘fear’ by looking at the row of confusion matrix where the true label is either ‘disgust’ or ‘fear’.

Ah, it is clear now. Most ‘disgusted’ faces are incorrectly predicted as angry, sad or fearful while most ‘fearful’ faces are incorrectly predicted as sad or angry. This should not come as a surprise, as the limited number of ‘disgusted’ faces in the training set might not be enough for the model to learn how a ‘disgusted’ face looks like.

More importantly, more often than not, we express more than one emotion on our faces. A person can experience and express varying degrees of fear, sadness and anger — all with one expression. Thus, even a human may not be able to accurately distinguish the type of negative emotion from a face, let alone a machine. Needless to say, surpassing human-level accuracy is challenging though not impossible.

5. The Limitations and Improvements

In fact, that’s the biggest limitation of the model. It is not able to squarely categorize the emotions we experience into 7 neat little boxes — simply because we experience more than 7 emotions, and an (arguably) infinite combination of them. Thus, a better version of this model should recognize a combination of different emotions. For instance, if a face expresses both sadness and anger, both emotions are shown as the output. This can be done by setting a probability threshold — as long as the probability of a particular emotion is higher than the threshold, it is shown as one of the outputs.

Another clear limitation of the model is the use of a large network with many parameters that may limit its prediction speed. In fact, the modified VGGFace network (19.5 million parameters) is significantly larger than lighter networks like MobileNet (2.3 million parameters). Other models worth exploring are LightFace and SqueezeNet, both of which are lightweight models used for computer vision.

6. Prediction in Real Time

In order to predict in real-time, the live video feed is first captured using CV2 on python and fed into a face detection network, MTCNN, which can achieve superior performance in real-time. The face detected is then fed into our trained network, where the model outputs the prediction.

7. The Next Steps

I am very excited to continue improving on the model’s limitation and eventually deployed to an app. In particular, I am interested in exploring building an app that is able to detect the downswings of a users’ mood based on camera images. Being able to detect negative emotions early, the app can provide guidance to help elevate users’ mood — through meditation, exercising or otherwise. To build the app, I plan to follow the foot steps of Laurence Monorey from in using Tensorflow Lite.

If you are interested in the project code, feel free to refer to my github repository. I welcome any feedback 🙂