Original article was published on Deep Learning on Medium
Identifying Emotions from Voice using Transfer Learning
Training a neural net to recognize emotions from voice clips using transfer learning.
In the episode titled “The Emotion Detection Automation” from the iconic sitcom “The Big Bang Theory” Howard manages to procure a device which would aid Sheldon ( who has trouble reading emotional cues in others) in understanding the feelings of other people around him by pointing the device at them …
Humans tend to convey messages not just using the spoken word but by also using tones,body language and expressions. The same message spoken in two different manners can have very different meanings. So keeping this in mind I thought about embarking on a project to recognize emotions from voice clips using the tone, loudness and various other factors to determine what the person speaking is feeling.
This article is basically a brief but complete tutorial which explains how to train a neural network to predict the emotions a person is feeling . The process will be divided into 3 steps :
- Understanding the data
- Pre-processing the data
- Training the neural network
We will be requiring the following libraries :
You will also require a jupyter notebook environment.
Understanding The Data
We will be using two datasets together to train the neural network :
The RAVDESS Dataset is a collection of audio and video clips of 24 actors speaking the same two lines with 8 different emotions (neutral, calm, happy, sad, angry, fearful, disgust, surprised). We will be using only the audio clips for this tutorial. You can obtain the dataset from here.
The TESS Dataset is a collection of audio clips of 2 women expressing 7 different emotions (anger, disgust, fear, happiness, pleasant surprise, sadness, and neutral). You can obtain the dataset from here.
We will be converting the sound clips into a graphical format and then merge the two datasets into one which we will then divide into 8 different folders one for each emotion mentioned in the RAVDESS dataset section ( We will merge the surprised and pleasant surprise into one component).
Pre-processing the Data
We will be converting the sound clips into graphical data so that it can be used for the training of the neural network. Check out the code below to do so :
The notebook above shows how to convert the sound files into graphical data so that it can be interpreted by neural networks. We are using the library called librosa to do so. We are converting the sound data into a spectogram with the MEL scale as this scale is designed to make it more interpretable. This notebook contains code to be applied to one sound file. The notebook containing the whole code for converting the initial datasets into final input data can be found here.
So after running the code given in the notebook above for each and every sound file and then dividing the files as necessary you should have 8 separate folders labelled with the corresponding emotions. Each folder should contain the graphical outputs for all the sound clips expressing the emotion with which the folder is labelled.
Training the Neural Network
We will now commence training the neural net to identify emotions by looking at the spectograms generated from the sound clips. We will be training the neural network using the fastai library. We will be using a pretrained CNN ( resnet34) and then train it on our data.
What we will be doing is as follows :
1. Making a dataloader with appropriate data augmentation to feed to the neural network. The size of each image is 432 by 288.
2. We will be using a neural net ( resnet34) pretrained on the imagenet dataset. We will then reduce our images to size of 144 by 144 by cropping appropriately and then train our neural net on that dataset.
3. We will then train the neural net again on images of size 288 by 288.
4. We will then analyse the performance of the neural net on the validation set.
5. Voila! The training process will be complete and you will have a neural net which can identify emotions from sound clips.
Lets’s start training !
In the above section we have created a dataloader using our data . We have applied the appropriate transformations on the image to reduce overfitting and also to reduce it to size of 144 by 144. We have also split it into validation and training sets and labelled the data from the folder name. As you can see the data has 8 classes so now this is a simple classification problem for an image dataset.
In the above section we used a pretrained neural net and then trained it on images of size 144 by 144 to identify emotions. At the end of training we managed to get an accuracy of 80.1 %.
So now we have a neural net which is pretty good at identifying emotions by looking at images of size 144 by 144. So now we will use the same neural net and train it to identify emotions by looking at images of size 288 by 288 (which it should be pretty good at already) .
In the above section we trained the neural net ( which we had trained on 144 by 144 sized images ) on the 288 by 288 sized images.
And voila! It can now identify emotions from sound clips irrespective of the content of the speech with an accuracy of 83.1 % ( on a validation set).
In the next section we will analyse the results of the neural net using a confusion matrix.
The above section contains the confusion matrix for our dataset.
Thank you for reading this article and hope you enjoyed it !
Livingstone SR, Russo FA (2018) The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5): e0196391. https://doi.org/10.1371/journal.pone.0196391.
Toronto emotional speech set (TESS) :
Pichora-Fuller, M. Kathleen; Dupuis, Kate, 2020, “Toronto emotional speech set (TESS)”, https://doi.org/10.5683/SP2/E8H2MF, Scholars Portal Dataverse, V1