Speech Emotion Detection

Source: Deep Learning on Medium

Extract Human Emotions from Audio Files

By Ian Hatfield, Pedro Rivas, Tushar Gupta, Rocco Lange, and Arnav Deshwal


This blog chronicles our journey training models to classify audio samples from the RAVDESS dataset to their corresponding emotions. We explored the use of different model types, including but not limited to, bagging, boosting, multi-layer perceptrons, convolutional neural nets, and voting classifier ensembles. The Python library libROSA provided the main tools for processing and extracting features from the audio files utilized in this project.

The foundation of modeling began with feature selection. After extracting MFCCs, Chroma, and Mel spectrograms from the audio files we began assembling models readily available from Sci-kit Learn and other Python packages. Hyperparameter tuning for several of these models was accomplished using the Optuna framework.

Our voting classifier ensembles, as well as various other models, outperformed the convolutional neural networks in classifying our data. Our process for constructing and testing these models that classify audio to one of sixteen classes (male/female and eight different emotions) is discussed in greater detail below.

Introduction and Background

Classifying speech to emotion is challenging because of its subjective nature. This is easy to observe since this task can be challenging for humans, let alone machines. Potential applications for classifying speech to emotion are numerous, including but not exclusive to, call centers, AI assistants, counseling, and veracity tests.

There are numerous projects and articles available on this subject. Please see our references section for articles and other Jupyter notebooks on this or related topics that we found useful and interesting.

Our approach to classifying speech to emotion was as follows:

  1. Read WAV files in by using the libROSA package in Python
  2. Extract features from the audio time series created by libROSA using functions from the libROSA package (MFCCs, Chroma, and Mel spectrograms)
  3. Construct a series of models from various readily available Python packages
  4. Tune hyper-parameters for the models using the Optuna framework
  5. Ensemble models using soft voting classifier to improve performance
  6. Use PCA to extract features for use in a CNN
  7. Use VGG 16 model on images of spectrograms
  8. Ensemble the CNN with the VGG 16 to improve performance

Our project demonstrates the use of ensembling to improve performance and provides an instance of deep learning being outperformed by less complex models, reinforcing the idea that less complex models can yield better results in certain situations. We also demonstrate the use of Optuna for hyperparameter tuning which can be used for a variety of models in a number of different applications in data science projects. Additionally, our project explores the use of image classification to classify sound through its spectrogram.

Audio Files

For some context on our approach to the problem, we will give some background on the technicalities behind audio waves. Audio is represented as waves where the x-axis is time and the y-axis is amplitude. These waves are stored as a sum of sine waves using three values as in A sin(omega*t + phi), where A controls the amplitude of the curve, “omega” controls the period of the curve, and “phi” controls the horizontal shift of the curve. Samples are recorded at every timestep, and the number of samples per second is called the sampling rate, typically measured in hertz (Hz), which are defined as cycles per one second. The standard sampling rate in libROSA is 22,050 Hz because it represents the upper bound of human hearing.

Data Description

The RAVDESS dataset was chosen because it consists of speech and song files classified by 247 untrained Americans to eight different emotions at two intensity levels: Calm, Happy, Sad, Angry, Fearful, Disgust, and Surprise, along with a baseline of Neutral for each actor.

A breakdown of the emotion classes in the dataset is provided in the following table:

The dataset is gender balanced being composed of 24 professional actors, 12 male and 12 female.

The audio files were created in a controlled environment and each consists of identical statements spoken in an American accent. Additionally, there are two distinct types of files:

  • Speech file (Audio_Speech_Actors_01–24.zip, 215 MB) contains 1440 files: 60 trials per actor x 24 actors = 1440
  • Song file (Audio_Song_Actors_01–24.zip, 198 MB) contains 1012 files: 44 trials per actor x 23 actors = 1012

The files are in the WAV raw audio file format and all have a 16 bit Bitrate and a 48 kHz sample rate. The files are all uncompressed, lossless audio, meaning that the audio files in the dataset have not lost any information/data or been modified from the original recording.

As mentioned before, to process/manipulate these files we used the libROSA python package. This package was originally created for music and audio analysis, making it the perfect selection for dealing with our dataset.

After importing libROSA, we read in one WAV file at a time. An audio time series in the form of a 1-dimensional array for mono or 2-dimensional array for stereo, along with time sampling rate (which defines the length of the array), where the elements within each of the arrays represent the amplitude of the sound waves is returned by libROSA’s “load” function.

Data Pre-processing and Exploration

Before going into pre-processing and data exploration we will explain some of the concepts that allowed us to select our features.

  • Mel scale — deals with human perception of frequency, it is a scale of pitches judged by listeners to be equal distance from each other
  • Pitch — how high or low a sound is. It depends on frequency, higher pitch is high frequency
  • Frequency — speed of vibration of sound, measures wave cycles per second
  • Chroma — Representation for audio where spectrum is projected onto 12 bins representing the 12 distinct semitones (or chroma). Computed by summing the log frequency magnitude spectrum across octaves.
  • Fourier Transforms — used to convert from time domain to frequency domain. Time domain shows how signal changes over time. Frequency domain shows how much of the signal lies within each given frequency band over a range of frequencies

Using the signal extracted from the raw audio file and several of libROSA’s audio processing functions, MFCCs, Chroma, and Mel spectrograms were extracted using the following function:

## Get features method takes in the metadata dataframe and spits out a dataframe with mfcc, mel scale and chroma features (180 features in total). 
def get_features(df):
data = pd.DataFrame(columns=['feature'])
label = pd.DataFrame(columns=['label'])
name = pd.DataFrame(columns=['name'])

for i in tqdm(range(df.shape[0])):
x, sample_rate = librosa.load(df.index[i])

## Numpy array that will store all the features

## MFCCs
mfccs=np.mean(librosa.feature.mfcc(y=x, sr=sample_rate, n_mfcc=40).T, axis=0)
result=np.hstack((result, mfccs))

## Chroma
chroma=np.mean(librosa.feature.chroma_stft(S=stft, sr=sample_rate).T,axis=0)
result=np.hstack((result, chroma))

## Mel Scale
mel=np.mean(librosa.feature.melspectrogram(x, sr=sample_rate).T,axis=0)
result=np.hstack((result, mel))

label.at[i,'label'] = df.ix[i,'label']
data.loc[i] = [result]
name.at[i,'name'] = df.index[i].split('/')[-1]

final_data = pd.DataFrame(data['feature'].values.tolist())
final_data = pd.concat([final_data,label,name], axis=1)

return final_data

The function receives a file name (path) and loads the audio file using the libROSA library. Several libROSA functions are utilized to extract features that are then aggregated and returned in the form of a numpy array.

The spectrograms used in our VGG 16 model discussed later were made using the following class coded here.

## Spectrogram class takes the metadata file created in the previous step along with output path and type of data (train, validation, test). 
## Users has the option to specify what kind of spectrograms they want.
## The class can generate 3 types of spectrograms: Mel Scale, MFCC, and Spectral
## If sample is set to true, the class will just display the required spectrogram of the fist file in the dataset
class Spectrograms():
def __init__(self, df, datasettype, outputpath, sample=False, augmentation=False, mel=True, mfcc=False, spectral=False, mfccbanks=20, n_mels=128):
self.df = df
self.augmentation = augmentation
self.mel = mel
self.mfcc = mfcc
self.spectral = spectral
self.mfccbanks = mfccbanks
self.n_mels = n_mels
self.outputpath = outputpath
self.datasettype = datasettype
self.sample = sample

def get_spectrograms(self):
if self.sample:
x, sample_rate = librosa.load(self.df.index[0])
self.generate(x, sample_rate, '', 0)

for file in tqdm(range(self.df.shape[0])):
emotion = df.ix[file, 'label']
path = self.outputpath+self.datasettype+"/"+emotion+"/"
if not os.path.exists(path):
## Reading signal from .wav file
x, sample_rate = librosa.load(self.df.index[file])
self.generate(x, sample_rate, path, file)
def generate(self, x, sample_rate, path, count):
if self.mel:
mel_features = librosa.feature.melspectrogram(x, sr=sample_rate, n_mels=self.n_mels)
log_mel_features = librosa.power_to_db(mel_features, ref=np.max)
fig = plt.figure(figsize=(12,4))
ax = plt.Axes(fig, [0., 0., 1., 1.])
librosa.display.specshow(log_mel_features, sr=sample_rate, x_axis='time', y_axis='mel')
if self.sample:

if self.mfcc:
mfcc_features = librosa.feature.mfcc(x, sr=sample_rate, n_mfcc=self.mfccbanks)
fig = plt.figure(figsize=(12,4))
ax = plt.Axes(fig, [0., 0., 1., 1.])
librosa.display.specshow(mfcc_features, sr=sample_rate, x_axis='time', y_axis='mel')
if self.sample:

if self.spectral:
spectral_features = librosa.feature.spectral_contrast(x, sr=sample_rate)
fig = plt.figure(figsize=(12,4))
ax = plt.Axes(fig, [0., 0., 1., 1.])
librosa.display.specshow(spectral_features, sr=sample_rate, x_axis='time', y_axis='mel')
if self.sample:

Summary of Features

RAW AUDIO — Image output of the audio file read in by libROSA:

MFCC — Mel Frequency Cepstral Coefficients:

  • Voice is dependent on the shape of vocal tract including tongue, teeth, etc.
  • Representation of short-time power spectrum of sound, essentially a representation of the vocal tract

STFT — returns complex-valued matrix D of short-time Fourier Transform Coefficients:

  • Using abs(D[f,t]) returns magnitude of frequency bin f at frame t

CHROMA_STFT — (12 pitch classes) using an energy (magnitude) spectrum (obtained by taking the absolute value of the matrix returned by libROSA’s STFT function) instead of power spectrum returns normalized energy for each chroma bin at each frame

MEL SPECTROGRAM — magnitude spectrogram computed then mapped onto mel scale — x-axis is time, y-axis is frequency


After all of the files were individually processed through feature extraction, the dataset was split into train and test in an 80–20 split.

The modeling process was divided into two main parts: “traditional” machine learning models and deep neural networks. Simpler models were to be used as a baseline for the convolutional neural network and recurrent neural network.

Traditional Machine Learning Models:

  • Simple models: K-Nearest Neighbors, Logistic Regression, Decision Tree
  • Ensemble models: Bagging (Random Forest), Boosting (XG Boost, LightGBM)
  • Multilayer Perceptron Classifier
  • Soft Voting Classifier ensembles

The hyperparameters for each of the models above were tuned with the Optuna framework using the mean accuracy of 3 to 5 fold cross-validation on the train set as the metric to optimize. This particular framework was chosen due to its flexibility, as it allows for distributions of numerical values or lists of categories to be suggested for each of the hyperparameters, and because it prunes the unpromising trials.

Deep Learning:

Two different approaches in modeling with deep learning networks were pursued:

  • Design a convolutional neural network and train it on a combination of MFCC, Mel Scale, and Chroma features
  • Take a more robust, widely tested convolutional architecture and train it on Mel spectrograms

Approach 1:

For this approach, we began with MFCCs, Mel Scale, and Chroma features (180 features in total). Following the same approach as this research paper, we reduced the number of features while attempting to keep as much information as possible using the dimensional reduction technique called Principal Component Analysis (PCA). Finally, the new feature set had 66 features capturing 95% of the variance in the original set.

We experimented with multiple combinations of the number and size for convolution layers and fully-connected layers, optimizers, batch sizes, and epochs to get the best performance. Finally, a neural network was designed with:

  • 8 convolution layers and 3 fully connected layers
  • ReLu as the activation function
  • Batch normalization and dropouts at different stages
  • SGD optimizer with momentum=0.9 and learning_rate=0.01
  • Categorical Cross-entropy as the loss function
  • Batch size as 16
  • Softmax function for final 16-class classification

The architecture can be seen below:

The convolution layers work to extract high-level complex features from the input data while the fully-connected layers are used to learn non-linear combinations of these features that will be used for classification.

Some key points about the input and output of the network:

  • It takes in the input of an (n, 66, 1) shaped array where n is the number of audio files (1961 in our case for the train set)
  • The Output of the network is an (n, 16) array which gives us the probabilities associated with each emotion for audio files
  • The argmax() function was used to find the emotion with the maximum probability

Approach 2:

In an alternative approach, we made use of a widely used network and trained it on our data. We tried different, well-known, architectures of CNNs such as VGG 16, VGG 19, ResNet 50, and InceptionNet and, finally, settled on VGG 16 as it was fairly robust and the most feasible option.

In this network, we used the bottleneck layers of VGG 16 to extract features from Mel spectrograms and added the same set of fully-connected layers, optimizer, batch size, and loss function used in Approach 1.

The architecture of VGG-16 can be seen below:

Mel spectrograms were converted to 224*224 pixel images. These were then used as input to VGG 16 network. The output is of the same format as in the previous approach.

Some interesting findings:

  • The models from both approaches were found to be overfitting during training. To avoid this we tried more aggressive dropouts, L1 and L2 regularization at various convolution and fully-connected layers for Approach 1 and only on the fully-connected layer for Approach 2. However, this resulted in a significant decrease in both training and validation accuracies even with an extremely small value of lambda.
  • Batch size also had an impact on model accuracy. A decrease in batch size led to an increase in accuracy, however, this increased the training time non-linearly. So after multiple iterations, we fixed the batch size to 16.

Final Approach:

In our final approach, we decided to create an ensemble of the neural nets developed in Approach 1 and Approach 2. To do this, we used the soft voting technique to combine the resultant posterior probabilities from both models. We found that giving a weight of three to posterior probabilities from Approach 1 and weight of two to posterior probabilities from Approach 2 resulted in better overall accuracy.


The results and parameters of the top performing models are provided below, as well as a summary of metrics obtained by other models. Note that results will vary slightly with each run of the associated Jupyter notebooks, unless seeds are set. Overfitting was an issue with the majority of our models with some models overfitting to a greater or lesser degree than others. We believe this may have been caused in part by the relatively small size of the dataset. Below is some of the code used to train and test the traditional machine learning models.

def model(clf, X_train=X_train, X_test=X_test, y_train=y_train, y_test=y_test, models=models, save=False, print_stat=True, inc_train=False, cv=False):
"""Trains models and outputs score metrics. Takes an identifier, list of models, and split dataset as inputs and has options for saving model,
printing confusion matrix and classification report and getting cross-validated 5 fold accuracy."""
clf_model = models[clf]
clf_model.fit(X_train, y_train)
y_pred = clf_model.predict(X_test)
if print_stat == True:
clf_report = pd.DataFrame(classification_report(y_test,y_pred, output_dict=True)).T
clf_report.to_csv('tuned_' + model_abrv[clf] + '_classification_report.csv')
print('\nTest Stats\n', classification_report(y_test,y_pred))
print_confusion_matrix(confusion_matrix(y_test, y_pred), unique_labels(y_test, y_pred), model=clf)
if inc_train == True:
print('\nTrain Stats\n', classification_report(y_train,clf_model.predict(X_train)))
print_confusion_matrix(confusion_matrix(y_train, clf_model.predict(X_train)), unique_labels(y_test, y_pred), model=clf)
if cv == True:
print(model_abrv[clf] + ' CV Accuracy:',
np.mean(cross_val_score(clf_model, X_train, y_train, cv=5, scoring='accuracy')))
if save == True:
return clf_model
models = {'dt':DecisionTreeClassifier(**dt_params),

model_abrv = {'dt':'Decision Tree Classifier',
'rf':'Random Forest Classifier',
'lgb':'LGBM Classifier',
'xgb':'XGB Classifier',
'mlp':'MLP Classifier',
'kn':'K-Nearest Neighbors',
'lr':'Logistic Regression',
'v':'Voting Classifier: MLP, LGB',
'v2':'Voting Classifier 2: KNN, XGB, MLP',
'v3':'Voting Classifier 3: XGB, MLP, RF, LR',
'v4':'Voting Classifier 4: MLP, XGB'}

To compare the different models we had to choose a common metric, in this case we decided on using the overall accuracy of the model since we are weighting all classes the same. The accuracy is a simple metric to compute across all the models as it can be done from the confusion matrix by simply adding the values in the diagonal over the total number of points.

Traditional Machine Learning

XG Boost

The following parameters were obtained for the XG Boost model using the Optuna framework, and yielded a test set accuracy of 0.73.

Below is the confusion matrix produced by this model. The confusion matrix shows that this model has more trouble classifying fearful female, sad female, and sad male than some of the other classes.


The following parameters were obtained for the MLP model using the Optuna framework, and yielded a test set accuracy of 0.83.

Below is the confusion matrix produced by the MLP model. The model shows considerable improvement in classifying fearful female and sad male, but has even more trouble classifying sad female than the XG Boost model.

Soft Voting Classifier Ensemble (MLP and XGB)

Thinking that the performance of the MLP model could offset the poor performance of the XG Boost model in the fearful female and sad male classes and vice versa with the performance on the sad female class, a soft voting classifier was used to average the probabilities produced by each of the models. This voting classifier outperformed each of the component models in both the 5-fold CV accuracy over the train set and the test set accuracy obtaining 0.84. The confusion matrix shows that although the voting classifier model performs better in fearful female, sad female, and sad male, it still struggled with sad female as the component models did.

Many more traditional machine learning models were trained, tuned, and tested other than those discussed previously. Below is a summary of the statistics associated with these other models.

Deep Networks

Convolutional Neural Network

Below is the confusion matrix produced by the Convolutional Neural Network. In addition to having issues classifying sad female and fearful female as is the case with most of the models, it shows that this model also struggles to classify angry male, fearful male, and happy male.

VGG 16

Below is the confusion matrix produced by the VGG 16. It shows comparable performance to the CNN in classifying sad female and fearful female while being less accurate with fearful male and sad male; however the model shows marked improvement in classifying angry male and happy male, as well as small improvements in several other classes.

Soft Voting Classifier Ensemble (CNN and VGG 16)

The confusion matrix for the Soft Voting Ensemble of the CNN and VGG 16 models shows improvement over its component models in classifying fearful female, sad female, fearful male and sad male, while sacrificing some accuracy with classifying happy male. Overall the ensemble boasts a better accuracy than either of its component models.

Several other deep learning models were trained and tested throughout this project. The results are summarized in the table below.

As reflected in the results tables for both traditional machine learning and deep learning, the highest accuracies for both approaches were achieved by using soft voting classifiers. It is interesting to note that some of the simpler models like logistic regression performed with comparable accuracy to the CNN. This could be due to the relatively small size of the dataset and/or the number of epochs used in training the CNN.

On this particular dataset a simple neural net, such as a multilayer perceptron, which we treated as a traditional machine learning approach, was the top performer on its own.

Some other interesting results are:

  • The deep networks trained on spectrograms performed poorly compared to the ensembles trained on a collection of features
  • Adding song data to increase the size of the dataset improved the model performance across the board even though it caused a slight imbalance of the classes


This project started with the desire to understand how to implement and use deep networks, however throughout the course of the project the importance of feature engineering became predominant.

It is abundantly clear from observing that the RNN trained using aggregated features performed worse than simple logistic regression that complex models are not always the best performing. Even when using the same features on an MLP and a CNN, the MLP outperformed the more complex model. As mentioned before, this could be because the size of the dataset was insufficient to properly train a deeper network.

Not only does the MLP perform better, but it also takes less time and effort to train once the features have been selected and/or created. This by itself can be very significant when deploying models in a business environment.

The use of three features (MFCC’s, MSF’s and chroma STFT) gave impressive accuracy in both simple and deep learning models, reiterating the importance of feature selection and understanding the data in order to select the proper preprocessing methods.

Future Work

An alternate approach that could be explored for this problem is splitting the classifying task into two distinct problems. A separate model could be used to classify gender and then separate models for each gender to classify emotion could be utilized. This could possibly lead to a performance improvement by segregating the task of emotion classification by gender.

As with many data science projects, different features could be used and/or engineered. Some possible features to explore concerning speech would be MFCC Filterbanks or features extracted using the perceptual linear predictive (PLP) technique. These features could affect the performance of models in the emotion classification task.

It would be interesting to see how a human classifying the audio would measure up to our models, however, finding someone willing to listen to more than 2,400 audio clips may be a challenge in of itself because a person can only listen to “the children are talking by the door” or “the dogs are sitting by the door” so many times.














Relevant Project Links



Header Photo

Lay, Holly. “day 041.” June 1, 2010. Online image. Flickr. <https://www.flickr.com/photos/hollylay/4661253331>. License: https://creativecommons.org/licenses/by/2.0/ Cropped from Original.