Beginner’s guide for making an Audio classifier

Source: Deep Learning on Medium

Beginner’s guide for making an Audio classifier

Problem statement: Let’s say, we have to create a multi-label classifier from raw audio files with some spoken words.

Before we actually dive into nitty-gritty of how to classify a sound fragment, let’s understand the basics indispensable concepts. I will try to keep things simple and conceptually easily to understand.

Phase 1: Data Exploration

First and foremost step is to explore and try to understand the data. For instance, find out what is the distribution of labels, is there any noisy files, what is average duration of files. Listen to the audio files that you find little odd. Being comfortable with the data act as solid ground to the process.

Label distribution in data

Phase 2: Data pre-processing:

Audio signal and feature extraction: Audio signals are sound signal samples captured over a period of time. However, the audio signal changes rapidly in time domain which makes feature extraction without losing information challenging. As a solution, the audio signals are changed from time domain to frequency domain. We generally use Fourier transformation function to extract underlying spectrum(Mel frequency cepstral coefficients -MFCC).

MFCC of a wav file

Too technical? ?? if you are not familiar with Fourier transformations…. Well, let’s say that we are trying to represent the audio signal features in arrays of number with minimum information loss. Once we have MFCCs, we can represent sounds as images, where it can be easily visualized what part of audio signal is having most of the useful information.

Visualizing where most of the information in wav file

You can use plotting function to see the spectrum of audio signal using extracted mfccs. It is possible that MFCC array(extracted features) may not be of same size or shape. This problem can be addressed by with taking mean, median, standard deviation etc. of arrays, Or by padding the array shape can be made uniform. Based on how you handle array shapes, the performance of model may differ. In addition to that, the audio signals with only noise will always lower the accuracy, so try to get rid of all the noisy audio files. This will save a lot of effort later during model optimization. I will talk a little more about feature engineering during model optimization.

Phase 3: Data splitting and model:

Always split the labeled data into train and test sets, so that you can measure how well your model will perform on unseen data. When it comes to choosing the model, try different models considering the kind of problem it is. Along with that, one thing that really helps is reading. Read! Read! Read! I can not emphasize enough, the importance of reading and trying different techniques. There are so many relevant articles, information and models are present on several websites like #kaggle, #github #youtube and ofcourse #medium. However, keep in mind that your data may be unique and exact code may not be used. So, try to understand the underlying working and choose what works best for you.

Phase 4: Train and test the model:

Time to see, what we have achieved so far with all the hard work! This is important stage to see how model is learning and performing. Some of the most common pitfalls are:

Understanding the fit of model(source)

Over-fitting and Under-fitting:

Conceptually overfitting is when the model memorizes the training data and results in higher accuracy, however performs poorly on test data or unseen data. Overfitting lacks in generalization. Hence the results are not reliable. Underfitting on other hand means model has failed to learn the underlying pattern in training data. This means that there is still room for improvement in the model, which can be handled by adding complexity to neural network, or training for a little longer.

How to identify if your model is over-fitting or under-fitting? Compare the training accuracy and validation accuracy. If training accuracy is marginally greater than validation accuracy → Model is overfitting. And if training accuracy is marginally lower than validation accuracy → Model is underfitting. When you find validation accuracy is high and close to the training accuracy, it can be confidently said that → Model is reliable

Loss function and its role: Putting it simply loss function tells us how wrong the predictor is in predicting the correct value. During model optimization, focus is to minimize the loss function. One interesting thing to notice is when we start to train the model, the loss function is high, but as model starts learning the loss function starts declining.

Phase 5: Model optimization:

Now it is time to enhance the performance of model and tackle the situations like overfitting, underfitting, minimizing loss function, improving accuracy. Depending on the problem relevant measure should be taken, for instance if model is overfitting in neural network add dropout rate to make sure the model is learning.

Using Neural networks like CNN 2D, RNN one can easily achieve accuracy around 80–90%. However, neural network’s hunger for data is unquenchable, which means more data, better the training, better the accuracy. One interesting technique “Data Augmentation” can also be used to have more data, for data hungry neural networks. In data augmentation, some fraction of training audio files are taken and their attributes are altered like, lowering or raising the signal pitch, adding noise, changing duration and so on.

https://github.com/nc15apr/datascience/blob/master/Audio%20Classifier.ipynb