Audio classification using transfer learning approach


Audio classification is about classifying an audio segment into a particular class. Audio classification is a very well know problem in speech community. Audio classification is also called acoustic event detection. Classifying audio/acoustic segment requires understanding of the underlying frequency structure of the acoustic signal. Examples of acoustic signals is car sound, gunshot, cheering, air plane sound etc. In audio classification we need to build a model which has the knowledge of features for each of the audio class so that during evaluation phase it understands and classify a given audio segment into its corresponding class. Many people have been working on audio classification problem and people have used many different algorithms to solve audio classification problem. People have used GMMs, SVMs, NMFs, Deep neural networks and random forest and many more machine learning algorithms for solving this problems. Also due to revolution of Deep learning many works have tried Recurrent neural networks, Convolution neural networks also. In our case we want to try and see how we can build an accurate classification machine using transfer learning approach to solve the problem of audio classification. Transfer learning is the method of learning from a already existed/trained model which has been trained using supervised/unsupervised method and has the characteristics of very good feature discrimination. We train only few hidden layers on top of already existing neural networks and we fit the classifier using very less amount of data still achieving the same accuracy as compared to fully trained machines. All the experiments are conducted on the data collected by Cogknit Semantics . We show that transfer learning can be used when we have less training data achieving state of the art accuracy with minimal compute requirements.

Transfer learning approach

The transfer learning approach uses an pretrained model which is already trained using large amount of data as a feature extractor. In our case we use a pre-trained model trained on Audioset data which is released by Google. Audioset is large amount of data collected from Youtube based on the tags provided when people upload videos into youtube. Audioset has around 6000hrs of audio data for 567 audio classes. Google has released its model which is trained on audioset dataset. The model is VGG like CNN architectures which operates on spectrograms at lower layers and uses many convolution layers and fully connected layers as we go upper layers. This allows us to extract representations from higher layers for our own audio data and helps in transfer learning. Since this VGG network is trained on large amount of data, the network would have learnt very good discriminative features for a different audio classes which helps in better classification. The inspiration comes from computer vision where CNN architectures trained on ImageNet data will give better features for modelling different tasks for example scene classification. In this case also we belive the trained CNN on audioset would give us better features which are highly discriminative. One the CNN has been trained we use it for feature extraction as shown in the figure bellow.

The Deep CNN in the above picture is the pre-trained CNN model provided by Google after training.We generate spectrogram for every 960ms audio data. The spectrogram works with 25ms window with 10ms shift. The feature dimension of the spectragram is 64. After processing 960ms audio data we get 96×64 image(spectrogram).We forward pass that image into pretrained CNN and obtain an higher dimensional embedding of dimension 128. This is repeated for all the available training data. Once we extract the features for all the training data we get training data for building a new model for our audio classification data. We then train a small DNN to classify these 128 dimensional embeddings into particular class and train the model with categorical cross entropy criterion. The full pipeline of this method is shown bellow.

The above picture shows the full pipe line for transfer learning for audio classification approach. We have training data and held out test data. We extract embeddings for all our training data and fit a classifier using DNN.During testing we extract the embedding for the test audio and expect the trained DNN to predict its class.

Experiments and Results

All our experiments uses data collected from Cogknit Semantics using various resources. We have collected data for 55 classes and each class has around 20–30mins of audio data. The class list is given bellow.

We split our data into 80% training and 20% testing. We train a 3 Layer DNN classifier using categorical cross entropy cost function with AdaGrad optimizer. We use Keras deep learning library for all our experiments. We conduct 3 experiments by varying the number of classes. The results obtained by our experiments are shown bellow.

Our experiments shows that we can achieve very good accuracy using transfer learning approach on audio classification task. Also we see slight reduction in accuracy when the number of classes are increased. This is due to the increase in the variability of the data. But still the network does a very good job at capturing these variations. The demo of this project can be viewed here


  1. Audio Set: An ontology and human-labeled dataset for audio events(PDF)
  2. CNN Architectures for Large-Scale Audio Classification(PDF)

About Me

I currently work in an AI company based in Bangalore called Cogknit Semantics. We work on Speech, Computer vision and NLP problems. We have built very good solutions for any Speech, Image or NLP problems. We have published many papers in both national and international conferences. Our speech team is the runner up in building speech recognition system for 3 Indian languages conducted by Microsoft. Feel free to chat with us. Visit our company website here.

Source: Deep Learning on Medium