Hands-on for multi-label Audio Tagging problem using Sklearn, Keras, and Transfer learning

Original article was published by Anurag Maji on Deep Learning on Medium

Hands-on for multi-label Audio Tagging problem using Sklearn, Keras, and Transfer learning

In this article, I will drive you through a Kaggle competition solution which will help you to deal with Audio data and Multilabel problems using Keras.


Background of the problem:

The main objective of the problem is to tag the audio data. Because of the richness of sounds, a significant amount of manual effort goes into tasks like annotating sound collections and providing captions for non-speech events in audio-visual content.

To tackle the problem Freesound created the dataset of manually annotated audio events.

In the given dataset there are 80 categories that we have to use and have to develop a model to tag audio data automatically. So, it’s a multi-label classification problem.



The dataset contains 6 files as shown in below Image

Dataset file structure

Train set

The train set for Training the model. The idea is to limit the supervision provided (i.e., the manually-labeled data). The train set is composed of two subsets as follows:

Curated subset

The curated subset is a small set of manually-labeled data from FSD.

· Number of clips/class: 75 except in a few cases (where there are less)

· Total number of clips: 4970

· Average number of labels/clip: 1.2

· Total duration: 10.5 hours

The duration of the audio clips ranges from 0.3 to 30s due to the diversity of the sound categories and the preferences of Freesound users when recording/uploading sounds.

Noisy subset

The noisy subset is a larger set of noisy web audio data from Flickr videos taken from the YFCC dataset.

· Number of clips/class: 300

· Total number of clips: 19815

· Average number of labels/clip: 1.2

· Total duration: ~80 hours

The duration of the audio clips ranges from 1s to 15s, with the vast majority lasting 15s.

Test set

The test set is used for system evaluation and consists of manually-labeled data from FSD. Also contains the same categories of 80 classes provided in Training data.


· train_curated.csv — ground truth labels for the curated subset of the training audio files (see Data Fields below)

· train_noisy.csv — ground truth labels for the noisy subset of the training audio files (see Data Fields below)

· sample_submission.csv — a sample submission file in the correct format, including the correct sorting of the sound categories; it contains the list of audio files found in the test.zip folder (corresponding to the public leader board)

· train_curated.zip — a folder containing the audio (.wav) training files of the curated subset

· train_noisy.zip — a folder containing the audio (.wav) training files of the noisy subset

· test.zip — a folder containing the audio (.wav) test files for the public leader board

Evaluation metric

The primary competition metric is label-weighted label-ranking average precision (lwlrap, pronounced “Lol wrap”). This measures the average precision of retrieving a ranked list of relevant labels for each test clip (i.e., the system ranks all the available labels, then the precisions of the ranked lists down to each true label are averaged).

Label weighting allows per-class values to be calculated, and still have the overall metric be expressed as a simple average of the per-class metrics.

Please refer to this notebook for the implementation of the evaluation metric.

EDA of Data

Curated data contains 4970 datapoints while noisy data contains 19815 data points.

Total Datapoints

Both files contain 2 columns “fname” and “labels”, labels column contains multi labels separated by commas and fname contains the .wav files name.

A snippet of the CSV file

Let’s use Sklearn’s MultiLabelBinarizer() function to convert labels column data to one-hot vectors.

Sklearn’s MultiLabelBinarizer()

On applying Sklearn’s MultiLabelBinarizer() we can observe there are 80 unique labels in the data and one-hot encoded as expected.

The below snippet is representing the count of the number of labels for the first 10 labels. From the snippet, we can observe there are few labels that are imbalanced by a few data points in curated data.

Count for the sample of labels in curated data
The plot of Count for the sample of labels in curated data

While from the below noise labels count snippet we can observe all the labels datapoints are balanced for the first 10 labels. I also checked for the rest of the labels it’s 300.

Count for the sample of labels in Noisy data
The plot of Count for the sample of labels in Noisy data

Transforming Audio data into Spectrogram images

First, we have to convert the waveform data into a spectrogram data.

There are 2 main methods to plot sound data.

  1. Waveform plot: It plots the audio data based on time and air pressure or vibrations.
Waveform plot

2. Fast Fourier Transform (FFT): If plots audio data based on the amplitude and frequency.

FFT Plot

If we consider only frequency we cannot get the actual sequence of the frequency, for example in the speech recognition system suppose the sentence is “Hi Eva it’s Tom”, using frequency we might get the words but without time information sequence of words is missing.

The spectrogram is the best way to represent time and frequency under the same plot. Please refer to this article to understand the spectrogram in more detail.


So, first, we have to set parameters for the spectrogram, which I am setting by creating the class specifying all the parameters in one place. The method and values I set are referred from this paper and this notebook.

Class containing the properties of Spectrogram

Preprocessing of Data

Based on [2] I am trimming all the audio data to 2 sec and for audio data less then 2 seconds I am padding by constant.

Trimming the audio data to 2 seconds

Now using the librosa library I am converting audio data to Spectrogram based on the configurations we set in config class by following functions.

Converting audio to Spectrogram

Now after setting up all the preprocessing functions, I split the Noisy data and Curated data separately into Train, Test, and CV based on train_noisy.csv and train_curated.csv. Then generated the spectrogram images in separate directories.

Directories for respective Spectrogram images

Data Augmentation:

For the data augmentation, I am applied random masking of rows and columns in the spectrogram for Train data. I also applied random Zoom and brightness for Train data using Keras ImageDataGenerator Please refer to my preprocessing notebook for the implementation of random masking.

Masking of random columns and rows

Training Approach

One of the best approaches suggested in [1] is to train the Noisy data first then transfer the weights learned by noise data to Curated data model initialization and then over the loaded trained weights we can train the new model with curated data. So, I started training the model with this approach.

Pipeline of Training

Loading of Spectrogram Images using Keras Image Generator

To load the images I am using the Keras Image generator in which I am also including random Zoom and brightness augmentation for Train data.

Keras Image Generator

Model Architecture

First I started with a custom model with Conv2D layers and skip connections.

I created a function containing multiple Conv2D layers along with LekyRelu and AveragePooling2D layers. Then I used this function multiple times in my main architecture with different numbers of filters.

Conv Block

Please refer to my custom model’s notebook for the complete implementation of the custom architecture.

Custom Model Architecture

For optimizer, I am using Adam optimizer with an initial learning rate of 0.0009. For the loss function, I am using BinaryCrossentropy with Reduction Sum and label smoothing of 0.7 which will help the model to not overfit over noisy data. Training metric I am using categorical accuracy.

Setting Loss function and metrics

I am using the following callbacks

  1. CSVLogger is used to recorded all training epochs loss and metric scores.
  2. ReduceLROnPlateau is used to decrease the learning rate when the monitored metric value is not increased for a particular number of epochs.
  3. ModelCheckpoint is used to save weights when there is an improvement in the monitored metric.
  4. EarlyStopping is used to stop training is the monitored metric is not improving for a certain number of epochs.

The result on Custom Model:

On Noisy data, I am getting a test score of 0.134 and the Kaggle score on submission data is 0.125 while after transferring the noisy model weights and training the model on curated data I am getting a Test Score of 0.222 and the Kaggle score on submission data around 0.201.

So there is some improvement in the curated model but to get a more good score we have to tune our model with more layers and also tune other important hyperparameters.

Transfer Learning:

Another way to improve our score is to use good architectures of popular Models available models in Keras.

In the below snippet I am importing the Dense169 model’s architecture and then ignoring the top layers by selecting the layers till ‘avg_pool’ layer (check line 3). After the ‘avg_pool’ layer I am adding a Flatten, Dense, and Activation layers based on 80 classes in our multilabel data.

Importing DenseNet169 model

Similarly, I trained a few more good architectures like ResNet50, MobileNet.


Along with the custom model, I used other few good architectures based on the same procedure i.e. first I trained the model with Noisy data and then transferred the trained weights to the Curated model and then trained with curated data.

Please check the following tables for the evaluation metric scores of different models.

Scores on different models

So, the best Kaggle score on submission data I am getting with DenseNet169 that is 0.471.

I also applied K-Fold cross-validation to improve the model’s performance please check this notebook.

Please check my git repository containing all the notebooks for the implementation of all the models, EDA, preprocessing, and K-fold cross-validation implementation.

Future Scope

  1. Can tune the hyperparameters and layers of the custom model to improve the scores.
  2. Can implement some more good augmentation techniques.
  3. Please check this winner solution to get more good results.


[1] https://arxiv.org/pdf/1807.09902.pdf

[2] https://www.kaggle.com/daisukelab/cnn-2d-basic-solution-powered-by-fast-ai

[3] https://github.com/lRomul/argus-freesound

[4] https://keras.io/api/applications

[5] https://www.appliedaicourse.com