Sound-Based Bird Classification

Source: Deep Learning on Medium


How a group of Polish women used deep learning, acoustics and ornithology to classify birds

Have you ever wondered about the name of the bird you just heard singing? A group of women from local Polish chapter of Women in Machine Learning & Data Science (WiMLDS) organization not only thought about it but also decided to create a solution, on their own, to be able to detect birds species — based on the sound they make.

Female data scientists, PhD candidates, ornithologists, data analysts and software engineers who had prior experience with Python joined forces in a series of two-week-long sprints to work together on the project.

This project was designed to be a collaboration on a real-life problem which machine learning can help to solve with a typical structure of a data science project including data research and analysis, data preparation, creation of models, analysis of results (or model improvement) and the final presentation.

After the weeks of work, the group has managed to build a solution that predicts the right bird’s name with 87% accuracy on the test sample.

Are you curious about the solution that has been built? We invite you to travel into a world of birds songs.

The birds’ problem

The birdsong analysis and classification is a very interesting problem to tackle.

Birds have many types of voices and the different types have different functions. The most common are song and ‘other voices’ (e.g. call-type).

The song is the “prettier” — melodic type of voice, thanks to which the birds mark their territory and get partners. It is usually much more complex and longer than “call”.

Call-type voices include contact, enticing and alarm voices. Contact and attracting calls are used to keep birds in a group during flight or foraging, for example in the treetops, alarm ones to alert (e.g. when a predator arrives). Most often these are short and simple voices.


Great tit (Parus major)
  • The song is a simple lively rhythmic verse with a slightly mechanical sound, e.g. “te-ta te-ta te-ta” or three-syllable with a different accent, “te-te-ta te-te-ta te-te-ta”


  • The call has a rich repertoire. Joyful “ping ping” voices, cheerful “si yut-tee yut-tee” and the chattering “te tuui”. In the autumn you can often hear slightly questioning, more shy “te te tiuh”. He warns with a hoarse crackling “yun-yun-yun-yun”. The ramps fill the forest with persistent penetrating “te-te-te te-te-te”.


Why can sound-based bird classification be a challenging task?

There are many problems you can encounter:

  • background noise — especially while using data recorded in a city (e.g. city noises, churches, cars)
  • multi-label classification problem — when there are many species singing at the same time
  • different types of bird songs (as described earlier)
  • inter-species variance — there might be a difference in birdsong between the same species living in different regions or countries
  • data set issues — the data can be highly imbalanced due to bigger popularity of one species over another, there is a large number of different species and recordings can have different length, quality of recordings (volume, cleanliness)

So, how were the problems solved in the past?

Recognizing birds just by their songs might be a difficult task but it does not mean it is not possible. But how to handle those problems?

To find the answer there was a need to dive into research papers and discovered that most of the work happened to be initiated by the various AI challenges, such as BirdCLEF and DCASE. Fortunately, winners of those challenges usually describe their approaches, so after checking the leader boards some interesting insights were obtained:

  • almost all winning solutions used Convolutional Neural Networks (CNNs) or Recurrent Convolutional Neural Network (RCNNs)
  • the gap between CNN-based models and shallow, feature-based approaches remained considerably high
  • even though many of the recordings were quite noisy the CNNs worked well without any additional noise removal and many teams claimed that noise reduction techniques did not help
  • data augmentation techniques seemed to be widely used, especially the techniques used in audio processing such as time or frequency shift
  • some winning teams successfully approached it with semi-supervised learning methods (pseudo-labeling) and some increased AUC by model ensemble

But how to apply CNNs, neural networks designed to extract features from images to classify or segment them, when we only have sound recordings? Mel-frequency cepstrum (MFCC) is the answer.


# Load the mp3 file
signal, sr = librosa.load(SOUND_DIR,duration=10) # sr = sampling rate
# Plot mel-spectrogram
N_FFT = 1024
HOP_SIZE = 1024
N_MELS = 128
WIN_SIZE = 1024
WINDOW_TYPE = 'hann'
FEATURE = 'mel'
FMIN = 1400

S = librosa.feature.melspectrogram(y=signal,sr=sr,

plt.figure(figsize=(10, 4))
librosa.display.specshow(librosa.power_to_db(S**2,ref=np.max), fmin=FMIN,y_axis='linear')
plt.colorbar(format='%+2.0f dB')
Example of a mel spectrogram

But what is it and how does it work?

Each sound we hear is composed of multiple sound frequencies at the same time. That is what makes the audio sound “deep”.

The trick of a spectrogram is to visualize also those frequencies in one plot, instead of visualizing only the amplitude as in the waveform. Mel scale is known as an audio scale of sound pitches that seem to be in equal distance from each other for listeners. The idea behind that is connected with the way how humans hear. When we connect those two ideas we get a modified spectrogram (mel-frequency cepstrum) that simply ignores the sounds humans do not hear and plot the most important parts.

The longer the length of the audio from which a spectrogram is created, the more information you get on an image but also the more overfitting your model can become. If your data has a lot of noise or silence, there is a chance that 5 seconds lasting audios will not catch the needed information. Therefore it was decided to create images out of 10s lasting audios (and it increased final model accuracy by 10%!). Since the birds sing in high frequencies, high pass filter was applied to remove useless noise.

Examples of 5s spectrograms with not enough information (silence) and predominantly noise

Time to model!

After the creation of mel-spectrograms with high pass filter out of 10s lasting audio files, data were split it into train (90%), validation (10%), and test set (10%).

IM_SIZE = (224,224,3) 
BIRDS = ['0Parus', '1Turdu', '2Passe', '3Lusci', '4Phoen', '5Erith',
'6Picap', '7Phoen', '8Garru', '9Passe', '10Cocco', '11Sitta','12Alaud', '13Strep', '14Phyll', '15Delic','16Turdu', '17Phyll','18Fring', '19Sturn', '20Ember', '21Colum', '22Trogl', '23Cardu','24Chlor', '25Motac', '26Turdu']
DATA_PATH = 'data/27_class_10s_2/'

Built-in Keras library data generators take care of data augmentation and normalization of all spectrograms.

train_datagen = ImageDataGenerator(preprocessing_function=preprocess_input, 
train_batches = train_datagen.flow_from_directory(DATA_PATH+'train',classes=BIRDS, target_size=IM_SIZE, class_mode='categorical', shuffle=True,batch_size=BATCH_SIZE)

valid_datagen = ImageDataGenerator(preprocessing_function=preprocess_input)
valid_batches = valid_datagen.flow_from_directory(DATA_PATH+'val',classes=BIRDS,target_size=IM_SIZE, class_mode='categorical', shuffle=False, batch_size=BATCH_SIZE)test_datagen = ImageDataGenerator(preprocessing_function=preprocess_input)test_batches = test_datagen.flow_from_directory(DATA_PATH+'test', classes=BIRDS,target_size=IM_SIZE,class_mode='categorical', shuffle=False,batch_size=BATCH_SIZE)

The final model was built on EfficientNetB3 and on 27 different classes (bird species) with Adam optimizer, categorical cross-entropy loss function and balanced class weights. Learning rate was reduced on plateau.

# Define CNN's architecture net = efn.EfficientNetB3(include_top=False, weights='imagenet', input_tensor=None, input_shape=IM_SIZE) x = net.output 
x = Flatten()(x)
x = Dropout(0.5)(x)
output_layer = Dense(len(BIRDS), activation='softmax', name='softmax')(x)
net_final = Model(inputs=net.input, outputs=output_layer) net_final.compile(optimizer=Adam(), loss='categorical_crossentropy', metrics=['accuracy'])
# Estimate class weights for unbalanced dataset class_weights = class_weight.compute_class_weight( 'balanced', np.unique(train_batches.classes), train_batches.classes) # Define callbacks ModelCheck = ModelCheckpoint('models/efficientnet_checkpoint.h5', monitor='val_loss', verbose=0, save_best_only=True, save_weights_only=True, mode='auto', period=1) ReduceLR = ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=5, min_lr=3e-4)
The solution summary — audio data preprocessing and neural networks model
# Train the modelnet_final.fit_generator(train_batches,
validation_data = valid_batches,
epochs = 30,
steps_per_epoch= 1596,
class_weight=class_weights, callbacks[ModelCheck,ReduceLR])

Finally, the solution predicted the right bird’s name with 87% accuracy on the test sample with:

  • 11 classes having F1-score over 90%
  • 8 classes having F1-score between 70% and 90%
  • 2 classes having F1-score between 50% and 70%
  • 6 classes having F1-score below 50%.
The classification report of the neural network model

If you are interested in seeing the code in a jupyter notebook, you can find it here: