[Week #2— Rock or Not? ♫]

Source: Deep Learning on Medium

☞ This sure does.

We are Defne Tunçer & Kutay Barçin and this is our second article of series of our Machine Learning Course Project about Music Genre Classification. So let’s give it a start!


The FMA dataset, a dump of the Free Music Archive is suitable for evaluating several tasks in MIR(Music Information Retrieval), a field concerned with browsing, searching, and organizing large music collections. It includes 106,574 tracks from 16,341 artists and 14,854 albums, arranged in a hierarchical taxonomy of 161 genres.

25,000 tracks of 30 seconds, 16 top unbalanced genres were used in this project for computational efficiency and information integrity.

After cleaning data (tracks with 0 second of audio and tracks with less than 30 seconds of audio dropped) we are left with total of 24,979 tracks. These tracks were splitted into training set, validation set and test set with sizes of 19983, 2498, 2498, respectively, and all data is shuffled randomly.


The features are generated using LibROSA (a Python package for music and audio analysis). Each track contains 518 attributes categorized in 11 audio features:

[‘chroma_cens’, ‘chroma_cqt’, ‘chroma_stft’, ‘mfcc’, ‘rmse’, ‘spectral_bandwidth’, ‘spectral_centroid’, ‘spectral_contrast’, ‘spectral_rolloff’, ‘tonnetz’, ‘zcr’]

And stored as statics, including [‘kurtosis’, ‘max’, ‘mean’, ‘median’, ‘min’, ‘skew’, ‘std’] for each feature.

Let’s go through the features with an example audio:

Zero Crossing Rate (zcr)

The zero crossing rate indicates the number of times that a signal crosses the horizontal axis.

We detect onsets [0.279, 0.766, 1.277, 1.533, 1.765] in seconds and sort the 100-ms segments beginning at each onset by zero crossing rate, then concatenate the sorted segments.

Mel Frequency Cepstral Coefficients (mfcc)

The mel frequency cepstral coefficients of a signal are a small set of features (usually about 10–20) which concisely describe the overall shape of a spectral envelope. In MIR, it is often used to describe timbre.

mfcc (left), scaled-mfcc (right)