Deep Learning based Pitch Detection (CNN, LSTM)

Source: Deep Learning on Medium


The usage of deep learning models in audio signal domain tasks has substantially increased over the time. The success of these models, however,
depends largely on the labeled dataset and their availability. Using digital data to render sufficient datasets for training have gained the focus as it reduces dependencies on the real world data. However, the realism of such digital data with corresponding real world data needs great human effort.

In this theoretical explanation, digital data based alternative prototype is proposed for pitch detection. As it is established, the relevant information of the signal is important for pitch detection, a synthesized audio signal is generated using MIDI data. Instead of processing the audio signal using conventional methods for pitch values, pitch tracker does the needful even in presence of noise using deep neural network. The proposed work has efficient approach to achieve the requirement. The evaluation on real world data shows the promising results. Finally, analysis on real world data shows the criteria that dataset needs to have for successful detection.

The appropriateness of the approach is demonstrated by training the state-of-the-art neural network architectures for pitch detection. Monophonic
recordings under LMD (Lakh Midi Datasets) are considered. This scenario is evaluated on the Bach datasets to check the performance of model on real world data. The experiments conducted shows that the synthetic data helps the model in training and detecting the pitches when the real world data is passed. Instead of creating the real world datasets for statistical method based pitch detection, which can be complex, synthetic data can be used along with consideration of sound properties.


Motivation can be best framed by naming the applications.


  • Sound transformations
  • Capture florid melodies of world music cultures.
  • Music notations, transcribe real performances into scores.
  • Conversion of a signal captured by a microphone into a midi number


MIDI-Musical Instrument Digital Interface

MIDI has wide spread applications in digital music. It is nothing but set of instructions that tells an electronic device about certain sound. These instructions are events or messages that specify notation, pitch, velocity, clock signals (tempo) etc. Examples of MIDI messages are Note Off event, Note On event, Polyphonic Key Pressure, Song Position Pointer, Song Select..,

Constant-Q Transform:

Here the audio signals are converted to time-frequency representations by applying CQT. Of course, there are many transforms but this transform outperforms all the other transforms for this particular goal.

Let’s dwell little bit into the concept. We need to have the frequency response that match the basilar membrane of the cochlea in the ear. Since the western music has frequencies geometrically spaced, implementation of FFT yields frequencies that do not map efficiently to musical frequencies.

The centre frequencies and octave of the CQT are geometrically distributed. With this distribution technique, higher temporal resolution is obtained for high frequencies and low frequencies will have high frequency resolution.

Unique features of CQT

Block Diagram of CNN based Denoiser

Fig. below shows the block diagram of a CNN based denoiser. The MIDI data is fed to the synthesizer. The synthesizer uses a naive technique for converting data into time domain samples. These time domain samples are taken and transformed into a time frequency representations using CQT. The time frequency representations calculated here are marked as ‘Clean’ CQT. In order to prepare inputs to the CNN, noise is added to the time domain samples and are transformed in to Time-Frequency representations and is marked as ‘Noisy’ CQT. The ‘Noisy’ CQT is fed into the CNN and the corresponding label, ‘Clean’ CQT, is used for training. The trained CNN is expected to give denoised outputs.

Block Diagram of CNN based Denoiser

Block Diagram of LSTM based Pitch Detection

The LSTM is the variant of Recurrent Neural Network (RNN) is used in detecting the pitch. Fig. below shows the block diagram of LSTM based pitch detection. The denoised output from CNN is considered as inputs to the network. The MIDI data serves as the labels for the training of LSTM network. The data is preprocessed in such a way that it adheres to the format of LSTM neural network. The output of trained LSTM network is expected to track the pitch from the denoised spectrogram.

Block Diagram of LSTM based Pitch Detection

Results and Evaluation:

The results for the deep learning based pitch detection was evaluated on the real world audio signals. The training with synthetic data led to very promising results with 95% accuracy. But the same network when tested on real world signals showed variations for some instruments because of timbre details. The evaluation was divided based on the instruments. Below are the plots of test performed on synthetic data and real world signals. The datasets produced does not handle the wide range of control over the spectrum and this might be one of the reason for the detection rate to be so less. This is one of the observation which might be the reason for not detecting the pitch in bassoon. But the detection rate has been pretty good in other 3 instruments.

Output from Synthetic Data
Evaluation Dataset divided based on instruments
Evaluation results from Violin
Evaluation results from Clarinet
Evaluation results from Saxophone
Evaluation results from Bassoon


  • J. C. Brown, Calculation of a constant Q spectral transform, The Journal of the Acoustical Society of America, vol. 89, no. 1, pp. 425434, 1991.
  • C. Schörkhuber and A. Klapuri, Constant-q transform toolbox for music processing, in 7th Sound and Music Computing Conference, Barcelona, Spain, 2010, pp. 364