How to run GPU accelerated Signal Processing in TensorFlow

Somewhere deep inside TensorFlow framework exists a rarely noticed module: tf.contrib.signal which can help build GPU accelerated audio/signal processing pipeline for you TensorFlow/Keras model. In this post, we will take a practical approach to exam some of the most popular signal processing operations and visualize the results.

You can find the source code on my GitHub for this post as well as a runnable Google Colab notebook.

Get started

We are going to build a complete computation graph in TensorFlow that takes a wav file name and outputs the MFCC feature. There are some intermediate output/audio features which could be fun to visualize, so we will enable the TensorFlow eager execution which allows us to evaluate operations immediately without building the complete graph. If you are new to TensorFlow eager execution, you are going find it much more intuitive than the graph API.

The following snippet will get you started with eager execution in TensorFlow.

Decode WAV file

The tf.contrib.ffmpeg.decode_audio depends on the locally installed FFmpeg library to decode an audio file.

To install FFmpeg on a Linux based system, run this.

apt update -qq
apt install -y -qq ffmpeg
ffmpeg -version

After that, we can download a small sample of the siren sound wav file and use TensorFlow to decode it.

The waveform is a Tensor, with the help of eager execution, we can immediately evaluate its value and visualize it.

A section of the waveform

From the raw waveform, we can bearly see any signature of the siren sound, plus that might be too much data to feed to the neural network directly. The next several steps will extract the frequency domain signatures of the audio signal.

Declaimer: decode an audio file with tf.contrib.ffmpeg is not supported on Windows. Refers to issue

An alternative on Windows is to decode the wav file with scipy.

Computing spectrograms

New to spectrograms? Check out the cool Chrome music lab experiment to visualize your voice as spectrograms in real time.

The most common approach to compute spectrograms is to take the magnitude of the STFT(Short-time Fourier Transform).

Any audio waveform can be represented by a combination of sinusoidal waves with different frequency, phase, and magnitude. STFT can determine the sinusoidal frequency and phase content of local sections of a signal as it changes over time.

tf.contrib.signal.stft computes the STFT of signals. This operation accepts a Tensor “signals” of shape (batch_size, samples).

The stfts Tensor has the shape (batch_size, frames, fft_unique_bins), each value contains a complex number in the form of a + bi with the real and imaginary part.

An energy spectrogram is the magnitude of the complex-valued STFT, i.e. sqrt{a^2 + b^2}.

In TensorFlow it can be computed as simple as,

We can plot the energy spectrogram, and we can spot some pattern shown at the top of the image where the frequency is going up and down resembling the pitch variation of the siren sound.

Magnitude spectrograms

Computing Mel-Frequency Cepstral Coefficients (MFCCs)

As you can see, there are 513 frequency banks in the computed energy spectrogram, and many are “blank”. When working with spectral representations of audio, the Mel Frequency Cepstral Coefficients (MFCCs) are widely used in automatic speech and speaker recognition, which results in a lower-dimensional and more perceptually-relevant representation of the audio.

We can turn the energy/magnitude spectrograms into Mel-spectrograms in TensorFlow and plot its outcome like this.

If desired, we can specify the lower and upper bounds on the frequencies to be included in the Mel-spectrum.

num_mel_bins specifies how many bands in the resulting Mel-spectrum.


To further compress the Mel-spectrogram magnitudes, you may apply a compressive nonlinearity such as logarithmic compression, and this helps to balance the importance of detail in low and high energy regions of the spectrum, which more closely matches human auditory sensitivity.

log_offset is a small number added to avoid applying log() on zero in rare case.

Log mel spectrograms

The last step, tf.contrib.signal.mfccs_from_log_mel_spectrograms computes MFCCs from log_mel_spectrograms.


Build everything together

Put everything together into a TensorFlow pipeline.

Conclusion and further reading

In this post, we introduced how to do GPU enabled signal processing in TensorFlow. We walked through each step from decoding a WAV file to computing MFCCs features of the waveform. The final pipeline is constructed where you can apply to your existing TensorFlow/Keras model to make an end to end audio processing computation graph.

Some related resources you might find helpful.

Mel Frequency Cepstral Coefficient (MFCC) tutorial

Chrome music lab Spectrogram experiment

Source code for this post on GitHub.

Share on Twitter Share on Facebook

Originally published at

Source: Deep Learning on Medium