Working with audio signals in Python

Original article was published on Deep Learning on Medium

Most of the attention, when it comes to machine learning or deep learning models, is given to computer vision or natural language sub-domain problems.

However, there’s an ever-increasing need to process audio data, with emerging advancements in technologies like Google Home and Alexa that extract information from voice signals. As such, working with audio data has become a new trend and area of study.

The possible applications extend to voice recognition, music classification, tagging, and generation, and are paving the way for audio use cases to become the new era of deep learning.

Audio File Overview

Sound are pressure waves, and these waves can be represented by numbers over a time period. These air pressure differences communicates with the brain. Audio files are generally stored in .wav format and need to be digitized, using the concept of sampling.

The sampling frequency (or sample rate) is the number of samples (data points) per second in a ound. For example: if the sampling frequency is 44 khz, a recording with a duration of 60 seconds will contain 2,646,000 samples. In practice, sampling even higher than 10x helps measure the amplitude correctly in the time domain.

Loading and Visualizing an audio file in Python

Librosa is a Python library that helps us work with audio data. For complete documentation, you can also refer to this link.

  1. Install the library : pip install librosa
  2. Loading the file: The audio file is loaded into a NumPy array after being sampled at a particular sample rate (sr).

3. Playing Audio : Using,IPython.display.Audio, we can play the audio file in a Jupyter Notebook, using the command IPython.display.Audio(audio_data)

4. Waveform visualization : To visualize the sampled signal and plot it, we need two Python libraries—Matplotlib and Librosa. The following code depicts the waveform visualization of the amplitude vs the time representation of the signal.

5. Spectrogram : A spectrogram is a visual representation of the spectrum of frequencies of a signal as it varies with time. They are time-frequency portraits of signals. Using a spectrogram, we can see how energy levels (dB) vary over time.

6. Log-frequency axis: Features can be obtained from a spectrogram by converting the linear frequency axis, as shown above, into a logarithmic axis. The resulting representation is also called a log-frequency spectrogram. The code we need to write here is:

librosa.display.specshow(Xdb, sr=sr, x_axis=’time’, y_axis=’log’)

Creating an audio signal and saving it

A digitized audio signal is a NumPy array with a specified frequency and sample rate. The analog wave format of the audio signal represents a function (i.e. sine, cosine etc). We need to save the composed audio signal generated from the NumPy array. This kind of audio creation could be used in applications that require voice-to-text translation in audio-enabled bots or search engines.

So far, so good. Easy and fun to learn. But data pre-processing steps can be difficult and memory-consuming, as we’ll often have to deal with audio signals that are longer than 1 second. Compared to the images or number of pixels in each training item in popular datasets such as MNIST or CIFAR, the number of data points in digital audio is much higher. This may lead to memory issues.