Audio Processing and Speech Classification using Deep Learning — PART 1

Source: Deep Learning on Medium

So in this first part of the two part article you will learn how to work with audio files and represent them in the form which can be used as an input to your Convolution Neural Network model for the classification task. We will be looking at some of the forms to visualize the audio data.

Please refer to my GitHub account for the in depth code.

GitHub link for the code —


Natural Language is the most common way through which human beings interact with each other. Everyday we exchange our thoughts and feelings through spoken language. You can only imagine the amount of data that is generated everyday when we interact with each other through spoken words.

Classification of Speech is one of the most important application in Natural Language Processing. For us humans it is pretty easy to understand what a person is saying. We are able to comprehend every word, every character so easily without a sweat which is quiet surprising. Our brain is a powerhouse which is able to process every word that a person is saying with quintessential accuracy. Even when the accent of certain people are difficult to understand, we just need to understand the few words and the context in which the words are being spoken to get the idea of what a person might be saying. This ability of us is what makes Natural Language Understanding (NLU) for us so trouble free.

But when it comes to machines, this is where things get little bit complicated. Making machine understand the natural language and making a sense of what has been spoken is very much tricky. Even though it is possible to hard code the principles and let the machine know of all the different words and in what context they can be spoken but that’s as far as we can go with traditional programming. Such machines are only limited to the knowledge that we are able to impart on them.

Human language is something that is continuously evolving. Everyday you come across different words that you get to hear for the first time. The English language vocabulary is incredibly vast. It almost becomes impossible to hard code each and every different words occurring in the English vocabulary to your machine. And so far we are only talking about in the context of English language. There are hundreds of thousands of languages and dialects that are been spoken all over the world. So you can imagine that pain we would have to go through to individually tell our machine about all the different words appearing in all the different languages.

From what we understand about speech or sound is that it is nothing but a form of a wave. Sound is the vibrations that travel through the air or another medium and can be heard when they reach a person’s ear. So when we speak something the molecules close to our mouth starts vibrating and these vibrating molecules collide with their neighboring molecules which ultimately reaches the ear of the another person. And in this way the other person is able to hear what has been spoken. This is all getting very scientific so i will stop right here……..

What you need to understand is that these sound waves can be represented in the form various sound properties such as amplitude, frequency and wavelength.


  • An analog signal is a continuous wave denoted by a sine wave (pictured below) and may vary in signal strength (amplitude) or frequency (time).
  • A digital signal — a must for computer processing — is described as using binary (0s and 1s), and therefore, cannot take on any fractional values.
  • We do quantization to get digital signals from analog signals. Digital computers can only capture this data at discrete moments in time. The rate at which a computer captures audio data is called the sampling rate.
  • For example if we use 44.1 KHz sampling rate, it means that we use 44100 samples per second.
  • Information: Humans can hear the sounds between 0–140 dB and 20–20000 Hz !
Fig 1. A typical form of sound wave (courtesy Google images)

The above pictorial representation shows how the sound waves can be represented in the digital form. The various highs and lows that you observe in the above picture is the amplitude of our speech. Depending upon the pitch of our voice the amplitude can be high or low. Every individual depending on their vocal tract can speak a certain word with low or high amplitude.


The data that we will be using belongs to TensorFlow Speech Recognition Challenge which was hosted by Google Brain on Kaggle. The dataset includes 65,000 one-second long utterances of 30 short words, by thousands of different people. I would suggest you to go and check the dataset which I am sure you will find interesting. These spoken utterances are in the form of .wav format which is a standard format for the audio files for processing them. I won’t provide much of the description of the dataset here so i will strongly urge you to follow the above link and get to understand the data that we have in hand.


Since the data is in the form of the wav format which makes the data temporal. And it’s very rarely that you work with the temporal or raw data in such scenarios when you have the luxury to represent these wav files in the format which can be easily fed to out Deep Learning Model. There are few couple of ways in which you can represent these wav files. Either you can represent them in the form of a Spectrogram or in the form of more sophisticated MFCC (Mel-Frequency Cepstral Coefficients). You can read more about what MFCC is here.



In order to represent the wav files in the form of wave forms and spectrogram, the python package called as ‘SciPy’ provides us with the most efficient way. Import the required libraries.

Fig 2. Importing the required libraries for converting wav files into their corresponding spetrograms and MFCC’s

Once you import the required libraries you go about reading the wav file as shown below. As explained in the file below when we read a wav file using we get two outputs, one corresponding to the sample rates and the other consists of the amplitude of the audio file. Here we decided to visualize the 1st(0th index), 5th and the 8th wav file belonging to the word ‘yes’. The 1st and the 8th wav file corresponds to the word being spoken by a female while the 5th wav file corresponds to the word spoken by male. Just to visualize if there are any differences in the way in which male and female speaks this particular word we have chosen these files.

Fig 3 : Reading the wav files using SciPy

Now let’s visualize the wave forms of these wav files.

Fig 4 : The wave form comparison for male and female voice

The difference in the amplitude for the word ‘yes’ when spoken by a female as opposed to male can be clearly seen. Also for the female voice the maximum amplitude seems to be in the range of -10000 to 10000. Whereas maximum amplitude for the male voice lies between -2000 to 2000. This difference is obvious when you actually hear when the word is been spoken. The male counterpart speaks the word with a low and subtle pitch hence low amplitude range while the intensity and the pitch of the sound when spoken by female counterpart is comparitively high signifying high amplitude range. Even though the differences in the amplitude of the sound for both the gender is noticable but the waveform follow almost the same pattern which can make us possible to recognize the word ‘yes’ while making a prediction on a new wav file.

One can also try visualizing the wave forms for the different words. It’s inevitable that you would get a different frequency distribution for each spoken word as shown below.

Fig 5 : Wave form visualization for different spoken words


Spectrograms are another way in which you can represent your wav files with. The difference between the two spoken words can be clearly observed from the spectrograms of the ‘CAT’ and ‘DOG’ wav files.

Fig 6 : Spectrogram representing the CAT wav file
Fig 7 : Spectrogram representing the ‘DOG’ wav file

3. MFCC (Mel-Frequency Cepstral Coefficients)

To get the better in depth into what MFCC’s are please refer to this excellent article. Now let’s visualize the MFCC’s for ‘CAT’ and ‘DOG’.

Fig 8 : MEL Power Spectrogram representing ‘CAT’ and ‘DOG’

With all the available representations of the wav file you can choose any one of them and can use as an input to your Convolution Neural Networks for the classification task which will learn all the internal features and representations of the provided representation formats.

So in this article we came across all the ways in which you can represent your audio files in the form of images. In the next one we will be seeing how to save these image formats and use them as an input for your Convolution Neural Network (CNN) model to do some speech classification task. From the above formats I have used the MEL Power Spectrograms of all the audio files as the data on which my CNN model will be trained.

For more interesting content please don’t forget to give thumbs up and follow me on medium and GitHub.

GitHub link for the code —

If you think of any changes I need to make in my content please feel free to run across by me. In case of any doubts make use of comment section.