Sense of Hearing in AI

Original article can be found here (source): Deep Learning on Medium

Humans have five senses sight, hearing, smell, taste, and touch. As the AI gets upgraded, the whole world is working on the development of human senses in AI assistant.

Humans have five different sense organs to feel various senses. But the machine understands only signals, so the whole approach is worked to convert the sound into signals. In this field, we already worked from very old times and developed many devices to transmit the sound from one place to another. Now the thing is to understand this signal by machine and then its response as a human does.

How Does it Work?

It’s all about the sound signal, Sound is nothing but a wave signal that have some basic characteristics like frequency, amplitude, pitch, loudness, and time.

  • Sound Pre-processing — Well sound wave is a type of sinusoidal wave it is an analog signal after converting it in digital. We apply standard sampling of rate 44100. And store some small chunks in wav audio format; it helps to store the standard value of sound.
  • Feature Extraction — Instead of using sample data as it is, we extract some feature extraction techniques. MFCC is one of them. It stands Mel Frequency Cepstral Coefficients and commonly used as a feature extraction technique in speech recognition systems such as the system which can automatically recognize which is the task of recognition people from their voice. MFCC of a signal is a small set of features (usually about 10–20) that concisely describe the overall shape of a spectral envelope. In this technique, we use FFT (Fast Fourier Transform) to extract audio features.

By using python we use a standard sampling rate and FFT samples,

We get an MFCC feature matrix

Alternate Approach-

We can use an alternative approach by creating a spectrogram of each wav file and use this image as input. For example, if someone says “hello” and the other one will say “hellllooooo” then our system must recognize the same “HELLO”. To do so we create a spectrogram of the sound wave.


By spectrogram, we get a feature like the frequency concerning time and by coloring, we also get sound intensity. It can be used to classify sound differently.

Areas and Applications-

If we talk in terms of machine learning we can classify various applications in three major categories sound classifications, speech recognition, and sound/speech generation.

Sound Classification —

As we discussed sound is nothing but a wave signal that has some basic characteristics like frequency, amplitude, pitch, loudness, and time. After pre-processing the sound, we classify it, by using various machine learning and deep learning approaches. Some important applications are as follows.

  • Surveillance System: — In surveillance, we detect suspicious activities, with the identification of the different types of crowd noise and individual aggressive sound. The concept behind is that an aggressive person abuse with a high pitch or louder sound and also he doesn’t speak any rhythmic sound as we do so while cheering.
  • Rigidness Detection: — When we knock any object like glass, wood, floor, plant, cotton, etc this all generates a different type of sound. We can detect this and can find the rigidness and strength of an object.
  • Sound Source Detection: — There are many types of sound sources either human-generated or environment generated. We can classify such type of sound use it as per their use. Also, we can detect speaker sources based on their noises.
  • Instrumental Music Classification:- There are many types of musical instruments like trumpet, piano, trombone, violin, guitar, saxophone, drum, conga, cello, etc. We can classify each of these instruments and improve them. The concept is simple harmonic motion,
  • Voice/Lyrics Detection: — There many noisy sound sources in which we unable to listen to the human voice clearly with the help of sound classification we can distinguish both sounds by using a spectrogram. Similarly, we can differentiate lyrics and karaoke sound.
  • Audio Fingerprinting: — Some sounds are created by someone and he/she has their copyright on that. By the use of a spectrogram, we can detect fingerprinting of sound by only using a few seconds of music.
  • Child Monitoring:- In today’s busy schedule people have no time to care about their small kids, in such a situation we can develop an AI-generated assistant that can monitor kids and also listen to a different type of sounds like laughing, crying, burping, etc by this we can understand kids.

Speech Recognition –

In speech recognition, we convert the speech into text. To create this we required a huge amount of dataset, each data point has single vocal audio of a particular word. I worked on an assistant in which we use basic command data and numbers like, one….nine, up, down, etc.

The data required a pre-processing as mention above, after that we can create our deep learning model to do so.

In the market; many huge companies like Amazon, Google provide their assistant as well as some paid/unpaid API to use it for your platform. Applications are

  • Personal Assistant: — Nowadays, there is a great craze of a personal assistant or home assistant. There are already many devices like Alexa, Google Assistant, Siri, etc. These are some device assistants we understand human language and work as per the commands. It is a way by which we reducing human effort and use AI as a servant. Now people are working to use the same concept with different languages.
  • Person Identification: — Many devices mimic sounds and also while tracing a call we have to make a continuous eye on a suspect to identify them. In such cases a small mistake can create a lot of changes, to sort this problem we can use trained AI assistants.

Sound Generation-

Our era is crazy about EDM (Electronic Dance Music), they want something remix with old music sound. Also, it helps to manipulate sound with some minute glitch. We can manipulate our voices with some celebrities. some important applications,

  • Music Generation: — Many generative networks help to generate new thing either image or sound. VAE-GAN (Variational Auto encoder-Generative adversarial networks) is one of them. It used to generate new percussion, melodies, and chords, based on vocal sounds. Also, we can create a new pattern of electronic music.
  • Speech Generation: — It is an NLP based concept in which we do the same as the text generation but in the form of sound, for example, defining something similar words, sentence completing.

Future Work –

As a human, we can understand that Hearing is very important to sense for us. After hearing a sound, how fast our nervous system proceeds it and responds to it. And these responses are how much reliable to us. Let’s understand the concept-

Neurolinguistic — Neurolinguistic is the study of how language is represented in the brain: that is, how and where our brains store our knowledge of the language (or languages) that we speak, understand, read, and write, what happens in our brains as we acquire that knowledge, and what happens as we use it in our everyday lives.

We can understand it by an example “All colleges and schools are getting close in Delhi because of………..” If you speak this sentence in front of any guy in the whole country, definitely his/her answer will be coronavirus according to the current situation. Here if you answer summer, pollution, riots can also be right but not that much closely related to the situation. This is how our neurolinguistic works based on previous knowledge and situation. With the help of Natural Language Processing, we can teach our AI about part of speech so it can fill a suitable answer to complete the sentence. Such a way we can enhance our system to use its short term memory to answer it.

Thank You