Solving Challenges in Speech Recognition in India

Source: Deep Learning on Medium


Speech is one of the most symbolic characteristics which distinguishes humans from other living beings. With advances in technology, the mode of Human Computer Interaction (HCI) has changed a lot. Nowadays, speech recognition is considered to be one of the most promising and convenient HCI modes of communication.

Empirical research on speech technology mainly focuses on intelligent interactive voice response systems, intelligent personal assistants, modality, assistive technology and accessibility.

But there is still a dearth of accurate speech recognition applications that perform well in real time applications where the use-case is based on Indian context: mixed language (e.g. Hinglish) or any of the 19,500 Indian dialects.

Voice quality of call-centre applications is degraded due to poor telecom network coverage, adding to the challenge. Punctuation marks, response time in online speech recognition and accuracy are also well-known impacting factors in speech recognition based applications. Deployment of speech technology in a resource-constrained device is also an extremely arduous problem.

Generative models have been state-of-the-art for a long time in speech recognition based on the Hidden Markov Model (HMM) and Gaussian Mixture Model (GMM). In late 2010, generative models got a new direction: Gaussian mixtures of the HMM states were replaced by Deep Neural Net (DNN). But the recognition model essentially retains its generative interpretation.

In the current trend, deep learning is rising to be a powerful alternative. Currently, higher accuracy is being achieved by using deep learning-based Discriminative models, i.e. sequence-to-sequence models. Processing time to generate and decode these models is reduced further by using powerful GPUs available today.

There are several factors that affect an Automatic Speech Recognition (ASR) system. These are:

1. Dialect:

In order to handle the dialectical variations due to regional and social influences, post processing is applied on the ASR system based on a dialectically variated pronunciation dictionary.

Dialect identification is achieved by using DNN for direct mapping of acoustic and phonotactic features along with the classical technique for dialect identification such as traditional machine learning classifiers using n-gram phonotactic features.

2. Mixed Language

One possible solution for handling mixed language is by treating a mixed language as a single language, e.g. spoken Hindi is treated as Hinglish.

3. Application Specific Vocabulary Design

To reduce the search space of speech recognition results, customized domain specific beam search may be applied in Sequence model. This will lead to more accurate results as well as faster results in terms of output generation.

4. Out of Vocabulary (OOV) Detection

OOV detection is very important to get correct input from the user. OOV is calculated using an acoustic score, a domain specific language score and phone/character level recognition.

5. Ambiguity

There are two types of ambiguities: homophones and word boundary ambiguity. Both of them are mostly removable using contextual dynamic programming and Bayesian Belief Network.

6. Noise

To overcome the unusual behavior of ASR due to noise, the acoustic model is trained with noisy data and noise suppression is performed as well. One of the approaches for noise suppression is to use a low-resolution spectral envelope using gains computed from a Recurrent Neural Network (RNN).

7. Spoken language (dialogue-oriented)

Segregation of speech signals of different speakers from a multi-speaker dialogue can be achieved using speaker diarization algorithms, e.g. I-vector/X-vector.

8. Stumbling

Contextual and linguistics information will help solve this problem.

9. Speed in Offline

For the offline scenario, computationally less expensive algorithms such as GMM and HMM can be used instead of deep learning. Optimization based on hardware architecture can also be performed on offline ASR systems.

Speech recognition technology is a very competitive area of work for cutting edge machine learning and deep learning companies, and there exists ample opportunity for new innovation and improvement not only in India, but also worldwide.

Mihup solves these problems in the context of vernacularity in India.


This article is a guest post by Sandipan Mandal, Co-founder of Mihup, a platform which enables accurate and intuitive voice interfaces to be built upon, in Indian languages.