The Promise of Natural Language Processing in Speech Recognition

Original article was published on Artificial Intelligence on Medium

Speech Recognition

Speech is the most efficient, effective, and natural way of exchanging information among humans. Therefore if we can recognize the speech using a technological way, it will be profitable. That process which recognizes human speech called Speech Recognition. This can be defined as the process of identifying and understand human spoken language through speech signal processing and pattern recognition. The target of speech recognition is for a machine to understand, hear, and react accordingly for ‘spoken information’. The goal of Automatic-Speech Recognition (ASR) is to analyzing, characterizing, extracting, and recognizing information about the identity of the speaker. Speech processing is an area of signal processing. Machine recognition of speech is generating a sequence of words which best matches the inputted speech signal. Some applications of speech- recognition include Multimedia searches, virtual reality, natural-language understanding, auto-attendants, travel-Information and reservation, translators, etc

Different Classes of Speech Recognition Systems

Speech recognition systems can be classified in different classes based on what type of utterances they can recognize.

Isolate word recognizes require to have two states, “Listen and Not-Listen”. This approach accepts a single word unit or single utterance at a single time.

Connected word systems are similar to isolated words but allow separate utterance to be run together with “minimum pause between them”

Continuous speech recognizes allow the user to speak naturally. Because of the utterance boundaries, it uses a special method, Which is why it considered as one of the most difficult systems to create

Spontaneous speech means speech that is natural-sounding and not rehearsed. An automatic speech recognition system should be able to handle a variety of natural speech features like words being run together.

Speech Recognition Systems

There are four stages of speech recognition,

The general structure of a speech recognition system

We can capture a speech or another sound with the help of a microphone. The analog signal can be converted to digital using a sound card.

After the capturing process is completed, speech or sound is available in continuous parts. This pre-processing step has the following four stages,

  1. Background Noise and Silence Removing
  2. Preemphasis Filter
  3. Blocking into Frames
  4. Windowing

This is the process of transforming speech into parameters which can represent speech signal in terms of feature vectors. Using a digital filter, Fourier transformation, or Linear predictive coding we can extract feature vectors. These speech signal should belong to the same feature vector when changing any speakers

The most powerful feature extraction method is Linear Predictive Coding and it gives the most accurate speech parameters.

Reorganization can be divided into two segments,

  1. Training Part
  2. Testing Part

In the training part, the system is experiencing and learning. The current speech recognition technologies do not allow the real-time implementation of models comparable to human complexity. This means the variability of speech must be limited to achieve proper results

In the training part, systems consider unknown speech signal to the reference patterns of closest that recognized the word.

Models of Speech Recognition Systems

1. Dynamic Time Wrap (DTW)

DTW is a technique that can find an optimal match between two given sequences of speech. And also this method facilitates non-linear mapping from one signal to another signal by reducing the distance between the two signals. DTW is a template-based method and also to understand this, there are two concepts to deal with. Those are,

  1. Feature — the information of every signal should be represented in the same manner.
  2. Distances — some form of metric should be used to gain a match path.

There are two different types of distances. The first one is Local-Distance. It means the computational difference between a characteristic of one signal and a characteristic of other signals. The second one is Global-Distance. It is the overall computational distinguish between full signal and another- signal of probably different length. DTW can be introduced based on two concepts. Those are,

Symmetrical DTW — Speech always depends on time because it is time-dependent. There are several ways in pronouncing the same word that can have different time durations and also utterance of the same word with equal duration will deviate from its middle. There are different spoken rates due to different part of the word

  • Matching paths can not reverse in time
  • All frame in the input must be used in matching point
  • Global distance is obtained by combining local distance scores

This is known as Dynamic Programming (DP). DP can find the minimum distance path through the matrix and also can reduce the amount of calculation

Asymmetrical DTW — In this approach, the input pattern which is in each frame is used only one time. It means that dispense and template length normalization and for diagonal transition no need to add local distance twice. This method is used to as asymmetric DP.

2. Hidden Markov Model (HMM)

Hidden. Markov model (HMM) is one of the most- popular techniques in machine learning and statistics for modeling sequences speech in the field of Natural Language. Processing. Speech can be recognized mathematically using this approach. This is a doubly embedded stochastic (having a random probability distribution or pattern that may be analyzed statistically but may not be predicted precisely) process with cannot directly observable (hidden) stochastic process. But this process can be observed from another stochastic process which gives a sequence of observations. HMM defines probability distribution for a set of observations a= a1,…, at,…, aT by talking another set of unobserved (hidden) discreet state variables u= u1,…, ut,…, uT. The main idea behind HMM is that the set of hidden states has Markov dynamics. given ut, uT is independent of up for all T<t<p and that the observations at are independent of all other variables given ut. The model is defined by using two sets of parameters, the transition matrix whose ij element is P(ut+1=j | ut=i) and the emission matrix whose iq element is P(at=q | ut=i).

By using the probabilistic model, stochastic modeling deals with incomplete and uncertain data/ information. Incompleteness and uncertainty are occurred in speech recognition, for example, speaker variability, contextual effect, confusable sounds, homophones words, etc

In HMM there is a collection of states-connected by transitions. Each transition takes two-sets of probabilities,

  • Transitional probability — gives the probability for taking this transition
  • Output probability — gives the conditional probability of emitting all outputs symbol from a finite alphabet given that a transition is taken.

Known disadvantages of HMM are decoding problems, learning problems, and evaluation.