Automatic speech recognition (ASR) has come a long way in recent years. Virtual assistants accepting voice commands is a prominent example of how the technology can successfully be applied. However, when background noise or multiple speakers are introduced, such systems are inadequate.
At Corti, we have found that numerous ASR services perform considerably worse when applied to noisy domains compared to structured single speaker, voice command settings. ASR models are often trained on data sets of audio recordings of a single speaker reading aloud written text, be it news reports such as in the Wall Street Journal (WSJ) data or audio books as in LibriSpeech.
When such models fail to generalize to noisy domains, one can argue that they are in a certain sense overfitted to the domain of their data — they have learned that noise does not occur, or rather, they haven’t learned that noise exists let alone what it is.
This post proposes and exemplifies that this ‘overfitting’ could be the consequence of a problem in the way that these models are defined and trained.
Mozilla Deep Speech
Mozilla Deep Speech is an open source project implementation of Baidu’s similarly named 2013 research paper. The project provides access to a high-performing pretrained ASR model that can be used to transcribe audio.
This model has been trained on a compilation of different data sets that consist primarily of text being read aloud by a single person (audio books, news reports etc.).
At Corti, we found it interesting to see how this pretrained model performs on the type of noisy conversational speech data used in our ASR.
One thing to note about the pretrained Deep Speech model is that it is trained on 16 kHz sound samples. Applying it to audio data of any other sample frequency requires upsampling that data. Mozilla warns that this may produce erratic results.
Before presenting and comparing results, we will briefly describe the different data sets used in this comparison.
Corti provides an advanced decision support system for emergency service dispatchers. As such, the available ASR task is to transcribe live emergency calls. Such calls involve at least two speakers and are typically noisy from various background sources. This differs fundamentally from the training data of Mozilla Deep Speech. For this comparison, the noisy data consists of calls from the Seattle emergency services dispatch center. As it stands, this data is recorded at 8 kHz and needs to be upsampled before being input to the Deep Speech model.
To asses the effect that upsampling has on the ASR, the model is also evaluated on the WSJ and the Librivox English data sets. None of these sets are noisy but the WSJ data is recorded at 16 kHz while the Librivox data is recorded at 8 kHz and requires upsampling. The entire Librivox English dataset is used for evaluation while the validation is on the “dev93” subset of WSJ according the the Kaldi recipe.
To avoid aliasing effects and other artefacts, upsampling from 8 to 16 kHz is done by simply forward copying each amplitude value in the audio file once to effectively double the number of samples in the file. The file’s frame rate is then simply adjusted to 16 kHz. Aurally (look it up, it is a word) inspecting the upsampled audio confirms that the speech is readily understandable although slightly more ‘metallic’ sounding.
Now to the interesting part. The table below summarizes the results of evaluating the pretrained Deep Speech model on the three data sets. We report on the word error rate (WER) as the metric for comparing the model’s performance.
| Data set | WER |
| WSJ (dev93) | 8.30% |
| Librivox | 11.68% |
| Seattle dispatch | 70.29% |
The performance on the noisy Seattle dispatch data stands out by having more than eight times the WER of the WSJ data and six times that of the Librivox data.
The performance on the Librivox data is quite good — even considering that each recording has been upsampled to 16 kHz. Compared to the WSJ data, the difference between the two is small. The slightly higher WER on Librivox may also simply be related to the different data distributions; with Librivox and WSj comprising audio books and news article recordings, respectively, the language may be expected to be more varied in Librivox
Looking at the Librivox and WSJ WERs, the high WER on the Seattle dispatch data does not seem to be due to the upsampling alone.
The performance on the noisy Seattle dispatch data stands out by having about more than eight times the WER of the WSJ data and six times that of the Librivox data
The comparison of the results of Deep Speech on WSJ and the upsampled Librivox seems to indicate that the drop in performance on the Seattle dispatch data is due to multiple speakers and background noise.
Clearly, the level of background noise differs fundamentally from the training data of Mozilla Deep Speech. But, it is present in many other sources of audio data and applications for ASR.
These results indicate that models pretrained on ‘read-aloud’ data sets such as WSJ and Librivox are not able to recognize speech in noisy domains at a satisfying level of WER. The application of ASR to naturally noisy domains requires a different approach.
[…] models pretrained on ‘read-aloud’ data sets such as WSJ and Librivox are not able to recognize speech in noisy domains at a satisfying level of WER.
Source: Deep Learning on Medium