Deep Speech — Real Time Neural Speech Recognition

Speech-to-text algorithms are doing amazing work, especially in our era. There are Google Duplex, Amazon Alexa, Cortana by Microsoft and etc. 
Big companies are trying to make a big progress in terms of AI development.

Honestly, Baidu is not a company that surprises us a lot with its products. But, they came up with research paper on Speech-to-text Neural Network case, that seems pretty amazing. I give a big part of credit to Baidu, because they came up with an idea of the architecture of neural network, that seems unbelievably efficient. “Translating” the paper into actual, working prototype was a hard work, done by Mozilla. But let’s keep it clear: no one could build a good working library without the paper-based architecture, and in the contrary — making a research and writing a paper without bringing it to the real world as a working one, wouldn’t be a game-changing step.

As Baidu says, their work is really game-changing as long as they’ve implemented couple of cool algorithms into their Deep Speech newest paper. 
They mention, that their architecture is significantly simpler that other traditional speech-to-text systems. The most valuable they’ve done is that their newest functions, or deep-learning based algorithms can percieve and filter the whitenoise effects, such as ground noise, reverberation, or speaker variation. In their words Deep Speech “directly learns a function that is robust to such effects”.

Additionally, they claim that the Deep Speech doesn’t use phenome dictionaries. In contrary, according to the latest paper, the key to their approach is an optimized Recurrent Neural Network, that uses GPUs for parallelized training and effective generation. And, GPU-optimized training supports training on huge amount of data, as well.

Furthermore, on a new noisy speech recognition dataset Baidu Deep Speech achieved 19.1% word prediction error, that is almost 10.4 whole percent less than average prediction error on the global market.

How can we increase the accuracy for our specific languages?
We don’t need our voice sentences to be converted into a text with one-fifth of it as a mistake. 
So, here’s the solution — SpellChecking

SpellChecking is not a huge deal in Deep Learning ecosystem. All you need is enough dataset to make your AI learn the semantics of your language — where should each of the words be standing, and in which form. Implementing spellchecking into a Speech-To-Text neural network increases the accuracy by almost 14–15%, if the spellchecker model is pretty good.

Reinforcement Learning implementation.

Mozilla — this cool guys built up a web platform where they let their customers to “donate” their voice-over-text data, and filter the data donated by others by liking or disliking the accuracy between the voice, and the given text. In simple words, Mozilla uses kind people do all the reinforcement learning stuff manually. And in addition, those people also donate voice data. Pretty smart, huh? The link for the kind people :

By this steps, they can increase the accuracy of the Speech-To-Text and Text-To -Speech models almost to perfect one. 
And they don’t even have pay for that work.

ჯიგარი is not translateable to any language, but anyways it means something, that makes you a good man :)

Baidu used couple of optimized datasets to train their Deep Speech model on it. They used datasets from four actual sources:
1. The Wall Street Journal — 80 hours of reading data by 280 speakers
2. Switchboard — 300 hours of conversation data by 4000 speakers.
3. Fisher — 2000 hours of conversation data by 23000 speakers
4. Baidu — 5000 hours of reading data by 9600 speakers.

All of those datasets are published by Linguistic Data Consortium.

Why they need this kind of dataset?
Well, as they mention, Deep Speech SWB model has Neural Network that consists of 5 hidden layers, where each of them consists of 2048 neurons that are trained on 300 hour data of Switchboard. 
But, the latest one, the Deep Speech SWB + FSH model is collaboration of 4 Reccurent Neural Networks, with 5 hidden layers, each of 2304 neurons trained on full 2300 hours of combined corpus data.
That’s where all the data goes. It’s being trained on this hugely deep neural networks.

Where to start

Go to the Deep Speech repository on the Github, and have a look at their files and documentation :

Firstly, I should mention that they say the generative model’s input is an audio file rendered on 16kHz. Which is also the WaveNet TTS generative model’s output. But in addition, WaveNet uses 48kHz rendered audio files for training. As Baidu says in their research paper, their training data has not such surprisingly high quality.

Let’s get back to the Deep Speech repository. You’ll need Tensorflow 1.6.0, for sure for the GPU architecture. And Python 3.6 as well (you shouldn’t be surprised by this information).

So, I hope you’ve installed pip on your laptop. But, anyways I’ll show you all the dependencies you should have installed. 
Open up your Terminal and >

You’ve all the dependencies right now. That leads us to the Deep Speech installation. Use the Terminal again to get the Deep Speech files, and pre-trained models.

And let’s clone into the Deep Speech repository, and get the pre-trained models using the next commands >

//Cloning into repository
git clone

//Getting pre-trained models
wget -O — | tar xvfz –


When the downloads are finished, let’s install all the dependencies including virtual environment, and assets for the virtualenv itself.

We are all set. Now, the algorithm with pre-trained models is ready to be re-trained or used. Note that, GeForce GTX 1070 needs 0.44sec for generation.
If you have worse GPU than this, waiting for a long time would be the most flattering thing for you.

As I mentioned above, Deep Speech has a lot of new upgrades and extensions including the checkpointing. With this model, you can freeze the training at some points, and re-train on additional data later on. Or, if you want to just continue training, checkpointing function gives you ability to resume the training whenever you want.


In case your native language is not English, you can re-train your language’s over English pre-trained models. Note that, Baidu’s collected data is pretty accurate for the model, and it’s really huge. And as well as the Deep Speech doesn’t use concept of phonimes at all, converting the generative models into your native language detection neural net will be possible for just over-training on your data.

I think you’re all set. You can use the Deep Speech in the way you want, it’s under MIT licence. Make sure to contribute at least something, if you find out an issue in the code, or a new way to make it better.

Thank you for reading this. I think I made at least something clear, easy and interesting for you.

Source: Deep Learning on Medium