DeepVoice- An Intuitive Guide

Original article was published on Deep Learning on Medium

DeepVoice- An Intuitive Guide

Text to speech conversion is a necessary problem to be solved in the world. It extends the reach of content by providing more than one way to consume it,makes content more accessible to those who cannot read etc. Researchers at Baidu AI came up with a long but intuitively very simple pipeline to tackle this problem and provide an end to end solution to it.The post assumes you have a basic understanding of how RNNs work.It gives me great pleasure to cover this paper since it was contributed to by one of my machine learning heroes: Andrew Ng.

The DeepVoice Solution Pipeline

DeepVoice consists of the following networks:

  • The grapheme-to-phoneme model: As the name suggests it converts text to phonemes.
  • The segmentation model: It locates boundaries between any two phonemes.
  • The phoneme duration model: Predicts the duration of each individual phoneme.
  • The fundamental frequency model: Predicts if the phoneme is voiced and its fundamental frequency throughout its duration
  • The audio synthesis model: Synthesizes the final audio corresponding to the required text.

Network Architecture

The grapheme to phoneme model

It employs the encoder-decoder architecture, often seen in Neural Machine Translation. It includes 3 bidirectional layers of 1024 GRU(Gated Recurrent Units) each in the encoder and 3 unidirectional layers of the same size in the decoder and a beam search with a width of 5 candidates. This model predicts phonemes which are the phonetic equivalents of words.

The segmentation model

This model learns alignment between given utterance and its corresponding sequence of phonemes. Its trained using the CTC loss. In order to detect precise phoneme boundaries, this model is trained to predict pairs of phonemes instead of a single phoneme. Phoneme pairs are predicted at timesteps closer to the boundary between them the probability of finding both phonemes together is maximum at the boundary. The encoding can be seen as follows:

  • Hello can be written as HH EH L OW in terms of phonemes
  • On padding with a silence token on either side, sil HH EH L OW sil
  • No consecutive token pairs are constructed as: “(sil, HH), (HH, EH), (EH, L), (L, OW), (OW, sil)”

It consists of 2 2-D Convolution layers, followed by 3 bidirectional GRU layers having 512 GRU units each, and finally a softmax output layer. Beam search of width 50 is further used to decode phoneme boundaries.

Phoneme Duration and Fundamental Frequency Model

The input to the model is phoneme sequences along with stresses(emphasized parts) encoded into a one hot vector. The architecture comprises of 2 fully connected layers with 256 units each followed by 2 unidirectional recurrent layers with 128 GRU units each. It outputs: the phoneme duration, probability that the phoneme is voiced and 20 time dependent frequency values. The loss function used is a joint loss from all the three outputs.

Audio Synthesis Model

This model is a modification of the Wavenet network. As the paper mentions it consists of two networks:

“A conditioning network, which upsamples linguistic features to the desired frequency, and an autoregressive network, which generates a probability distribution P(y) over discretized audio samples y ∈ {0, 1, . . . , 255}”

The only difference in the wavenet here is instead of simply using transposed convolutions for upsampling, a method called as quasi-RNN is used to encode the inputs and then perform upsampling using repetitions.

Training and Inference

The training and inference cycles for the pipeline


  • Text is first fed into the grapheme-to-phoneme model which predicts phonemes for the corresponding text. This model is trained first.
  • These phonemes along with the audio form of the same text are fed into the segmentation model to predict the time steps at which pairs of phonemes have a maximum probability to occur. This gives an estimation of phoneme boundaries.
  • The duration and frequency prediction model is then trained using phoneme boundaries and phoneme inputs to predict the duration for each and every phoneme. The audio input data is used to provide the base frequency for each and every phoneme.
  • Finally using the phonemes,duration and base frequencies, the audio synthesis network is trained. Ground truth audio is provided as the label to be trained against.


  • The segmentation model is not used in the inference pipeline
  • Text is fed through the grapheme-to-phoneme model to produce phonemes.
  • These phonemes are then passed to the duration and frequency prediction model which predict the duration and base frequencies for the same and these inturn are used to predict the audio.

Why Deep Voice?

End to End Solution

Deep Voice provided an end to end deep learning solution to the TTS problem unlike its predecessors which used special hand engineered features for the same.

Minimal use of hand engineered features

Deep Voice used one hot encoded phonemes, duration and frequency as features which were easily available from audio transcripts. Unlike other models like Wavenet and char2wav which used complex features instead.

Production Ready System

Deep Voice is very fast and provides speedy inference. This is great as compared to Wavenet which requires several minutes of runtime and SampleRNN which require almost 4–5X more compute during inference.


Deep Voice was trained on an internal English speech database containing approximately 20 hours of speech data segmented into 13,079 utterances.
It was able to generate great results and proved to be superior to its predecessors.It was able to run at great runtime speeds even on CPU’s. I hope the post was able to shed some light on how deep voice works. I strongly urge the you to go through the paper since it does not really need a post like this. The paper is so beautifully written that is becomes very simple to understand it.

Link to the original paper:

Link to blogpost to get more clear intuition on training and inference pipeline: