Original article was published on Artificial Intelligence on Medium
Facebook’s Highly Efficient New Real-Time Text-To-Speech System Runs on CPUs
Text-to-Speech (TTS) refers to the ability of computers to read text aloud. As voice assistant technology become increasingly common and sophisticated, TTS has leapt far beyond the monotonous metallic voices of yesteryear. Now, computer voices are expected to sound humanlike, and recreating the nuances of human voice using neural networks has become a research focal point for the modern TTS systems.
To deliver human-level voices to its platform’s billions of users while maintaining strict compute efficiency, Facebook AI researchers have deployed a new neural TTS system that works on CPU servers. The model attains a 160x speedup over the company baseline while retaining state-of-the-art audio quality.
Previous systems mostly rely on powerful GPUs or other specialized hardware to generate high-quality speech in real time. The new Facebook system however is able to host the whole service in real time using regular CPUs without any specialized hardware, thanks to “strong engineering and extensive model optimization.”
The highly flexible system is expected to play an important role in creating and scaling new voice applications that sound more human and expressive and are more enjoyable for users to interact with. It’s currently deployed in Portal, a video-calling device, and is available as a service for other applications like reading assistance and virtual reality.
In a blog post, Facebook introduces a pipeline which efficiently combines four components — each of which focusing on a different aspect of speech — to solve the core efficiency challenges to deploy the new system at scale.
They use a linguistic front-end to convert input text to a sequence of linguistic features such as phonemes and sentence type, followed by a prosody model that predicts rhythm and melody to re-create the expressive qualities of natural human speech. The researchers explain that building a separate prosody model in the pipeline is essential because it allows easier control for the speech style during synthesis time.
The Facebook team also adopted the conditional neural vocoder architecture in an acoustic model to transform linguistic and prosodic information into the frame-rate spectral feature, which is taken as neural vocoder inputs. This approach enables the neural vocoder to focus on spectral information packed in a few neighbouring frames and allows them to train a lighter and smaller neural vocoder.
Their neural vocoder can generate 24 kHz speech waveform conditioned on prosody and spectral feature. It consists of a convolutional neural network that expands the input feature vectors from frame rate (around 200 predictions per second) to sample rate (24,000 predictions per second) and a recurrent neural network that synthesizes audio samples auto-regressively at 24,000 samples per second.
The new TTS system lays the foundation that will make building more humanlike systems a reality, the researchers say. They will continue to add more languages, accents, and dialects to their voice portfolio while remaining focused on keeping the system light and efficient so it can run on smaller devices.