Ever Wanted to Build a Text-to-Speech App?

Original article can be found here (source): Artificial Intelligence on Medium

Ever Wanted to Build a Text-to-Speech App?

Application ideas and inspiration for text-to-speech

Photo by CoWomen on Unsplash

Text-to-speech synthesis, TTS for short, is the artificial production of human speech which was, in the past, largely performed by the generation of human speech that artificially used a process called concatenative text-to-speech.

This approach solely relied on a very large database of short speech fragments that were recorded from a single speaker and then recombined to form complete utterances.

Once there’s a need to convert text to speech, the text-to-speech engine would search this large database for speech units, match the input text defined by a user, which would then begin the concatenating process to derive the final audio file — think of this as stitching audio fragments together.

Though this process largely outdated the production of speech via this mode of speech synthesis, it mostly sounded incredibly monotone and somewhat robotic.

To remember how this actually sounded — at some point, some of you may have dabbled with Adobe’s PDF reader’s text-to-speech feature and can relate to how incredibly unbearable and robotic it was to listen to.

With recent advancements in deep learning and neuro networks, such as WaveNet’s text-to-speech, it has now moved from mere concatenative text to speech to a near-fluid human-like synthesis process — at least as per my comparison with conventional text-to-speech applications.

WaveNet is a deep neural network for generating raw audio waveforms that utilize probabilistic and autoregressive models designed by DeepMind, a company acquired by Google in 2014.

This new way of text-to-speech synthesis yields more realistic human computer-generated speech with human listeners rating it as significantly more natural sounding than the best concatenative systems.

With the introduction to WaveNet, the text-to-speech space has started to garner wider adoption, especially with the introduction of cloud services such as Google’s text to speech and other technologies such as AWS Polly.

Figure 1. Chart showing a comparison of WaveNet to other synthetic voices, human speech. source https://cloud.google.com/text-to-speech/docs/wavenet

Human and computer interaction is something that has been long and elusive. Thus, these newer ways of text-to-speech synthesis have brought us much closer to that Star Trek age, where we can verbally communicate and get real-time feedback from computer systems in a natural, human-like voice as depicted in the image above.

Transforming between text and voice can add powerful functionality to your applications.

The most obvious benefit of implementing text-to-speech and speech-to-text options is accessibility. A visually impaired or dyslexic user would benefit from a narrated version of an article, while a deaf person could become a member of your podcasting audience by reading a transcript of the show.

Now that we understand where text to speech has been and where it is now, let’s look at some practical out of the box use cases and places we can apply text-to-speech synthesis.

Some of these may spark ideas on how you can start integrating this technology into your own applications and truly appreciate the possibilities of application integration with text to speech.