Survey of Text-to-Speech System

Original article was published on Artificial Intelligence on Medium


Survey of Text-to-Speech System

The goal of TTS is the automatic conversion of written text into corresponding speech. The speech synthesis field has witnessed much advancement in the past few decades. A general TTS synthesizer comprises a text analysis module and a digital signal processing module.

Figure 1 shows the functional diagram of a very general TTS synthesizer. The text analysis module produces a phonetic transcription of the text read, together with the desired intonation and rhythm (i.e., prosody). The DSP module produces the synthetic speech corresponding to the transcription produced by the text analysis module.

General TTS system. Adapted from

Text Analysis Module

Figure 2 shows the skeleton of a general text analysis module. The text analysis stage is extremely difficult, as this stage needs to produce all the information required by the DSP module (for producing speech) from mere text only. However, a mere text does not contain all of the information needed to produce speech. The first block of the text analysis module is Text-to- Phonetics (T2P) block which converts the input text-to-phonetic transcription and the second block is Text-to-Prosody which produces prosodic information.

Text analysis module of a general Text-to-Speech (TTS) system

Text-to-Phonetics:- The Text-to-Phonetics block can be further broken down into a text normalization module and a word pronunciation module. Following is the brief description of these two modules

Text Normalization: A text normalization module organizes the input text into manageable lists of words. It identifies numbers, abbreviations, acronyms, and idiomatic and expands them into full text. This is commonly done by using regular grammar.

Word Pronunciation: Once the sequence of words has been generated using the text normalization module, their pronunciation can be determined. A simple Letter-to-Sound (LTS) rule may be applied where words are pronounced as they are written. Where this is not the case, a morphosyntactic analyzer may be used. A morpho-syntactic analyzer tags the speech with various identities, such as prefixes, roots, and suffixes, and organizes the sentences into syntactically related groups of words, such as nouns, verbs, and adjectives. The pronunciation of these can then be determined using a lexicon.

Text-to-Prosody

The term prosody refers to certain properties of the speech signal such as audible changes in the pitch (i.e., intonation), loudness, tempo, duration, stress, and rhythm. The naturalness of speech can be described mainly in terms of prosody. Prosodic events are also referred to as suprasegmental phenomena as these events appear to be time-aligned with syllables or group of syllables, rather than with segments (sounds, phonemes)

The pattern of prosody is used to communicate the meaning of sentences. The Text-to- Prosody block produces prosody information using the text and output of the word pronunciation module. This block can be further broken down into smaller processes that determine accenting, phrasing, duration, and intonation for each sentence. Following is the brief description of these four processes:

Accenting: Accent or stress assignment is based on the category of the word. For example, content words (such as nouns, adjectives, and verbs) tend to be accented and function words (such as prepositions and auxiliary verbs) are usually not accepted. This information is used for predicting the intonation and duration.

Phrasing: Sentences are broken down into phrasal units and phrase boundaries are assigned to the text. These boundaries indicate pauses and the resetting of intonation contours.

Intonation: Intonation clarifies the type and meaning of the sentence (neutral, imperative, or question). In addition, intonation also conveys information about the speaker’s characteristics (such as gender and age) and emotions. The intonation module generates a pitch contour for the sentences. For example, the sentences “Open the door” and “Open the door?” have very different prosody. In terms of intonation contour (defined as rising and fall of the pitch throughout the utterance), the first sentence is declarative and has a relatively flat pitch contour, whereas the second is questioning and exhibits a rise in pitch at the end of the phrase.

Duration: Segmental duration is an essential aspect of prosody. It affects the overall rhythm of the speech, stress, and emphasis, the syntactic structure of the sentence, and the speaking rate. There are many factors that contribute to the duration of a speech segment. Some of them are the identity of the phone itself, the identity and characteristics of neighboring phones, the accent status of the syllable containing the phone, its phrase position, and the speaking rate and dialect of the speaker.

Digital Signal Processing (DSP) Module

The DSP module uses the phonetic transcription and prosodic information produced by the text analysis module to produce speech. This can be done in two ways, viz.,

By using the series of rules which formally describe the influence of one phoneme on the other (i.e., the coarticulation effect).

By storing numerous instances of each speech sound unit and using them as they are, as ultimate acoustic units.

Based on the above two ways, two main classes of TTS systems have emerged, namely, synthesis-by-rule and synthesis-by-concatenation. Figure 3 shows a general DSP module.

Digital Signal Processing (DSP) module Adapted from

Types of Synthesis Techniques

Previously speech synthesis techniques were classified into articulatory, formant, and concatenative speech synthesis. The concatenative speech synthesis method was the most popular method. With the advancement of statistical modeling of the speech production mechanism, the statistical approach of using hidden Markov model (HMM) based and further using the deep neural network (DNN) in speech synthesis. Next, we give a brief description of the previously developed techniques.

Articulatory Synthesis

The articulatory synthesis attempts to ideally model the complete human vocal organs (i.e., the human articulators and vocal folds) that produce speech as perfectly as possible. The articulatory control parameters include lip aperture, lip protrusion, tongue tip position, tongue tip height, tongue position, and tongue height. It has the advantage of accurate modeling of transients due to abrupt area changes due to continuous vocal track movement. Therefore, ideally, it should be the most adequate method to produce high-quality synthetic speech. On the other hand, it is also one of the most difficult methods to implement.

The first articulatory model was based on a table of vocal tract area functions from the larynx to lips for each phonetic segment (Klatt 1987). The articulators are usually modeled with a set of area functions between glottis and mouth. For rule-based synthesis, the articulatory control parameters may be for example lip aperture, lip protrusion, tongue tip height, tongue tip position, tongue height, tongue position, and velic aperture. Phonatory or excitation parameters may be glottal aperture, vocal fold tension, and lung pressure.

During the process of speech generation, the vocal tract muscles cause the articulators to move, and hence, the shape of the vocal tract changes, which results in the production of different sounds. During speaking, the moment of the vocal tract is obtained from the data for the articulatory model is generally derived from X-ray analysis of natural speech. However, this data is usually only 2- D when the real vocal tract is naturally 3-D, so the rule-based articulatory synthesis is very difficult to optimize due to the unavailability of sufficient data of the motions of the articulators during the production of speech. In addition, another disadvantage with articulatory synthesis is that X-ray data do not describe information about the masses or degrees of freedom of the articulators. In addition, the movements of the tongue are so complicated that it is almost impossible to model them precisely. The articulatory synthesis is quite rarely used in present systems; however, since the analysis methods are developing fast and the computational resources are increasing rapidly, it might be a potential synthesis method in the future.

Sinewave Synthesis

Sinewave synthesis is a method for synthesizing speech by replacing the formants (the resonances in the vocal tract model) with pure tone whistles. It is based on the assumption that the speech signal can be represented in terms of sine waves having varying frequencies and also varying amplitudes. Therefore, speech signals (n) can be modeled as the sum of N no. of sinusoids,

where Ai, ωi, and φi represents the amplitude, frequency, and phase of the sinusoid component. These parameters are estimated from the discrete Fourier transform (DFT) of windowed signal frames. The peaks of the spectral magnitude are selected from each frame. The basic model is also known as the McAulay/Quartieri sinusoidal model. The basic model has also some modifications such as ABS/OLA (Analysis by Synthesis / Overlap Add) and Hybrid / Sinusoidal Noise models. However, such a model works well with representing periodic signals, such as vowels and voiced consonants only and it will not work with the representation of unvoiced speech.

Formant Synthesis

Formant synthesis is based on the source-filter model of speech. It is based on the model that the human vocal tract system has several resonances and these resonances in the system results in various characteristics of speech sounds. The vocal tract can be assumed to be a cascade of several resonators in the case of speech production of various periodic sounds. However, for nasalized sounds, the nasal cavity comes into the picture and the tract can be modeled as a parallel resonant structure. Therefore, there are two basic structures in general, parallel and cascade, but for better performance, some kind of combination of these is usually used. Formant synthesis also provides an infinite number of sounds which makes it more flexible than (for example, concatenation methods).

The lower formants are known to carry the information of the sound and the higher formants carry speaker information. At least three formants are generally required to produce intelligible speech and up to five formants to produce high-quality speech. Each formant is usually modeled by a 2nd order digital band-pass resonator, where a complex conjugate pole-pair corresponds to a formant frequency. The pole angle corresponds to the angular frequency and the distance of the pole from the unit circle (in ᵶ plane) corresponds to the -3 dB bandwidth specified.

Concatenation Synthesis

Concatenative speech synthesis is one of the currently used systems. It is based on concatenating pre-recorded speech to produce the desired utterances. In concatenative speech synthesis approach, there is no need to determine speech production rules. Hence, concatenative synthesis is simpler than rule-based synthesis. Concatenative synthesis generates speech by connecting natural, prerecorded speech sound units. These sound units can be words, syllables, half- syllables, phonemes, half-phone, diphones or triphones. The joining of natural speech utterances helps to achieve very high natural-sounding speech. However, the joints of speech sound units need to be smooth to avoid abrupt changes or glitches in the synthetic speech signal.

The main challenges in this approach are the prosodic modification to speech units and resolving discontinuities at unit boundaries or joints. Prosodic modification results in artifacts in the speech which in turn make the speech sound unnatural. Unit selection-based speech synthesis technique, which is a kind of concatenative synthesis, solves this problem by storing numerous instances of each unit with varying prosodies. The unit that best matches the target prosody is selected and concatenated. Depending on the unit selected, a large speech corpus is required to be labeled, which is very tedious and time-consuming. These types of TTS systems are more natural sounding. However, the memory requirement of such synthesis techniques is very large and hence, may find difficulty in being portable to hand-held devices.

HMM-based Speech Synthesis System

Instead of storing the whole speech database in this method, statistically, meaningful model parameters are stored and used to synthesize speech waveforms. To build unit-selection system on one speaker data requires about 8–10 hours of speech database. In addition, syllable or phoneme coverage in the database is very uneven. Apart from it, in a concatenative approach, the system will recreate units from what we have recorded. Furthermore, in this approach, we are effectively memorizing the speech data whereas, in the statistical approach, we are attempting to learn the general properties of the speech data. Hidden Markov Model (HMM) -based speech synthesis system (HTS) comes under the category of statistical parametric speech synthesis (SPS) methods.

Basic idea is to generate an average of some similar-sounding speech segments. Spectrum and excitation parameters are first extracted from the speech database. Mel frequency cepstral coefficients (MFCC) and their dynamic features are generally taken as spectrum (i.e., vocal tract system) parameters and log(F0) and it’s dynamic features are taken as excitation (i.e., speech source) parameters. Then, these features are modeled by context-dependent HMMs. Here, spectrum, excitation and duration are going to be modeled in a unified framework. After the training at the time of synthesis, first given sentence, which has to be synthesized, its corresponding utterance is converted to context-dependent phoneme sequence. Then, according to the phoneme sequence, utterance HMM is constructed by concatenating context-dependent HMMs. Then state duration of HMMs is determined. Then using speech parameter generation algorithm, spectrum and excitation parameters are generated. Finally, using Mel log spectrum approximation (MLSA) filter speech waveform is generated.

Evaluation of TTS

TTS systems need to be properly evaluated so that the gap of natural speech and synthetic speech can be identified and should be taken care of by developing proper methods in each modeling block of TTS systems. TTS systems evaluation includes finding the effectiveness of individual building blocks of TTS as well as final outcome-the the synthetic speech.

Evaluation of TTS blocks includes text normalization, pitch and duration modeling, waveform generation through a vocoder, speech parameter generation algorithm, and many other blocks of the system. Quality of synthesized speech has been evaluated primarily using subjective listening tests (for example, Mean Opinion Score (MOS)). Objective tests are also conducted to give further support to subjective tests and may give other attributes of speech quality which may not be captured by the human listeners. The detailed descriptions about the evaluation of TTS voice quality are given in the lecture module “Evaluation of TTS Voice”.

Summary

In this module, we presented a general overview of the text-to-speech synthesis system. Details of the initial process of data collection and labeling required for data collection will be discussed in the next modules. The techniques involved in building different types of TTS, especially concatenative and statistical parametric. The final stage includes evaluating the quality of the TTS voice. There are subjective and objective measures involved in evaluating the TTS system quality, details of which are discussed in the concluding section.