Source: Deep Learning on Medium
End to End Speech Translation: The Promise of Breaking Down Language Barriers
Harnessing Indirect Training Data for End-to-End Automatic Speech Translation
This research summary is just one of many that are distributed weekly on the AI scholar newsletter. To start receiving the weekly newsletter, sign up here.
Getting technology to work for different languages is crucial if we are going to eliminate language barriers. Thanks to the AI community including researchers, engineers, and practitioners, there has been a lot of active research work to this end.
In the past, speech translation (AST) has been achieved through the use of cascading models that transcribe with automatic speech recognition (ASR), then translate with machine translation (MT). They have helped achieve good results which have powered several commercial speech-to-speech translation products such as Google Translate. Recently, Google AI released Translatotron, an attention-based sequence-to-sequence neural network which can directly translate speech from one language into speech in another language, without relying on an intermediate text representation.
End-to-end models for AST have been shown to perform better than or on par with cascade models when both are trained only on speech translation parallel corpora. However, when additional data are used to train its ASR and MT subsystems, the cascade outperforms the vanilla end-to-end approach.
Towards Exploiting Indirect Training Data for End-to-End Automatic Speech Translation
In this paper, Facebook and John Hopkins University researchers explore several techniques that leverage automatic speech translation(ASR) and machine translation (MT) data to aid end-to-end systems, by means of data augmentation. They demonstrate that cascaded models are very competitive when not constrained to only train on AST data.
The researchers study several techniques aimed at bridging the gap between end-to-end and cascade models. With data augmentation, pretraining, fine-tuning and architecture selection, they were able to train end-to-end models that show competitive performance when compared to the cascade approach. Their approaches reduced the performance gap between end-to-end and strong cascade models, from 8.2 to 1.4 BLEU on En– Fr Librispeech AST data and from 6.7 to 3.7 on the En–Ro MuST-C corpus.
Potential Uses and Effects
In an increasingly digital world, effective speech translation has more applications than ever before. It is no wonder researchers and developers are increasingly working on achieving robust speech technology which is needed to help translate text data much faster. Better speech translation has significant potential to eliminate current global language translation challenges.
But then doing so requires high quality and enough data.
Through this work, researchers were able to evaluate several data augmentation and pretraining approaches for AST, by comparing all on the same datasets. What more, their work also provides recommendations on how to harness such type of data and overall aims at enhancing state-of-the-art in speech translation which can help improve business systems efficiency and productivity.