End to End Speech Translation: The Promise of Breaking Down Language Barriers

Source: Deep Learning on Medium

End to End Speech Translation: The Promise of Breaking Down Language Barriers

Harnessing Indirect Training Data for End-to-End Automatic Speech Translation

This research summary is just one of many that are distributed weekly on the AI scholar newsletter. To start receiving the weekly newsletter, sign up here.

Getting technology to work for different languages is crucial if we are going to eliminate language barriers. Thanks to the AI community including researchers, engineers, and practitioners, there has been a lot of active research work to this end.

In the past, speech translation (AST) has been achieved through the use of cascading models that transcribe with automatic speech recognition (ASR), then translate with machine translation (MT). They have helped achieve good results which have powered several commercial speech-to-speech translation products such as Google Translate. Recently, Google AI released Translatotron, an attention-based sequence-to-sequence neural network which can directly translate speech from one language into speech in another language, without relying on an intermediate text representation.

End-to-end models for AST have been shown to perform better than or on par with cascade models when both are trained only on speech translation parallel corpora. However, when additional data are used to train its ASR and MT subsystems, the cascade outperforms the vanilla end-to-end approach.

Towards Exploiting Indirect Training Data for End-to-End Automatic Speech Translation

In this paper, Facebook and John Hopkins University researchers explore several techniques that leverage automatic speech translation(ASR) and machine translation (MT) data to aid end-to-end systems, by means of data augmentation. They demonstrate that cascaded models are very competitive when not constrained to only train on AST data.

AST datasets typically include three parts: recorded speech, transcripts, and translations. While cascaded models can leverage these, an advantage over end-to-end systems is that they can also leverage data that provide adjacent pairs — which are far more prevalent. The approaches we investigate involve completing the triplets, so end-to-end systems can also benefit from incomplete triplets( From the research paper)

The researchers study several techniques aimed at bridging the gap between end-to-end and cascade models. With data augmentation, pretraining, fine-tuning and architecture selection, they were able to train end-to-end models that show competitive performance when compared to the cascade approach. Their approaches reduced the performance gap between end-to-end and strong cascade models, from 8.2 to 1.4 BLEU on En– Fr Librispeech AST data and from 6.7 to 3.7 on the En–Ro MuST-C corpus.

Potential Uses and Effects

In an increasingly digital world, effective speech translation has more applications than ever before. It is no wonder researchers and developers are increasingly working on achieving robust speech technology which is needed to help translate text data much faster. Better speech translation has significant potential to eliminate current global language translation challenges.

But then doing so requires high quality and enough data.

Through this work, researchers were able to evaluate several data augmentation and pretraining approaches for AST, by comparing all on the same datasets. What more, their work also provides recommendations on how to harness such type of data and overall aims at enhancing state-of-the-art in speech translation which can help improve business systems efficiency and productivity.

Read more: Harnessing Indirect Training Data for End-to-End AST: Tricks of the Trade

Thanks for reading, comment, share & let’s connect on Twitter, LinkedIn, and Facebook. Stay updated with the latest AI research developments, news, resources, tools, and more by subscribing to our weekly AI Scholar Newsletter for free! Subscribe here Remember to 👏 if you enjoyed this article. Cheers!