Getting Started with End-to-End Speech Translation

Original article can be found here (source): Deep Learning on Medium

Getting Started with End-to-End Speech Translation

With Pytorch you can translate English speech in only a few steps


Speech-to-text translation is the task of translating a speech given in a source language into text written in a different, target language. It is a task with a history that dates back to a demo given in 1983. The classic approach to tackle this task consists in training a cascade of systems including automatic speech recognition (ASR) and machine translation (MT). You can see it in your Google Translate app, where your speech is first transcribed and then translated (although the translation appear to be real-time)

Both tasks of ASR and MT have been studied for long time and the systems’ quality has experienced significant leaps with the adoption of deep learning techniques. Indeed, the availability of big data (at least for some languages), large computing power and clear evaluation, made these two tasks perfect targets for big companies like Google that invested a lot in research. See as a reference the papers about Transformer[1] and SpecAugment [2]. As this blog post is not about cascaded systems, I refer the interested reader to the system that won the last IWSLT competition [3].

IWSLT is the main yearly workshop devoted to spoken language translation. Every edition hosts a “shared task”, a kind of competition, with the goal of recording the progress in spoken language technologies. Since 2018, the shared task started a separate evaluation for “end-to-end” systems, that are those systems consisting of a single model that learns to translate directly from audio to text in the target language, without intermediate steps. Our group has been participating to this new evaluation since its first edition, and I reported our first participation in a previous story.

The quality of end-to-end models is still discussed, when compared to the cascaded approach, but it is a growing research topic and quality improvements are reported quite frequently. The goal of this tutorial is to lower the entry barriers to this field by providing the reader with a step-to-step guide to train an end-to-end system. In particular, we will focus on a system that can translate English speech into Italian, but it can be easily extended to additional seven languages: Dutch, French, German, Spanish, Portuguese, Romanian or Russian.

What you need

The minimum requirement is the access to at least one GPU, which you can get for free with Colab and Pytorch 0.4 installed.

However, the K80 GPUs are quite slow and will require several days of training. Accessing to better or more GPUs will be of great help.

Getting data

We will use MuST-C, the largest multilingual corpus available for the direct speech translation task. You can find a detailed description in the paper that introduced it [4] or in the following Medium story:

To get the corpus, go to, click on the button “Click here to download the corpus”, then fill the form and you will soon be able to download it.

MuST-C is divided in 8 portions, one for each target language, feel free to download one or all of them, but for this tutorial we will use the Italian target (it) as an example. Each portion contains TED talks given in English and translated in the target language (the translations are provided by the Ted website). The size of the training set depends on the availability of translations for the given language, while the validation and test sets are extracted from a common pool of talks.

Each portion of MuST-C is divided into train, dev, tst-COMMON and tst-HE. Train, dev and tst-COMMON represent our split into training, validation and test set, while you can safely ignore tst-HE. In each of the three directories you will find three sub-directories: wav/, txt/, and h5/. wav/ contains the audio side of the set in the form of .wav files, one for each talk. txt contains the transcripts and translations, for our example with Italian you will find, under the train/txt directory, the files, train.en, train.yaml. The first two are, respectively, the textual translation and trascript. train.yaml is a file containing the audio segmentation in a way that it is aligned with the textual files. As a bonus, the .en and .it files are parallel and, as such, they can be used to train MT systems. If you don’t know what to do with the segmentation provided by the yaml file, don’t be afraid! In the h5/ directory there is a single .h5 file that contains the audio already segmented and transformed to extract 40 Mel Filterbanks features.

NOTE: The dataset will be downloaded from Google Drive, if you want to download it from a machine with no GUI, you can try to use the tool gdown. However, it does not work always correctly. If you are unable to download with gdown, please try again after a few hours.

Getting the software

We will use FBK-Fairseq-ST, that is the fairseq tool by Facebook for MT adapted for the direct speech translation task. Clone the repository from github:

git clone

Then, clone also mosesdecoder, which contains useful scripts for text preprocessing.

git clone

Data preprocessing

The audio side of the data is already preprocessed in the .h5 file, so we only have to care about the textual side.

Let us first create a directory where to put the tokenized data.

> mkdir mustc-tokenized
> cd mustc-tokenized

Then, we can proceed to tokenize our Italian texts (an analogous process is needed for the other target languages):

> for file in $MUSTC/en-it/data/{train,dev,tst-COMMON}/txt/*.it; do
$mosesdecoder/scripts/tokenizer/tokenizer.perl -l it < $file |
$mosesdecoder/scripts/tokenizer/deescape-special-chars.perl > $file
> mkdir tokenized> for file in *.it; do
cp $file tokenized/$file.char
sh tokenized/$file

The second for-loop splits the words in characters, as done in our paper that sets baselines for all the MuST-C languages [5].

Now, we have to binarize the data to make audio and text in a single format for fairseq. First, link the h5 files in the data directory.

> cd tokenized
> for file in $MUSTC/en-it/data/{train,dev,tst-COMMON}/h5/*.h5; do
ln -s $file

Then, we can move to the actual binarization

> python $FBK-Fairseq-ST/ --trainpref train --validpref dev --testpref tst-COMMON -s h5 -t it --inputtype audio --format h5 --destdir bin

This will require some minutes, and in the end you should get something like this:

> ls bin/ valid.h5-it.h5.bin train.h5-it.h5.bin valid.h5-it.h5.idx
test.h5-it.h5.bin train.h5-it.h5.idx

We have a dictionary for the target language (, and for each split of the data, an index and a content file for the source side (*.h5.idx and *.h5.bin) and the same for the target side (*.it.idx and *.it.bin).

With this, we have finished with the data preprocessing and can move on the training!

Training your model

For training, we are going to replicate the one reported in [5]. You just need to run the following command:

> mkdir models
> CUDA_VISIBLE_DEVICES=$GPUS python $FBK-Fairseq-ST/ bin/ \
--clip-norm 20 \
--max-sentences 8 \
--max-tokens 12000 \
--save-dir models/ \
--max-epoch 50 \
--lr 5e-3 \
--dropout 0.1 \
--lr-schedule inverse_sqrt \
--warmup-updates 4000 --warmup-init-lr 3e-4 \
--optimizer adam \
--arch speechconvtransformer_big \
--distance-penalty log \
--task translation \
--audio-input \
--max-source-positions 1400 --max-target-positions 300 \
--update-freq 16 \
--skip-invalid-size-inputs-valid-test \
--sentence-avg \
--criterion label_smoothed_cross_entropy \
--label-smoothing 0.1

Let me explain it step by step. bin/ is the directory containing the binarized data, as above, while models/ is the directory where the checkpoints will be saved (one at the end of each epoch). --clip-norm refers to gradient clipping, and --dropout should be clear if you are familiar with deep learning.--max-tokens is the maximum number of audio frames that can be loaded in a single GPU for every iteration, and --max-sentences is the maximum batch size, which is limited also by max-tokens. --update-freq also affects the batch size, as here we are saying that the weights have to be updated after 16 iterations. It basically emulates the training with 16x GPUs. Now, the optimization policy: --optimizer adam is for using the Adam optimizer, --lr-schedule inverse_sqrt uses the schedule introduced by the Transformer paper [1]: the learning rate grows linearly in --warmup-updates step (4000) from --warmup-init-lr(0.0003) to --lr (0.005) and then decreases following the square root of the number of steps. The loss to optimize is cross entropy with label smoothing (--criterion) using a --label-smoothingof 0.1 . The loss is averaged among the sentences and not the tokens with--sentence-avg. --arch defines the architecture to use and the hyperparameters, these can be changed when running the training, but speechconvtransformer_big uses the same hyperparameters as in our paper, except for the distance penalty that is specified in our command.

The deep learning architecture is an adaptation of the Transformer to the speech translation task, which modifies the encoder to work with spectrograms in input. I will describe it in a future blog post.

During training, one checkpoint will be saved at the end of each epoch and called accordingly,, etc. Additionally, two more checkpoints will be updated at the end of every epoch: and The former is a copy of the checkpoint with the best validation loss, the latter a copy of the last saved checkpoint.

Generation and evaluation

When you are ready to run a translation from audio (actually, preprocessed spectrograms), you can run the following command:

python $FBK-Fairseq-ST/ tokenized/bin/ --path models/ --audio-input \
[--gen-subset valid] [--beam 5] [--batch 32] \
[--skip-invalid-size-inputs-valid-test] [--max-source-positions N] [--max-target-positions N] > test.raw.txt

What is absolutely needed here is the directory with binarized data bin/, the path to the checkpoint --path models/, but it can be any of the saved checkpoints, --audio-inputand to inform the software that it has to expect audio (and not text) in input.

By design, this command will look for the “test” portion of the dataset within the given directory. If you want to translate another, valid or train, you can do it with --gen-subset {valid,train}. The beam size and the batch size can modified, respectively, with --beam and --batch . --skip-invalid-size-inputs-valid-test let the software skip the segments that are longer than the limits set by --max-source-positions and --max-target-positions.

The output will be something like this: