Voice Translation and Audio Style Transfer with GANs

Source: Deep Learning on Medium


We have all heard about image style transfer: extracting the style from a famous painting and applying it to another image is a task that has been achcieved with a number of different methods. Generative Adversarial Networks (GANs in short) are also being used on images for generation, image-to-image translation and more.

Examples of image style transfer

But what about sound? On the surface, you might think that audio is completely different from images, and that all the different techniques that have been explored for image-related tasks can’t also be applied to sounds. But what if we could find a way to convert audio signals to image-like 2-dimensional representations?

As a matter of fact, yes we can! This kind of sound representation is what we call “Spectrogram”, and it is the key that will allow us to make use of algorithms specifically designed to work with images for our audio-related task.

Spectrogram (source)


If you are new to the world of audio processing, you may be unfamiliar with what a spectrogram really is. Given a time-domain signal (1 dimension) we want to obtain a time-frequency 2-dimensional representation. To achieve that, we apply the Short-Time Fourier Transform (STFT) with a window of a certain length on the audio signal, only considering the squared magnitude of the result.

Incredible illustration of how Time and Frequency correlate from the MelNet paper page

In simpler terms, we divide our original waveform signal into chunks that overlap with one another, extract the magnitude of the frequency in the chunk (with a Fourier Transform), and each resulting vector is going to represent a column of our final spectrogram. The x axis of the spectrogram stands for time, while the y axis represents the frequency.

To make these spectrograms even more useful for our task, we convert each “pixel” (or magnitude value) to be in the decibel scale, taking the log of each value. Finally, we convert spectrograms to the mel scale, applying a mel filter bank, resulting in what are known as “mel-spectrograms”.

Examples of mel-spectrograms with log-amplitude

This allows us to make the spectrogram representations more sensible to our human understanding of sound, highlighting the amplitudes and frequencies that us humans are more prone to hearing.

Our Task

Now that we know how to represent sounds as images, let’s have some fun.

In this article I will explain how to build and train a system capable of performing voice conversion and any other kind of audio style transfer (for example converting a music genre to another). The method is heavily inspired by recent research in image-to-image translation using Generative Adversarial Networks, with the main difference consisting in applying all these techniques to audio data. As a bonus feature, we will be able to translate samples of arbitrary length, which is something that we don’t see very often in GAN systems.

To get you a bit hyped up for what you are about to learn, here is a demo video of the results we can achieve with this method.

In the demo video, you can listen to different voice translation examples and also a couple of music genre conversions, specifically from Jazz to Classical music. Sounds pretty good, doesn’t it?

Choosing the Architecture

There are a number of different architectures from the computer vision world that are used for image-to-image translation, which is the task that we want to achieve with our spectrogram representations of audio (let’s say speech).

Image-to-image translation consists in converting an image from a domain A (pictures of cats for example) to a different domain B (pictures of dogs), while keeping content information from the original picture (the expression and pose of the cat). Our task is practically the same: we want to translate from speaker A to speaker B, while keeping the same linguistic information from speaker A (the generated speech should contain the same words as the original speech from speaker A).

CycleGAN architecture

The most famous GAN architecture built for this goal may be CycleGAN, introduced in 2017 and widely used since then. While CycleGAN is very successful at translating between similar domains (similar shapes and contexts), such as from horses to zebras or from apples to oranges, it falls short when rained on very diverse domains, like from fishes to birds or from apples to carrots.

The cause of this shortcoming is the fact that CycleGAN heavily relies on pixel-wise losses, or in other words, its loss tends to minimize differences in pixel values of real and generated images: intuitively, when converting an image of an object (an apple for example) to a substantially different domain (carrot) we need to change the main shape of the original object, and CycleGAN can’t help us in this case.

CycleGAN translation example. (Zebra to Horse)

Spectrograms of speeches from different people (or spectrograms of musical pieces of different genres) can be very visually different from one another: thus we need to find a more general approach to the problem, one that does not involve being constrained by translating between visually similar domains.

Spectrograms of different speakers or different music genres can be very visually different

TraVeLGAN: our Solution

Originally introduced here, the TraVeLGAN (Transformation Vector Learning GAN) aims at solving exactly our problem.

Examples of TraVeLGAN image-to-image translations with very different domains

In addition to a Generator and a Discriminator (or Critic), TraVeLGAN introduces a Siamese network (a network that encodes images into latent vectors) to allow translations between substantially different domains keeping a content relationship between the original and converted samples.

Let’s learn how TraVeLGAN exactly works.

TraVeLGAN architecture

Our goal is to find a way to keep a relationship between the original and generated samples without relying on pixel-wise losses (such as the cycle-consistency constraint in CycleGAN), that would limit translations between visually similar domains. Thus, if we encode the images (or spectrograms) into vectors that capture their content information in an organized latent space we are able to maintain a relationship between these vectors instead of the whole images.

That’s exactly what a siamese network allows us to achieve. Originally used for the task of face recognition, the siamese network takes an image as input and outputs a single vector of length vec_len. Specifying with a loss function which image encodings should be close (images of the same face for example) in the vector space and which ones should be far apart (images of different faces) we are able to organize the latent space and make it useful for our goal.

The Siamese network encodes images into vectors

More specifically, we aim at keeping the transformation vectors between pairs of encodings equal: this seems an extremely difficult concept to comprehend, but it is in fact quite easily understandable.

With G(X) as the translated image X (output of the generator), S(X) as the vector encoding of X and A1, A2 two images from the source domain A, the network must encode vectors such as:

(S(A1)-S(A2)) = (S(G(A1)-S(G(A2)))

In this way the transformation vector that connects encodings of a pair of source images must be equal to the transformation vector between the same pair translated by the generator.

This allows to preserve semantic information (differently from CycleGAN that preserves more geometric content information with its cycle-consistency constraint) in the translation, allowing the constraining of more “abstract” relationships between images of different domains.

Formally, to keep content information in the translation we will minimize the euclidean distance and the cosine similarity between the two transformation vectors, so that both angle and magnitude of the vectors get preserved.

Formal TraVeL loss

Furthermore, it is important to clarify that both the generator and the siamese network must cooperate to achieve this objective. More specifically, the gradients of the TraVeL loss get backpropagated through both of the networks and their weights get updated accordingly. Thus, while the discriminator and the generator have an adversarial objective (they challenge one another to reach their goal), the siamese and the generator help each other, cooperating under the same rules.

In addition to this “content” loss, the generator will learn how to generate realistic samples thanks to a traditional adversarial loss (in my experiments I used the hinge loss).

If you are new to GANs and how they work, or if you want to dive a little deeper into how to preserve content information with a latent space, I recommend you check out my article here on how to apply the same techniques on a simple image-to-image translation task.

Translating Audio Signals of Arbitrary Length

Now that we have explored a method that allows us to preserve content information in the translation, we need to understand how we can make the generator convert samples (spectrograms) that are arbitrarily long, without putting extra work on computation and training times.

Let’s say we have an audio signal: “extracting” the mel-spectrogram of the signal we obtain an image with a single channel (different from traditional 3 channels RGB images) with a determined height H (that depends on the hop-size used for the STFT) and a width X that depends on the original length of the audio sample.

However, working with images that have variable dimensions is known to be a challenging task, especially if we do not decide those dimensions beforehand. That’s why we will split all the spectrograms (of shape XxH with X that varies) into chunks with a determined width, let’s say L. Perfect! Our dataset now consists of source and target spectrograms with known dimensions (LxH), and we are ready to proceed to the next step.

Each spectrogram in the dataset has a fixed height H and width L

Before creating our generator G, we need to specify the dimensions of its inputs, which in our case will be (L/2)xH. In other words G will accept spectrograms that have half the width of those in our dataset. Why? Because in this way we will be able to make G translate the whole original XxH spectrograms that we previously split up. Let’s discover how.

Our training pipeline will consist in the following actions:

  1. Split the source LxH spectrograms in half, obtaining (L/2)xH spectrograms
  2. Feed the pairs of halves to the generator and get the translated pairs as outputs
  3. Concatenate the translated halves back to their original shape (LxH)
  4. Feed the translated and the target LxH spectrograms to the discriminator, making it distinguish one from the other and allowing the adversarial training.
Illustration of the training pipeline: splitting, converting and concatenating.

Making the discriminator inspect the concatenated “fake” spectrograms and comparing them to the “real” target ones forces the generator to generate samples that when concatenated together result in a realistic spectrogram.

In other words the translated (L/2)xH samples must not present any discontinuity on the edges that would make them look unrealistic to the discriminator. Thus, this important constraint on the generator is what allows us to translate audio signals of any length from one domain to the other.

After training, when wanting to translate an arbitrary spectrogram of shape XxH where X varies and is given by the length of the original audio signal, this is what we will need to do:

  1. Split the XxH spectrogram into (L/2)xH chunks, using padding if X is not perfectly divisible by L/2
  2. Feed each (L/2)xH sample to the generator for translation
  3. Concatenate the translated samples into the original XxH shape, cutting out the extra if padding was used.

The final translated sample should not present discontinuities and should present the same style as the target domain (a particular voice or music genre). Easy, isn’t it?

Examples of source and converted spectrograms: the concatenated samples do not present discontinuities

Putting Everything Together

We have previously learned how we can preserve content from the source audio sample (in the case of speech it would be the some verbal information, in the case of music it would be the particular melody of a song) without the constraint of translating between visually similar domains (spectrograms of different voices or music genres can be extremely visually different) and a simple but effective technique that allows us to convert samples of arbitrary length.

Now it is finally time to put everything together.

This is an extract from my paper that presents this technique:

Putting everything together: the siamese network helps preserve content keeping vector arithmetic between source and converted samples

MelGAN-VC training procedure. We split spectrogram samples, feed them to the generator G, concatenate them back together and feed the resulting samples to the discriminator D to allow translation of samples of arbitrary length without discrepancies. We add a siamese network S to the traditional generator-discriminator GAN architecture to preserve vector arithmetic in latent space and thus have a constraint on low-level content in the translation. An optional identity mapping constraint is added in tasks which also need a preservation of high-level information (linguistic information in the case of voice translation).

Furthermore, we must add a margin loss for the siamese network to avoid it from degenerating into learning a trivial function to satisfy its objective. The margin loss keeps all the vectors produced by S far from one another, so that the network can’t associate the same exact vector to every input and must learn meaningful relationships creating a useful latent space.

where delta is a fixed value and t is the transformation vector

Finally, here are the formal losses used to train the three networks:

Final losses for generator G, discriminator D, siamese network S

It is important to note that the added identity constraint (mean squared error between samples from the target domain and those same samples translated by the generator) is only useful in case of voice translation, where linguistic information must be preserved and where our content loss (based on the vector outputs of the siamese network) struggles to capture those high level information.

I recommend and invite you to read my full paper if you’re looking for more information on this particular technique or if you prefer a more formal and methodical explanation.


Today we have learned how to perform voice translation and audio style transfer (such as music genre conversion) using a deep convolutional neural network architecture and a couple of tricks and techniques to achieve realistic translations on arbitrarily long audio samples.

We now know that we are able to leverage a large part of the recent research on deep learning for computer vision applications to also solve tasks related to audio signals, thanks to the image-equivalent spectrogram representation.

Finally, I would like to conclude by acknowledging the fact that misusing this and other techniques for badly intentioned goals is possible, especially in the case of voice translation. With the rise of powerful machine learning methods to create realistic fake data we should all be very aware and cautious when exploring and using this kind of algorithms: and while the research won’t stop and shouldn’t be stopped, we should also allocate resources and look into how to detect the fake data that we helped creating.

Thank you so much for your precious attention, have fun!

P.S. If you’re interested in GANs and GAN related out-of-the-box ideas and applications, you should also check out:

10 Lessons I Learned Training GANs for a Year (if you’re interested in tips and tricks to help you in the super challenging task that is training GANs)

Style Transfer with GANs on HD Images (where I use a similar technique to allow style transfer of large images with very little computation resources)

A New Way to look at GANs (where I explore in great detail how a latent space works and how it can be leveraged for a image-to-image translation task)

Synthesizing Audio with Generative Adversarial Networks (where I explore a paper that proposes the use of convolutional GANs to generate audio using raw waveform data and 1-dimensional convolutions)