Paper Reading – A Universal Music Translation Network, from Facebook AI Rearch


A few days ago, Facebook AI Research published a paper on arxiv proposing a framework enabling universal music domain transfer. Before we dig into the paper, let’s check out the sample results of this paper on youtube.

While it is true that they are over-claiming the power of their model, they have produced a high-quality music domain transfer. To be precisely, what they have done is to make a song sounds like it was performed by different configurations of instruments. For example, they converted a piece of Bach’s organ works into a generated piece which sounds like it was Beethoven’s piano music. The melody and chorus remain, by the textures of sound (instrumentation) differ.

So why do I say they over-claimed their model? Since the results are more like timbral-texture transfer instead of music style or domain transfer. In the example mentioned above, they were actually producing new clips sounds like Beethoven’s piano “timbre-wisely”, instead of transform the composition to Beethoven’s style.

Despite the over-claim, this is still a interesting work. They adopted the Wavenet auto-encoder (AE) architecture similar to that being used in the Nsynth project from Google Magenta. Wavenet is an autoregressive model that process raw audio, predicting next audio sample based on the previous generated ones.

Fig. 1 Wavenet Auto-encoder Model from Nsynth

They modified the 1–1 auto-encoder to suit the purpose of domain transfer, with one universal encoder that compress audio from different domain into domain-invariant latent codes and k decoders used for generating audio in each domain i, where i = 1 , 2 , 3……k

Fig. 2 Model used in this paper

Then we may wonder, how did they manage to create a universal encoder for all domains? The trick they used is the adversarial training between the AE and Domain Classification Network (DCN) showed in Fig. 2. The DCN was trained to classified the original domain of the input latent representations. Therefore, by adding the adversarial terms to the loss of AE, the competition between AE and DCN will force AE to learn to discard the domain information and compress the inputs into domain-invariant latent representations.

So let’s take a look at the training loss. Let s^j be an input sample from domain j = 1, 2, . . . , k, k being the number of domains employed during training. Let E be the shared encoder, and D^j the WaveNet decoder for domain j. Let C be the domain classification network, and O(s, r) be the random augmentation procedure applied to a sample s with a random seed r (pitch shift to prevent overfitting). L(o, y) is the cross entropy loss.

Eq. 1 shows the adversarial loss of the proposed model, the first term is the reconstruction loss and the second term is domain classification term. We favor reconstruction loss to be small and domain classification loss to be bigger. This is where the adversarial training stands.

Eq. 1 The loss of proposed auto-encoder model

So the above are a rough summary of this paper, for more details regarding the experiments and findings of this cool paper, please refer to the original paper at arxiv. Although this paper is suspicious to be exaggerated on describing their works, it still stand as an clear step stone that can potentially inspire and attract more efforts into deep learning based audio domain music technology. Last but not the least, it’s a good material to learn how to consecrate your own piece of works ;)

Source: Deep Learning on Medium