Music Style Transfer with Deep Learning Method

Source: Deep Learning on Medium

Go to the profile of Rex Zhou

Motivation and Introduction

Have you ever wanted to listen to a Jazz version of Justin Bieber’s songs? or Justin Bieber makes his own version of Louis Armstrong? We are curious how could neural networks could help us with that since it has been very popular recently working on the audio task on neural networks.


Music is fundamentally a sequence of notes. A composer constructs long sequences of notes which are then performed through an instrument to produce music. Two important aspects of music are composition and performance. The composition focuses on the notes which define the musical score, and the performance focuses on how these musical notes are played. Music style can be easily distinguished by human instincts, but we need to represent it on paper. So, before we start the experiment, we need to give our own definition of music, that is we determine a music style based on its dynamics, which is the loudness of music; its instruments, and pitch.

Neural Networks

The deep generative model can be applied to change properties of existing data in a principled way, even transfer properties between data sample. This idea has been highly using in the computer vision domain and can be approved by a lot of astonishing results. In this project, we take a step towards the goal of transferring properties in music. The model that we experienced with is an architecture consists of parallel Variational Autoencoders (VAE) with shared latent space and an additional style classifier.

Model and Experiment

Our model is based on the VAE and applies on a symbolic music representation that extracted from MIDI files. We extend the standard piano roll representation of pitches with dynamics and instrument rolls, modeling the most important information that we considered contained in MIDI files. We use parallel recurrent encoder-decoder pairs that share a latent space in our modified version VAE. A style classifier is applied to the latent space to force the encoder learns a compact encoding of “latent style label” that we can then use to perform style transfer.

Here is the model architecture

Parallel Variational Autoencoder, with LSTM layers

Symbolic Music Representation

We use the MIDI format to represent our music, which is a symbolic representation. MIDI files have multiple tracks and can be easily extracted by a pretty_midi package. Tracks can either be on with a certain pitch, held, or be silent. An instrument is assigned to each track. To feed the note pitches, velocity and instruments into the model we represent them as a tensor respectively.

Parallel VAE

The parallel VAE the model that we introduce in this project, and it is based on the standard VAE with a hyperparameter β to weight the KL divergence in the loss function.

Loss of Parallel VAE

In this model, we throw three input tensors (pitch, instrument, and velocity) into the parallel LSTM encoder, and concatenated the results to get a joint latent space. The latent vector is then fed into three parallel decoders to reconstruct the pitch, velocity, and instrument rolls.

The goal of the project is to construct the harmonic multitrack music, so we want to learn a joint distribution instead of three marginal distribution, and thus we chose to use three parallel encoder-decoder pairs instead of three individual autoencoders.

Choosing a high value for β (the weight of the KL term in the VAE loss function) has been shown to increase disentanglement of the latent space in the visual domain. However, increasing β has a negative effect on the reconstruction performance. Therefore, we introduce additional structure into the latent space by attaching a softmax style classifier to the top k dimensions of the latent space, where k equals the number of different styles in our dataset. This forces the encoder to write a “latent style label” into the latent space. Using only k dimensions and a weak classifier encourages the encoder to learn a compact encoding of the style.


The sample output:

The quality of the transferred work is not really high, and the reason behind this that I can think of is that:

  1. We limited the instrument for a certain song to be 4 and excluded the drum track since the drum track doesn’t give the pitch value, but the drum is important in Jazz. Also, there could be different tracks with the same instrument, and in our feature extraction, we may concatenate them together to mess the separate tracks.
  2. In Jazz and Pop genre, there are pretty similar instruments to play, so it’s really hard for our style classifier to recognize which is which.
  3. Due to our limited computational power, we only able to train our parallel VAE with 100 epochs, and only use 400 songs as our input.
  4. The model VAE usually produce a very blurry result, and this has been confirmed by many works in the computer vision domain, even though it is an easy-training generative model.
  5. We could add KL weight annealing to our loss function so we can train the model to learn more from the latent space instead of from the RNN’s hidden states.

To Improve

There are the following things that I can think of to improve our model:

  1. Find a better computational power machine. We can train our model with more epochs and more data since, according to the plots of metric, the result doesn’t converge, so there is still a lot of potential space for our model.
  2. Given more genre. The transferred result of our song is made by interpolating from the note from a certain song. If we throw more genre data, we could get more source to interpolate the note we want to.
  3. Theoretically, we could make the VAE loss easier to train by adding KL weight annealing.

Alternative Method

CycleGAN has been approved in many properties transferring task in vision. It does not require the extraction of explicit style and content features, but instead uses a pair of generators to transform data from a domain A to another domain B. The nature of the two domains implicitly specifies the kinds of features that will be extracted.