Music Generation using Deep Learning

In this project, my goal is to build a machine which can generate endless music which has temporal consistency given a set of songs instead of just using a single song as an input. The generated music will consist of one of each melody and accompaniment hence the name MAC-Net (Melody and Accompaniment Composer Network).

Here’s a taste of what to come

Algorithmic music composition has been developed a lot in the last few years, but the idea has a long history and is certainly not a new task. There have been few mentionable works which use deep learning approaches such as Deep Jazz, Magenta, Bach Bot, Flow Machines, Wave Net, GRUV, and many others.

In order to apply deep learning to such task first, we have to define what would be the best format to represent the data to the model. The aforementioned works can be divided into two categories on the input method which are wave format and notes sequences (ABC notation, MIDI). GRUV and Wave Net falls into the wave format category while the rests of the mentioned methods are using note sequences.

Comparison of Note Sequences and Raw Audio (source:

Knowing this, I choose to use Note sequences to represent the music data. MIDI (Musical Instrument Digital Interface) is a protocol designed for recording and playing back music on digital synthesizers that are supported by many makes of personal computer sound cards. Originally intended to control one keyboard from another, it was quickly adopted for the personal computer. Rather than representing musical sound directly, it transmits information about how music is produced. The command set includes note-on, note-offs, key velocity, pitch bend and other methods of controlling a synthesizer.

Now we need a model which is suitable to generate some music. There are two most commonly used architecture Feedforward Neural Networks and Recurrent Neural Networks.

Feedforward Neural Networks:

A single node in a simple neural network takes some number of inputs, and then performs a weighted sum of those inputs, multiplying them each by some weight before adding them all together. Then, bias is added, and the overall sum is then fed into a nonlinear activation function, such as a Sigmoid, ReLU, LeakyReLU, etc

Recurrent Neural Networks

Notice that in the basic feed-forward network, the information flows in a single direction from input to output. But in a recurrent neural network, this direction constraint does not exist. Basically, what we can do is take the output of each hidden layer, and feed it back to itself as an additional input. Each node of the hidden layer receives both the list of inputs from the previous layer and the list of outputs of the current layer in the last time step.

If you are interested in the details of how recurrent neural networks and its variants works, I suggest you visit for more insight.

OK, a short recap… we are going to use MIDI as a format to represent the musical data and Recurrent Neural Networks as the brain to generate some songs. With these properties being said, this projects aims to generate songs with an arbitrary length which the user can specify using sequence model’s variant GRU. The model should be able to accommodate multiple songs which are represented using MIDI format.


Before feeding the musical data into MAC-Net I need to do preprocessing steps which are encoding the midi data into beat-representation and discretize these data. MAC-Net then tries to generate a composition which consists of melody and accompaniment that follows the style of the given songs. It extracts the temporal feature contained in a song by using Recurrent Neural Networks variant (GRU) and generate songs from this learned features. Here’s the overview pipeline of MAC-Net.

MAC-Net Pipeline Overview


In MIDI format the sound data is stored as note sequences (128 possible note), this drastically reduces both computation and musical complexity. Instead of predicting the amplitude, phase and all the wave parameters, using MIDI representation the model only needs to predict what note to press and for how long it is pressed. MIDI saves the pitches contained by a song in discrete values (doesn’t have rich quality when played). However, we can use sound-font to apply textures or styles on a MIDI file making it sound more lively and rich.

My initial attempt was using smallest note duration for sampling rate and use a flag 0 and 1 to indicate if the note is sustained or articulated. This representation works very well given that all the note duration is divisible by the sampling rate.

Dummy data for data encoding illustration

For example, the data shown above has each one melody and accompaniment track. Each of them has 4 notes and both are being played for 8 sampling duration. At the beginning of the melody track note 1 is pressed so we see in t=0 it is encoded as “1–1”, in t=1, however, there’s note 2 which is pressed over two time durations (t=1 and t=2) therefore it is encoded as “2–1” in t=1 and “2–0” in t=2.

The aforementioned representation works well with these kinds of data, where all of the notes are divisible by a single sampling rate.

Failure Case on Initial Attempt

But the representation fails if this assumption is not satisfied, for example the representation expects the shortest note in a MIDI file to be 0.125 but turns out in some midi songs this assumption is no longer valid since there are songs with different time signature and tempo leading the midi file to have different shortest notes duration eg:(0.33, 0.25, 0.167) which leads to:

1. Shifting song data: note A that is played for 0.33 seconds is going to be represented as 0.375 this shifts the song data for each time we try to represent such note

2. Merging notes: two A notes that is played for 0.33 seconds each is going to be represented as only one A note that is played for 0.675 duration

3. Truncating notes: notes which start before the sampling rate will get truncated. For example, if note A is played on 0.33-second duration from the beginning of the song this will get truncated since the sampling is going to be “0.125 0.25 0.375”

There are 3 Factors that affect a note duration in a song: Time Signature, BPM, and Note Type. For example, a 4/4-time signature 120 bpm song with different “note type” will still have the previously mentioned sampling rate problem since there exist 0.33-second note duration in 0.125 smallest sampling rate song.

Music Sheet and MIDI visualization of 4/4 Time Signature Song with “Special Notes”

As shown above, since the first note is a dotted-note this means that the first note is going to be played for 0.667 seconds and the next note 0.333 seconds. That makes me think to change the way I represent the musical data. 4/4 Time Signature song will always have 4 beats for every measure regardless of the bpm and note type, so instead of using the shortest note duration as the sampling rate, I sample the song data every beat. This way such music data will be encoded as follows


The process of tokenization first creates a dictionary containing all unique elements in the encoded song data. Tokenization then converts the encoded music data “72–0.7–1#69–0.3–1” to a single number in the dictionary. The size of this dictionary indicates the problem complexity which the sequence model will try to estimate. More songs mean larger dictionary thus increasing the problem complexity. To deal with multiple songs I just concatenate their tokenized vector to create a “playlist” which then is used to train the model.


After all the data preprocessing is done (encoding and tokenization), we can now start training our model. Here’s how I trained the sequence model to extract the temporal feature contained in the musical data. Let’s discuss training the model using only the melody. For each training step a snippet of the playlist is taken and used to train the model. Each data will only contain a single number (from the tokenization process) I use embedding to project this number into a higher dimension where the model can encode more information to represent such number. During the training process the model input will follow the original data to ensure the learning process is done properly. The output of each time step is then compared to Y, I use softmax cross-entropy to quantify my unhappiness on the model result.


At the beginning of the generation the network’s hidden state is still empty (no memory stored), the best way to build up this hidden state is by feeding a snippet of the original data as seed. After using the seed to build up the model’s hidden state, now the model will take its own output as the input of the next step. This goes on until a certain duration is fulfilled.

I evaluate my setup using Nottingham database, which is a collection of 1200 British and American folk tunes, (hornpipe, jigs, and etc.) that was created by Eric Foxley and posted on Eric Foxley’s Music Database.

Check out the other sample songs here.

Listening to the generated songs, I would consider that the AI has already been able to generate well-composed songs where the melody and accompaniment are synced in the beat while accommodating multiple songs. This opens up more possibility where machine and a real human composer can work side by side to create music.

Don’t forget to drop a like, Feel free to contact me at


Source: Deep Learning on Medium