Original article can be found here (source): Deep Learning on Medium
Imitation learning: Note to note
For a neural network to improvise music, it needs to be able to capture motifs and musical ideas besides comprehending music theory. This means that the neural network must be capable of capturing long term characteristics of the musical dataset. Additionally, as any human jazz player knows, the neural network has to be aware of the harmonic context at each moment, i.e., the chord being played by the rhythm section.
As music and language share many of the same qualities, we initially based BebopNet on a language-modeling neural network that can generate linguistic sentences. We trained BebopNet, a long short-term memory (LSTM) model, to predict the next note to be played, based on the previous notes.
The advantage of using recurrent neural networks is their ability to capture repetitions and long-term patterns. As attention-based
models became prevalent in language modeling, we replaced our neural network with transformer-xl.
As done in Markov models, we train our network by repeatedly asking it to predict the next note following a sequence of notes and harmony. The method we use is very similar to the one described in Andrej Karapathy’s famous blog post.
Nevertheless, unlike neural networks for natural language processing, where the output is one probability vector for the next word, our output is two probability vectors: one for the pitch and one for the duration of the next note.
After training, while using the model to generate solos, we have two options at each step: either to be greedy and choose the most probable note or treat the output as a distribution vector and sample the next note. The latter assures us variability and allows us to generate different improvisations.
Our dataset, consisting of jazz solo transcriptions purchased as XML files from saxsolos.com, comprises jazz improvisations by:
– Charlie Parker (1920–1955)
– Cannonball Adderley (1928–1975)
– Sonny Stitt (1924–1982)
– Phil Woods (1931–2015)
– Sony Rollins (1930-)
– Stan Getz (1927–1991)
– Gene Ammons (1925–1974)
– Dexter Gordon (1923–1990)
Choosing the correct input representation method for the neural network is critical for its functionality and impacts performance immensely. There is a complete spectrum of music representation methods — from raw audio waveforms, through MIDI format to sheet music, so choosing the one most suitable to the task is challenging.
BebopNet is based on a symbolic note representation that closely resembles the standard music notation system used to communicate music by musicians. Using this kind of representation may help to learn patterns and repetitions in beats and melodies. Each object represents a musical note with its pitch, duration, offset within a measure, and harmonic context. The harmonic context includes the four notes of the current chord. We use the above pitch representation method and concatenate all four pitches to represent a chord.
Having created a model, are our improvisations any good artistically?
One main success criterion of any jazz improvisation, as well as any art, is creativity. How can one assess the creativity level of jazz solos? Well, defining creativity has been one of humankind’s long-debated conundrums. Far be it from us to wade into the discussion so we decided to tackle a very modest but quantifiable aspect of creativity and assess the originality of jazz solos by their “plagiarism level”.
Using this notion of originality, we can observe several interesting facts. For example: Among the Bebop giants, Sonny Stitt was the “king” of the copycats(pun intended). He copied numerous (short) phrases from Stan Getz and Sonny Rollins. Another intriguing fact: Surprisingly, using several plagiarism metrics, our trained model appears to be as original as any professional jazz player.
The measurement we define is the number of notes in the largest common subsequence. As a baseline, we look for this characteristic in our dataset: How much do jazz giants “copy” each other’s musical sentences?
Note that we defined this comparison to be invariant to pitch shift: two identical sequences will contain the same intervals between pitches and the same durations.
Comparing a pool of jazz solos generated by our jazz improvisation learning model with the dataset results in an average largest common subsequence of 4.4 notes, which means that we do not differ much from any jazz giant vis-à-vis this characteristic.
But let’s also look at shorter sequences. To do this, we calculate the percent of common sequences in length n between an artist and the rest of the dataset. A sequence in a certain length n is known in language modeling as an n-gram.
As expected, most of the sequences of length 1 can also be found somewhere in the dataset, while the larger n gets, the smaller is the percentage of occurrences. For any n, the imitation learning model does not exceed the largest percent of occurrences.
We calculated the area under the curve to achieve a normalized measurement of plagiarism:
– Cannonball 0.704
– Gordon 0.746
– Getz 0.745
– Parker 0.693
– Rollins 0.718
– Stitt 0.680
– Woods 0.714
– Ammons 0.718
– Our model 0.713