Original article was published by Ran (Reine) on Artificial Intelligence on Medium
In a nutshell, this project utilizes neural networks to generate music. Given the genre, artist and lyrics as input, Jukebox would output new music sample produced from scratch.
Automated music generation is not exactly new technology — previously, approaches include symbolically generating music. However, for these generators, they often cannot capture essential musical elements like human voices, subtle timbres, dynamics, and expressiveness. I spent some time listening to some of the musical tracks listed in their “Jukebox Sample Explorer” page and personally felt like it’s still not at the level where listeners will be unable to differentiate between the real track VS their auto-generated track. However, I think it is definitely a exciting to see where projects like these will take us and what this means to our music industry in the near future.
Some key technical ideas/concepts:
Their approach is a two-step process — the first step consists of compressing music to discrete codes and the second step involves generating codes using transformers. They have a really nice diagram explaining this process and I’ll include them here:
How the compression part works — they utilize a modified version of the Vector Quantised-Variational AutoEncoder (VQ-VAE-2) (generative model for discrete representation learning). With reference to Figure A above, their 44kHz raw audio is compressed 8x, 32x, and 128x, with a codebook size of 2048 for each level. If you go to their site and try clicking on the sound icon to hear how each reconstructed audio sounds like, the right-most one will sound the noisiest since it is compressed 128x and only the very essential features is retained (e.g. pitch, timbre, and volume).
How the generation part works — As mentioned in the VQ-VAE paper and the VQ-VAE-2 paper, a powerful autoregressive decoder is used (but in Jukebox’s algorithm, separate decoders are used and input from the codes of each level is independently reconstructed to maximize the use of the upper levels). This generative phase is essentially about (from their official site):
(training) the prior models whose goal is to learn the distribution of music codes encoded by VQ-VAE and to generate music in this compressed discrete space. […] The top-level prior models the long-range structure of music, and samples decoded from this level have lower audio quality but capture high-level semantics like singing and melodies. The middle and bottom upsampling priors add local musical structures like timbre, significantly improving the audio quality. […] Once all of the priors are trained, we can generate codes from the top level, upsample them using the upsamplers, and decode them back to the raw audio space using the VQ-VAE decoder to sample novel songs.
Read more about it or try out the demo here.