What’s new in NLP for you!!

Original article can be found here (source): Deep Learning on Medium

So amid this deadly coronavirus pandemic, it’s now more than ever to keep ourselves will new development in our field. Now we are gonna look about the latest development in NLP space. This article focuses on ELECTRA from Google AI, Quant-Noise from FaceBook AI and OpenAI JUKEBOX. Let’s get brief about each of them.


ELECTRA A.K.A Efficiently Learning an Encoder that Classifies Token Replacements Accurately is a pretraining method that outperforms state-of-the-art models on various downstream tasks with 1/4 compute power. The thing that is most unique and fascinating about this model is the way it is trained. Let’s understand how!!

Courtesy: More Efficient NLP Model Pre-training with ELECTRA by GoogleAI

Language model like GPT is trained to predict the next word from previous words as they process the text left-to-right. Masked Language models such as BERT, Roberta, T5 among others try to predict the masked token out of the input. The benefit of this technique is being bidirectional rather than unidirectional to predict the missing word ( As they can see both left input and the right input of the given text to predict the masked word). But since only 15% of the entire text is masked during training the learning capabilities of the bidirectional transformer is hindered.

So what can we do to overcome this ?? Use magic “wingardium meliosa”(Harry Potter!! Just kidding). The answer is GAN A.K.A Generative adversarial network. Believe me, I am not joking. The paper presenting the ELECTRA is titled “ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators”. Now if we recall GAN, they have a component named “The Discriminator” whose sole purpose is to distinguish between generator’s fake data from real data. The discriminator penalizes the generator for producing implausible results. Inspired by this ELECTRA is trained to distinguish between “real” and “fake” input data.

Courtesy: More Efficient NLP Model Pre-training with ELECTRA by GoogleAI
Generator and Discriminator. Courtesy: More Efficient NLP Model Pre-training with ELECTRA by GoogleAI

Yeah if you have read you would really not saw that coming. Transfering approaches from one area in Deep Learning to another seems like a tradition now. If you really got interested to dive deep I advice you to go through the original research paper.


Ever wondered if it’s possible to decrease the size of the massive pre-trained model that is increasing day by day. Let’s look at the chart to get the gravity of this problem.

This figure was adapted from an image published in DistilBERT.

Now you can realise that the model size is really a big issue. But we also do not want to compromise our state-fo-the-art accuracy. Here comes Quant-noise into play. It allows extreme compression of models yet maintaining the accuracy when deployed in practical applications. ( Up to 10x to 20x reduction in memory footprint compared to 4x by earlier versions)

Hmm, that’s great! But how does it work??

So we use Quantization to do so. This is not another jargon that I would throw at you😅 . Quantization is the process of converting a continuous range of values into a finite range of discreet values(wiki). We all know that model weights are stored in floating-point numbers of a given model. Now floating-point number (with a dynamic range of the order of 1^-38 to 1×1⁰³⁸) require larger memory than a fixed point number(e.g. 8-bit integer between 0 and 255). This process of conversion of floating-point to fixed-point integer is quantization.

Well, you see the idea is amazing but directly applying quantization to a trained model can significantly harm performance, because the model was not trained in this setting. To get around this issue Quant-Noise does quantization on a subset of the weights and then randomly applies simulated quantization noise at the time of training. This makes the model handle the quantization at the same time mimic the original accuracy.

Original Model. Courtesy: Training with quantization noise for extreme model compression
Naive Quantization. Courtesy: Training with quantization noise for extreme model compression
Quant-Noise. Courtesy: Training with quantization noise for extreme model compression

Yeah, the results shown by quant-noise are pretty amazing!! Facebook AI was able to use Quant-Noise on Roberta Base Model and compressed the 480 MB model to 14MB yet achieving similar accuracy. Facebook has open-sourced this and for more info check this blog.


OpenAI JukeBox, this is a dream come true for many of us. Generating music using a neural net in various genres and artist styles really make you wonder the future of AI. Unlike normal text generating sequence music is practically impossible due to the massive size of sequences. To give an example a $ minute song might contain 10-million timesteps. This means models have to deal with extremely long-range dependencies.

One way of dealing with this is by using autoencoders that compresses raw audio to lower-dimensional space than generating audio in compressed space and then again upsample to raw audio space.JukeBox uses VQ-VAE to this and generates music sequences directly in raw audio.

Courtesy: JukeBox

The actual process of music generation is quite interesting and much complex and advise you to take a look at the original blog post.

Well, the COVID-19 has caused a huge toll on our daily lives and for many of us also on our livelihood. But seeing the progress made by mankind even in this unfortunate times gives us hope for a bright future. Hope you love reading the blog.