20 Questions To Test Your Skills In Transfer Learning For NLP

Original article was published on Deep Learning on Medium

What is the current state of Pre-trained models(PTM)?


Which tasks are used for training PTMs?


What is the current state of PTMs on GLUE?


Does more data always lead to better language model?

T5 paper says no. Quality matter more than quantity.


What tokenisation method seems best for training language models?

This paper says that a new method Unigram LM is better than BPE and WordPiece.

Which task is best for training a language model?

The current best approach is by ELECTRA → Replacing an input token with the help of a generator and then using a discriminator to predict which token was corrupted.


Also T5 paper says dropping a span of 3 is good.


Is gradual unfreezing needed for task training of transformer?

T5 paper says no.


What would you change to get a better language model if you have a fixed budget of training?

T5 paper suggests increasing both size and training steps.


Which model would you use if your sequence is longer than 512 tokens?

Transformer-XL or Longformer

How does the processing time of transformer scale with sequence length?


How can we bring down the processing time of transformers for long documents since it is quadratic in nature with sequence length?

Longformer uses an attention mechanism that scales linearly with sequence length.


Longformer can be really good for encoding long documents for semantic search. Below table shows the work done till now.


Does BERT perform great because of its attention layer?

Paper Attention is not Explanation argues that attention is not properly correlated to outputs and hence we cannot say the model performs better because of attention mechanism.

Will the performance of BERT fall drastically if we turned off a head?

No — As per paper Revealing the Dark Secrets of BERT

Will the performance of BERT fall drastically if we turned off a layer?

No — As per paper Revealing the Dark Secrets of BERT

Will the performance of BERT fall drastically if we randomly initialised it?

No — As per paper Revealing the Dark Secrets of BERT

Do we really need model compression?

Maybe not! Notes from this amazing article.

“Model compression techniques give us a hint about how to train appropriately-parameterized models by elucidating the types of solutions over-parameterized models tend to converge to. There are many types of model compression, and each one exploits a different type of “simplicity” that tends to be found in trained neural networks:”

  • Many weights are close to zero (Pruning)
  • Weight matrices are low rank (Weight Factorization)
  • Weights can be represented with only a few bits (Quantization)
  • Layers typically learn similar functions (Weight Sharing)

Can we steal a model if exposed as API?

Yes, we can → explained in this mind-blowing post.

What is the current state of distillation?