20 Questions To Test Your Skills In Transfer Learning For NLP

Original article was published on Deep Learning on Medium

What is the current state of Pre-trained models(PTM)?

https://arxiv.org/pdf/2003.08271.pdf

Which tasks are used for training PTMs?

https://arxiv.org/pdf/2003.08271.pdf

What is the current state of PTMs on GLUE?

https://arxiv.org/pdf/2003.08271.pdf

Does more data always lead to better language model?

T5 paper says no. Quality matter more than quantity.

https://arxiv.org/pdf/1910.10683.pdf

What tokenisation method seems best for training language models?

This paper says that a new method Unigram LM is better than BPE and WordPiece.

Which task is best for training a language model?

The current best approach is by ELECTRA → Replacing an input token with the help of a generator and then using a discriminator to predict which token was corrupted.

https://arxiv.org/pdf/2003.10555.pdf
https://arxiv.org/pdf/2003.10555.pdf

Also T5 paper says dropping a span of 3 is good.

https://arxiv.org/pdf/1910.10683.pdf

Is gradual unfreezing needed for task training of transformer?

T5 paper says no.

https://arxiv.org/pdf/1910.10683.pdf

What would you change to get a better language model if you have a fixed budget of training?

T5 paper suggests increasing both size and training steps.

https://arxiv.org/pdf/1910.10683.pdf

Which model would you use if your sequence is longer than 512 tokens?

Transformer-XL or Longformer

How does the processing time of transformer scale with sequence length?

Quadratic

How can we bring down the processing time of transformers for long documents since it is quadratic in nature with sequence length?

Longformer uses an attention mechanism that scales linearly with sequence length.

https://arxiv.org/pdf/2004.05150.pdf

Longformer can be really good for encoding long documents for semantic search. Below table shows the work done till now.

https://arxiv.org/pdf/2004.05150.pdf

Does BERT perform great because of its attention layer?

Paper Attention is not Explanation argues that attention is not properly correlated to outputs and hence we cannot say the model performs better because of attention mechanism.

Will the performance of BERT fall drastically if we turned off a head?

No — As per paper Revealing the Dark Secrets of BERT

Will the performance of BERT fall drastically if we turned off a layer?

No — As per paper Revealing the Dark Secrets of BERT

Will the performance of BERT fall drastically if we randomly initialised it?

No — As per paper Revealing the Dark Secrets of BERT

Do we really need model compression?

Maybe not! Notes from this amazing article.

“Model compression techniques give us a hint about how to train appropriately-parameterized models by elucidating the types of solutions over-parameterized models tend to converge to. There are many types of model compression, and each one exploits a different type of “simplicity” that tends to be found in trained neural networks:”

  • Many weights are close to zero (Pruning)
  • Weight matrices are low rank (Weight Factorization)
  • Weights can be represented with only a few bits (Quantization)
  • Layers typically learn similar functions (Weight Sharing)

Can we steal a model if exposed as API?

Yes, we can → explained in this mind-blowing post.

What is the current state of distillation?