Understanding Knowledge Distillation in Neural Sequence Generation

Source: Deep Learning on Medium

Understanding Knowledge Distillation in Neural Sequence Generation — part 1

He is mostly about NLP → he is a natural language researcher → he had an internship at Microsoft and now works at Facebook.

Ongoing project on → working on natural language generation → generate text in many different ways → such as inserting words → this is not a typical way. (he focuses on multi-modal data such as text and audio etc)

Knowledge distillation is not a new thing → to train a smaller model → using a teacher’s model → rather than the hot topic → we use soft labeling. (this is interesting).

There is a temperature variable → and it seems like we need some searching method such as a beam search.

So we are using teachers’ labels → to train a smaller model → this actually works very well! → when we are doing model compression → but this model does not have any specifics on how to select the models.

Can this work in sequence generation as well? → this is not a classification problem → so he had to first work on what is non-autoregressive translation. (decoding step by step → this is strong → but this is slow.)

Implementation is slow and hard to do → NAT → does it at the same time and much faster. (this is worse)

So in practice, no one actually uses → NAT as it is → rather there are some modifications done in the framework. (both student and teacher is trained on the same data → the only difference is the label information).

He first started in the toy dataset → every single language is translated into three different languages. (and during the testing → the model chooses which language to translate → this is such an interesting and different approach)

So the multi-modal information has been learned by the model → this is very good. (this is an interesting problem for NLP → since there are multiple languages we can translate to → hence choosing the label is a task itself)

There are some mathematical measurements done → to know how much knowledge has been distilled.

As the number of parameters increases, → we are able to get much better results. (when they measured complexity of data → real data is really hard to learn compared to the label output by the teacher model)