Knowledge Distillation

How to use your kaggle-winning ensemble to create a lean production model

In this paper Geoffrey Hinton et al. describe how this can be done. Hence one can mostly circumvent the problem that the best models are usually too cumbersome to be used in production.

For more depth and examples, here is a screencast i made: https://www.youtube.com/watch?v=lSjBc1wSJMI&feature=youtu.be.

How to create the lean production model? Train a single “distillation model” to make the same predictions as your cumbersome ensemble.

Since the cumbersome model has already dealt with overfitting, you don’t need to worry about overfitting when training the distillation model.

Hinton et al.’s main contribution is to train the “distillation model” model on softened target probabilities (generated by the cumbersome model) by “raising the temperature” used in the distillation model’s softmax calculations. This allows you to transfer more information per training example from the cumbersome to the distillation model, without transferring too much noise from very unlikely class probabilities.

Source: Deep Learning on Medium