BERT, lets make it lighter

Source: Deep Learning on Medium

BERT, lets make it lighter

Bidirectional Encoder Representations from Transformers (BERT) is a technique for natural language processing by Google [1]. In a very simple explanation, it is a pre-trained language model that is trained on Wikipedia (2.5B words) + BookCorpus (800M words)[2]. Based on BERT, a deep bidirectional Transformer can be trained. BERT can see the WHOLE sentence on either side of a word contextual language modeling and all of the words almost at once.

B: Bidirectional

ER: Encoder Representations

T: Transformers

The issue with BERT is that it is too time and resource consuming. The two important steps to build BERT are pre-training and fine-tuning. If you use pre-training model, the overall training time will be lesser. But if you train the model according to yours knowledge domain, then it takes a lot of time — may be days or weeks and you will need GPUs for that.

A few ways to make a lighter version of BERT [4] is :

  1. use TensorFlow Lite [3]
  2. Compression during training vs after training
  3. Pruning: removing neurons, weights, Removing weight matrices

More ideas are welcome please!!!

References

[1] Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805 (2018).

[2] https://nlp.stanford.edu/seminar/details/jdevlin.pdf

[3] https://www.tensorflow.org/lite

[4] https://blog.rasa.com/compressing-bert-for-faster-prediction-2/#quantizing-with-tflite-the-results