Source: Deep Learning on Medium
BERT, lets make it lighter
Bidirectional Encoder Representations from Transformers (BERT) is a technique for natural language processing by Google . In a very simple explanation, it is a pre-trained language model that is trained on Wikipedia (2.5B words) + BookCorpus (800M words). Based on BERT, a deep bidirectional Transformer can be trained. BERT can see the WHOLE sentence on either side of a word contextual language modeling and all of the words almost at once.
ER: Encoder Representations
The issue with BERT is that it is too time and resource consuming. The two important steps to build BERT are pre-training and fine-tuning. If you use pre-training model, the overall training time will be lesser. But if you train the model according to yours knowledge domain, then it takes a lot of time — may be days or weeks and you will need GPUs for that.
A few ways to make a lighter version of BERT  is :
- use TensorFlow Lite 
- Compression during training vs after training
- Pruning: removing neurons, weights, Removing weight matrices
More ideas are welcome please!!!
 Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805 (2018).