Original article was published by Rohan Jagtap on Deep Learning on Medium
ALBERT: A Lite BERT
Understanding Transformer-Based Self-Supervised Architectures
BERT pretraining is the pioneer of language modeling. The state of the art in NLP has been evolving ever since. However, the convention says larger models perform better. But, large models hinder scaling. It is difficult and expensive to train them. Moreover, the training speed decreases with the increasing size of the model.
In this article, we’ll be discussing the ALBERT model by Google AI proposed in the paper, “ALBERT: A Lite BERT for Self-supervised Learning of Language Representations.” This paper essentially proposes 2 techniques for parameter reduction (to overcome the above issues) with the original BERT architecture:
- Factorized Embedding Parametrization
- Cross-layer Parameter Sharing
while preserving the performance
Additionally, to improve the performance, a self-supervised objective is proposed for sentence order prediction (SOP). This addresses the ineffectiveness of the NSP task from BERT.
Factorized Embedding Parametrization
In BERT pretraining, we essentially take an embedding matrix of size V x E, where V is the vocab_size, and E is the embedding_dim. And here, H = E, where H is the hidden_dim.
Now, the main job of the word (or WordPiece) embeddings is to learn context independent representations of the tokens. On the other hand, the job of the hidden-layer embeddings is to learn context dependent representations of the tokens. It vaguely means that the word embeddings learn to capture the correspondence between the tokens irrespective of the distribution of the data, and the hidden-layer embeddings learn to capture the patterns between the tokens for a specific distribution on which it is being trained.
It is evident that the model will capture higher contextual information for a larger hidden_dim. However, with the standard parametrization, this will be a huge overhead for the embedding matrix. Hence, the authors of the paper have proposed a parametrization technique to separate the word and hidden-layer embeddings. The idea is to project the vocab_size length one-hot vector to a much smaller embedding_dim sized vector (say 128). And then, project this vector to a much larger hidden_dim sized vector (say 768). Do the math!
Thus, we are optimizing the number of parameters from O(V x H) to O(V x E + E x H). This makes a huge difference if H is very large compared to E.
Cross-Layer Parameter Sharing
There are several ways one can share weights across layers: It can be done for all the feed-forward layers, or it can also be done for attention layers.
ALBERT shares all of its parameters across layers.
Following comparison shows that weight sharing not only reduces the parameters in the model but also helps to stabilize the network parameters: