Source: Deep Learning on Medium
Transfer Learning — Powering up with Pretrained Models (Read: BERT)
Bonus: Use Huggingface Transformers Pretrained Models with Internet Off
Much has been said about these pretrained models but, the important concepts around transfer learning meaning, how we can use it required some bit of digging. For example, if you had used tfhub version of bert. One might wonder, what its the “trainable” in the layer?
It turns out there are three ways that we can do transfer learning of Pretrained models:
- Feature Extraction : where the pretrained layer is used to only extract features like using BatchNormalization to convert the weights into a range between 0 to 1 with mean being 0. In this method, the weights are not updated during back propagation. This what is marked as non-trainable in the model summary
and in tfhub as trainable=False.
If you would like to set it manually. then, you select the layer like so:
- Fine Tuning: Sort of what this entire competition is about. BERT is ideal for this task because it trained for question answering. So, we just have to fine tune the model to suit our purpose. What it means — the layer has been trained for a general dataset. we need to retrain to optimize for our specific task. This again, in TFHUB is trainable = True. (see above — highlighted in green). In Keras Model Summary, its called out as trainable parameters (see above — highlighted in green). For Pretrained models with multiple layers, usually, this will be ginormous. Albert which is the light version contains 11 million parameters.
- Extract Layers: In this method, we extract only those layers required for the task for example, we might want to extract just the lower levels layers in BERT to carry out tasks like POS, sentiment analysis, etc where only extracting word level features would be enough and too much of context or sequence matching is not required. Below is an example of
import tensorflow as tf
from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained(‘bert-base-uncased’)
model = TFBertModel.from_pretrained(‘bert-base-uncased’)
input_ids = tf.constant(tokenizer.encode(“Hello, my dog is cute”, add_special_tokens=True))[None, :] # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs # The last hidden-state is the first element of the output tuple
BERT — Important Note
Given the crazy about of parameters, the BERT paper recommendation is adopt:
- Epochs — range between 3,4
- Batch_size — 4,8,16 (if you training for specific groups like a smaller sample then might be 32)
- Layers — You may not have to add any additional layers apart from the output & averaging/max pooling to reshape the bert output to your requirments. Since BERT or equivalent has already has optimized the layers & hidden units for us. I did try adding adding a layer for vanishing gradients — these deep belief networks can dwindle the weights to almost zero so, i had added a leaky relu layer to slow weight deterioration. The model run time went up crazily.