BERT in keras (tensorflow 2.0) using tfhub/huggingface

Source: Deep Learning on Medium

BERT in keras (tensorflow 2.0) using tfhub/huggingface

(courtesy: jay alammar)

In the recent times, there has been considerable release of Deep belief networks or graphical generative models like elmo, gpt, ulmo, bert, etc. This has been a crucial breakthrough since the advent of the pretrained embeddings in the form of glove, fasttext, word2vec. While the word embeddings helped create dense representation of text sequences, these belief networks have made it possible

  • to beat the challenge of shortage of training data for nlp tasks
  • to fine tune various nlp tasks like entity recognition model, sentiment analysis, question answering.


BERT is deeply bidirectional, OpenAI GPT is unidirectional, and ELMo is shallowly bidirectional.

BERT is the first deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus. Why does this matter? Two reasons:

  • Contextual representations of words which have multiple representation like in the case of polysemy. Take an example of mole — a. its an animal or b. its a spy. Other models are context-free, meaning, in both the context they will return the same embeddings.
  • Fully Connected Bidirectional models — it is not possible to train bidirectional models by simply conditioning each word on its previous and next words, since this would allow the word that’s being predicted to indirectly “see itself” in a multi-layer model. To solve this problem, BERT uses a straightforward technique of masking out some of the words in the input and then condition each word bidirectionally to predict the masked words.

For detailed understanding of BERT, you can read directly from google-ai and here.

Implementation of BERT

If you like to get directly into action and suffer no further, here is the colab notebook to start playing around. Your biggest headache will come from converting your text features into the above format. This _get_inputs function will help you to do so. It is available in the below colab notebook.

The implementation typically takes two steps:

  • get inputs as required by bert which is input ids, input masks and input segments. This I have achieved by creating a function called _get_inputs
  • add the pretrained bert model as a layer to your own model

The inputs might be confusing to look at the first time. Here is possibly, a simple explanation. Given an sentence like “I want to embed. Maybe now”. BERT Tokenizer will convert it into

[“[CLS]”,”i”,”want”,”to”,”em”, “##bed”,”[SEP]”,”Maybe”,”now”,”[SEP]”].

Here ‘CLS’ & ‘SEP’ are reserved words to mark sentence or sequence separation. Using convert_to_ids on the text, we can get input_ids. Lets say we fix 15 words as the standard shape. The above sequence contains only 10. The remaining 5 words have be padded. input_masks are these padded cells. input_segments represent the separation. The first sentence will be marked ‘0’ and the second will be marked as ‘1’. All these are carried out in _get_inputs function, given the dataset and the tokenizer to be used. I have used the dataset from Google Q&A competition from Kaggle in colab (on how to get the dataset. You can check here).

Here are the snippets on implementing a keras model

Using TFhub

For tf 2.0, hub.module() will not work. we need to use hub.keraslayer. This is for internet on version. For internet off, use hub.load — check common issues in tfhub

Using HuggingFace/Transformers

HuggingFace is a startup that has created a ‘transformers’ package through which, we can seamlessly jump between many pre-trained models and, what’s more we can move between pytorch and keras. Check out here for more on this awesome startup

The embedding layer is almost similar. The magic is ‘TFBertModel’ module from transformers package. In fact, it extremely easy to switch between models. For example, to get ‘roberta’, simply access ‘TFRoberataModel’. The TF in the model name indicates TF 2.0 compatability.

Common issues or errors

  • Bert requires the input tensors to be of ‘int32’. Note how the input layers have the dtype marked as ‘int32’.
  • Bert outputs 3D arrays in case of sequence output and 1D array in case of pooled output. You will need to use a layer to convert it into the output you want. Here I have used GlobalAveragePooling1D to convert it into 1D output array.
  • If you are using tfhub for bert implementation, some of them will not be tf2 compatible. Only choose those which, have clear documentation on how to use like the one shown in the example.
  • People struggle to determine the input shape in keras for their dataset. easy way is to arrange [sentence/batch_size,no_of_words,embed_dim]. Here, embed_dim is the output of dimensions for each word by different embeddings. Example for bert – it is 768.