Albert Vectorization (With Tensorflow Hub)

Original article was published on Deep Learning on Medium

Albert Vectorization (With Tensorflow Hub)

We vectorize input texts using pre-trained Albert embeddings from tensorflow-hub. It is used as a Keras layer, so can be extended to build deep learning models easily.

The advent of BERT has disrupted the traditional paradigm of NLP. The downstream model building is now from a model primed with knowledge of a language rather than from scratch. ALBERT can be called a lite BERT with a greatly reduced number of parameters that use a transformer-encoder architecture. In the paper, vocabulary size is also of 30K as used in the original BERT. Three major things distinct ALBERT from BERT are Factorized Embedding Parameterization, Cross-Layer Parameter Sharing, Sentence Order Prediction.

Advantage over BERT

BERT like models can provide a poor-quality performance when one tries to simply enlarge the hidden size of the model. Parameter-reduction technique such as factorized embedding parameterization is used to separate the size of the hidden layers from the size of vocabulary embedding which makes it easy to grow the hidden size without significantly increasing the parameter size. While the cross-layer parameter sharing prevents the parameter from growing with the depth of the network. Thus, both the techniques significantly reduce the number of parameters for traditional BERT without worsening the performance and improving parameter-efficiency. The performance of ALBERT is further improved by introducing a self-supervised loss for sentence-order prediction (SOP).

In this article, we will get the ALBERT vectors of corresponding text using keras layer format of tensorflow-hub module.

1. Setup Environment

We will be making a model using tensorflow keras API which will give the ALBERT embedding for text input. The environment setup includes installing the required libraries and getting required tensorflow-hub modules to get ALBERT vectors. The libraries used are:


We will be using TF2 SavedModel format of ALBERT. You can directly use the module hosted on tfhub for the inference. But for the production scenario, having the module at local will be preferable. For that, we need to first get the zip file of the module and unzip it.

# mkdir albert_en_base
# mkdir 1
# wget
# tar-xzf albert_en_base_1.tar.gztar-xzf albert_en_base_1.tar.gz
# rm -rf albert_en_base_1.tar.gzrm -rf albert_en_base_1.tar.gz

2. Tokenizing

We will import the ALBERT module as a keras layer.

albert_layer = hub.KerasLayer("albert_en_base/1", trainable=False)

The tricky part is, ALBERT module can’t be directly fed with the text. It needs to go through a preprocessing layer.

First, we will tokenize the input text using ALBERT tokenizer which is based on a sentencepiece tokenizer i.e. a subword level tokenizer. It is a data-driven tokenizer to take care of the out of vocabulary words. Since we will have only one text at each input the token list of sentences would look like [“[CLS]”] +TOKENS+ [“[SEP]”].

sp_model_file = albert_layer.resolved_object.sp_model_file.asset_path.numpy()
tokenizer = FullSentencePieceTokenizer(sp_model_file)
stokens = tokenizer.tokenize(sentence)
stokens = stokens[:MAX_LEN]
stokens = ["[CLS]"] + stokens + ["[SEP]"]

For example, the text “the body is made of metallic and delivers high tension” generate a token list like:


Now we need 3 input sequences that can be fed to ALBERT module.

  • Token ids: for every token in the sentence which is fetched from the ALBERT vocab dictionary.
  • Mask ids: for every token to mask out tokens used only for the sequence padding (so every sequence has the fixed length).
  • Segment ids: 0 for one-sentence sequence (our case here), 1 if there are two sentences in the sequence.
def get_ids(tokens, tokenizer, max_seq_length):
"""Token ids from Tokenizer vocab"""
token_ids = tokenizer.convert_tokens_to_ids(tokens,)
input_ids = token_ids + [0] * (max_seq_length-len(token_ids))
return input_ids
def get_masks(tokens, max_seq_length):
return [1]*len(tokens) + [0] * (max_seq_length - len(tokens))
def get_segments(tokens, max_seq_length):
"""Segments: 0 for the first sequence, 1 for the second""
segments = []
current_segment_id = 0
for token in tokens:
if token == "[SEP]":
current_segment_id = 1
return segments + [0] * (max_seq_length - len(tokens))
ids = get_ids(stokens, tokenizer, MAX_SEQ_LEN)
# [2, 13, 1, 589, 17378, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
masks = get_masks(stokens, MAX_SEQ_LEN)
# [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
segments = get_segments(stokens, MAX_SEQ_LEN)
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

3. Albert Vectorization

Now we are ready with the required input processing now. We will build a model using tf keras API which will take the processed input and get the ALBERT vectors of the text. The output ALBERT vectors contain 2 vectors, one is pooled_output and sequence_output. The pooled_output is the sentence embedding of the dimension 1×768 and the sequence output is the token level embedding of the dimension 1x(token_length)x768.

def get_model():
input_word_ids = tf.keras.layers.Input(shape=(MAX_SEQ_LEN,), dtype=tf.int32,name="input_word_ids")
input_mask = tf.keras.layers.Input(shape=(MAX_SEQ_LEN,), dtype=tf.int32,name="input_mask")
segment_ids = tf.keras.layers.Input(shape=(MAX_SEQ_LEN,), dtype=tf.int32,name="segment_ids")
pooled_output, sequence_output = albert_layer([input_word_ids, input_mask, segment_ids]) model = tf.keras.models.Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=[pooled_output, sequence_output])
return model

The idea behind making a keras model is to extend it easily for any classification models with the required addition of layers and parameters.

Now all the hard work is done. Will just need an inference function to get corresponding ALBERT embeddings of the texts.

s = "This is a nice sentence."def get_albert_vec(s):
stokens = tokenizer.tokenize(s)
stokens = ["[CLS]"] + stokens + ["[SEP]"]
ids = get_ids(stokens, tokenizer, MAX_SEQ_LEN)
masks = get_masks(stokens, MAX_SEQ_LEN)
segments = get_segments(stokens, MAX_SEQ_LEN)
input_ids = np.asarray(ids, dtype=np.int32).reshape(1,22)
input_masks = np.asarray(masks, dtype=np.int32).reshape(1,22)
input_segments = np.asarray(segments, dtype=np.int32).reshape(1,22)
return input_ids, input_masks, input_segments
input_ids, input_masks, input_segments = get_albert_vec(s)
pool_embs, word_embs = albert_model.predict(
[[input_ids, input_masks, input_segments]]

To quickly check the quality of sentence embeddings let’s run it on a small set of examples and check the sentence similarity score of each pair using dot product of their corresponding normalized sentence embedding vectors.

sentences = [
# Smartphones
"I like my phone",
"My phone is not good.",
# Weather
"Recently a lot of hurricanes have hit the US",
"Global warming is real",
# Asking about age
"How old are you?",
"what is your age?"]