Transfer Learning using ELMO Embedding

Source: Deep Learning on Medium

Last year, the major developments in “Natural Language Processing” were about Transfer Learning. Basically, Transfer Learning is the process of training a model on a large-scale dataset and then using that pre-trained model to process learning for another target task. Transfer Learning became popular in the field of NLP thanks to the state-of-the-art performance of different algorithms like ULMFiT, Skip-Gram, Elmo, BERT etc.

Elmo embedding, developed by Allen NLP, is a state-of-the-art pre-trained model available on Tensorflow Hub. Elmo embeddings are learned from the internal state of a bidirectional LSTM and represent contextual features of the input text. It’s been shown to outperform previously existing pre-trained word embeddings like word2vec and glove on a wide variety of NLP tasks. Some of those tasks are Question Answering, Named Entity Extraction and Sentiment Analysis.

Elmo Embedding using Tensorflow-hub

There is a pre-trained Elmo embedding module available in tensorflow-hub. This module supports both raw text strings or tokenized text strings as input. The module outputs fixed embeddings at each LSTM layer, a learnable aggregation of the 3 layers, and a fixed mean-pooled vector representation of the input (for sentences). To use this module first, let’s download it to the local.

#download the model to local so it can be used again and again
!mkdir module/module_elmo2
# Download the module, and uncompress it to the destination folder.
!curl -L "" | tar -zxvC module/module_elmo2

This module exposes 4 trainable scalar weights for layer aggregation. The output dictionary contains :

  • word_emb: the character-based word representations with shape [batch_size, max_length, 512].
  • lstm_outputs1: the first LSTM hidden state with shape [batch_size, max_length, 1024].
  • lstm_outputs2: the second LSTM hidden state with shape [batch_size, max_length, 1024].
  • elmo: the weighted sum of the 3 layers, where the weights are trainable. This tensor has shape [batch_size, max_length, 1024]
  • default: a fixed mean-pooling of all contextualized word representations with shape [batch_size, 1024].

To pass raw strings as input:

elmo = hub.Module("module/module_elmo2/", trainable=False)
embeddings = elmo(
["the cat is on the mat", "what are you doing in evening"],
with tf.Session() as session:[tf.global_variables_initializer(), tf.tables_initializer()])
message_embeddings =

The output message_embeddings is of shape (2, 6, 1024), as there are 2 sentences with max length of 6 words and for each word 1D vector of length 1024 is generated. It internally tokenizes it based of spaces. If a string with less than 6 words would have been supplied, it would have appended spaces to it internally.

We can also supply tokenized strings to the module as shown below:

elmo = hub.Module("sentence_wise_email/module/module_elmo2/", trainable=False)
tokens_input = [["the", "cat", "is", "on", "the", "mat"],
["what", "are", "you", "doing", "in", "evening"]]
tokens_length = [6, 5]
embeddings = elmo(
"tokens": tokens_input,
"sequence_len": tokens_length
with tf.Session() as session:[tf.global_variables_initializer(), tf.tables_initializer()])
message_embeddings =

The output will be similar.

When using in REST API or backend for multiple inputs, rather than initializing sessions for each call, which is a overhead, one efficient way can be:

def embed_elmo2(module):
with tf.Graph().as_default():
sentences = tf.placeholder(tf.string)
embed = hub.Module(module)
embeddings = embed(sentences)
session = tf.train.MonitoredSession()
return lambda x:, {sentences: x})

embed_fn = embed_elmo2('module/module_elmo2')
embed_fn(["i am sambit"]).shape

Here, by default it outputs a vector of size 1024 for each sentence which is a fixed mean-pooling of all contextualized word representations. Normally while using it in a classifier we can use this output.

ELMO Embedding in a Simple Neural Network Classifier

Data Input

We will be using First GOP Debate Twitter Sentiment data, which contains around 14K tweets on the first 2016 GOP Presidential Debate. We are making a binary classifier, hence will ignore tweets with neutral sentiment.

df =  pd.read_csv("sentence_wise_email/Sentiment.csv",encoding="latin")
df = df[df["sentiment"]!="Neutral"]

Data Processing

We will perform some cleaning on the data like dealing with contractions like “I’ll”, “It’s” etc. We will remove numbers, links, punctuations and e-mail addresses also. (The names also should be removed, but not done here as this model was actually getting developed for some other purpose and I got lazy to change in this case :-) )

import re
def cleanText(text):
text = text.strip().replace("\n", " ").replace("\r", " ")
text = replace_contraction(text)
text = replace_links(text, "link")
text = remove_numbers(text)
text = re.sub(r'[,!@#$%^&*)(|/><";:.?\'\\}{]',"",text)
text = text.lower()
return text
X = np.array(df["text"].apply(cleanText))
y = np.array(df["sentiment"])

Classifier Model Building

First, need to import the necessary modules for this. Then we need to make a function which will perform the pre-trained Elmo embedding on inputs.

embed = hub.Module("module/module_elmo2")
def ELMoEmbedding(x):
return embed(tf.squeeze(tf.cast(x, tf.string)), signature="default", as_dict=True)["default"]

Then, we need to build the architecture. Here we are building it using high level keras api as it is easier to use. We are using the functional approach to build a simple feed-forward neural network along with regularization to avoid over-fitting.

def build_model(): 
input_text = Input(shape=(1,), dtype="string")
embedding = Lambda(ELMoEmbedding, output_shape=(1024, ))(input_text)
dense = Dense(256, activation='relu', kernel_regularizer=keras.regularizers.l2(0.001))(embedding)
pred = Dense(1, activation='sigmoid')(dense)
model = Model(inputs=[input_text], outputs=pred)
model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
return model
model_elmo = build_model()

The model summary is:

Layer (type) Output Shape Param #
input_2 (InputLayer) (None, 1) 0
lambda_2 (Lambda) (None, 1024) 0
dense_3 (Dense) (None, 256) 262400
dense_4 (Dense) (None, 1) 257
Total params: 262,657
Trainable params: 262,657
Non-trainable params: 0

Now since the model architecture (compiled) and data both are ready, it is time to start training and saving the trained weights.

with tf.Session() as session:
history =, y, epochs=5, batch_size=256, validation_split = 0.2)

Just to see how the learning was going on while training with respect to accuracy and loss function, we can draw a plot:

import matplotlib.pyplot as plt
%matplotlib inline

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

plt.plot(epochs, acc, 'g', label='Training Acc')
plt.plot(epochs, val_acc, 'b', label='Validation Acc')
plt.title('Training and validation Acc')

It seems after 4th epoch there is not much change in accuracy. You can draw the same for loss value also. To have a better idea when the model is achieving its destination, it has to be trained for more epochs (means added cost as it is very computationally intensive and require GPU :-) ).

Prediction with the trained model

Now to predict with the trained model, we need to first process the text and make an array of it.

new_text =  ['RT @FrankLuntz: Before the #GOPDebate, 14 focus groupers said they did not have favorable view of Trump.',
'Chris Wallace(D) to be the 2nd worst partisan pontificating asshole "moderating" #GOPDebate @megynkelly'
#the texts should go through clean text also
new_text_pr = np.array(new_text, dtype=object)[:, np.newaxis]

Now we can start a tensorflow session, where we will call the model architecture first and then load the weights from the saved file. Calling the predict API call, it will give us the sentiment probability of each text. The less is the score, the more negative sentiment is embedded in the sentence.

with tf.Session() as session:
model_elmo = build_model()
import time
t = time.time()
predicts = model_elmo.predict(new_text_pr)
print("time: ", time.time() - t)

The output is:

time:  0.6370120048522949
[0.3037635 ]]

I have printed time just to show how much time is required, It took 0.63 seconds for two sentences in Tesla k80 gpu. In i5 processor cpu, it took 14.3 seconds. Since it is a very computationally intensive process especially due to highly complex architecture of Elmo Embedding, use of accelerator is required in real time.

If you see in above training we achieved an accuracy of 0.8094 on Elmo Embedding, while with pre-trained word2vec, glove and online embedding the accuracies were 0.7821, 0.7432 and 0.7213 respectively. These were the results with the same data processing after 5 epochs.

Train on 8583 samples, validate on 2146 samples
Epoch 1/5
8583/8583 [==============================] - 63s 7ms/step - loss: 0.8087 - acc: 0.7853 - val_loss: 0.6919 - val_acc: 0.7819
Epoch 2/5
8583/8583 [==============================] - 62s 7ms/step - loss: 0.6015 - acc: 0.8265 - val_loss: 0.6359 - val_acc: 0.7651
Epoch 3/5
8583/8583 [==============================] - 62s 7ms/step - loss: 0.5377 - acc: 0.8371 - val_loss: 0.5407 - val_acc: 0.8169
Epoch 4/5
8583/8583 [==============================] - 62s 7ms/step - loss: 0.4946 - acc: 0.8401 - val_loss: 0.5016 - val_acc: 0.8071
Epoch 5/5
8583/8583 [==============================] - 63s 7ms/step - loss: 0.4836 - acc: 0.8396 - val_loss: 0.4995 - val_acc: 0.8094