Source: Deep Learning on Medium
“Is artificial intelligence less than our intelligence?” — Spike Jonze
Now, in this part, we will discuss the ML model and the deployment from the cloud infrastructure perspective.
ML Model, which we call [Crocodile Model] :) ,
consists of artificial neural Networks especially time series NN i.e. Recurrent Neural Networks based on LSTM (Long-term Short-Term Memory) and CRF (Convolution Random Fields) as a Probabilistic Model. Reference — https://guillaumegenthial.github.io/sequence-tagging-with-tensorflow.html
Broadly, It can be divided into three parts:
1. Word Representation: First of all, we need to represent the words in the search query in the form of a feature vector, for which we have used pre-trained word embedding model developed as an open-source project at Stanford named GloVe. It provides the word-embedding 300-d vector. We are about to concatenate ElMo embedding as well along with this.
2. Contextual Word Representation: For finding the contextual information between the words in the query, it is passed through a Bidirectional-LSTM followed by a dense layer producing the output of the dimension equal to the number of tags.
3. Decoding: Now, we have the vector representation of each word, we apply CRF to find the best possible combination of tags.
There are three variants of the state-of-the-art model which are as below:
1. LSTM-CRF: Consists of the same three parts as described above.
2. CharLSTM-LSTM-CRF: We created an embedding of a word at character level using LSTM because there may be some words like a brand which may not be present in the dictionary of pre-trained word embedding model i.e. GloVe.
3. CharConv-LSTM-CRF: The same as the above model only difference being the NN used for generating character level embedding which is CNN here in place of LSTM.
4. CharLSTM-LSTM-CRF with Elmo Embedding:
We used second model CharLSTM-LSTM-CRF for our use-case as it was performing better than any other model. We are working to implement Model-4 as well.
Implementation of the Model-2 [Crocodile Model] in Tensorflow:
Tensorflow tf.Data API is a good candidate to feed the data to your model when working with the High-Level API like Estimator. It introduces tf.data.Dataset class which creates an input pipeline to read the data. It has a method
from_generator which generates the elements from generator. You can use
map method as well for feature engineering.
#Accepting a method which yields the generator after reading each line from csv file
dataset = tf.data.Dataset.from_generator( functools.partial(generator_fn, words, tags), output_shapes=shapes, output_types=types)
After getting the dataset, we can do more like:
# it will shuffle the the dataset first sampled from 100 elements, and then repeat the same dataset for 5 times- which can be used for iterating through number of epochs
dataset = dataset.shuffle(100).repeat(5)
#creates the batch which is padded in the defaults shape and buffer 2500 records for the next iteration
dataset = (dataset.padded_batch(2500, shapes, defaults).prefetch(2500))
Another class which extends tf.data.Dataset called tf.data.TextLineDataset, which takes the csv file name as argument and read it for you. This API will do a lot of memory management for you when you’re using its file-based datasets. You can, for example, read in dataset files much larger than memory or read in multiple files by specifying a list as argument.
Tensorflow Estimator API using custom Estimators and Tensorflow tf.data have been used for writing the code of all the training, modeling and evaluation. Custom Estimator in Tensorflow has tf.estimator.Estimator class which wraps a model which is specified by a
model_fnand tf.estimator.train_and_evaluate utility function which trains, evaluates, and (optionally) exports the model by using the given estimator. The
model_fn for the model is as below:
- For Word Representation: First, generate the character embedding:
#For each sentence and words, we have a list of characters
#We find the index of character present in the dictionary of all characters
char_ids = vocab_chars.lookup(chars) #[sentence, words, chars]
#Initialize a variable [total_number_of_chars, dimension of char_embedding=100] storing the initial embedding of all characters with some random floating point numbers
variable = tf.get_variable(
‘chars_embeddings’, [num_chars, params[‘dim_chars’]], tf.float32)
#Lookup the embeddings of the chars in char_ids
char_embeddings = tf.nn.embedding_lookup(variable, char_ids, validate_indices=False) #[sentence, word, chars, char_dim=100]
#Adding a dropout in the layer
char_embeddings = tf.layers.dropout(char_embeddings, rate=dropout,
#[max_length of sentences in batch]
dim_words = tf.shape(char_embeddings)
#[max_length of words in all the sentences]
dim_chars = tf.shape(char_embeddings)
flat = tf.reshape(char_embeddings, [-1, dim_chars, params['dim_chars']]) #[sentence*max_words_in_sentence ,max_chars_in_all_words, char_dim=100]
#making time major from batch major as required by tf.contrib.rnnt = tf.transpose(flat, perm=[1, 0, 2])
#Initializing LSTM each having 25 units
lstm_cell_fw = tf.contrib.rnn.LSTMBlockFusedCell(25)
lstm_cell_bw = tf.contrib.rnn.LSTMBlockFusedCell(25)
#Creating backward dir LSTM
lstm_cell_bw = tf.contrib.rnn.TimeReversedFusedRNN(lstm_cell_bw)
#output having dim [max_chars_in_all_words, sentence*max_words_in_sentence, char_embd_size=25]
#Here time_steps i.e.[sequence_length] = number of chars in each words
_, (_, output_fw) = lstm_cell_fw(t, dtype=tf.float32, sequence_length=tf.reshape(nchars, [-1]))
#Reverse Bi-LSTM output
_, (_, output_bw) = lstm_cell_bw(t, dtype=tf.float32, sequence_length=tf.reshape(nchars, [-1]))
output = tf.concat([output_fw, output_bw], axis=-1) # [max_chars_in_all_words, sentence*max_words_in_sentence, char_embd_size=25+25=50]
#Reshape to [num_of_sentences, max_num_of_words, 50]
char_embeddings = tf.reshape(output, [-1, dim_words, 50])
Now, generate the word embeddings as well:
#For each sentence, we have a list of words
#We find the index of words present in the dictionary of all words
word_ids = vocab_words.lookup(words) #[sentence, words]
#Getting the glove embeddings of all the words
glove = np.load(params[‘glove’])[‘embeddings’]
#Appending an extra embeddings to return if some word is not found
variable = np.vstack([glove, [[0.] * params[‘dim’] ]])variable = tf.Variable(variable, dtype=tf.float32, trainable=False)
#Look up the word embeddings in the dictionary we created as non-trainable
word_embeddings = tf.nn.embedding_lookup(variable, word_ids) #[sentence, word, glove_word_dim = 300]
# Concatenate Word and Char Embeddings
embeddings = tf.concat([word_embeddings, char_embeddings], axis=-1)
#[sentence, word, 300+50=350]
2. Contextual Word Representation
#Time major, input shape= [sentences, words, 350]
t = tf.transpose(embeddings, perm=[1, 0, 2])
#Forward and Backward lstm each of 100 units
lstm_cell_fw = tf.contrib.rnn.LSTMBlockFusedCell(100)
lstm_cell_bw = tf.contrib.rnn.LSTMBlockFusedCell(100)
lstm_cell_bw = tf.contrib.rnn.TimeReversedFusedRNN(lstm_cell_bw)
# time steps i.e. [sequence_length] having number of words in each sentence
output_fw, _ = lstm_cell_fw(t, dtype=tf.float32, sequence_length=nwords) #[sentence, words, 100]
output_bw, _ = lstm_cell_bw(t, dtype=tf.float32, sequence_length=nwords) #[sentence, words, 100]
# Concatenate the both forward and backword encoding
output = tf.concat([output_fw, output_bw], axis=-1) #[sentence, words, 100+100=200]
output = tf.transpose(output, perm=[1, 0, 2])
#transponse to original shape
#Create a dense layer to reduce the output to num of tags
logits = tf.layers.dense(output, num_tags) # [sentence, word, num_of_tag=6]
3. Decoding using CRF:
#Create a variable and initialize as a transition score from one tags to another tags in determining the score of a particular combination of tags
crf_params = tf.get_variable(“crf”, [num_tags, num_tags], dtype=tf.float32)
# determining the tags for each sentence # [sentence, no_of_tags]
pred_ids, _ = tf.contrib.crf.crf_decode(logits, crf_params, [length_of_tags])
Calculating Loss and optimizing it:
#Using Log likelihood as the loss function
log_likelihood, _ = tf.contrib.crf.crf_log_likelihood(logits, correct_tags, [length_of_tags], crf_params)
loss = tf.reduce_mean(-log_likelihood)
#Using adam optimizer to minimize the loss
if mode == tf.estimator.ModeKeys.TRAIN:
train_op = tf.train.AdamOptimizer().minimize(loss, global_step=tf.train.get_or_create_global_step())
The above estimator model can be saved in a dir which exports inference graph as a
SavedModel into the given dir using method
export_saved_model which can be used for the ecosystem called Tensorflow Serving.
The above method takes the directory path where the model params will be saved and an input_fn of type ServingInputReceiver in which input features passed as a dict.
The SavedModel is exported in the ways described above after training on the datasets created on several scenarios. The same model is loaded in the Tensorflow serving API (docker-version) which exposes a REST API and returns the tags predicted corresponding to a particular search query.
docker run -p 8501:8501 \
--mount type=bind,source=/path/to/my/models.config,target=/models/models.config \
-t tensorflow/serving --model_config_file=/models/models.config
The TF-Serving Model response for query: “nut free chocolate” looks something like below:
The tag “NV-S”: represents the nutrition tag, “PR-S”: represents the preposition tag and “BQ-S”: represents the base query.
The discussion about deployment part will be continued in Part III.