End to End Chatbot using Sequence to Sequence Architecture

Source: Deep Learning on Medium

Go to the profile of Sai Sandeep

Ever felt bored when you are all alone? Had a thought of talking to someone who could give you witty replies? If that is the case why not train one to be? I mean a deep learning model. Yes, since the past half-decade deep learning has grown humongously powerful with evolution of state-of-the-art architectures and algorithms that were brought up into the limelight as part of tons of research that’s happening around the world. Artificial intelligence is moving right direction in catching up the humans cognitive power with these advances. And one such cognitive power of human is the ability to talk or reply.

Chatbots have been around quite a while now. ELIZA was the first chat bot that was built way back in 1966 at the MIT Artificial Intelligence Laboratory. It was created based on pattern matching techniques and involved a lot of hard-coded rules to make it simulate a human conversation. Since then, many Chatbots have been developed but could not bring semantical meaning to their conversations until the advent of Deep Learning. A lot of deep learning based chatbots including Siri, Google assistant, Cortana and the controversial Microsoft’s Tay narrowed down the gap unfilled for decades.

To get an understanding of these deep learning based chatbots works, I have implemented a Chatbot using encoder-decoder architecture.

Dataset and pre-processing

For building a chatbot, the nature of dataset we chose plays a very important role as it defines the characteristics of the chatbot. As stated in the introduction part, I wanted to build a chatbot that could give sarcastic replies just like Chandler from the FRIENDS series. So I have gone through many datasets and fortunately found the dialogue dataset of FRIENDS itself here. This dataset after processing had 57 thousand questions and answers.

As training, an effective deep learning model involves it to send a lot of data into the model I had to search for other datasets that were large and publicly available. Unfortunately, I wasn’t able to find any such data with huge Q and A but found smaller ones. So, I have stacked them up into one big dataset so we could feed enough data for our model.

The following are the list of datasets I have used

  1. FRIENDS dialogue Dataset
  2. Cornell Movie dialogue Dataset
  3. Question Answer jokes Dataset
  4. South park Dataset
  5. BNCSpitwordCorpus Dataset

All the above datasets were in a different format. I have structured them properly in such a way that each line contains a question and corresponding answer to it.


Before performing any preprocessing, let’s look at how the sentences are in the dataset so we can get an idea of what we could do to these sentences.

Questions sample from the dataset

As we can see from the above result, sentences seem to be well structured and hence standard pre-processing such as removing the special characters and expanding the contractions must be performed.


Be it machine learning or deep learning, the model can never understand words or alphabets. We have to convert the words into numbers. One simple and straightforward approach is to tokenize the words in sentences such that each unique word is represented by a number. This way, we can build the vocabulary of words in the sentences. Besides building the vocabulary, we must also be able to convert words to tokens and tokens back to words. For this purpose, we can define the following function-

An example of tokenization
PAD_token = 0 # Used for padding short sentences
SOS_token = 1 # Start-of-sentence token
EOS_token = 2 # End-of-sentence token
class Voc:
def __init__(self, name):
self.name = name
self.trimmed = False
self.word2index = {}
self.word2count = {}
self.index2word = {PAD_token: "PAD", SOS_token: "SOS", EOS_token: "EOS"}
self.num_words = 3 # Count SOS, EOS, PAD
def addSentence(self, sentence):
for word in sentence.split(' '):
def addWord(self, word):
if word not in self.word2index:
self.word2index[word] = self.num_words
self.word2count[word] = 1
self.index2word[self.num_words] = word
self.num_words += 1
self.word2count[word] += 1

Now, we can write a function that takes each sentence from the dataset, preprocess the sentence and send it to Voc for building the vocabulary. Besides the pre-processing, we also trim those sentences that are having rare words(words that have occurred less than 5 times in entire data corpus) in both the question and answer. This can be seen as a hack that can help in faster convergence.

def unicodeToAscii(s):
return ''.join(
c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn'
# Lowercase, trim, and remove non-letter characters
def normalizeString(s):
s = unicodeToAscii(s.lower().strip())
s = " ".join([good_prefixes[each] if each in good_prefixes else each for each in s.split()])
s = re.sub(r"([.!?])", r" \1", s)
s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
s = re.sub(r"\s+", r" ", s).strip()
return s
# Read query/response pairs and return a voc object
def readVocs(datafile, corpus_name):
print("Reading lines...")
# Read the file and split into lines
lines = open(datafile, encoding='utf-8').\
# Split every line into pairs and normalize
pairs = [[normalizeString(s) for s in l.split('<CoSe>')] for l in lines]
voc = Voc(corpus_name)
return voc, pairs
# Returns True iff both sentences in a pair 'p' are under the MAX_LENGTH threshold
def filterPair(p):
# Input sequences need to preserve the last word for EOS token
return len(p[0].split(' ')) < MAX_LENGTH and len(p[1].split(' ')) < MAX_LENGTH
# Filter pairs using filterPair condition
def filterPairs(pairs):
return [pair for pair in pairs if filterPair(pair)]
# Using the functions defined above, return a populated voc object and pairs list
def loadPrepareData(corpus_name, datafile, save_dir):
print("Start preparing training data ...")
voc, pairs = readVocs(datafile, corpus_name)
print("Read {!s} sentence pairs".format(len(pairs)))
pairs = filterPairs(pairs)
print("Trimmed to {!s} sentence pairs".format(len(pairs)))
print("Counting words...")
for pair in pairs:
print("Counted words:", voc.num_words)
return voc, pairs

Batchwise input and data preparation

In my initial attempts, I tried building the model by sending a single sentence at a time as input to the model. This was an unsuccessful attempt as the model could not converge with a single input even after running for many epochs. So I have gone by batch-wise prediction. With batch wise prediction, we send multiple sentences into the model at once.

As each sentence is of variable size, we pick a batch_size number of sentences randomly and then pad every sentence with PAD token so that the length of each sentence will be equal to the length of longest sentence in that batch. This way we can make the sentences of fixed batch size.

Padding the sentences

But there is a problem with the above batch wise prediction. We know that for RNN’s we give input as per time steps. At a particular time stamp “T” we have to send the “T’th” word in the sentence as input. For this purpose, we have to transpose the matrix we have got in the previous step.

We are sending a max_length * batch_size tensor as input to the model. S, we must also represent the output i.e. the reply sentence in the form of tensors as shown above and this would be used for calculating the loss. For a better understanding, let’s look at the complete flow of how the sentences are converted into tokens and then to tensors with the following code and it’s output.

def indexesFromSentence(voc, sentence):
return [voc.word2index[word] for word in sentence.split(' ')] + [EOS_token]
def zeroPadding(l, fillvalue=PAD_token):
return list(itertools.zip_longest(*l, fillvalue=fillvalue))
def binaryMatrix(l, value=PAD_token):
m = []
for i, seq in enumerate(l):
for token in seq:
if token == PAD_token:
return m
# Returns padded input sequence tensor and lengths
def inputVar(l, voc):
indexes_batch = [indexesFromSentence(voc, sentence) for sentence in l]
lengths = torch.tensor([len(indexes) for indexes in indexes_batch])
padList = zeroPadding(indexes_batch)
padVar = torch.LongTensor(padList)
return padVar, lengths
# Returns padded target sequence tensor, padding mask, and max target length
def outputVar(l, voc):
indexes_batch = [indexesFromSentence(voc, sentence) for sentence in l]
max_target_len = max([len(indexes) for indexes in indexes_batch])
padList = zeroPadding(indexes_batch)
mask = binaryMatrix(padList)
mask = torch.ByteTensor(mask)
padVar = torch.LongTensor(padList)
return padVar, mask, max_target_len
# Returns all items for a given batch of pairs
def batch2TrainData(voc, pair_batch):
pair_batch.sort(key=lambda x: len(x[0].split(" ")), reverse=True)
input_batch, output_batch = [], []
for pair in pair_batch:
print("Input sentence-",pair[0])
print("Output sentece-",pair[1])
inp, lengths = inputVar(input_batch, voc)
output, mask, max_target_len = outputVar(output_batch, voc)
return inp, lengths, output, mask, max_target_len
# Example for validation
small_batch_size = 5
batches = batch2TrainData(voc, [random.choice(pairs) for _ in range(small_batch_size)])
input_variable, lengths, target_variable, mask, max_target_len = batches
input_variable:", input_variable)
print("input shape:", input_variable.shape)
print("lengths:", lengths)
print("target_variable:", target_variable)
print("Output shape:",target_variable.shape)
Illustration of sentences to tensors

As shown in the above image, the first column represents the input variable corresponding to the first input sentence. As it is the longest one, we pad all the other sentences with <PAD> token which makes up to 0. We have two tensors with the shape of max_length * batch_size for both input sentence and target sentence.


What will be the input and output for a chatbot? Input is a query sentence we send into the model and output is a reply sentence we get from the model. Breaking it down further, it’s multiple words that we are sending in as input and multiple words are expected as output. The important point to be noted here is that we have input and output sentence is of variable length and both are not necessarily of the same length. For example, when we give input as “Hi how are you?” We might get the output as “Hello I am fine.”

Types of RNN

So when are dealing with this kind of problem where we have input and output size of variable length, the obvious choice is the RNN. As seen from the above image where different RNN are shown, we would use the many-to-many (4th from left) RNN also called a sequence to sequence model.

We can divide the architecture of the model into 3 main parts –

  1. Encoder
  2. Attention Layer
  3. Decoder


We can see an encoder as a simple RNN that takes the words in question as input and returns a single vector representing all the words of the question without losing the sequential information. Lets deep dive into how is this going to work and the layers that are associated with it.

Basic Overview of Encoder

As seen from the above figure, at every time stamp, we pass a word of a sentence as input to the model and at the end of the last word(EOS), we get a vector representing the encoder state. For a more detailed analysis, let’s inspect each layer of the encoder.

Layers in Encoder


Why do we need to use an embedding layer in Neural Network? Why can’t we directly send the tokens inside the model? The obvious reason is that we are implicitly giving a natural order by tokenizing the words which are not a correct way. So the next idea would be to perform a one-hot encoding of words. We have a very large vocabulary of words and converting these into dimensions using one hot encoding would lead us to a massive sparse dataset which would obviously not work well with our neural network. So overcoming this problem, we embed the words into an N-dimensional point such that what is semantically closer to each other are grouped together as shown in the figure.

Word embedding representation

Now we have an n-dimensional representation of words, we need to pass this to our next layer i.e GRU

Gated Recurrent Unit –

GRU’s are the altered version of LSTM and are computationally less expensive when compared to them. GRU comprises two gates, update gate and reset gate. The basic structure of GRU is given as follows

Structure of GRU

I don’t want to go into details of how GRU works as there are many articles including the Cristopher Olah’s blog which provide us with a great understanding. From a bird-eye view, we can say GRU’s includes two gates Update gate and Reset gate. We can use update gate to decide what information to throw away and what new information to add. The reset gate is another gate that decides how much past information to forget. We use a slightly complex version of this called the Bi-directional GRU.

Bi-directional GRU-

Bidirectional RNN

The idea behind Bidirectional RNN is to feed in the input sequence in normal order for the first GRU and in reverse order for the second GRU. The output from both the GRUs can either be concatenated or summed up. From the above figure, we can see that X0 is sent as the first input for GRU A and the last input for the GRU A’. For this specific problem, we sum up the outputs of both the GRU’s which results in the tensor y0. Besides this, we also take the last hidden state Si that will be passed to the next layers.

Code and flow of data-

For building the Encoder architecture which comprises embedding layer and GRU and outputs hidden state and outputs, we can use the following code

class EncoderRNN(nn.Module):
def __init__(self, hidden_size, embedding, n_layers=1, dropout=0):
super(EncoderRNN, self).__init__()
self.n_layers = n_layers
self.hidden_size = hidden_size
self.embedding = embedding
# Initialize GRU; the input_size and hidden_size params are both set to 'hidden_size'
# because our input size is a word embedding with number of features == hidden_size
self.gru = nn.GRU(hidden_size, hidden_size, n_layers,
dropout=(0 if n_layers == 1 else dropout), bidirectional=True)
def forward(self, input_seq, input_lengths, hidden=None):
# Convert word indexes to embeddings
embedded = self.embedding(input_seq)
# Pack padded batch of sequences for RNN module
packed = torch.nn.utils.rnn.pack_padded_sequence(embedded, input_lengths)
# Forward pass through GRU
outputs, hidden = self.gru(packed, hidden)
# Unpack padding
outputs, _ = torch.nn.utils.rnn.pad_packed_sequence(outputs)
# Sum bidirectional GRU outputs
outputs = outputs[:, :, :self.hidden_size] + outputs[:, : ,self.hidden_size:]
# Return output and final hidden state
return outputs, hidden

It’s always better to keep track of the shape of tensors after every layer. This will help in understanding the layer in a better way.

Shapes of tensors in Encoder

From the above step where we discussed data preparation, the shape of input tensor was 12 * 5 signifying the batch size of 5 with max_length = 12. Now, we pass this to the embedding layer with some embedding size say 50. This means that every word or token in the tensor will be converted to a 50-dimensional point. Hence our final input embedded tensor would be 12,5,50 denoting max_length * batch_size * embedding_size. This when passed to a GRU layer with hidden_size equal to the embedding_layer. This will result in an output tensor of shape 12,5,100. Here the last dimension denotes the hidden size and it is 100 as we have bi-directional RNN. We have to sum these up to get a final output tensor of shape 12,5,50.

Output — shape=(max_length, batch_size, hidden_size)=12,5,50

Hidden — shape =(n_layers x num_directions, batch_size, hidden_size)=2,5,50

Attention Layer and Decoder

What is the Attention Layer? Why do we need to have it for our architecture? Let’s take an example of image captioning. When we are given an image and asked to give a caption for it, we give more attention to specific parts of the image for adding specific words to caption.

Attention Mechanism Example

In the same way, while giving a reply to a question, we give attention to certain words of the question. As all the information from the encoder is passed to the decoder only through the context vector, preserving all the information is difficult. Hence we use Attention mechanism for this purpose.

From the image shown above context vector, Ct is calculated by taking the weighted sum of all the encoder outputs. This is used in calculating the decoder’s current hidden state Ht which in turn will be used for calculating the output.

The decoder output at every time step depends on 3 vectors

  1. Previous decoder output (y t-1)
  2. The hidden vector of the decoder (h t)
  3. The context vector that is coming from the attention layer(c t)

Teacher Forcing and Gradient Clipping-

As discussed earlier, the output of the decoder at time step t-1 is sent as input to the decoder at time t. Contrary to this, we can also send the real targets to decoder instead of sending the decoder’s predicted value. This ensures a faster convergence but comes with a problem. When you are using teacher forcing, you are explicitly making the model do rotten learning. Although this gives good results on train data as we have real target variables that we send in at every time step, the model struggles to predict when it comes to test data where target variables are not provided. One way to overcome this problem is to use the teacher forcing in a proportion such that when we set the ratio to be 0.5, we send the target variable to the decoder 50 % of the times and remaining 50% of times, we send in the decoder’s predicted value. This way we can strike the balance between faster convergence and rotten learning.

Also, exploding gradient has been a major problem in neural networks. In order to overcome this problem, we clip the gradients to a specific value and this technique is called gradient clipping.

Code and flow of data-

Attention layer, which comes as part of the decoder network can be implemented using the following code. Here we use a variation of Attention layer called the Luong attention layer.

# Luong attention layer
class Attn(torch.nn.Module):
def __init__(self, method, hidden_size):
super(Attn, self).__init__()
self.method = method
if self.method not in ['dot', 'general', 'concat']:
raise ValueError(self.method, "is not an appropriate attention method.")
self.hidden_size = hidden_size
if self.method == 'general':
self.attn = torch.nn.Linear(self.hidden_size, hidden_size)
elif self.method == 'concat':
self.attn = torch.nn.Linear(self.hidden_size * 2, hidden_size)
self.v = torch.nn.Parameter(torch.FloatTensor(hidden_size))
def dot_score(self, hidden, encoder_output):
return torch.sum(hidden * encoder_output, dim=2)
def general_score(self, hidden, encoder_output):
energy = self.attn(encoder_output)
return torch.sum(hidden * energy, dim=2)
def concat_score(self, hidden, encoder_output):
energy = self.attn(torch.cat((hidden.expand(encoder_output.size(0), -1, -1), encoder_output), 2)).tanh()
return torch.sum(self.v * energy, dim=2)
def forward(self, hidden, encoder_outputs):
# Calculate the attention weights (energies) based on the given method
if self.method == 'general':
attn_energies = self.general_score(hidden, encoder_outputs)
elif self.method == 'concat':
attn_energies = self.concat_score(hidden, encoder_outputs)
elif self.method == 'dot':
attn_energies = self.dot_score(hidden, encoder_outputs)
# Transpose max_length and batch_size dimensions
attn_energies = attn_energies.t()
# Return the softmax normalized probability scores (with added dimension)
return F.softmax(attn_energies, dim=1).unsqueeze(1)

As attention layer is part of the decoder, we call the attention layer from the decoder to get the attention weights. The following code is used for implementing the decoder.

class LuongAttnDecoderRNN(nn.Module):
def __init__(self, attn_model, embedding, hidden_size, output_size, n_layers=1, dropout=0.1):
super(LuongAttnDecoderRNN, self).__init__()
# Keep for reference
self.attn_model = attn_model
self.hidden_size = hidden_size
self.output_size = output_size
self.n_layers = n_layers
self.dropout = dropout
# Define layers
self.embedding = embedding
self.embedding_dropout = nn.Dropout(dropout)
self.gru = nn.GRU(hidden_size, hidden_size, n_layers, dropout=(0 if n_layers == 1 else dropout))
self.concat = nn.Linear(hidden_size * 2, hidden_size)
self.out = nn.Linear(hidden_size, output_size)
self.attn = Attn(attn_model, hidden_size)
def forward(self, input_step, last_hidden, encoder_outputs):
# Note: we run this one step (word) at a time
# Get embedding of current input word
embedded = self.embedding(input_step)
embedded = self.embedding_dropout(embedded)
# Forward through unidirectional GRU
rnn_output, hidden = self.gru(embedded, last_hidden)
# Calculate attention weights from the current GRU output
attn_weights = self.attn(rnn_output, encoder_outputs)
# Multiply attention weights to encoder outputs to get new "weighted sum" context vector
context = attn_weights.bmm(encoder_outputs.transpose(0, 1))
# Concatenate weighted context vector and GRU output using Luong eq. 5
rnn_output = rnn_output.squeeze(0)
context = context.squeeze(1)
concat_input = torch.cat((rnn_output, context), 1)
concat_output = torch.tanh(self.concat(concat_input))
# Predict next word using Luong eq. 6
output = self.out(concat_output)
output = F.softmax(output, dim=1)
# Return output and final hidden state
return output, hidden

Now let’s look at the shape of outputs we get after passing through every layer

Decoder tensors shape

In every iteration, the tensors corresponding to time step t are sent as input. We can see the input tensor with all 1’s for the first iteration as we pass <SOS> start of the sentence token as input. Therefore we will have [1, batch_size, embedding dimension] as the shape of tensors. This loops will iterate for max_length number of times. The final output of this layer is [batch_size, size of the vocabulary ], where the value with the highest probability is returned as the output of that particular time step.

Loss Function

At the core of the problem, we are classifying the word and the number of classes is equal to the number of words in the vocabulary. So for this classification problem, we use the standard Cross Entropy metric where we calculate the loss based on the probabilities of the model. The loss is given as follows,

Multi-class log loss

As we are using batches of data where we PAD the outputs after the sentence is completed it is not the right way to calculate the loss for tensors with PAD tokens. Hence we introduce a new loss called the Mask loss where we only calculate the log loss for outputs leaving the ones with PAD token.

def maskNLLLoss(inp, target, mask):
nTotal = mask.sum()
crossEntropy = -torch.log(torch.gather(inp, 1, target.view(-1, 1)).squeeze(1))
loss = crossEntropy.masked_select(mask).mean()
loss = loss.to(device)
return loss, nTotal.item()

Training the model and evaluation

There are many decisions that we have to make while training the model. From the number of epochs to teacher forcing and learning rate, every parameter we set effects the model’s performance. After trying a lot of combinations, I have chosen to go with the following parameters for training the model.

# Configure training/optimization
clip = 50.0
teacher_forcing_ratio = 0.9
learning_rate = 0.0001
decoder_learning_ratio = 5.0
n_iteration = 500000
print_every = 100
save_every = 500
# Ensure dropout layers are in train mode
# Initialize optimizers
print('Building optimizers ...')
encoder_optimizer = optim.Adam(encoder.parameters(), lr=learning_rate)
decoder_optimizer = optim.Adam(decoder.parameters(), lr=learning_rate * decoder_learning_ratio)
if loadFilename:
# Run training iterations
print("Starting Training!")
trainIters(model_name, voc, pairs, encoder, decoder, encoder_optimizer, decoder_optimizer,
embedding, encoder_n_layers, decoder_n_layers, save_dir, n_iteration, batch_size,
print_every, save_every, clip, corpus_name, loadFilename)

We can also store the weights of the model so that we can rerun the model from where we left.

The loss is

and when we evaluate the model we get the follow outputs

Productionizing the model-

As stated in the title of this blog, we are not just restricting our model to jupyter notebook. We are going to take it one step further and productionize our deep learning model by developing an interactive web-based application around it.

The front end is developed using HTML and JavaScript .Node.js is used to build a server that listens to a port and takes the request from the webpage and redirects it to the python program where the output is generated. The following is the server code written in Node.js

Socket programming is used for connecting the web server with the python.

The following snippet of code will get the list of users that are logged in and sends a message for the user to use #gobot at the start of the sentence to chat with the bot.

And once a message prefixed with #gobot is received, the server redirects it to the port 5000, where chatbot application is running.

Now, for starting the application we have to open the terminal from the directory where server.js file is present and run the following code

npm start

In app.py where our model is present, using Flask a web-based framework for python, we start a server at another port 5000 and receive the requests through the socket interface as shown below.

Now we have to start the python server as well by going to the directory where app.py is located and typing the following command

python app.py

Now that we have both the servers running on different ports, we can make them interact through socket programming. Open the browser and navigate to port number 3000 to get the following screen.

Once you enter your name, you are done with all the hassles!

Why wait, download the project from my github and try it yourself!

Github — https://github.com/saisandeep97/Chat-botV2

Linkedin- https://www.linkedin.com/in/naraparajusaisandeep