Teach seq2seq models to learn from their mistakes using deep curriculum learning (Tutorial 8)

Source: Deep Learning on Medium

Go to the profile of amr zaki
scheduled sampling to help seq2seq model learn from its mistakes

This tutorial is the eighth one from a series of tutorials that would help you build an abstractive text summarizer using tensorflow .

Today we would use curriculum learning to solve a major problem that the seq2seq models suffer from .

seq2seq models is trained by maximizing the likelihood of next token given BOTH

  1. previous token (from previous LSTM)
  2. ground truth summary

while in inference (testing) , it can only depend on

  1. previous token

no ground truth summary can be provided in testing ,

seq2seq model has been trained to depend on the outside .

while testing , it is forced to only depend on itself , which is something it hasn’t been raised to do !!

this actually causes a major problem , which is the discrepancy between training and inference (testing) , this is called (Exposure Problem)

There have been multiple approaches to help solving this problem , one of them is , while in training make the model begin learning to depend on itself , by exposing the model to its own mistakes so that it tries to optimize them (i.e: learn from its mistakes while in training phase) , this is what is called Scheduled Sampling , which is a form of curriculum learning that we would use to help our seq2seq models .

This model has been implemented using tensorflow , code can be found here , in a jupyter notebook to run on google colab and connect seamlessly with google drive , so no need to neither run code on your machine nor downloading data , as all can be done on google colab for free (more on this)

This tutorial is built over the concepts addressed by bengio,vinyals,ndjaitly,noamg from google in their paper (Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks)

code from yasterk , i have modified it to run on google colab (my code)

0- About Series

This is a series of tutorials that would help you build an abstractive text summarizer using tensorflow in multiple approaches , we call it abstractive as we teach the neural network to generate words not to merely copy words

We have covered so far (code for this series can be found here)

0. Overview on the free ecosystem for deep learning (how to use google colab with google drive)

  1. Overview on the text summarization task and the different techniques for the task
  2. Data used and how it could be represented for our task (prerequisites for this tutorial)
  3. What is seq2seq for text summarization and why
  4. Mulitlayer Bidirectional LSTM/GRU
  5. Beam Search & Attention for text summarization
  6. Building a seq2seq model with attention & beam search
  7. Combination of Abstractive & Extractive methods for Text Summarization
EazyMind free Ai-As-a-service for text summarization

You can actually try generating your own summaries using the output of these series , through eazymind (and see what you would eventually be able to build yourself) you can also call it through simple api calls , and through a python package , so that text summarization can be easily integrated into your application without the hassle of setting up the tensorflow environment , you can register for free , and enjoy using this api for free .

to lets Begin !!

1 -Exposure bias problem ( the model has never been raised to depend on itself !! )

seq2seq models are trained to depend on

  1. the output from the previous node of the decoder , thus depending on output of the previous state
  2. and the input summary

But the problem arises in the inference (testing) step , where the model is not provided the input summary , it only depends on

  1. the output from the previous node ( previous lstm decoder step )

so this causes a difference between how the model is trained , and between how it runs in inference (testing) , this problem is called Exposure bias

2 -How would Exposure bias problem affect our model !!

In the inference (testing) phase , as we have just said , the model only depend on the previous step , which means that it totally depend on it self .

The problem actually arises when the model results in a bad output in (t-1) (i.e : the previous time step results in a bad output) this would actually affect all the coming sequences , it would lead the model to an entirely different state space from that it has seen and trained on in the training phase , so it simply won’t be able to know what to DO !! , this would simply results in a cumulative bad output decisions .

3 -Lets solve it by curriculum learning

A solution to this problem that has been suggested by bengio et ai from google research , was to gradually change the reliance of the model from being totally dependent on the ground truth to be supplied to it , to depend on itself (i.e: depend on only its previous tokens generated from previous time steps in the decoder)

The concept of making the learning path difficult through time (i.e: making the model depends on only itself) is called curriculum learning

Their technique to implement this was truly genius , they call it scheduled sampling .

They build a simple sampling mechanism , it would randomly choose (during training) from where to sample , either from

  1. ground truth (with probability ei ) (i stands for number of batch)
  2. model itself (with probability (1-ei) )

so lets flip a coin :

when its head (with probability ei)→ then we use the ground truth summary

when its tails (with probability (1-ei) )→ we use the output from the previous time step

coin animation borrowed from google search (when you search flip coin )

Intuitively we can have an even better approach , not just having a constant e , but it can be variable , as at the beginning of the training we can favor using the ground truth summaries , while at the end of the training we can favor using the output form the model itself , as the model would have learnt even more , so lets schedule the decay of e (probability)

the decay of e itself , can be a function of the number of iterations

graph borrowed from bengio et ai from google research

from here comes the word scheduled sampling

4 -Implement scheduled sampling in Tensorflow

Yasterk built a great library in tensorflow that enables you to implement multiple papers concerning text summarization , one of them was (Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks), i have modified it to run on google colab (my code)

The library can be adjusted to implement multiple papers by just modifying the flags , here (in my code jupyter notebook) i have modified the required flags , and also enabled a version of the decoder called intradecoder (to limit word repetition) , so you would just run the example (with the set flags)

we work on the news data of CNN / Daily News , it is a widely used dataset for this task , or you can copy the dataset directly from my google drive , to your own google drive (without the need to download then to upload) , and to seamlessly connect to your google colab (more about this)

Next time if GOD wills , we would go through the combination of reinforcement learning with deep learning to solve the exposure problem to solve other problem that seq2seq suffers from

I truly hope you have enjoyed reading this tutorial , and i hope i have made these concepts clear , all the code for this series of tutorials are found here , you can simply use google colab to run it , please review the tutorial and the code and tell me what do you think about it , don’t forget to try out eazymind for free text summarization generation , hope to see you again