Is the race over for Seq2Seq models?

Original article was published by Thushan Ganegedara on Artificial Intelligence on Medium

Ideation of Seq2Seq or sequence-to-sequence models came in a paper by Ilya Sutskever in “Sequence to Sequence Learningwith Neural Networks”. They are essentially a certain organization of deep sequential models (a.k.a. RNN based models) (e.g. LSTMs/GRUs)[1] (discussed later). The main type of problems addressed by these models is,

mapping an arbitrary length sequence to another arbitrary length sequence

Where might we come across such problems? Pretty much anywhere. Applications of,

  • Machine translation
  • Text summarization
  • Question answering

are few examples that can capitalize on such a model. These applications have a very unique problem formulation requiring the ability to map an arbitrarily long source sequence to an arbitrary-length target sequence. For example, if you imagine a English to French translation, there is no one-to-one mapping between words in two languages. Often, translating from one language to another requires learning copious complex features (one-to-many, many-to-one, many-to-many mappings, lexical dependencies, word alignment [2], etc).

This is drastically different to image classification (i.e. Fixed size input → Class/Label) or a sentiment analysis problem (i.e. Arbitrary length input → Class/Label).

Bonjour! Welcome to Machine Translation 101

Before sinking our teeth further, it’s imperative that you have a clear understanding of how the problem of machine translation formulated as a machine learning problem.

You have data belonging to two languages; the source language (the language translating from) and the target language( the language translated to). For example, if you want to translate from English to French, English will be the source language and French target language.

Next, you have an n_s elements long text sequence (e.g. sentence), drawing words from a V_s size vocabulary (English). During the model training, you also have an n_t elements long text sequence, drawing words from a V_t size vocabulary (French). Each word is represented as a d_w size vector. This could be either using one-hot-encoding or word vectors (e.g. Word2vec or GloVe). Finally, during prediction, the model makes n_t number of sequential predictions over the V_t vocabulary. The following diagram illustrates the process.

The machine translation process

Drilling down to the Seq2Seq model

Before jumping ahead to learning the fate of these models, let’s understand what they do in more detail. Here, I draw most of visual aid and concepts from my Machine Translation in Python course at DataCamp. If you want to learn more on this topic, I invite you to try the course.

If you look at a seq2seq model, squint your eyes to blur out the details, you will see that it’s in fact two components; an encoder and a decoder. You use the encoder-decoder concept in your day to day life more than you realize. Take the simple analogy of a teach explaining what an elephant looks like.

The encoding process is analogous to a teacher explaining to you what an elephant looks like and you create a mental image of that. The decoding process takes place if a friend of yours ask what an elephant looks like. (Source: Machine Translation in Python)

While you listen to the teacher, the encoding takes place and you encode the mental image of an elephant. Then when your friend who missed the class asks what an elephant is, you will start decoding that mental image by either verbally explaining it to the friend or perhaps by drawing a picture.

From a more technical lens, here’s what a seq2seq model used for machine translation looks like.

Overview of an Encoder Decoder model (Source: Machine Translation in Python)

The encoder takes in an English sentence and create a context vector (also called thought vector) and then the decoder uses the context vector to decode the correct French translation.

Quick tour in GRUs

What does the Encoder and the decoder consist of? They comprise deep sequential model (or several layers of such models). We will quickly gloss over the details of one such deep sequential model; a Gated Recurrent Unit (GRU). The idea is,

h(0) = {0}^n # A zero vector
for each i^th word w in sequence (e.g. sentence):
h(i) = GRU(f(w), h(i-1))

f(w) is some numerical representation of the word (e.g. one hot encoding / word vectors). I won’t explain what takes place in GRU function in the code. The important thing is that the GRU cell takes the current input and the previous output and produce the current output. I strongly recommend reading the provided resources to understand that in depth [1][3]. The figure below illustrates the how GRUs process text sequences.

How Gated Recurrent Unit (GRU) works (Source: Machine Translation in Python)

Grand entrance of Transformers

Recurrent Neural Networks (RNNs) like LSTMs and GRUs were basking in their well-earned reputation for quite a while until they were recently challenged by a new kid on the block; something called a Transformer.

Transformer model was introduced in [5]. It’s a very innovative concept and is addressing two major weaknesses in RNNs:

  • RNNs are unparallelizable as the output of the t^th step depends on the output of the (t-1)^th step (thus the term recurrent neural networks)
  • RNNs struggle to preserve long term dependencies in the language as it only sees the memory from the previous step

Here to understand the Transformer model better, we will assume a translation task from English to French. The task the model is trained to do is, given an English sentences, find the correct French translation.

We will now look at the main bells and whistles of the Transformer model. Note that I will not be discussing all the intricacies of the Transformer model, but just enough to understand how it differs from the Seq2Seq model. The transformer is also an encoder-decoder model.

Abstracting out the details, the Transformer is very similar to how an abstract Seq2Seq model looks like

The encoder has several layers and the decoder has several layers. Each layer consists of two types of sublayers,

  • Self-attention layer
  • Fully-connected layer

The final decoder layer needs to include a softmax layer as it needs to produce probabilities over target language vocabulary for each position.

The detailed Transformer model. The encoder consists of several layers. Each layer consists two sub layers; a self attention sublayer and a fully-connected sublayer. The decoder also consists of several layers, where each layer consists of two self-attention sublayers and a fully-connected sublayer. The diagram also shows what sort of connections are created between inputs and the sublayers. The self-attention sublayer looks at all the words at a given time, where the fully-connected sublayer processes words individually.

Self attention layer

The self attention layer is the groundbreaking concept of the Transformer model. Basically the self-attention layer, while processing a single word in the sequence, enables the model to look at all the other words. Now why is this important? Imagine the following sentence,

The dog ran across the road to get its ball

Now visualize a model going from one word to another sequentially. When the model sees the word “its” it helps to know that “its” referring to the dog. This goes for any machine learning task, be it machine translation, dependency parsing or language modelling.

Self-attention layer enables to transformer to exactly do that. While processing the word “its”, the model can look at all the other words and decide for itself which words are important to “mix” into the output, so that the transformer can solve the task effectively. Additionally, this is a “weighted-mix” and the weights are learned during the training process. The following image visualizes this process.

How self attention works when processing the word “its”. The attention layer has weights for each word, enabling the layer to created a “weighted-mix” of words as the output. Essentially the gray-box encodes information about the word “its” and “dog”.

Note: You can also see that there is a masked self-attention layer in the decoder. This is essentially there to mask any look-aheads that would happen during model training (that would be cheating). In other words, the decoder shouldn’t know what is ahead of what it has seen so far. For more information on this, refer the original paper.

Fully-connected layer

There’s not much enigma around the fully connected layer. It takes the separate self-attention layer outputs and produce a latent (i.e. hidden) representation for each word using a fully-connected layer. And this fully connected layer is shared across different timesteps. However, each layer has it’s one fully-connected set of weights.

Advantages of the Transformer

As you can see, none of the sublayers contain sequential computations that wait on the output of the previous step (like LSTMs/GRUs). This alleviates the need for the model to maintain a state/memory like LSTMs. Consequentially, the Transformer can compute outputs in parallel,for all time steps at once.

Furthermore, as you can see, at a given timestep, the self-attention sublayer sees all the other inputs. Due to this reason, preserving long-term dependencies in long text sequences becomes trivial.

Final verdict: Is there still hope for Seq2Seq models?

Now let’s come to burning question. Are Seq2Seq models going to be obsolete very soon? Personally, I think not! Due to several reasons

Yes: Seq2Seq models are still a good option for low-resource environments

The original transformer model are quite large e.g (BERT, GPT, XLNET). This limits the ability to use these models in restrictive environments like embedded devices or IoT devices. You can have a simple LSTM/GRU model at a fraction of the memory taken up by these massive models.

Note: It is worth highlighting that there have been and are attempts to come up with smaller models still delivering comparative performance to the original models. Notably DilBert. But these are still quite large compared to a simple RNN model (e.g. DilBert has around 66M parameters).

Yes: Easy to prototype/understand

Say you are given an NLP problem and asked to assess the feasibility of using a Seq2Seq/Transformer model. You can quickly have Seq2Seq model up and running compared to a Transformer, as Seq2Seq models are much simpler and easier to understand. If you’re keen on learning how to implement Seq2Seq models and understand how they work, you can try my course “Machine Translation in Python” on DataCamp.

Yes: RNNs are evolving

There’s always research done to improve RNN based models and improve their ability to preserve long-term dependencies. One particular example is found in the paper Mogrifier LSTM [6].

No: The Transformer model delivers better performance across many NLP tasks

It is no brainer that, it has been proven constantly that Transformer models outperform sequential models almost always. Therefore, if you are all about performance and no need to worry about memory, Transformers would be the to-go solution.

No: Transformer models are more robust to adverserial attacks

Improving models against adversarial attacks is an important field of research in machine learning. Research has been done assessing Transformers/RNNs ability to withstand adversarial attacks [7]. It appears to be that Transformer models are more robust against adversarial attacks.

Does it have to be yes or no?

We shouldn’t forget that it doesn’t have to be yes or no and we can in fact leverage the best of both worlds. While transformer based models have superior performance, RNN based models are low in memory consumption. May be it’s possible to reach a trade-off between the two by combining them to a single model.

Hopefully, we’ll see some exciting research in the future trying to combine these two super powers to one single awesome model!