Introduction to Encoder-Decoder Models — ELI5 Way

Source: Deep Learning on Medium

Hi All, welcome to my blog “Introduction to Encoder-Decoder Models — ELI5 Way”. My name is Niranjan Kumar and I’m a Senior Consultant Data Science at Allstate India.

In this article, we will discuss the basic concepts of Encoder-Decoder models and it’s applications in some of the tasks like language modeling, image captioning, text entailment, and machine transliteration.

Citation Note: The content and the structure of this article is based on my understanding of the deep learning lectures from One-Fourth Labs — PadhAI.

Before we discuss the concepts of Encoder-Decoder models, we will start by revisiting the task of language modeling.

Language Modeling — Recap

Language Modeling is the task of predicting what word/letter comes next. Unlike the FNN and CNN, in sequence modeling, the current output is dependent on the previous input and the length of the input is not fixed.

Given a ‘t-1’ words, we are interested in predicting the iᵗʰ word based on the previous words or information. Let’s see how we solve the language modeling using Recurrent Neural Networks.

Language Modeling — RNN

Let’s look at the problem of auto-complete in WhatsApp. As soon as you opened the keyboard to type, you noticed the letter I as the suggestion for the first character of the message. In this problem, whenever we type a character the network tries to predict the next possible character based on the previously typed character.

The input to the function is denoted in orange color and represented as an xₜ. The weights associated with the input is denoted using a vector U and the hidden representation (s) of the word is computed as a function of the output of the previous time step and current input along with bias. The output of the hidden represented (s) is given by the following equation,

Once we compute the hidden representation of the input, the final output (yₜ) from the network is a softmax function (represented as O) of hidden representation and weights associated with it along with the bias.

Encoder-Decoder Model — Language Modeling

In this section, we will see how we were using the Encoder-Decoder model in the problem of language modeling without even knowing.

In language modeling, we are interested in finding the probability distribution of the iᵗʰ word based on the previous information.

Encoder Model

  • The RNN the output of the first time step is fed as input along with the original input to the next time step.
  • At each time step, the hidden representation (sₜ₋₁) of the word is computed as a function of the output of the previous time step and current input along with bias.
  • The final hidden state vector(sₜ) contains all the encoded information from the previous hidden representations and previous inputs.
  • Here, Recurrent Neural Network is acting as an Encoder.

Decoder Model

  • Once we pass the encoded vector to the output layer, which decodes into the probability distribution of the next possible word.
  • The output layer is a softmax function and it takes hidden state representation and weights associated with it along with the bias as the inputs.
  • Since the output layer contains the linear transformation and bias operation, it can be referred to as the simple feed-forward neural network.
  • Feed-Forward Neural Network is acting as a Decoder.

Encoder-Decoder Applications

In this section, we will discuss some applications of Encoder-Decoder Model

Image Captioning

Image captioning is a task of generating caption automatically based on what was shown on the image.

  • In image captioning, we will pass the image through the Convolution Neural Network and extracts the features from our image in the form of a feature representation vector.
  • The feature representation vector after pre-processing is passed through the RNN or LSTM to generate the caption.
  • CNN is used to encode the image
  • RNN is then used to decode a sentence from the embedding

Text Entailment

Text entailment is a task of determining whether a given piece of text T entails another text called the “hypothesis”.

For example,

Input: It is raining outside.

Output: The ground is wet.

In this problem, both the input and output are a sequence of characters. So both the encoder and decoder networks are RNN or LSTM.

Machine Transliteration

Transliteration — “Writing the same word in another language or script”. Translation tells you the meaning of words in another language but transliteration doesn’t tell you the meaning of the words, but it helps you pronounce them.

Input: INDIA

Output: इंडिया


  • Each character of the input is fed into RNN as the input by converting the character into a one-hot vector representation.
  • At the last time step of the encoder, the final hidden representation of all the previous inputs will be passed as the input to the decoder.


  • The decoder model which can be RNN or LSTM network will decode the state representation vector and gives the probability distribution of each character.
  • The softmax function is used to generate the probability distribution vector for each character. Which in turn helps to generate a complete transliterated word.