Encoder-Decoder Model for Multistep time series forecasting using Pytorch

Original article was published on Deep Learning on Medium


The input to the encoder network is of the shape (sequence length, n_values), therefore each item in the sequence is made of n values. In constructing these values, different types of features are treated differently.

Time dependant features — These are the features that vary with time, such as sales, and DateTime features. In the encoder, each sequential time dependant value is fed into an RNN cell.

Numerical features — Static features that do not vary with time, such as the yearly autocorrelation of the series. These features are repeated across the length of the sequence and are fed into the RNN. The process of repeating in and merging the values are handled in the Dataset.

Categorical features — Features such as store id and item id, can be handled in multiple ways, the implementation of each method can be found in encoders.py. For the final model, the categorical variables were one-hot encoded, repeated across the sequence, and are fed into the RNN, this is also handled in the Dataset.

The input sequence with these features is fed into the recurrent network — GRU. The code of the encoder network used is given below.


The decoder receives the context vector from the encoder, in addition, inputs to the decoder are the future DateTime features and lag features. The lag feature used in the model was the previous year’s value. The intuition behind using lag features is, given that the input sequence is limited to 180 days, providing important data points from beyond this timeframe will help the model.

Unlike the encoder in which a recurrent network(GRU) is used directly, the decoder is built be looping through a decoder cell. This is because the forecast obtained from each decoder cell is passed as an input to the next decoder cell. Each decoder cell is made of a GRUCell whose output is fed into a fully connected layer which provides the forecast. The forecast from each decoder cell is combined to form the output sequence.

Encoder-Decoder Model

The Encoder-decoder model is built by wrapping the encoder and decoder cell into a Module that handles the communication between the two.

Model Training

The performance of the model highly depends on the training decisions taken around optimization, learning rate schedule, etc. I’ll briefly cover each of them.

  1. Validation Strategy — The cross-sectional train-validation-test split does not work since our data is time dependant. A time-dependant train-validation-test split poses a problem, which is that the model is not trained on the recent validation data, which affects the performance of the model in test data.
    In order to combat this, a model is trained on 3 years of past data, from 2014 to 2016, and predicts the first 3 months of 2017, which is used for validation and experimentation. The final model is trained on data from 2014 to 2017 data and predicts the first 3 months of 2018. The final model is trained in blind mode without validation, based on learnings from the validation model training.
  2. Optimizer — The optimizer used is AdamW, which has provided state of the result in many learning tasks. A more detailed analysis of AdamW can be found in Fastai. Another optimizer explored is the COCOBOptimizer, which does not set the learning rate explicitly. On training with COCOBOptimizer, I observed that it converged faster than the AdamW, especially in the initial iterations. But the best result was obtained from using AdamW, with One Cycle Learning.
  3. Learning Rate Scheduling1cycle learning rate scheduler was used. The maximum learning rate in the cycle was determined by using the learning rate finder for cyclic learning. The implementation of the learning rate finder used is from the library — pytorch-lr-finder.
  4. The loss function used was Mean squared error loss, which is different from the completion loss — SMAPE. MSE loss provided a more stable convergence, that using SMAPE.
  5. Separate optimizer and scheduler pairs were used for the encoder and decoder network, which gave an improvement in result.
  6. In addition to weight decay, dropout was used in both encoder and decoder to combat overfitting.
  7. A wrapper was built to handle the training process with the capability to handle multiple optimizers and schedulers, checkpointing, and Tensorboard integration. The code for this can be found in trainer.py.


The following plot shows the forecast made by the model for the first 3 months of 2018, for a single item from a store.