Read the original post from here.

Read the original post from here.

The Amazing Effectiveness of Sequence to Sequence Model for Time Series

In this tutorial, I am excited to showcase examples of building Time Series forecasting model with seq2seq in TensorFlow. The purpose of this post is to give an intuitive as well as technical understanding of the implementations, and to demonstrate the two useful features under the hood:

  1. Multivariate input and output signals
  2. Variable input and output sequence lengths

After that, I will demonstrate how to apply what we have covered on something very interesting — to predict extreme events / outliers.

Finally, we will end this tutorial by applying what we learnt into a real world case — forecasting Beijing pm2.5 pollution.

At any time, please feel free to jump to python notebook at my github if you want to skip reading.

Organisational structures:

  1. Univariate case
  2. Multivariate case
  3. Predict extreme events / outliers
  4. Case study on real world data set — forecasting Beijing PM2.5 pollution

A Univariate case

1. Data generation approach

Let’s say we want to learn the pattern of a sinusoidal wave like below:

However, the real world data might be way more noisy than this, as shown below:

So we will sample data in batch from training, with each data as an input-output-sequence pair. The input and output lengths can be different as well. After training, we will feed the model with input sequence of test, and let the model give the predicted final output sequence.

The code for generating the training data are as follows:

Please note that we chose input and output sequence lengths to be 15 and 20, respectively. The last 20 elements from the array of ‘x’ are reserved to generate the test labels later on after we made the predictions, so we are not using them for training.

To see what the training data looks like, we can sample and visualise one pair of them:

2. The seq2seq model and how we train it

What we will do next is similar to what’s depicted above. The seq2seq model contains two RNNs, e.g. LSTMs, which are the ‘encoder’ and ‘decoder’. Each box here is one RNN cell. The encoder will first run through the whole input sequence (the ‘A’, ‘B’, ‘C’ here), encode them into a fixed length vector — the last hidden state of encoder LSTM — and pass that vector to decoder to decode into the output sequence. (the ‘W’, ‘X’, ‘Y’, ‘Z)

The training approach we adopt here is called ‘guided’ training: basically, during decoding steps, we will first fit a ‘GO’ token as initial input to decoder (this can be a zeros vector), and subsequently, we will provide the correct input to the decoder at every time-step, even if the decoder made a mistake before. However, during test or inference time, the output of the decoder at time t is fed back and becomes the input of the decoder at time t+1.

The code is below:

This is basically the same function from the official seq2seq repo. However, I did make a change to the basic_rnn_seq2seq function by adding in the ‘feed_previous‘ feature, which has not yet been implemented in TensorFlow. This will be feeding decoder output at time t as input for time t+1, as discussed. So during the decoding of inference phase, only the first element of decoder_inputs will be used.

However, during training phase, decoder_inputs will still be used as input at each time step t.

The _loop_function simply serves as a util function to transform the decoder cell output into the dimensions of input, as the LSTM Cell may have different dimensions for both input and output. And we will define the weights[‘out’] and biases[‘out’] later on when we build the model.

Now with the core _basic_rnn_seq2seq function implemented, we can then move on to building the whole graph, including steps such as variables & placeholders declarations, formatting inputs & outputs, as well as computing loss and defining training ops.

Here’s the whole graph building:

The ‘feed_previous‘ will be set to True during inference phase, and False during training.

The input and output dims are both 1s for our univariate case, and input and output sequences can have different lengths in this case.

3. Train the model

We will use the ‘guided’ training approach as discussed previously:

And we are seeing our training loss decreases as expected:

41.1176
15.1229
38.0447
17.5147
10.7143
15.86
11.0018
9.36498
8.26214
...
6.23474
5.99408
6.12859
5.7535
5.7275
6.19146
Checkpoint saved at: /home/weimin/seq2seq/univariate_ts_model0

4. Inference and visualization

For prediction, we will use the last sequence of 15 values from train_data_x as input signal, and let the model predict without feeding in the true labels (by setting ‘feed_previous‘ to True)

We will then visualize the predictions against the true labels below:

Through a few iterations of learning, the model has already figured out the hidden signal from the noisy training data, which is amazing.

Multivariate case

This one should be more exciting!

So what if we have more than one signals for input and output? And this is especially true for real world scenario. Let’s say we would like to predict the supply & demand for the next 30 minutes; what affect the results may not only be the past supply & demand, but also for things like weather, rain, temperature, app usages, traffic signals, whether it is a public holiday, etc.

1. Simulate input and output signals

We start with two cosine and sine signals x1 and x2 — these are our outputs to predict! From them, we derive three additional signals — y1, y2, y2 — as our input sequences, using some random formula as below:

So overall, the structure will be like 3-in-2-out.

We can also visualize inputs as follow:

Put it another way, we want our model to uncover the hidden signals from the observed ones, as plotted below:

This is more interesting as we have reached stage of being flexible both on the number of signals as well as length of sequences.

You are free to explore different number of input and output sequences, if you want to.

2. Sampling training data

Let’s rewrite data sampling mechanism:

As usual, we can visualize one training sample:

3. Build the model

Luckily, we don’t need to change too much of the code for building the graph. What we need to change are only the input_dim = 3 and output_dim = 2.

I will skip pasting the same codes here.

4. Train the model

Training is also no much different. We will again set the ‘feed_previous‘ as Falsefor ‘guided’ training — feeding in the correct decoder input at each time step. So here’s the code:

5. Inference and visualisation

Once training is done, we can visualise the prediction performance on the test output sequence.

To put them together, we can also visualize both the input and output sequences at the same time (I scaled up the output sequences just to make them more obvious):

Outliers / Extreme events

1. What could be an outlier / extreme event case?

This will be very useful for companies when are dealing with real time traffic, for example. There are many factors that could be the triggers for anomaly events, for example, the sudden rain of the day could lead to surge of demand of taxis on the road. Public holidays could be another factor which might lead to increase in demands for the whole day.

2. Case & data simulation

Let’s assume our traffic patterns are like below, where y-axis is the number of traffic, and x-axis is the days.

Image that during public holidays, we will notice an additional 2 units of traffic will be added on top of what we have for that day, as reflected by those sharp peaks in the graph.

Of course, the real data will have noise for both traffic on normal days as well as on public holidays (which means the true effect of PH — 2 units — is unknown to us, which the model needs to learn through the data). What we do know, however, is whether each day is a public holiday (1) or not (0). Here’s the code for data generation:

As always, we can sample a training data to see what it looks like:

Quite noisy to see the extreme points though. However, you can simply turn off the noise factor (set it to zero) to see clearly the extreme events. But I will skip this verification step here just to save the space.

3. If we train the same seq2seq as before …

Now if we use the simulated data to train our seq2seq model previously, we won’t get a very good result. The one outlier in the test data will not be captured by the model. (It’s easy to understand this as the extreme events are totally random, and model has no idea which days they will fall on for both training and test phases.)

So if you test the model, you should see something similar to this:

  1. The model has poor fit for test data.
  2. The model has no awareness / differentiation on the outlier.

4. Re-design the seq2seq

How do we redesign the model to deal with extreme events?

Since we already have the knowledge of whether each day is a public holiday, we can pass that ‘knowledge’ to the decoder at each time step i, to let the decoder learn the differentiation between normal days and public holidays.

In summary, here are the steps:

  1. Get the bool vector for output sequence (‘1’ for public holiday, ‘0’ for normal days)
  2. During decoding, at each time step t, concatenate the bool value to the original decoder input as new input
  3. Feed the new input to decoder to generate output at t
  4. Continue until the decoding phase is done

We won’t change the rest parts such as ‘guided’ training, and we also won’t touch the encoding phase. With these steps implemented, we should be able to see our improved results on test data:

Overall it fits well, and most importantly, the outlier point was perfectly captured!

Please refer to my github page for complete codes and notebook.

Case study on Beijing PM2.5 pollution forecasting

1. The case introduction.

Finally, we will apply what we learnt on a real-world data set — the hourly data set that contains the PM2.5 data of US Embassy in Beijing.

If can download the data from here. The complete list of features included in the raw data set is:

  1. No: row number
  2. year: year of data in this row
  3. month: month of data in this row
  4. day: day of data in this row
  5. hour: hour of data in this row
  6. pm2.5: PM2.5 concentration
  7. DEWP: Dew Point
  8. TEMP: Temperature
  9. PRES: Pressure
  10. cbwd: Combined wind direction
  11. Iws: Cumulated wind speed
  12. Is: Cumulated hours of snow
  13. Ir: Cumulated hours of rain

Our task here, however, is only to predict the pollution factor — pm2.5 — for the next few hours, based on historical trends of other features. In short, it’s going to be a multi-variate case with n-input-1-ouput.

2. Visualisation and pre-processing of data

A quick df.head() function plus simple plots give us an intuitive understanding of the data set.

All features plotted above are numeric / continuous. However, there’s one more called cbwd which is categorical, because it is referring to the wind directions. We will do one-hot encoding for that. I will not be using data such as year, month, day or hour as my features. However, you are encouraged to add in those as features as well, to see if those might help your performance.

Further more, there are no NAs values for all features except for pm2.5 — there are 2067 out of 43824 are NAs in pm2.5. And we will simply fill NAs with 0.

Regarding train and test splits — I will use last one month data for test, and all data before that as my training sample. Feel free to split according to your preferences.

Finally, I will do a simply processing step to normalize all data using z-score.

The overall scripts are below:

3. Transforming training and test data into 3-D formats.

Now we have X_train, y_train, X_test, y_test as our train and test data sets, we will then need to transform them into 3-D formats for time series — (batch_size, time_steps, feature_dim)

We will prepare two util functions to do so:

So generate_train_samples will randomly sample batches from training, and generate_test_samples will simply convert all test data into 3-D formats in sequence. We can print out the shapes of the processed data:

(10, 30, 11) (10, 5, 1)
(709, 30, 11) (709, 5, 1)

4. Train the model and visualize the results on test.

The model to use will be no different than the multi-variate one we used before. So I will not duplicate that part here.

Training part is also similar, we can set feed_previous to False for guided training.

After training is done, we will visualize the results over the last month to evaluate our prediction performance, as below:

Well, slightly off but not too bad 🙂 The orange is predicted and blue is actual value for each hour. I believe with some tweaks on the hyperparameters or feature creations, one might get an even better results!

Any thoughts?

In this tutorial, we have demonstrated how to use sequence to sequence architecture to learn and predict Time Series signals. We learnt that seq2seq model really has the power to sift through the noisy information, uncover the true patterns, and eventually comes down to discovering the hidden signals that are actually governing the data generation mechanism.

We have seen that the model actually performs equally well on both univariate as well as multivariate cases, with the additional capability of encoding and decoding variable lengths of signals. We have also witnessed the flexibility of re-designing the seq2seq model to take care of the extreme events scenario, which is powerful predict the ‘unpredictability’ of the real world.

What other problems may be more challenging and useful to tackle in your business/research settings? Please briefly share your ideas or thoughts, and we may see how to design the seq2seq to fit your needs.

References

TensorFlow official repo –https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/legacy_seq2seq/python/ops/seq2seq.py

Another excellent blog about seq2seq for Time Series — https://github.com/guillaume-chevalier/seq2seq-signal-prediction

If you would like to try Keras — https://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/

Source: Deep Learning on Medium