# Weather Forecasting Using Multilayer Recurrent Neural Network

Original article was published on Deep Learning on Medium

# Some theory

Well, actually, there are plenty of useful resources like this or this, that explained in details how GRU/LSTM architectures work, all the math behind them and so on, but I think that all explanations I’ve seen before are somewhat misleading.

Many articles show pictures like these:

But when we use Keras RNN layers like GRU(units=128, return_sequences=True) — what exactly does this mean? What is unit?

It took me a little while to figure out that I was thinking of LSTMs wrong. You (like me) might be visualizing LSTM cell as something with a scalar (1D) hidden cell state c and a scalar output h. So it takes vector input x and gives a scalar output. If you think of LSTMs this way, then it is tempting to think of the ‘number of units = d’ as taking d serial LSTMs and running them parallel, with a total of d hidden states and d output states.

This is not a good way to think about it, though. The number of units is a parameter in the LSTM, referring to the dimensionality of the hidden state and dimensionality of the output state (they must be equal). One LSTM or GRU cell comprises an entire layer. There is crosstalk between the hidden states via the weight matrix, so its not correct to think of it as d serial LSTMs running in parallel. Put another way, an ‘unrolled’ LSTM looks just like a normal feedforward neural network, but every layer has the same number of units.

## What about multiple layers in RNN?

Well, given that some RNNs operate on sequences of time-series data , it means that the addition of layers adds levels of abstraction of input observations over time. In effect, chunking observations over time or representing the problem at different time scales.

RNNs are inherently deep in time, since their hidden state is a function of all previous hidden states. The question that inspired this paper was whether RNNs could also benefit from depth in space; that is from stacking multiple recurrent hidden layers on top of each other, just as feedforward layers are stacked in conventional deep networks.

So the main purpose of using multilayer RNNs is to learn more sophisticated conditional distributions. In a single layered RNN, one hidden state is doing all the work. If you are modeling a sequence such as text, then the internal parameters are learning that a is more likely to follow c than o. By introducing multiple layers, you allow the RNN to capture more complicated structures. The first layer might learn that some characters are vowels and others are consonants. The second layer would build on this to learn that a vowel is more likely to follow a consonant.

# Dataset preparation

You could download my data from here or follow these steps to get the custom data file:

• Original data was taken from official NCDC website.
• Choose United States, then click Access Data/Products
• Choose Surface Data, Hourly Global
• Choose Continue With SIMPLIFIED options
• Select Retrieve data for: New York
• Select country: United States
• Then find and select the required weather station, for example:
Selected UNITED STATES stations: EAST HAMPTON AIRPORT……………….. 72209864761 01/2006 to 03/2020