Original article was published on Deep Learning on Medium

# Weather Forecasting Using Multilayer Recurrent Neural Network

# Some theory

Well, actually, there are plenty of useful resources like this or this, that explained in details how **GRU/LSTM** architectures work, all the math behind them and so on, **but I think that all explanations I’ve seen before are somewhat misleading**.

**Many articles show pictures like these:**

But when we use Keras RNN layers like ** GRU(units=128, return_sequences=True) **— what exactly does this mean? What is

**unit**?

**It took me a little while to figure out that I was thinking of LSTMs wrong. **You (like me) might be visualizing LSTM cell as something with a scalar (1D) hidden cell state **c** and a scalar output **h**. So it takes vector input **x** and gives a scalar output. If you think of LSTMs this way, then it is tempting to think of the **‘number of units = d’** as taking **d** **serial LSTMs** and running them parallel, with a total of **d hidden states and d output states**.

**This is not a good way to think about it, though.** The number of units is a parameter in the LSTM, referring to the dimensionality of the hidden state and dimensionality of the output state (they must be equal). **One LSTM or GRU cell comprises an entire layer.** There is crosstalk between the hidden states via the weight matrix, so its not correct to think of it as **d** serial LSTMs running in parallel. Put another way, an ‘**unrolled’** LSTM ** looks **just like a normal feedforward neural network, but every layer has the same number of units.

**What about multiple layers in RNN?**

Well, given that some RNNs operate on sequences of time-series data , it means that the addition of layers adds levels of abstraction of input observations over time. **In effect, chunking observations over time or representing the problem at different time scales.**

**Speech Recognition with Deep Recurrent Neural Networks**** , 2013:**

RNNs areinherently deep in time, since their hidden state is a function of all previous hidden states. The question that inspired this paper was whether RNNs could also benefit fromdepth in space; that is from stackingmultiple recurrent hidden layerson top of each other, just as feedforward layers are stacked in conventional deep networks.

**So the main purpose of using multilayer RNNs is to learn more sophisticated conditional distributions**. In a single layered RNN, one hidden state is doing all the work. If you are modeling a sequence such as text, then the internal parameters are learning that **a** is more likely to follow **c** than **o**. **By introducing multiple layers, you allow the RNN to capture more complicated structures. **The first layer might learn that some characters are vowels and others are consonants. The second layer would build on this to learn that a vowel is more likely to follow a consonant.

# Dataset preparation

**You could download my data from ****here**** or follow these steps to get the custom data file:**

- Original data was taken from official
**NCDC**website. - Choose
**United States**, then click**Access Data/Products** - Choose
**Surface Data, Hourly Global** - Choose
**Continue With SIMPLIFIED options** - Select
**Retrieve data for: New York** - Select country:
**United States** - Then find and select the required weather station, for example:
**Selected UNITED STATES stations: EAST HAMPTON AIRPORT……………….. 72209864761 01/2006 to 03/2020**