Weather Forecasting Using Multilayer Recurrent Neural Network

Original article was published on Deep Learning on Medium

Weather Forecasting Using Multilayer Recurrent Neural Network

Some theory

Well, actually, there are plenty of useful resources like this or this, that explained in details how GRU/LSTM architectures work, all the math behind them and so on, but I think that all explanations I’ve seen before are somewhat misleading.

Many articles show pictures like these:

Recurrent neural network with Gated Recurrent Unit
Vanilla RNN architecture

But when we use Keras RNN layers like GRU(units=128, return_sequences=True) — what exactly does this mean? What is unit?

It took me a little while to figure out that I was thinking of LSTMs wrong. You (like me) might be visualizing LSTM cell as something with a scalar (1D) hidden cell state c and a scalar output h. So it takes vector input x and gives a scalar output. If you think of LSTMs this way, then it is tempting to think of the ‘number of units = d’ as taking d serial LSTMs and running them parallel, with a total of d hidden states and d output states.

This is not a good way to think about it, though. The number of units is a parameter in the LSTM, referring to the dimensionality of the hidden state and dimensionality of the output state (they must be equal). One LSTM or GRU cell comprises an entire layer. There is crosstalk between the hidden states via the weight matrix, so its not correct to think of it as d serial LSTMs running in parallel. Put another way, an ‘unrolled’ LSTM looks just like a normal feedforward neural network, but every layer has the same number of units.

Speech Recognition with Deep Recurrent Neural Networks , 2013

What about multiple layers in RNN?

Well, given that some RNNs operate on sequences of time-series data , it means that the addition of layers adds levels of abstraction of input observations over time. In effect, chunking observations over time or representing the problem at different time scales.

Speech Recognition with Deep Recurrent Neural Networks , 2013:

RNNs are inherently deep in time, since their hidden state is a function of all previous hidden states. The question that inspired this paper was whether RNNs could also benefit from depth in space; that is from stacking multiple recurrent hidden layers on top of each other, just as feedforward layers are stacked in conventional deep networks.

So the main purpose of using multilayer RNNs is to learn more sophisticated conditional distributions. In a single layered RNN, one hidden state is doing all the work. If you are modeling a sequence such as text, then the internal parameters are learning that a is more likely to follow c than o. By introducing multiple layers, you allow the RNN to capture more complicated structures. The first layer might learn that some characters are vowels and others are consonants. The second layer would build on this to learn that a vowel is more likely to follow a consonant.

Dataset preparation

You could download my data from here or follow these steps to get the custom data file:

  • Original data was taken from official NCDC website.
  • Choose United States, then click Access Data/Products
  • Choose Surface Data, Hourly Global
  • Choose Continue With SIMPLIFIED options
  • Select Retrieve data for: New York
  • Select country: United States
  • Then find and select the required weather station, for example:
    Selected UNITED STATES stations: EAST HAMPTON AIRPORT……………….. 72209864761 01/2006 to 03/2020
  • Then select the required period of time
  • Click Continue and follow the instructions to download the requested data
  • After that, the raw data will look liks this:
Unprocessed NCDC weather data

We are only interested in two columns from this file — TEMP and YR — MODAHRMN.

The second column is datetime value in YYYYMMDDHHMI format.

Then we need to extract the required columns from the unprocessed text file, remove invalid values and save the result to the Pickle file:

download_dir = “dataset/
path = os.path.join(download_dir, “3317898170868.pkl”)
if os.path.exists(path):
df = pd.read_pickle(path)
df = pd.read_csv(download_dir + ‘3317898170868.txt’,
df = df.filter([‘YR — MODAHRMN’, ‘TEMP’])
df = df.rename(columns={df.columns[0]: “time”,
df.columns[1]: ‘temp’})
df[‘time’] = pd.to_datetime(df[‘time’], format=’%Y%m%d%H%M’)
df = df.set_index(pd.DatetimeIndex(df[‘time’]))
df = df.drop([‘time’], axis=1)
df = df[df[‘temp’] != ‘****’]
df = df.astype(int)

Extract Day, Hour, Minute into the separate columns for better results:

Processed dataset with all necessary date-time columns
df[‘Various’, ‘Day’] = df.index.dayofyear
df[‘Various’, ‘Hour’] = df.index.hour
df[‘Various’, ‘Minute’] = df.index.minute

All we need to do now is to prepare training and test data from our dataset

# How many days we want to predict
shift_days = 1
# Each step is one hour
shift_steps = shift_days * 24
# List of signals that we to predict
# It is 'temperature' in our case
target_names = [‘temp’]
# Rewind data shift_steps backwards
df_targets = df[target_names].shift(-shift_steps)
# Training input is all values from the very beginning to shift_days
x_data = df.values[0:-shift_steps]
# Training target data is the last shift_steps in the dataset
y_data = df_targets.values[:-shift_steps]
num_data = len(x_data)
# 80% is training set, 20% is test set
train_split = 0.8
# Calculate the exact number of training and test steps
num_train = int(train_split * num_data)
num_test = num_data — num_train
# How many signals we want to predict (1 in this case)
num_x_signals = x_data.shape[1]
num_y_signals = y_data.shape[1]
# Then we need to scale data to the range from 0 to 1
x_scaler = MinMaxScaler()
y_scaler = MinMaxScaler()
# Split data and scale it
x_train_scaled = x_scaler.fit_transform(x_data[0:num_train])
x_test_scaled = x_scaler.transform(x_data[num_train:])
y_train_scaled = y_scaler.fit_transform(y_data[0:num_train])
y_test_scaled = y_scaler.transform(y_data[num_train:])
print(‘Training data:’)
x_data: (62455, 4)
print(‘x_data: ‘, x_train_scaled.shape)
# y_data: (62455, 1)
print(‘y_data: ‘, y_train_scaled.shape)
print(‘Test data:’)
x_data: (15614, 4)
print(‘x_data: ‘, x_test_scaled.shape)
# y_data: (15614, 1)
print(‘y_data: ‘, y_test_scaled.shape)

Batch generation

# Generator for creating random batches of training data
def batch_generator(batch_size, sequence_length):
while True:
# Allocate a new array filled with zeroes
# for the batch of input signals

x_shape = (batch_size, sequence_length, num_x_signals)
x_batch = np.zeros(shape=x_shape, dtype=np.float16)
# Allocate a new array filled with zeroes
# for the batch of output signals
y_shape = (batch_size, sequence_length, num_y_signals)
y_batch = np.zeros(shape=y_shape, dtype=np.float16)
for i in range(batch_size):
# Get a random point in out time-series data
idx = np.random.randint(num_train — sequence_length)
# Copy the sequences of data starting from this point
x_batch[i] = x_train_scaled[idx:idx + sequence_length]
y_batch[i] = y_train_scaled[idx:idx + sequence_length]
yield x_batch, y_batch

Network architecture

Essentially, this architecture was designed through trial and error. After many experiments I came to the following conclusions regarding GRU performance on this dataset:

  • Greater number of units in the first layer better captures short-term trends
  • The presence and greater number of units in the second layer better captures long-term trends
  • Third Dense layer helps to smooth curve in general (eliminate small spikes)
  • Addition of the second layer produces better results than increasing number of units in the first layer
# Model: "sequential"
# _________________________________________________________________
# Layer (type) Output Shape Param #
# =================================================================
# gru (GRU) (None, None, 128) 51456
# _________________________________________________________________
# gru_1 (GRU) (None, None, 128) 99072
# _________________________________________________________________
# time_distributed (TimeDistri (None, None, 100) 12900
# _________________________________________________________________
# time_distributed_1 (TimeDist (None, None, 1) 101
# =================================================================
# Total params: 163,529
# Trainable params: 163,529
# Non-trainable params: 0
# _________________________________________________________________
batch_size = 32# I will use a sequence-length of 2688, which means that each random # sequence contains observations for 16 weeks. One timestep
# corresponds to one hour, so 24 x 7 timesteps corresponds to a
# week, and 24 x 7 x 16 corresponds to 16 weeks
sequence_length = 24 * 7 * 16
generator = batch_generator(batch_size=batch_size,

validation_data = (np.expand_dims(x_test_scaled, axis=0),
np.expand_dims(y_test_scaled, axis=0))

model = Sequential()
model.add(GRU(units=128, return_sequences=True,
input_shape=(None, num_x_signals))
model.add(GRU(units=128, return_sequences=True))
model.compile(loss=’mse’, optimizer=Adam())

Training process

I’ve added few callbacks to better control training process.

  • ModelCheckpoint callback is used to save the weights if validation loss is the lowest so far
  • EarlyStopping callback is used to check at end of every epoch whether the validation loss is no longer decreasing. Once it’s found no strictly decreasing for 3 epochs straight, training process will be terminated.
  • ReduceLROnPlateau callback monitors validation loss and if no improvement is seen for 3 epochs, the learning rate is reduced by a factor of 1.25
  • TensorBoard callback logs events for the training process visualuzation
callback_checkpoint = ModelCheckpoint(filepath=’checkpoint.keras’,
callback_early_stopping = EarlyStopping(monitor=’val_loss’,
callback_reduce_lr = ReduceLROnPlateau(monitor=’val_loss’,
callback_tensorboard = TensorBoard(log_dir=’.\\weather_logs\\’,
callbacks = [callback_checkpoint,


On training data:

On test data:

Full project is available here on my Github profile.