Source: Deep Learning on Medium

*In this post we will understand a variation of RNN called GRU- Gated Recurrent Unit. Why we need GRU, how does it work, differences between LSTM and GRU and finally wrap up with an example that will use LSTM as well as GRU*

*Prerequisites*

*Optional read*

Multivariate-time-series-using-RNN-with-keras

*What is Gated Recurrent Unit- GRU?*

- GRU is an improvised version of Recurrent Neural Network(RNN)
- Addresses the vanishing gradient problem of RNN.
- GRU is capable of learning long term dependencies

RNN are neural networks with loop to help persist information. RNN suffer from either exploding gradient or vanishing gradient issue.

*What is Exploding and Vanishing gradients?*

Gradients of neural network is calculated during back propagation.

With deeper neural layers in RNN and sharing weights across different RNN cell, we sum up the gradients at each time step. As gradients go through **continuous matrix multiplication due to chain rule**, they either **shrink exponentially and have small values called Vanishing gradient** or they **blow up to a very large value and this is referred as Exploding gradients**

*How can we resolve the problem of Vanishing or Exploding gradients?*

Exploding Gradient can be resolved using ** gradient clipping**. In gradient clipping we set a pre-determined gradient threshold and when the gradients exceeds this threshold we scale the gradient to the threshold.

Vanishing Gradients is addressed either Long Short Term Memory(LSTM) or Gated Recurrent Unit(GRU).

we will discuss GRU here

*How does GRU address the Vanishing Gradient problem?*

For this we need to first understand how GRU works.

** GRU like LSTM is capable of learning long term dependencies**.

** GRU and LSTM both have gating mechanism to regulate the flow of information** like remembering the context over multiple time steps. Thy keep track of what information from past can be kept what can be forgotten. To achieve this

*GRU uses Update gate and Reset gate**What is the functionality of Update and Reset gate in remembering long term dependencies?*

**Update Gate**

- How much of previous memory to keep around. Decide what to keep and what to throw away.
- How much of cell state you will update
- Will have a value between 0 and 1
- If the value of the update unit is close to 0 then we remember the previous state
- If value of the update unit is 1 or close to 1 then we forgot the previous value and store the new value
- Update gate acts similar to input and forget gate of LSTM

**Reset Gate also known as Relevance Gate**

- Reset gate decides how much of information to forget
- Allows model to drop information that is irrelevant for the future
- Determines how much of previous memory to keep around

Let’s go step by step and understand how GRU works

*Step 1:***Drop irrelevant information for future**

**Reset Gate takes the input Xt and the previous hidden state ht-1 and applies a sigmoid activation function**.

Reset gate determines if the current state will have the new information or will still have the previous information

If the Reset gate has a value close to 0 then ignore previous hidden state. This means that the previous information is irrelevant and we want to drop that and store the new information.

*Step 2:***How much of previous memory to be stored**

**Update Gate takes the input Xt and the previous hidden stae ht-1 and applies a sigmoid activation function.**

Update gate determines how much of previous memory to keep around, it decides what to keep and what to throw away.

If the value of the update unit is close to 0 then we remember the previous state.

** Step 3**:

**Final memory to be stored**

when the Reset gate **rt** is close to 0, previous hidden state is ignored and reset with the current input **xt **only.

Hidden state will drop any information that is found to be irrelevant for the future. This a compact representation.

Update gate controls how much information from the previous hidden state will carry over to the current hidden state.

If the value of the update unit is close to 0 then we remember the previous hidden state. If value of the update unit is 1 or close to 1 then we forgot the previous hidden state and store the new value

GRU has separate reset and update gates, each unit will learn to capture dependencies over different time scales. Units that learn to capture **short-term dependencies will tend to have reset gates that are frequently active**. Units that capture **longer-term dependencies will have update gates that are mostly active**

*Now that we have understood the working of GRU we revisit our question of how GRU solves vanishing gradient issue.*

In Vanishing gradient, the gradients become small or zero and easily vanishes.

Gating mechanism in GRU and LSTM helps resolve the vanishing gradient issue. Shutting the update gate essentially skip layers when calculating the gradient.Gates hold information in memory as long as required and update it with new information only when necessary.

Using a combination of gates either allows to pass or block, so no matter how deep our neural network or input sequence is, the network can remember the gradients .

Intuitively the error is additive instead of multiplicative and hence it is easier to keep in reasonable range. Forget gate in LSTM or Update gate in GRU help with long term dependencies

*Let’s understand the commonality and difference between LSTM and GRU?*

**Commonality between LSTM and GRU**

- LSTM and GRU both have update units with additive component from t to t + 1, which is lacking in the traditional RNN.
- LSTM unit and GRU both
**keep the existing content and add the new content on top of it** - Update gate in GRU and Forget gate in LSTM takes the
**linear sum between the existing state and the newly computed state** - LSTM and GRU
**addresses vanishing and exploding gradient issue**present in RNN

#### Differences between LSTM and GRU

**GRU has two gates, reset and update gates. LSTM has three gates, input, forget and output**. GRU does not have an output gate like LSTM. Update gate in GRU does the work of input and forget gate of LSTM**GRU have fewer parameters**so they are**computationally more efficient**and need less data to generalize than LSTM**LSTM maintains an internal memory state c**, while GRU does not have a separate memory cell**GRU does not have any mechanism to control the degree to which its state or memory content is exposed**, but exposes the whole state or memory content each time. LSTM can control how much memory content it wants to expose.

*Finally we wrap up with an example that will use LSTM as well as GRU*

Here I have used the Electric power consumption data set .

Importing required libraries

import pandas as pd

import numpy as np

from math import sqrt

from numpy import concatenate

from matplotlib import pyplot

from pandas import read_csv

from pandas import DataFrame

from pandas import concat

from sklearn.preprocessing import MinMaxScaler

from sklearn.preprocessing import LabelEncoder

from sklearn.metrics import mean_squared_error

from keras.models import Sequential

from keras.layers import Dense

from keras.layers import LSTM, GRU

import tensorflow as tf

from datetime import datetime

Reading the data set, parsing the dates and inferring the date format to date time. We also fill the Nan’s with 0.

dataset = read_csv("c:\data\power_consumption.csv",parse_dates={'dt' : ['Date', 'Time']},infer_datetime_format=True,

index_col= 0,na_values=['nan','?'])dataset.fillna(0, inplace=True)

values = dataset.values

# ensure all data is float

values = values.astype('float32')

Looking at the sample data from dataset

dataset.head(4)

As the input features are on different scale, we need to normalize the features. We are using using Min Max scalar

# normalizing input features

scaler = MinMaxScaler(feature_range=(0, 1))

scaled = scaler.fit_transform(values)

scaled =pd.DataFrame(scaled)

Looking at the data after it is normalized

scaled.head(4)

We define a function to creating the time series data set. We can specify the lookback interval and the predicted column

def create_ts_data(dataset, lookback=1, predicted_col=1):

temp=dataset.copy()

temp["id"]= range(1, len(temp)+1)

temp = temp.iloc[:-lookback, :]

temp.set_index('id', inplace =True)

predicted_value=dataset.copy()

predicted_value = predicted_value.iloc[lookback:,predicted_col]

predicted_value.columns=["Predcited"]

predicted_value= pd.DataFrame(predicted_value)

predicted_value["id"]= range(1, len(predicted_value)+1)

predicted_value.set_index('id', inplace =True)

final_df= pd.concat([temp, predicted_value], axis=1)

#final_df.columns = ['var1(t-1)', 'var2(t-1)', 'var3(t-1)', 'var4(t-1)', 'var5(t-1)', 'var6(t-1)', 'var7(t-1)', 'var8(t-1)','var1(t)']

#final_df.set_index('Date', inplace=True)

return final_df

We now create the time series dataset with looking back one time step

reframed_df= create_ts_data(scaled, 1,0)

reframed_df.fillna(0, inplace=True)

reframed_df.columns = ['var1(t-1)', 'var2(t-1)', 'var3(t-1)', 'var4(t-1)', 'var5(t-1)', 'var6(t-1)', 'var7(t-1)','var1(t)']

print(reframed_df.head(4))

Splitting data set into test and train data set

# split into train and test sets

values = reframed_df.values

training_sample =int( len(dataset) *0.7)

train = values[:training_sample, :]

test = values[training_sample:, :]

# split into input and outputs

train_X, train_y = train[:, :-1], train[:, -1]

test_X, test_y = test[:, :-1], test[:, -1]

Reshaping the data set to 3D with sample size, lookback time steps and the input features.

# reshape input to be 3D [samples, time steps, features]

train_X = train_X.reshape((train_X.shape[0], 1, train_X.shape[1]))

test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1]))

print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)

We now create the LSTM model with 3 LSTM layers and one Dense layer. We compile the model using adam optimizer . Loss is calculated using mean absolute error(mae)

model_lstm = Sequential()model_lstm.add(LSTM(75, return_sequences=True,input_shape=(train_X.shape[1], train_X.shape[2])))model_lstm.add(LSTM(units=30, return_sequences=True))

model_lstm.add(LSTM(units=30))

model_lstm.add(Dense(units=1))model_lstm.compile(loss='mae', optimizer='adam')

Let’s look at the LSTM model summary

model_lstm.summary()

Fitting the LSTM model

# fit networkhistory_lstm = model_lstm.fit(train_X, train_y, epochs=10, batch_size=64, validation_data=(test_X, test_y), shuffle=False)

We now create GRU with similar layers like LSTM

model_gru = Sequential()

model_gru.add(GRU(75, return_sequences=True,input_shape=(train_X.shape[1], train_X.shape[2])))

model_gru.add(GRU(units=30, return_sequences=True))

model_gru.add(GRU(units=30))

model_gru.add(Dense(units=1))

model_gru.compile(loss='mae', optimizer='adam')

Let’s look at the GRU model summary

model_gru.summary()

We can see that LSTM and GRU had the same architecture but number of parameters in LSTM is 44,971 where as GRU in GRU is 33,736. GRU is a simpler model with two gates compared to LSTM that has three gates. As GRU has less parameters it is computationally more efficient than LSTM.

Fitting the GRU model

# fit networkgru_history = model.fit(train_X, train_y, epochs=10, batch_size=64, validation_data=(test_X, test_y), shuffle=False)

For understanding how the loss varied across LSTM and GRU we plot the loss

pyplot.plot(history_lstm.history['loss'], label='LSTM train', color='red')

pyplot.plot(history_lstm.history['val_loss'], label='LSTM test', color= 'green')

pyplot.plot(gru_history.history['loss'], label='GRU train', color='brown')

pyplot.plot(gru_history.history['val_loss'], label='GRU test', color='blue')

pyplot.legend()

pyplot.show()

**What did I learn while creating the model**

Bad data with null values caused accuracy and loss to be NAN. To resolve that

ensure you do not have any nulls in the data

#### Read it, share it and give claps if you liked the post.

#### References:

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation

**Recurrent Neural Network Tutorial, Part 4 – Implementing a GRU/LSTM RNN with Python and Theano**

*The code for this post is on Github. This is part 4, the last part of the Recurrent Neural Network Tutorial. The…*www.wildml.com

https://www.cs.toronto.edu/~guerzhoy/321/lec/W09/rnn_gated.pdf