Advanced Use of Recurrent Neural Networks: Part 6

Fourth Section in a Series of Python Deep Learning Posts.

Previous sections:

Using Recurrent Dropout to Fight Overfitting

It’s evident from the training and validation curves that the model is overfitting: the training and validation losses start to diverge considerably after a few epochs. You’re already familiar with a classic technique for fighting this phenomenon: dropout, which randomly zeros out input units of a layer in order to break happenstance correlations in the training data that the layer is exposed to. But how to correctly apply dropout in recurrent networks isn’t a trivial question. It has long been known that applying dropout before a recurrent layer hinders learning rather than helping with regularization. In 2015, Yarin Gal, as part of his PhD thesis on Bayesian deep learning*, determined the proper way to use dropout with a recurrent network: the same dropout mask (the same pattern of dropped units) should be applied at every timestep, instead of a dropout mask that varies randomly from timestep to timestep. What’s more, in order to regularize the representations formed by the recurrent gates of layers such as GRU and LSTM, a temporally constant dropout mask should be applied to the inner recurrent activations of the layer (a recurrent dropout mask). Using the same dropout mask at every timestep allows the network to properly propagate its learning error through time; a temporally random dropout mask would disrupt this error signal and be harmful to the learning process.

See Yarin Gal, “Uncertainty in Deep Learning (PhD Thesis),” October 13, 2016

Yarin Gal did his research using Keras and helped build this mechanism directly into Keras recurrent layers. Every recurrent layer in Keras has two dropout-related arguments: dropout, a float specifying the dropout rate for input units of the layer, and recurrent_dropout, specifying the dropout rate of the recurrent units. Let’s add dropout and recurrent dropout to the GRU layer and see how doing so impacts overfitting. Because networks being regularized with dropout always take longer to fully converge, you’ll train the network for twice as many epochs.

>>> from keras.models import Sequential
>>> from keras import layers
>>> from keras.optimizers import RMSprop
>>> model = Sequential()
>>> model.add(layers.GRU(
... 32,
... dropout=0.2,
... recurrent_dropout=0.2,
... input_shape=(None, float_data.shape[-1])))
>>> model.add(layers.Dense(1))
>>> model.compile(optimizer=RMSprop(), loss='mae')
>>> history = model.fit_generator(
... train_gen,
... steps_per_epoch=500,
... epochs=40,
... validation_data=val_gen,
... validation_steps=val_steps)

Show the results.

>>> import matplotlib.pyplot as plt
>>> loss = history.history['loss']
>>> val_loss = history.history['val_loss']
>>> epochs = range(1, len(loss) + 1)
>>> plt.figure()
>>> plt.plot(epochs, loss, 'bo', label='Training loss')
>>> plt.plot(epochs, val_loss, 'b', label='Validation loss')
>>> plt.title('Training and validation loss')
>>> plt.legend()

Success! You’re no longer overfitting during the first 30 epochs. But although you have more stable evaluation scores, your best scores aren’t much lower than they were previously.

Source: Deep Learning on Medium