Intuit and Implement: Batch Normalization



In this article I will review the usefulness of Ioffe and Svegedy’s batch normalization. I will also implement batch normalization in Keras, and demonstrate substantial gains in training performance. The original post can be found on my website, and the code can be found on my GitHub.

An Intuitive Explanation of Batch Normalization

Problems in Training

Problem 1: As a network trains, weights in early layers change and as a result the inputs of later layers vary wildly. Each layer must readjust its weights to the varying distribution of every batch of inputs. This slows model training. If we could make layer inputs more similar in distribution, the network could focus on learning the difference between classes.

Another effect of varied batch distribution is vanishing gradients. The vanishing gradient problem is a big deal, particularly for the sigmoid activation function. If g(x) represents the sigmoid activation function, as |x| increases, g′(x) tends to zero.

Sigmoid Function and its Derivative

Problem 2. When input distribution varies, so does neuron output. This results in neuron outputs that occasionally fluctuate into the sigmoid function’s saturable regions. Once there, neurons can neither update their own weights, nor pass a gradient back to prior layers. How can we keep neuron outputs from varying into saturable regions?

If we can restrict neuron output to the area around zero, we can ensure that each layer will pass back a substantial gradient during back propagation. This will lead to faster training times, and more accurate results.

The Sigmoid Sweet Spot

Batch Norm as a Solution.

Batch normalization mitigates the effects of a varied layer inputs. By normalizing the output of neurons , the activation function will only receive inputs close to zero. This ensures a non-vanishing gradient, solving the second problem.

Batch normalization transforms layer outputs into a unit gaussian distribution. As these outputs are fed through an activation function, layer activations will also become more normally distributed.

Since the output of one layer is the input of the next, layer inputs will now have significantly less variation from batch to batch. By reducing the varied distribution of layer inputs we solve the first problem.

Mathematical Explanation

With batch normalization we seek a zero-centered, unit variance distribution of inputs for each activation function. During training time we take an activation input x and subtract it by the batch mean μ to achieve a zero centered distribution.

Next we take x and divide it by the batch variance and a small number to prevent division by zero σ+ϵ. This ensures that all activation input distributions have unit variance.

Lastly we put x hat through a linear transformation to scale and shift the output of batch normalization. Ensuring that this normalizing effect is maintained despite the changes in the network during back propagation.

When testing the model we do not use batch mean or variance, as this would break the model. (Hint: What is the mean and variance of a single observation?) Instead, we calculate a moving average and variance estimate of the training population. These estimates are averages of all batch means and variances calculated during training.

Benefits of Batch Normalization

The benefits of batch normalization are the following.

1. Helps prevent vanishing gradient in networks with saturable nonlinearities (sigmoid, tanh, etc)

With Batch normalization we ensure that the inputs of any activation function do not vary into saturable regions. Batch normalization transforms the distribution of those inputs to be unit gaussian (zero-centered and unit variance).

2. Regularizes the model

Maybe. Ioffe and Svegeddy make this claim but doesn’t right extensively on the issue. Perhaps this comes as a result of normalizing layer inputs?

3. Allows for Higher Learning Rates

By preventing issues with vanishing gradient during training, we can afford to set higher learning rates. Batch normalization also reduces dependence on parameter scale. Large learning rates can increase the scale of layer parameters which cause the gradients to amplify as they are passed back during back propagation. I need to read more about this.

Implementation in Keras

Imports

import tensorflow as tf
import numpy as np
import os

import keras
from keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

from keras.models import Model, Sequential
from keras.layers import Input

from keras.callbacks import ModelCheckpoint, EarlyStopping
from keras.layers import BatchNormalization
from keras.layers import GlobalAveragePooling2D
from keras.layers import Activation
from keras.layers import Conv2D, MaxPooling2D, Dense
from keras.layers import MaxPooling2D, Dropout, Flatten

import time

Data Load and Preprocessing

In this notebook we use the Cifar 100 toy data set, as it is reasonably challenging, and won’t take forever to train. The only preprocessing performed is a zero-centering, and an image variation generator.

from keras.datasets import cifar100
from keras.utils import np_utils

(x_train, y_train), (x_test, y_test) = cifar100.load_data(label_mode='fine')

#scale and regularize the dataset
x_train = (x_train-np.mean(x_train))
x_test = (x_test - x_test.mean())

x_train = x_train.astype('float32')
x_test = x_test.astype('float32')

#onehot encode the target classes
y_train = np_utils.to_categorical(y_train)
y_test = np_utils.to_categorical(y_test)


train_datagen = ImageDataGenerator(
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True)

train_datagen.fit(x_train)

train_generator = train_datagen.flow(x_train,
y = y_train,
batch_size=80,)

Constructing the Model in Keras

Our architecture will consist of stacked 3×3 convolutions followed by a max pool and dropout. There are 5 convolutional blocks in each network. The final layer is a fully connected layer with 100 nodes and softmax activation.

We will be constructing 4 different convolutional networks, each with either sigmoid or ReLU activations and either batch normalization or without. We will compare the validation loss of each of the networks to its colleagues.

def conv_block_first(model, bn=True, activation="sigmoid"):
"""
The first convolutional block in each architecture. Only separate so we can specify the input shape.
"""
   #First Stacked Convolution
model.add(Conv2D(60,3, padding = "same", input_shape = x_train.shape[1:]))
if bn:
model.add(BatchNormalization())
model.add(Activation(activation))
    #Second Stacked Convolution
model.add(Conv2D(60,3, padding = "same"))
if bn:
model.add(BatchNormalization())
model.add(Activation(activation))

model.add(MaxPooling2D())
model.add(Dropout(0.15))
return model

def conv_block(model, bn=True, activation = "sigmoid"):
"""
Generic convolutional block with 2 stacked 3x3 convolutions, max pooling, dropout,
and an optional Batch Normalization.
"""
model.add(Conv2D(60,3, padding = "same"))
if bn:
model.add(BatchNormalization())
model.add(Activation(activation))

model.add(Conv2D(60,3, padding = "same"))
if bn:
model.add(BatchNormalization())
model.add(Activation(activation))

model.add(MaxPooling2D())
model.add(Dropout(0.15))
return model

def conv_block_final(model, bn=True, activation = "sigmoid"):
"""
I bumped up the number of filters in the final block. I made this separate so that I might be able to integrate Global Average Pooling later on.
"""
model.add(Conv2D(100,3, padding = "same"))
if bn:
model.add(BatchNormalization())
model.add(Activation(activation))

model.add(Conv2D(100,3, padding = "same"))
if bn:
model.add(BatchNormalization())
model.add(Activation(activation))

model.add(Flatten())
return model

def fn_block(model):
"""
I'm not going for a very deep fully connected block, mainly so I can save on memory.
"""
model.add(Dense(100, activation = "softmax"))
return model

def build_model(blocks=3, bn=True, activation = "sigmoid"):
"""
Builds a sequential network based on the specified parameters.

blocks: number of convolutional blocks in the network, must be greater than 2.
bn: whether to include batch normalization or not.
activation: activation function to use throughout the network.
"""
model = Sequential()

model = conv_block_first(model, bn=bn, activation=activation)

for block in range(1,blocks-1):
model = conv_block(model, bn=bn, activation = activation)

model = conv_block_final(model, bn=bn, activation=activation)
model = fn_block(model)

return model

def compile_model(model, optimizer = "rmsprop", loss = "categorical_crossentropy", metrics = ["accuracy"]):
"""
Compiles a neural network.

model: the network to be compiled.
optimizer: the optimizer to use.
loss: the loss to use.
metrics: a list of keras metrics.
"""
model.compile(optimizer = optimizer,
loss = loss,
metrics = metrics)
return model
#COMPILING THE 4 MODELS
sigmoid_without_bn = build_model(blocks = 5, bn=False, activation = "sigmoid")
sigmoid_without_bn = compile_model(sigmoid_without_bn)

sigmoid_with_bn = build_model(blocks = 5, bn=True, activation = "sigmoid")
sigmoid_with_bn = compile_model(sigmoid_with_bn)


relu_without_bn = build_model(blocks = 5, bn=False, activation = "relu")
relu_without_bn = compile_model(relu_without_bn)

relu_with_bn = build_model(blocks = 5, bn=True, activation = "relu")
relu_with_bn = compile_model(relu_with_bn)

Model Training

Sigmoid without Batch Normalization

Training gets stuck here. With 100 classes, this model never achieves better performance than random guessing (10% accuracy).

history1 = sigmoid_without_bn.fit_generator(
train_generator,
steps_per_epoch=2000,
epochs=20,
verbose=0,
validation_data=(x_test, y_test),
callbacks = [model_checkpoint])

Sigmoid with Batch Normalization

Unlike without batch normalization, this model gets off the ground during training. This may be a result of batch normalization’s mitigation of the vanishing gradient.

history2 = sigmoid_with_bn.fit_generator(
train_generator,
steps_per_epoch=2000,
verbose=0,
epochs=20,
validation_data=(x_test, y_test),
callbacks = [model_checkpoint])

ReLU Without Batch Normalization

Implementing ReLU without batch norm led to some initial gains, then convergence to a non-optimal local minimum.

history3 = relu_without_bn.fit_generator(
train_generator,
steps_per_epoch=2000,
epochs=20,
verbose=0,
validation_data=(x_test, y_test),
callbacks = [model_checkpoint])

ReLU with Batch Normalization

As with the sigmoid models, batch normalization improves the training capabilities of this network.

history4 = relu_with_bn.fit_generator(
train_generator,
steps_per_epoch=2000,
verbose=0,
epochs=20,
validation_data=(x_test, y_test),
callbacks = [model_checkpoint])

Comparing Architectures

We clearly see the benefit of batch normalization here. Both ReLU and sigmoid models without batch normalizations failed to maintain training performance gains. This may be a result of vanishing gradients. Architectures with batch normalization trained faster and performed better than architectures without batch normalization.

Conclusion

Batch normalization reduced training time and boost stability of a neural network. This effect applies to both the sigmoid and ReLU activation functions.

Resources

Further reading

Below are some more recent research papers that extend Ioffe and Svegedy’s work.

[1] How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift)

[2] Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models

[3] Layer Normalization

[4] Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks

[5] Group Normalization

Source: Deep Learning on Medium