The Bread and Butter from Deep Learning by Andrew Ng — Course 2: Improving Deep Neural Networks

Source: Deep Learning on Medium


Deep learning is a very iterative process looking for the right set of hyper-parameters. It is important to repeat the process of having an idea, coding, and experimenting. To do so, we need to set up dataset properly into train, dev, and test sets. Train set is used to train the model. dev and test sets are data that the model had never seen before and are used to analyze the model. Traditionally, we had the golden ratio of 6:2:2 = Train:Dev:Test. However with big data, that distribution will allocate too much data for dev and test sets. Considering that deep learning is very data hungry, it is wiser to use most of the data for training and just have around 10,000 samples each for dev and test sets. Also, it is okay to just have the dev set. If we have both dev and test sets, it is extremely important that they come from the same distribution. For example, one cannot be high-def images while the other is low-def images.

The reason we need separate train and dev set is to analyze the model’s bias and variance. Model with high bias will be less flexible and fail to have enough complexity to classify properly. On the other hand, model with high variance will be too flexible and overfit to the training data, being specialized with the train set and not good with any data that it has not seen before. Therefore, when dev error is high while train error is low, we fix the variance problem and if the train error is high and dev error is about same, we fix the bias problem. If we have a higher dev error while train error is already high, we have both bias and variance problems to solve. Eventually, we want both train error and dev error to be low with very low gap in between them — low bias and low variance. When we have high bias, we need to make the network more flexible/complex or simply make it learn more. So for bias problem, we can build a bigger network, which almost never hurts, or train for longer. With high variance, we want to prevent overfitting by introducing more training data or regularization to the model.

There are several ways to do regularization. One of them, weight decay adds the sum of magnitudes of weights and biases to the loss function, so that it penalizes large weights (both negative and positive). Weights are more likely to be closer to 0 because our goal is to minimize the loss function. This allows us to learn much simpler mapping function. If you square the sum, it becomes the L2 regularization and if you don’t, it becomes L1 regularization. λ / 2N is a term multiplied before the sum and λ is a new hyper-parameter. It is okay to omit biases because they are negligible.

In Python:

loss = cross_entropy_loss
cost = np.mean(loss)
#L2 Regularization
for l in range(L):
cost += lambd/(2N) * np.dot(parameters['W'+str(l+1)].T,
parameters['W'+str(l+1)])

Another regularization method is dropout. With certain chance, we shut some neurons off forcing the network to learn much simpler function. It is important to dropout different neurons every iteration. We might want to use different dropout chances for each layer because earlier layers have higher chance of overfitting compared to the later layers. So we would normally dropout more neurons in earlier layers by using lower keep_prob, which tells us how much neurons to keep. We usually do not use dropout for the input layer and definitely not during test time. The intuition behind dropout is in trying to learn simpler mapping function but also to spread out the weights so that we do not specialize in few features. One downside of dropout is that it kind of makes the loss function less well defined.

In Python:

# dropout in layer 2
D2 = np.random.rand(A2.shape) < keep_prob2
A2 = A2*D2
A2 /= keep_prob2
# dropout backward in layer 2
dA2 = dA2 * D2
dA2 /= keep_prob2

Lastly, early stopping is something worth trying. If weights are initialized to small random numbers, they will start to grow too big and overfit as we train for longer time. So, we monitor both train and dev error during train time and end training when things seem to be just right. However, we mentioned that training for longer period affects bias, which means that shortening training time will affect bias as well. Therefore, early stopping is not an ideal option for orthogonalization which means to deal with bias and variance separately.

Coming up with more data could be very difficult and expensive. Data augmentation is a technique where we create new data by flipping, random cropping, and etc. to modify the original data just slightly enough to not change the meaning of it.

Normalizing inputs is always a good idea because it speeds up the process of learning. Normalization centers the inputs to (0, 0) with the variance of 1 in every direction. To achieve this, we first subtract the mean (µ) of X from X then divide it by the square root of its variance (σ²). This allows to have a more round and easier cost function to optimize with. Normalization becomes more important when input features are in various scales. For example, with x₁ ranging from 1.0 to -1.0 and x₂ ranging from 100 to -100, we really want to normalize.

In Python:

X_mean = np.mean(X, axis=1)
X_var = np.mean((X-mean)**2, axis=1)
X_norm = (X - X_mean) / np.sqrt(X_var + 1e-8)

Vanishing gradient has been a problem for deeper networks. If all the weights are less than 0, all the gradients flowing back during back propagation will be getting smaller and smaller as they pass through layers. One of the ways to alleviate this problem is initializing weights properly. As if we multiplied 0.01 to initialize weights to be small random numbers for sigmoid functions, we set the variance of weights to be 2 / (number of inputs) for ReLU activation. This method is called the He initialization.

In Python:

def he_initialization(n_units):
parameters = {}
for l in range(1, len(n_units)):
parameters['W'+str(l)] = np.random.rand(n_units[l],
n_units[l-1]) * np.sqrt(2/n_units[l-1])
parameters['b'+str(l)] = np.zeros((n_units[l], 1))
 return parameters

There are several ways to optimize the model. Methods that we have already seen before are stochastic gradient descent and mini-batch gradient descent. Stochastic gradient descent optimizes parameters by using one sample at a time and mini-batch gradient descent uses an assortment of multiple samples called a mini-batch at a time. We can consider stochastic gradient descent as mini-batch gradient descent with the mini-batch size of 1. We call a cycle through training data, an epoch. Because of larger mini-batch size, mini-batch gradient descent has less noise than stochastic gradient descent. In other words, mini-batch gradient descent takes more direct route towards the minimum than stochastic gradient descent. Mini-batch size is another hyper-parameter that we have to figure out empirically. However, it is known that numbers like 64, 128, 256, and 512 that are computer memory sized work well for mini-batch size.

By using exponentially weighted averages of gradients, we could take even more direct route. Exponentially weighted average is computed by mixing exponentially weighted average till now (v𝗍-₁) and current value (θ𝗍): v𝗍 = β(v𝗍-₁) + (1-β)θ𝗍. β is another hyper-parameter, and 0.9 usually works well. You can notice from the formula that, v𝗍 with low t will not be close to the original θ𝗍. We could use bias correction to fix such problem. After computing normal v𝗍, we divide it by (1-βᵗ). Overall, exponentially weighted average will give smoother curve than the original values. Having oscillations during gradient descent made it hard for us use large learning rates. To solve this problem, we can use gradient descent with momentum that uses exponentially weighted averages of dW’s and db’s (VdW and Vdb) to update the parameters. Things almost always work better with momentum.

In Python:

# initialize
if t == 1:
momentum = {}
for l in range(L):
momentum['VdW'+str(l+1)] = 0
momentum['Vdb'+str(l+1)] = 0
for l in range(L):
momentum['VdW'+str(l+1)] = beta*momentum['VdW'+str(l+1)] +
(1-beta)*gradient['dW'+str(l+1)]
momentum['Vdb'+str(l+1)] = beta*momentum['Vdb'+str(l+1)] +
(1-beta)*gradient['db'+str(l+1)]
# update parameters
for l in range(L):
parameters['W'+str(l+1)] -= learning_rate *
momentum['VdW'+str(l+1)]
parameters['b'+str(l+1)] -= learning_rate *
momentum['Vdb'+str(l+1)]

RMS Prop is an optimization method with the similar concept except that it takes exponentially weighted average on gradient squared (SdW and Sdb) instead of normal gradient. To update parameters, it uses dW/√(SdW) and db/√(Sdb). Another optimization method that uses exponentially weighted average is Adam. It computes exponentially weighted average for both gradient and gradient squared. So, we need two extra hyper-parameters β₁ and β₂. 0.9 works well for β₁, and 0.999 works well for β₂. With Adam optimization, we update parameters with VdW/√(SdW) and Vdb/√(Sdb).

In Python:

if t == 1:
rms_prob = {}
for l in range(L):
rms_prob['SdW'+str(l+1)] = 0
rms_prob['Sdb'+str(l+1)] = 0
for l in range(L):
rms_prob['SdW'+str(l+1)] = beta2*rms_prob['SdW'+str(l+1)] +
(1-beta2)*gradient['dW'+str(l+1)]**2
rms_prob['Sdb'+str(l+1)] = beta2*rms_prob['Sdb'+str(l+1)] +
(1-beta2)*gradient['db'+str(l+1)]**2

parameters['W'+str(l+1)] -= learning_rate *
gradient['dW'+str(l+1)] /
np.sqrt(rms_prob['SdW'+str(l+1)]+1e-5)
parameters['b'+str(l+1)] -= learning_rate *
gradient['db'+str(l+1)] /
np.sqrt(rms_prob['Sdb'+str(l+1)]+1e-5)
if t == 1:
adam = {}
for l in range(L):
adam['VdW'+str(l+1)] = 0
adam['Vdb'+str(l+1)] = 0
adam['SdW'+str(l+1)] = 0
adam['Sdb'+str(l+1)] = 0
for l in range(L):
adam['VdW'+str(l+1)] = beta*momentum['VdW'+str(l+1)] +
(1-beta)*gradient['dW'+str(l+1)]
adam['Vdb'+str(l+1)] = beta*momentum['Vdb'+str(l+1)] +
(1-beta)*gradient['db'+str(l+1)]
 adam['SdW'+str(l+1)] = beta2*adam['SdW'+str(l+1)] +
(1-beta)*gradient['dW'+str(l+1)]**2
adam['Sdb'+str(l+1)] = beta2*adam['Sdb'+str(l+1)] +
(1-beta)*gradient['db'+str(l+1)]**2

parameters['W'+str(l+1)] -= learning_rate *
adam['VdW'+str(l+1)] /
np.sqrt(adam['SdW'+str(l+1)]+1e-5)
parameters['b'+str(l+1)] -= learning_rate *
adam['Vdb'+str(l+1)] /
np.sqrt(adam['Sdb'+str(l+1)]+1e-5)

Last thing we have to know about optimization is learning rate decay. As its name suggests, we reduce learning rate as we get closer to the minimum in order to make converging to it easier. We can use several different functions to decay the original learning rate.

We have discovered many new hyper-parameters. How do we tune hyper-parameters? First, we should try to find good learning rate because it is the most important hyper-parameter. Then, we should work on finding the number of hidden units, mini-batch size, and β’s. How do we find good value for hyper-parameters? We should choose random points within certain ranges of hyper-parameters. As we try them out, we should find the good area with smaller ranges to sample the random points more densely and keep searching from coarse to fine. There are two search scales we can use: linear scale and log scale. Linear scale samples things within the range evenly, while log scale samples things evenly on the orders of magnitude. For example, linear scale will sample 0.0001 to 1.0 evenly/linearly, but log scale will sample evenly within the orders of magnitude: from 0.0001 to 0.001, 0.001 to 0.01, and 0.01 to 0.1, and 0.1 to 1.0. In practice, even after finding the good hyper-parameters for our model, we should tune our hyper-parameters once in a while.

Batch normalization is like having input normalization for every layer. So it normalizes every activation to train every parameters faster. But in practice, we actually normalize Z which has the same effect as normalizing A. We first normalize Z into Znorm. Then, we compute the Znew = 𝛄*Znorm + β. 𝛄 and β are learnable parameters that basically allows Z to have other mean and variance that are more suitable for activation functions. Because Znew is centered to 0 with variance 1, it is very likely for Z to hit the sweet spot of activations functions. Batch norm allows the model to be more robust to covariate shift when the distribution of input change. Moreover Batch norm has slight regularization effect because it kind of can cancel out large W’s, adding noise. Its regularization effect gets weaker as batch size gets larger because there is less noise as batch size gets larger. If batch norm is not beneficial, the model can always learn 𝛄 to be the variance of Z and β to be the mean of Z to cancel out batch norm and make Znew equal to Z. During back propagation, we need to compute d𝛄 andto update 𝛄 and β. Each layer has its own 𝛄 and β, and their shapes are (number of units in the layer, 1) because 𝛄 and β has to multiply and add to every single z’s from the layer. Obviously, d𝛄 andwill have the same size. During test time, we use exponentially weighted average of means and variances from training across different mini-batches and across different layers. This is as if we are learning general mean and variance during training.

In Python:

# batch norm in layer 2
Z2_mean = np.mean(Z, axis=1)
Z2_var = np.var(Z, axis=1)
Z2_norm = (Z2 - Z2_mean) / np.sqrt(Z2_var+1e-8)
Z2_new = gamma2*Z2_norm + beta2
# batch norm backward in layer 2
Z2_mu = Z2 - Z2_mean
std_inv = 1. / np.sqrt(Z2_var + 1e-8)

dZ2_norm = dZ2_new * gamma2
dZ2_var = np.sum(dZ2_norm * Z2_mu, axis=1) * -.5 * std_inv**3
dZ2_mean = np.sum(dZ2_norm * -std_inv, axis=1) + dZ2_var *
np.mean(-2. * Z2_mu, axis=1)

dZ2 = (dZ2_norm * std_inv) + (dZ2_var * 2 * Z2_mu / N) +
(dZ2_mean / N)
dgamma2 = np.sum(dZ2_new * Z2_norm, axis=1)
dbeta2 = np.sum(dZ2_new, axis=1)

We used sigmoid activation function in output layer for our binary classifier. What if we have more than two classes? We use Softmax function which is really good for picking the maximum probability out of many. As its name suggests, it is a soft version of max function. Instead of selecting the maximum, it distributes probability in between 0 and 1 so that the maximum gets the largest portion and the minimum gets the least portion, which makes Softmax function suitable for multi-class classification. Those probabilities add up to 1. For loss function, we use cross-entropy loss. For backprop, dL/dZ of the output layer beautifully works out to be A – Y. For implementing Softmax function, there is a naive way that directly follows the mathematical function. Because exponentials can easily explode beyond float64 capacity very easily, the naive version is numerically unstable. Therefore, we generally implement the stable version of Softmax function that basically normalizes the values before putting them through exponentials.

In Python:

def softmax(Z):
exps = np.exp(Z)
return exps / np.sum(exps, axis=1)
def stable_softmax(Z):
Z = Z - np.max(Z)
exps = np.exp(Z)
return exps / np.sum(exps, axis=1)
def cross_entropy_loss(A, Y):
loss = np.sum(-Y*np.log(A), axis=1)
return loss