Source: Deep Learning on Medium

**Deep learning** is a very iterative process looking for the right set of **hyper-parameters**. It is important to repeat the process of having an idea, coding, and experimenting. To do so, we need to set up dataset properly into **train**, **dev**, and **test** sets. **Train** **set** is used to train the model. **dev** and **test** **sets** are data that the model had never seen before and are used to analyze the model. Traditionally, we had the golden ratio of 6:2:2 = Train:Dev:Test. However with **big data**, that distribution will allocate too much data for **dev** and **test sets**. Considering that **deep learning** is very data hungry, it is wiser to use most of the data for **training** and just have around 10,000 samples each for **dev** and **test sets**. Also, it is okay to just have the **dev** **set**. If we have both **dev** and **test sets**, it is extremely important that they come from the same distribution. For example, one cannot be high-def images while the other is low-def images.

The reason we need separate **train** and **dev** **set** is to analyze the model’s **bias** and **variance**. Model with **high bias** will be less flexible and fail to have enough complexity to classify properly. On the other hand, model with **high variance** will be too flexible and **overfit** to the **training data**, being specialized with the **train set** and not good with any data that it has not seen before. Therefore, when **dev error** is high while **train error** is low, we fix the **variance** problem and if the **train error** is high and **dev error** is about same, we fix the **bias** problem. If we have a higher **dev error** while **train error** is already high, we have both **bias** and **variance** problems to solve. Eventually, we want both **train error** and **dev error** to be low with very low gap in between them — low **bias** and low **variance**. When we have **high bia**s, we need to make the network more flexible/complex or simply make it learn more. So for **bias** problem, we can build a bigger network, which almost never hurts, or train for longer. With **high variance**, we want to prevent **overfitting** by introducing more **training data** or **regularization** to the model.

There are several ways to do **regularization**. One of them, **weight decay** adds the sum of magnitudes of **weights **and** biases **to the **loss function**, so that it penalizes large **weights **(both negative and positive). **Weights** are more likely to be closer to 0 because our goal is to minimize the **loss function. **This** **allows us to learn much simpler mapping function. If you square the sum, it becomes the **L2 regularization** and if you don’t, it becomes **L1 regularization**. **λ / 2N** is a term multiplied before the sum and **λ** is a new **hyper-parameter**. It is okay to omit **biases** because they are negligible.

In Python:

loss = cross_entropy_loss

cost = np.mean(loss)

#L2 Regularization

for l in range(L):

cost += lambd/(2N) * np.dot(parameters['W'+str(l+1)].T,

parameters['W'+str(l+1)])

Another **regularization** method is **dropout**. With certain chance, we shut some **neurons** off forcing the network to learn much simpler function. It is important to **dropout** different **neurons** every iteration. We might want to use different **dropout** chances for each layer because **earlier layers** have higher chance of overfitting compared to the **later layers**. So we would normally **dropout** more **neurons** in **earlier layers **by using lower **keep_prob**, which tells us how much **neurons** to keep. We usually do not use **dropout** for the **input layer** and definitely not during **test time**. The intuition behind **dropout** is in trying to learn simpler mapping function but also to spread out the **weights** so that we do not specialize in few features. One downside of **dropout** is that it kind of makes the **loss function** less well defined.

In Python:

# dropout in layer 2

D2 = np.random.rand(A2.shape) < keep_prob2

A2 = A2*D2

A2 /= keep_prob2

# dropout backward in layer 2

dA2 = dA2 * D2

dA2 /= keep_prob2

Lastly, **early stopping** is something worth trying. If **weights** are initialized to small random numbers, they will start to grow too big and **overfit** as we train for longer time. So, we monitor both **train** and **dev error** during **train time** and end **training** when things seem to be just right. However, we mentioned that training for longer period affects **bias**, which means that shortening training time will affect **bias** as well. Therefore, **early stopping** is not an ideal option for **orthogonalization** which means to deal with **bias** and **variance **separately**.**

Coming up with more data could be very difficult and expensive. **Data augmentation** is a technique where we create new data by **flipping**, **random cropping**, and etc. to modify the original data just slightly enough to not change the meaning of it.

**Normalizing inputs** is always a good idea because it speeds up the process of learning. **Normalization** centers the inputs to (0, 0) with the **variance** of **1** in every direction. To achieve this, we first subtract the **mean **(**µ**) of** X** from **X** then divide it by the square root of its **variance** (**σ²**). This allows to have a more round and easier **cost function** to optimize with. **Normalization** becomes more important when input features are in various scales. For example, with **x₁ **ranging from **1.0** to **-1.0** and **x₂** ranging from **100** to **-100**, we really want to **normalize**.

In Python:

X_mean = np.mean(X, axis=1)

X_var = np.mean((X-mean)**2, axis=1)

X_norm = (X - X_mean) / np.sqrt(X_var + 1e-8)

**Vanishing gradient** has been a problem for **deeper networks**. If all the **weights** are less than **0**, all the **gradients** flowing back during **back propagation** will be getting smaller and smaller as they pass through **layers**. One of the ways to alleviate this problem is initializing weights properly. As if we multiplied **0.01** to initialize **weights** to be small random numbers for **sigmoid functions**, we set the **variance** of **weights** to be **2 / (number of inputs)** for **ReLU activation**. This method is called the **He initialization**.

In Python:

def he_initialization(n_units):

parameters = {}

for l in range(1, len(n_units)):

parameters['W'+str(l)] = np.random.rand(n_units[l],

n_units[l-1]) * np.sqrt(2/n_units[l-1])

parameters['b'+str(l)] = np.zeros((n_units[l], 1))

return parameters

There are several ways to **optimize** the model. Methods that we have already seen before are **stochastic** **gradient descent **and** mini-batch gradient descent**. **Stochastic gradient descent** optimizes **parameters** by using **one sample** at a time and **mini-batch gradient descent** uses an assortment of **multiple samples** called a **mini-batch **at a time. We can consider** stochastic gradient descent** as **mini-batch gradient descent** with the **mini-batch size** of 1. We call a cycle through training data, an **epoch**. Because of larger **mini-batch size**, **mini-batch gradient descent** has less noise than **stochastic gradient descent**. In other words,** mini-batch gradient descent **takes more direct route towards the **minimum **than** stochastic gradient descent**. **Mini-batch size** is another **hyper-parameter** that we have to figure out empirically. However, it is known that numbers like **64**, **128**, **256**, and **512** that are computer memory sized work well for **mini-batch size**.

By using **exponentially weighted averages **of **gradients**, we could take even more direct route. **Exponentially weighted average** is computed by mixing **exponentially weighted average **till now (**v𝗍-₁**) and **current value** (**θ𝗍**): **v𝗍 = β(v𝗍-₁) + (1-β)θ𝗍**. **β **is another **hyper-parameter**, and **0.9 **usually works well. You can notice from the formula that, **v𝗍 **with** **low** t **will not be close to the original **θ𝗍**. We could use **bias correction** to fix such problem. After computing normal** v𝗍**, we divide it by** (1-βᵗ)**. Overall,** exponentially weighted average** will give smoother curve than the original values. Having oscillations during **gradient descent** made it hard for us use large **learning rates**. To solve this problem, we can use **gradient descent** **with momentum** that uses **exponentially weighted averages** of **dW**’s and **db**’s (**VdW** and **Vdb**) to update the **parameters**. Things almost always work better with **momentum**.

In Python:

# initialize

if t == 1:

momentum = {}

for l in range(L):

momentum['VdW'+str(l+1)] = 0

momentum['Vdb'+str(l+1)] = 0

for l in range(L):

momentum['VdW'+str(l+1)] = beta*momentum['VdW'+str(l+1)] +

(1-beta)*gradient['dW'+str(l+1)]

momentum['Vdb'+str(l+1)] = beta*momentum['Vdb'+str(l+1)] +

(1-beta)*gradient['db'+str(l+1)]

# update parameters

for l in range(L):

parameters['W'+str(l+1)] -= learning_rate *

momentum['VdW'+str(l+1)]

parameters['b'+str(l+1)] -= learning_rate *

momentum['Vdb'+str(l+1)]

**RMS Prop** is an **optimization** method with the similar concept except that it takes **exponentially weighted average** on **gradient squared **(**SdW **and** Sdb**) instead of normal **gradient**. To update **parameters**, it uses **dW/√(SdW)** and **db/√(Sdb)**. Another **optimization method** that uses **exponentially weighted average** is **Adam**. It computes **exponentially weighted average** for both **gradient** and **gradient squared**. So,** **we need two extra hyper-parameters **β₁ **and** β₂**. **0.9** works well for **β₁**, and **0.999 **works well for **β₂. **With** Adam optimization**, we update **parameters **with** VdW/√(SdW) **and** Vdb/√(Sdb)**.

In Python:

if t == 1:

rms_prob = {}

for l in range(L):

rms_prob['SdW'+str(l+1)] = 0

rms_prob['Sdb'+str(l+1)] = 0

for l in range(L):

rms_prob['SdW'+str(l+1)] = beta2*rms_prob['SdW'+str(l+1)] +

(1-beta2)*gradient['dW'+str(l+1)]**2

rms_prob['Sdb'+str(l+1)] = beta2*rms_prob['Sdb'+str(l+1)] +

(1-beta2)*gradient['db'+str(l+1)]**2

parameters['W'+str(l+1)] -= learning_rate *

gradient['dW'+str(l+1)] /

np.sqrt(rms_prob['SdW'+str(l+1)]+1e-5)

parameters['b'+str(l+1)] -= learning_rate *

gradient['db'+str(l+1)] /

np.sqrt(rms_prob['Sdb'+str(l+1)]+1e-5)

if t == 1:

adam = {}

for l in range(L):

adam['VdW'+str(l+1)] = 0

adam['Vdb'+str(l+1)] = 0

adam['SdW'+str(l+1)] = 0

adam['Sdb'+str(l+1)] = 0

for l in range(L):

adam['VdW'+str(l+1)] = beta*momentum['VdW'+str(l+1)] +

(1-beta)*gradient['dW'+str(l+1)]

adam['Vdb'+str(l+1)] = beta*momentum['Vdb'+str(l+1)] +

(1-beta)*gradient['db'+str(l+1)]

adam['SdW'+str(l+1)] = beta2*adam['SdW'+str(l+1)] +

(1-beta)*gradient['dW'+str(l+1)]**2

adam['Sdb'+str(l+1)] = beta2*adam['Sdb'+str(l+1)] +

(1-beta)*gradient['db'+str(l+1)]**2

parameters['W'+str(l+1)] -= learning_rate *

adam['VdW'+str(l+1)] /

np.sqrt(adam['SdW'+str(l+1)]+1e-5)

parameters['b'+str(l+1)] -= learning_rate *

adam['Vdb'+str(l+1)] /

np.sqrt(adam['Sdb'+str(l+1)]+1e-5)

Last thing we have to know about **optimization** is **learning rate decay**. As its name suggests, we reduce **learning rate** as we get closer to the **minimum **in order** **to make converging to it easier. We can use several different functions to decay the original **learning rate**.

We have discovered many new** hyper-parameters**. How do we tune **hyper-parameters**? First, we should try to find good **learning rate **because it is the most important **hyper-parameter**. Then, we should work on finding **the number of hidden units**, **mini-batch size**, and **β**’s. How do we find good value for **hyper-parameters**? We should choose random points within certain ranges of** hyper-parameters**. As we try them out, we should find the good area with smaller ranges to sample the random points more densely and keep searching from coarse to fine. There are two search scales we can use: **linear scale** and **log scale**. **Linear scale** samples things within the range evenly, while log scale samples things evenly on the orders of magnitude. For example, **linear scale** will sample 0.0001 to 1.0 evenly/linearly, but** log scale** will sample evenly within the orders of magnitude: from 0.0001 to 0.001, 0.001 to 0.01, and 0.01 to 0.1, and 0.1 to 1.0. In practice, even after finding the good **hyper-parameters** for our model, we should tune our **hyper-parameters** once in a while.

**Batch normalization** is like having **input normalization** for every** layer**. So it normalizes every activation to **train** every **parameters** faster. But in practice, we actually normalize **Z** which has the same effect as normalizing **A**. We first normalize **Z **into** Z norm.** Then, we compute the

**Z**.

*new =*𝛄*Z*norm*+ β**𝛄**and

**β**are

**learnable**

**parameters**that basically allows

**Z**to have other

**mean**and

**variance**that are more suitable for

**activation functions**. Because

**Z**is centered to 0 with

*new***variance**1, it is very likely for

**Z**to hit the sweet spot of

**activations functions**.

**Batch norm**allows the model to be more robust to

**covariate shift**when the distribution of input change. Moreover

**Batch norm**has slight

**regularization**effect because it kind of can cancel out large

**W**’s, adding noise. Its

**regularization**effect gets weaker as

**batch size**gets

**larger**

**because**

**there is less noise as**

**batch size**gets larger. If

**batch norm**is not beneficial, the model can always learn

**𝛄**to be

**the**

**variance of Z**and

**β**to be

**the mean of Z**to cancel out

**batch norm**and make

**Z**equal to

*new***Z**. During

**back propagation**, we need to compute

**d𝛄**and

**dβ**to update

**𝛄**and

**β**. Each layer has its own

**𝛄**and

**β**, and their shapes are

**(number of units in the layer, 1)**because

**𝛄**and

**β**has to multiply and add to every single

**z**’s from the

**layer**. Obviously,

**d𝛄**and

**dβ**will

**have the same size. During**

**test time**, we use

**exponentially weighted average**of

**means**and

**variances**from

**training**across different

**mini-batches**and across different

**layers**. This is as if we are learning general

**mean**and

**variance**during training.

In Python:

# batch norm in layer 2

Z2_mean = np.mean(Z, axis=1)

Z2_var = np.var(Z, axis=1)

Z2_norm = (Z2 - Z2_mean) / np.sqrt(Z2_var+1e-8)

Z2_new = gamma2*Z2_norm + beta2

# batch norm backward in layer 2`Z2_mu`

=Z2-Z2_mean

std_inv=1./np.sqrt(Z2_var+1e-8)

dZ2_norm=dZ2_new*gamma2

dZ2_var=np.sum(dZ2_norm*Z2_mu, axis=1)*-.5*std_inv**3

dZ2_mean=np.sum(dZ2_norm*-std_inv, axis=1)+dZ2_var*

np.mean(-2.*Z2_mu, axis=1)

dZ2=(dZ2_norm*std_inv)+(dZ2_var*2*Z2_mu/N)+

(dZ2_mean/N)

dgamma2=np.sum(dZ2_new*Z2_norm, axis=1)

dbeta2=np.sum(dZ2_new, axis=1)

We used **sigmoid activation function** in **output layer** for our **binary classifier**. What if we have more than two **classes**? We use **Softmax** **function** which is really good for picking the maximum probability out of many. As its name suggests, it is a soft version of **max function**. Instead of selecting the maximum, it distributes probability in between 0 and 1 so that the maximum gets the largest portion and the minimum gets the least portion, which makes **Softmax function** suitable for **multi-class classification**. Those probabilities add up to 1. For **loss function**, we use **cross-entropy loss**. For **backprop**, **dL/dZ** of the output layer beautifully works out to be **A – Y**. For implementing **Softmax function**, there is a naive way that directly follows the mathematical function. Because exponentials can easily explode beyond float64 capacity very easily, the naive version is numerically unstable. Therefore, we generally implement the stable version of **Softmax function** that basically normalizes the values before putting them through exponentials.

In Python:

def softmax(Z):

exps = np.exp(Z)

return exps / np.sum(exps, axis=1)

def stable_softmax(Z):

Z = Z - np.max(Z)

exps = np.exp(Z)

return exps / np.sum(exps, axis=1)

def cross_entropy_loss(A, Y):

loss = np.sum(-Y*np.log(A), axis=1)

return loss