Source: Deep Learning on Medium

# Understanding Convolution Neural Networks -Part II

This article is a continuation of Part I. If you haven’t read Part I, I would strongly advise you to read it

In Part II, we would be building the network describe in Figure 1.1

Let’s define some terms which we would be using in this article.

f = filter size

n_filers = number of filters

p = padding

s = stride

m = # of training examples

**Data**

We will be using digit signs data set which contains images of shape 64 x 64 x 3. The training set consists of 1080 images and test set consists of 120 images. Number of classes are 6 which contains digit signs 0, 1, 2, 3, 4, 5.

We would make use of concepts that we have gone through in Part I to build a convolution neural network from scratch.

**Convolutional Neural Network**

The network as described in Figure 1.1 consists of *convolution layer*, r*elu layer*, *max pool layer*, *flatten layer*, *dense layer* followed by a *softmax activation function* since we have more than one class. We would make use of *Adam optimizer* for training out network. In all the layers weights were initialised using He initialisation [He *et. al. *2015] as the network is ReLU activated.

**Forward Propagation**

**Convolution Layer**

The hyper parameters of convolution layer are:

*p*is 2*s*is 2*f*is 3 x 3*n_filters*is 10

Forward propagation in convolution layer consists of three steps :

- Pad zeros to image with amount
*p*

`def zero_pad(self, X, pad):`

*"""*

Set padding to the image X.

Pads with zeros all images of the dataset X.

Zeros are added around the border of an image.

Parameters:

X -- Image -- numpy array of shape (m, n_H, n_W, n_C)

pad -- padding amount -- int

Returns:

X_pad -- Image padded with zeros around width and height. -- numpy array of shape (m, n_H + 2*pad, n_W + 2*pad, n_C)

"""

X_pad = np.pad(X, ((0, 0), (pad, pad), (pad, pad), (0, 0)), 'constant')

return X_pad

- Get image window based on
*s*

`def get_corners(self, height, width, filter_size, stride):`

*"""*

Get corners of the image relative to stride.

Parameters:

height -- height of an image -- int

width -- width of an image -- int

filter_size -- size of filter -- int

stride -- amount by which the filter shifts -- int

Returns:

vert_start -- a scalar value, upper left corner of the box.

vert_end -- a scalar value, upper right corner of the box.

horiz_start -- a scalar value, lower left corner of the box.

horiz_end -- a scalar value, lower right corner of the box.

"""

vert_start = height * stride

vert_end = vert_start + filter_size

horiz_start = width * stride

horiz_end = horiz_start + filter_size

return vert_start, vert_end, horiz_start, horiz_end

- Apply convolution operation doing element wise product of image window with
*f.*

`def convolve(self, image_slice, W, b):`

*"""*

Apply a filter defined by W on a single slice of an image.

Parameters:

image_slice -- slice of input data -- numpy array of shape (f, f, n_C_prev)

W -- Weight parameters contained in a window - numpy array of shape (f, f, n_C_prev)

b -- Bias parameters contained in a window - numpy array of shape (1, 1, 1)

Returns:

Z -- a scalar value, result of convolving the sliding window (W, b) on image_slice

"""

s = np.multiply(image_slice, W)

z = np.sum(s)

Z = z + float(b)

return Z

Bringing it all together, the forward propagation in convolution layer would be the following over complete training data.

`def forward(self, A_prev):`

*"""*

Forward proporgation for convolution.

This takes activations from previous layer and then convolve it

with a filter defined by W with bias b.

Parameters:

A_prev -- output activations of the previous layer, numpy array of shape (m, n_H_prev, n_W_prev, n_C_prev)

Returns:

Z -- convolution output, numpy array of shape (m, n_H, n_W, n_C)

"""

np.random.seed(self.seed)

self.A_prev = A_prev

filter_size, filter_size, n_C_prev, n_C = self.params[0].shape

m, n_H_prev, n_W_prev, n_C_prev = self.A_prev.shape

Z = np.zeros((m, self.n_H, self.n_W, self.n_C))

A_prev_pad = self.zero_pad(self.A_prev, self.pad)

for i in range(m):

a_prev_pad = A_prev_pad[i, :, :, :]

for h in range(self.n_H):

for w in range(self.n_W):

for c in range(n_C):

vert_start, vert_end, horiz_start, horiz_end = self.get_corners(h, w, self.filter_size, self.stride)

a_slice_prev = a_prev_pad[vert_start:vert_end, horiz_start:horiz_end, :]

Z[i, h, w, c] = self.convolve(

a_slice_prev, self.params[0][:, :, :, c], self.params[1][:, :, :, c])

assert (Z.shape == (m, self.n_H, self.n_W, self.n_C))

return Z

Output of convolution layer would be of shape (m, 33, 33, 10). The general formula to calculate the height and width of output of convolution layer :

where n is the input dimension. This is useful in checking for the matrix shapes during implementation. Output shape then becomes (m, height, width, number of channels), where number of channels equals to n_filters.

**Relu layer**

Forward propogation in ReLU

`def forward(self, Z):`

*"""*

Forward propogation of relu layer.

Parameters:

Z -- Input data -- numpy array of shape (m, n_H_prev, n_W_prev, n_C_prev)

Returns:

A -- Activations of relu layer-- numpy array of shape m, n_H_prev, n_W_prev, n_C_prev)

"""

self.Z = Z

A = np.maximum(0, Z) # element-wise

return A

Input to relu layer is the output of convolution layer. relu layer does not change matrix dimensions so the output shape remains the same.

## Maxpool Layer

The hyper parameters of max pool are :

Forward propagation in maxpool layer consists of two steps :

- Get input window based on
*s*

`def get_corners(self, height, width, filter_size, stride):`

*"""*

Get corners of the image relative to stride.

Parameters:

height -- height of an image -- int

width -- width of an image -- int

filter_size -- size of filter -- int

stride -- amount by which the filter shifts -- int

Returns:

vert_start -- a scalar value, upper left corner of the box.

vert_end -- a scalar value, upper right corner of the box.

horiz_start -- a scalar value, lower left corner of the box.

horiz_end -- a scalar value, lower right corner of the box.

"""

vert_start = height * stride

vert_end = vert_start + filter_size

horiz_start = width * stride

horiz_end = horiz_start + filter_size

return vert_start, vert_end, horiz_start, horiz_end

- Apply Maxpool operation on input window and forward propagation over complete training data.

`def forward(self, A_prev):`

*"""*

Forward prpogation of the pooling layer.

Arguments:

A_prev -- Input data, numpy array of shape (m, n_H_prev, n_W_prev, n_C_prev)

Returns:

Z -- output of the pool layer, a numpy array of shape (m, n_H, n_W, n_C)

"""

self.A_prev = A_prev

m, n_H_prev, n_W_prev, n_C_prev = self.A_prev.shape

Z = np.empty((m, self.n_H, self.n_W, n_C_prev))

for i in range(m):

a_prev = self.A_prev[i]

for h in range(self.n_H):

for w in range(self.n_W):

for c in range(self.n_C):

vert_start, vert_end, horiz_start, horiz_end = self.get_corners(

h, w, self.filter_size, self.stride)

#if horiz_end <= a_prev.shape[1] and vert_end <= a_prev.shape[0]:

a_slice_prev = a_prev[

vert_start:vert_end, horiz_start:horiz_end, c]

Z[i, h, w, c] = np.max(a_slice_prev)

assert(Z.shape == (m, self.n_H, self.n_W, n_C_prev))

return Z

Output of Maxpool layer would be of shape (m, 32, 32, 10). The general formula to calculate the height and width of output of maxpool layer is shown in Figure 1.6. Output shape then becomes (m, height, width, number of channels), where number of channels is the last axis of input dimension.

## Flatten layer

Forward propagation in flatten layer

`def forward(self, A_prev):`

*"""*

Forward propogation of flatten layer.

Parameters:

A_prev -- input data -- numpy of array shape (m, n_H_prev, n_W_prev, n_C_prev)

Returns:

Z -- flatten numpy array of shape (m, n_H_prev * n_W_prev * n_C_prev)

"""

np.random.seed(self.seed)

self.A_prev = A_prev

output = np.prod(self.A_prev.shape[1:])

m = self.A_prev.shape[0]

self.out_shape = (self.A_prev.shape[0], -1)

Z = self.A_prev.ravel().reshape(self.out_shape)

assert (Z.shape == (m, output))

return Z

Output shape is (m, 10240).

## Dense Layer

This is a fully connected neural network layer.

`def forward(self, A_prev):`

*"""*

Forward propogation of Dense layer.

Parameters:

A_prev -- input data -- numpy of array shape (m, input_dim)

Returns:

Z -- flatten numpy array of shape (m, output_dim)

"""

np.random.seed(self.seed)

m = A_prev.shape[0]

self.A_prev = A_prev

Z = np.dot(self.A_prev, self.params[0]) + self.params[1]

assert (Z.shape == (m, self.output_dim))

return Z

This is mostly used before the loss function. Since the number of classes is 6 so the output shape of this layer would be number of classes.

**Softmax Loss**

Since its a multi class classification problem so we would use softmax loss function which is also called Categorical Cross-Entropy loss. We would use softmax activation function to generate probability for each individual class with all probability the sum of one as shown in Figure 1.6.

`def softmax(z):`

*"""*

*:param** Z: output of previous layer of shape (m, 6)*

*:return**: probabilties of shape (m, 6)*

"""

# numerical stability

z = z - np.expand_dims(np.max(z, axis=1), 1)

z = np.exp(z)

ax_sum = np.expand_dims(np.sum(z, axis=1), 1)

# finally: divide elementwise

A = z / ax_sum

return A

Softmax function is prone to two issues: **overflow** and **underflow**

**Overflow**: it means that incase of exploding gradients, weights could increase significantly which makes the probability useless.

**Underflow**: It occurs incase of vanishing gradients, weights could be close to zero and thus share the same probability.

To combat these issues when doing softmax computation, a common trick is to shift the input vector by *subtracting the maximum element in it from all elements*. For the input vector z, define z such that:

`z = z - np.expand_dims(np.max(z, axis=1), 1)`

The loss function over *m* training data is defined as:

`def softmaxloss(x, labels):`

*"""*

*:param** x: output of previous layer of shape (m, 6)*

*:param** labels: class labels of shape (1, m)*

*:return**:*

"""

one_hot_labels = convert_to_one_hot(labels, 6)

predictions = softmax(x)

epsilon = 1e-12

predictions = np.clip(predictions, epsilon, 1. - epsilon)

N = predictions.shape[0]

loss = -np.sum(one_hot_labels * np.log(predictions + 1e-9)) / N

grad = predictions.copy()

grad[range(N), labels] -= 1

grad /= N

return loss, grad

In the loss function we have clipped the softmax predictions by *epsilon* to prevent extremely large values that could lead to numerical instability.

`epsilon = 1e-12`

predictions = np.clip(predictions, epsilon, 1. - epsilon)

**Backward Propagation**

## Softmax Loss

Now we would propagate our gradients back to the first layer. First we would compute derivative of cross entropy loss with softmax with respect to dense layer (input to softmax).

`grad = predictions.copy()`

grad[range(N), labels] -= 1

grad /= N

## Dense Layer

In Dense layer, we receive as input, gradient of cross entropy loss with respect to dense layer. We then compute dW and db and dA_prev. *Note we would use loss or cost function interchangeably*.

`def backward(self, dA):`

*"""*

Backward propogation for Dense layer.

Parameters:

dA -- gradient of cost with respect to the output of the Dense layer, same shape as Z

Returns:

dA_prev -- gradient of cost with respect to the input of the . Dense layer, same shape as A_prev

"""

np.random.seed(self.seed)

m = self.A_prev.shape[0]

dW = np.dot(self.A_prev.T, dA)

db = np.sum(dA, axis=0, keepdims=True)

dA_prev = np.dot(dA, self.W.T)

assert (dA_prev.shape == self.A_prev.shape)

assert (dW.shape == self.params[0].shape)

assert (db.shape == self.params[1].shape)

return dA_prev, [dW, db]

## Flatten layer

Flatten layer has no parameters to train so we would not compute dW and db. For propagating gradient backward it just reshapes dA to A_prev which is (m, 10240).

`def backward(self, dA):`

*"""*

Backward propogation of flatten layer.

Parameters:

dA -- gradient of cost with respect to the output of the flatten layer, same shape as Z

Returns:

dA_prev -- gradient of cost with respect to the input of the flatten layer, same shape as A_prev

"""

np.random.seed(self.seed)

dA_prev = dA.reshape(self.A_prev.shape)

assert (dA_prev.shape == self.A_prev.shape)

return dA_prev, []

## Maxpool Layer

Before doing backpropogation we would create a function which keeps track of where the maximum of the matrix is. True (1) indicates the position of the maximum in matrix, the other entries are False (0).

`def create_mask_from_window(self, image_slice):`

*"""*

Get mask from a image_slice to identify the max entry.

Parameters:

image_slice -- numpy array of shape (f, f, n_C_prev)

Returns:

mask -- Array of the same shape as window, contains a True at the position corresponding to the max entry of image_slice.

"""

mask = np.max(image_slice)

mask = (image_slice == mask)

return mask

We keep track of the maximum value in the matirx because this is the input value that ultimately influenced the output, and therefore the cost. Backprop is computing gradients with respect to the cost, so anything that influences the ultimate cost should have a non-zero gradient. So, back propagation will “propagate” the gradient back to this particular input value that had influenced the cost.

Since Maxpool has no parameters so we would not compute dW and db.

`def backward(self, dA):`

*"""*

Backward propogation of the pooling layer.

Parameters:

dA -- gradient of cost with respect to the output of the pooling layer,

same shape as Z

Returns:

dA_prev -- gradient of cost with respect to the input of the pooling layer,

same shape as A_prev

"""

m, n_H_prev, n_W_prev, n_C_prev = self.A_prev.shape

m, n_H, n_W, n_C = dA.shape

dA_prev = np.zeros((m, n_H_prev, n_W_prev, n_C_prev))

for i in range(m):

a_prev = self.A_prev[i]

for h in range(n_H):

for w in range(n_W):

for c in range(n_C):

vert_start, vert_end, horiz_start, horiz_end = self.get_corners(h, w, self.filter_size, self.stride)

a_prev_slice = a_prev[vert_start:vert_end, horiz_start:horiz_end, c]

mask =self.create_mask_from_window(a_prev_slice)

dA_prev[i, vert_start: vert_end, horiz_start: horiz_end, c] += mask * dA[i, h, w, c]

assert(dA_prev.shape == self.A_prev.shape)

return dA_prev, []

## Relu Layer

Backward propagation in relu is shown in Figure 2.2

`def backward(self, dA):`

*"""*

Backward propogation of relu layer.

f′(x) = {1 if x > 0}

{0 otherwise}

Parameters:

dA -- gradient of cost with respect to the output of the relu layer, same shape as A

Returns:

dZ -- gradient of cost with respect to the input of the relu layer, same shape as Z

"""

Z = self.Z

dZ = np.array(dA, copy=True)

dZ[Z <= 0] = 0

assert (dZ.shape == self.Z.shape)

return dZ, []

## Convolution Layer

In convolution layer we would compute three gradients, dA, dW, db.

In Figure 2.3, *Wc* is a filter and *dZhw* is a scalar corresponding to the gradient of the cost with respect to the output of the convolution layer *Z* at the hth row and wth column (corresponding to the dot product taken at the ith stride left and jth stride down). Note that at each time, we multiply the the same filter *Wc* by a different* dZ* when updating *dA*. We do so mainly because when computing the forward propagation, each filter is dotted and summed by a different a_slice. Therefore when computing the backprop for *dA*, we are just adding the gradients of all the a_slices. The formula in Figure 2.3 translates to the following code in back propagation:

`da_prev_pad[vert_start:vert_end, horiz_start:horiz_end, :] += self.params[0][:, :, :, c] * dZ[i, h, w, c]`

aslice corresponds to the slice which was used to generate the activation Zij. Hence, this ends up giving us the gradient for W with respect to that slice. Since it is the same W, we will just add up all such gradients to get dW.

The formula in figure 2.4 translates to the following code in backpropagation.

`dW[:, :, :, c] += a_slice_prev * dZ[i, h, w, c]`

The formula in Figure 2.4 translates to the following code in backpropagation.

`db[:, :, :, c] += dZ[i, h, w, c]`

Bringing it all together, the backward propagation of convolution layer is given below:

`def backward(self, dZ):`

*"""*

Backward propagation for convolution.

Parameters:

dZ -- gradient of the cost with respect to the output of the conv layer (Z), numpy array of shape (m, n_H, n_W, n_C)

Returns:

dA_prev -- gradient of the cost with respect to the input of the conv layer (A_prev), numpy array of shape (m, n_H_prev, n_W_prev, n_C_prev)

dW -- gradient of the cost with respect to the weights of the conv layer (W) numpy array of shape (f, f, n_C_prev, n_C)

db -- gradient of the cost with respect to the biases of the conv layer (b) numpy array of shape (1, 1, 1, n_C)

"""

np.random.seed(self.seed)

m, n_H_prev, n_W_prev, n_C_prev = self.A_prev.shape

f, f, n_C_prev, n_C = self.params[0].shape

m, n_H, n_W, n_C = dZ.shape

dA_prev = np.zeros(self.A_prev.shape)

dW = np.zeros(self.params[0].shape)

db = np.zeros(self.params[1].shape)

# Pad A_prev and dA_prev

A_prev_pad = self.zero_pad(self.A_prev, self.pad)

dA_prev_pad = self.zero_pad(dA_prev, self.pad)

for i in range(m):

a_prev_pad = A_prev_pad[i, :, :, :]

da_prev_pad = dA_prev_pad[i, :, :, :]

for h in range(n_H):

for w in range(n_W):

for c in range(n_C):

vert_start, vert_end, horiz_start, horiz_end = self.get_corners(h, w, self.filter_size, self.stride)

a_slice_prev = a_prev_pad[

vert_start:vert_end, horiz_start:horiz_end, :]

da_prev_pad[vert_start:vert_end, horiz_start:horiz_end, :] += self.params[0][:, :, :, c] * dZ[i, h, w, c]

dW[:, :, :, c] += a_slice_prev * dZ[i, h, w, c]

db[:, :, :, c] += dZ[i, h, w, c] dA_prev[i, :, :, :] = da_prev_pad[self.pad:-self.pad, self.pad:-self.pad, :]

assert(dA_prev.shape == (m, n_H_prev, n_W_prev, n_C_prev))

return dA_prev, [dW, db]

**Gradient checking**

Gradient check is very useful in verifying your back propagation and that you have computed the gradients correctly. It uses two sided difference to numerically approximate gradients. We would randomly select 2 data points from training data and would run gradient check on it. NOTE: Since gradient checking is very slow, do not use it during training.

`def grad_check():`

train_set_x, train_set_y, test_set_x, test_set_y, n_class = load_data()

# select randomly 2 data points from training data

n = 2

index = np.random.choice(train_set_x.shape[0], n)

train_set_x = train_set_x[index]

train_set_y = train_set_y[:, index]

cnn = make_model(train_set_x, n_class)

print (cnn.layers)

A = cnn.forward(train_set_x)

loss, dA = softmaxloss(A, train_set_y)

assert (A.shape == dA.shape)

grads = cnn.backward(dA)

grads_values = grads_to_vector(grads)

initial_params = cnn.params

parameters_values = params_to_vector(initial_params) # initial parameters

num_parameters = parameters_values.shape[0]

J_plus = np.zeros((num_parameters, 1))

J_minus = np.zeros((num_parameters, 1))

gradapprox = np.zeros((num_parameters, 1))

print ('number of parameters: ', num_parameters)

epsilon = 1e-7

assert (len(grads_values) == len(parameters_values))

for i in tqdm(range(0, num_parameters)):

thetaplus = copy.deepcopy(parameters_values)

thetaplus[i][0] = thetaplus[i][0] + epsilon # parameters

new_param = vector_to_param(thetaplus, initial_params)

difference = compare(new_param, initial_params)

# make sure only one parameter is changed

assert ( difference == 1)

cnn.params = new_param

A = cnn.forward(train_set_x)

J_plus[i], _ = softmaxloss(A, train_set_y)

thetaminus = copy.deepcopy(parameters_values)

thetaminus[i][0] = thetaminus[i][0] - epsilon

new_param = vector_to_param(thetaminus, initial_params)

difference = compare(new_param, initial_params)

# make sure only one parameter is changed

assert (difference == 1)

cnn.params = new_param

A = cnn.forward(train_set_x)

J_minus[i], _ = softmaxloss(A, train_set_y)

gradapprox[i] = (J_plus[i] - J_minus[i]) / (2 * epsilon)

numerator = np.linalg.norm(gradapprox - grads_values)

denominator = np.linalg.norm(grads_values) + np.linalg.norm(gradapprox)

difference = numerator / denominator

if difference > 2e-7:

print("\033[93m" + "There is a mistake in the backward propagation! difference = " + str(

difference) + "\033[0m")

else:

print("\033[92m" + "Your backward propagation works perfectly fine! difference = " + str(

difference) + "\033[0m")

return difference

If your backpropogation works, it would output a message such as in Figure 2.6. If there is some mistake in your back propogation one thing would be to compare individual values of approximate gradients and original gradients and check where the difference is large and look for the implementation of those gradients. One more thing to note we can encounter *kinks *which can be a source of inaccuracy for failing grad check. Kinks refers to non-differentiable parts of objective function, introduced by functions such as ReLU *(max(0,x))**. *For instance, consider gradient checking at *x = -1e-8. *As you may recall ReLU backward propagation in Figure 2.2, since x < 0, it would compute a zero gradient. However when computing two sided difference (x + epsilon), epsilon given as *1e-7* would compute *9e-08*, which would introduce a non zero gradient.

## Adam Optimizer

Adam optimizer is used as an optimization algorithm to minimize the loss function. It has known to work well with a variety of problems. Hyper parameters for Adam are: *learning rate*, *beta1*, *beta2*, *epsilon*. The default choice for *beta1* is 0.9 and default choice for *beta2* is 0.999. The choice of *epsilon* does not matter very much and it is set to 1e-08. Generally all the other hyper parameters are used with default value and learning rate is tuned. *Beta1 *is computing the mean of the derivatives which is called first moment and *Beta2 *is used to compute exponentially weighted average of the squares. which is called second moment. Adam has relatively low memory requirements and usually works well even with little tuning of hyper parameters except learning rate. [Kingma *et al. *2014]

class Adam(object):

def __init__(self, model, X_train, y_train,

learning_rate, epoch, minibatch_size, X_test, y_test):

self.model = model

self.X_train = X_train

self.y_train = y_train

self.learning_rate = learning_rate

self.beta1 = 0.9

self.beta2 = 0.999

self.epsilon = 1e-08

self.epoch = epoch

self.X_test = X_test

self.y_test = y_test

self.num_layer = len(self.model.layers)

self.minibatch_size = minibatch_size

def initialize_adam(self):

VdW, Vdb, SdW, Sdb = [], [], [], []

for param_layer in self.model.params: # layers which has no learning

if len(param_layer) is not 2: VdW.append(np.zeros_like([]))

Vdb.append(np.zeros_like([]))

SdW.append(np.zeros_like([]))

Sdb.append(np.zeros_like([]))

else:

VdW.append(np.zeros_like(param_layer[0]))

Vdb.append(np.zeros_like(param_layer[1]))

SdW.append(np.zeros_like(param_layer[0]))

Sdb.append(np.zeros_like(param_layer[1]))

assert len(VdW) == self.num_layer

assert len(Vdb) == self.num_layer

assert len(SdW) == self.num_layer

assert len(Sdb) == self.num_layer

return VdW, Vdb, SdW, Sdb

def update_parameters(self, VdW, Vdb, SdW, Sdb, grads, t):

VdW_corrected = [np.zeros_like(v) for v in VdW]

Vdb_corrected = [np.zeros_like(v) for v in Vdb]

SdW_corrected = [np.zeros_like(s) for s in SdW]

Sdb_corrected = [np.zeros_like(s) for s in Sdb]

# compute dW, db using current mini batch

grads = list(reversed(grads))

for i in range(len(grads)): # layer which contains weights and biases

if len(grads[i]) is not 0:

# Moving average of the gradients (Momentum)

a = self.beta1 * VdW[i]

b = (1 - self.beta1) * grads[i][0]

VdW[i] = np.add(a, b)

a = self.beta1 * Vdb[i]

b = (1 - self.beta1) * grads[i][1]

Vdb[i] = np.add(a, b)

# Moving average of the squared gradients. (RMSprop)

a = self.beta2 * SdW[i]

b = (1-self.beta2) * np.power(grads[i][0], 2)

SdW[i] = np.add(a, b)

a = self.beta2 * Sdb[i]

b = (1-self.beta2) * np.power(grads[i][1], 2)

Sdb[i] = np.add(a, b)

# Compute bias-corrected first moment estimate

den = (1-(self.beta1 ** t))

VdW_corrected[i] = np.divide(VdW[i], den)

Vdb_corrected[i] = np.divide(Vdb[i], den)

# Compute bias-corrected second raw moment estimate

den = 1-(self.beta2 ** t)

SdW_corrected[i] = np.divide(SdW[i], den)

Sdb_corrected[i] = np.divide(Sdb[i], den)

# weight update

den = np.sqrt(SdW_corrected[i]) + self.epsilon

self.model.params[i][0] = self.model.params[i][0] - self.learning_rate * np.divide(VdW_corrected[i], den)

# bias update

den = np.sqrt(Sdb_corrected[i]) + self.epsilon

self.model.params[i][1] = self.model.params[i][1] - self.learning_rate * np.divide(Vdb_corrected[i], den)

def minimize(self):

costs = []

t = 0

np.random.seed(1)

VdW, Vdb, SdW, Sdb = self.initialize_adam()

for i in tqdm(range(self.epoch)):

start = time.time()

loss = 0

minibatches = get_minibatches(self.X_train,

self.y_train,

self.minibatch_size)

for minibatch in tqdm(minibatches):

# Select a minibatch

(minibatch_X, minibatch_Y) = minibatch

# forward and backward propogation

loss, grads = self.model.fit(minibatch_X, minibatch_Y)

loss += loss

t = t + 1 # Adam counter

# weight update

self.update_parameters(VdW, Vdb, SdW, Sdb, grads, t)

# Print the cost every epoch

end = time.time()

epoch_time = end - start

train_acc = accuracy(self.model.predict(self.X_train),

self.y_train)

val_acc = accuracy(self.model.predict(self.X_test),

self.y_test)

print ("Cost after epoch %i: %f" % (i, loss),

'time (s):', epoch_time,

'train_acc:', train_acc,

'val_acc:', val_acc)

costs.append(loss)

print ('total_cost', costs)

return self.model, costs

Now we have reached to the end of the article and hopefully you have followed along. If you like it don’t forget to give it a thumbs up 🙂

You can add me on LinkedIn: https://www.linkedin.com/in/mustufain-abbas/

The code for the article can be found at: https://github.com/Mustufain/Convolution-Neural-Network-

# References

Andrew Ng course on coursera : https://www.coursera.org/learn/convolutional-neural-networks-tensorflow

He, Kaiming, et al. “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.” *Proceedings of the IEEE international conference on computer vision*. 2015

Kingma, Diederik P., and Jimmy Ba. “Adam: A method for stochastic optimization.” *arXiv preprint arXiv:1412.6980*(2014).

Zeiler, Matthew D., and Rob Fergus. “Visualizing and understanding convolutional networks.” *European conference on computer vision*. Springer, Cham, 2014.