# Understanding Convolution Neural Networks -Part II

Source: Deep Learning on Medium # Understanding Convolution Neural Networks -Part II

In Part II, we would be building the network describe in Figure 1.1

Let’s define some terms which we would be using in this article.

f = filter size

n_filers = number of filters

s = stride

m = # of training examples

Data

We will be using digit signs data set which contains images of shape 64 x 64 x 3. The training set consists of 1080 images and test set consists of 120 images. Number of classes are 6 which contains digit signs 0, 1, 2, 3, 4, 5.

We would make use of concepts that we have gone through in Part I to build a convolution neural network from scratch.

Convolutional Neural Network

The network as described in Figure 1.1 consists of convolution layer, relu layer, max pool layer, flatten layer, dense layer followed by a softmax activation function since we have more than one class. We would make use of Adam optimizer for training out network. In all the layers weights were initialised using He initialisation [He et. al. 2015] as the network is ReLU activated.

# Forward Propagation

## Convolution Layer

The hyper parameters of convolution layer are:

• p is 2
• s is 2
• f is 3 x 3
• n_filters is 10

Forward propagation in convolution layer consists of three steps :

• Pad zeros to image with amount p
`def zero_pad(self, X, pad):""" Set padding to the image X. Pads with zeros all images of the dataset X. Zeros are added around the border of an image. Parameters: X -- Image -- numpy array of shape (m, n_H, n_W, n_C) pad -- padding amount -- int Returns: X_pad -- Image padded with zeros around width and height. -- numpy array of shape (m, n_H + 2*pad, n_W + 2*pad, n_C) """X_pad = np.pad(X, ((0, 0), (pad, pad), (pad, pad), (0, 0)), 'constant') return X_pad`
• Get image window based on s
`def get_corners(self, height, width, filter_size, stride):""" Get corners of the image relative to stride. Parameters: height -- height of an image -- int width -- width of an image -- int filter_size -- size of filter -- int stride -- amount by which the filter shifts -- int Returns: vert_start -- a scalar value, upper left corner of the box. vert_end -- a scalar value, upper right corner of the box. horiz_start -- a scalar value, lower left corner of the box. horiz_end -- a scalar value, lower right corner of the box. """vert_start = height * stride vert_end = vert_start + filter_size horiz_start = width * stride horiz_end = horiz_start + filter_size return vert_start, vert_end, horiz_start, horiz_end`
• Apply convolution operation doing element wise product of image window with f.
`def convolve(self, image_slice, W, b):""" Apply a filter defined by W on a single slice of an image. Parameters: image_slice -- slice of input data -- numpy array of shape (f, f, n_C_prev) W -- Weight parameters contained in a window - numpy array of shape (f, f, n_C_prev) b -- Bias parameters contained in a window - numpy array of shape (1, 1, 1) Returns: Z -- a scalar value, result of convolving the sliding window (W, b) on image_slice """s = np.multiply(image_slice, W) z = np.sum(s) Z = z + float(b) return Z`

Bringing it all together, the forward propagation in convolution layer would be the following over complete training data.

`def forward(self, A_prev):""" Forward proporgation for convolution. This takes activations from previous layer and then convolve it with a filter defined by W with bias b. Parameters: A_prev -- output activations of the previous layer, numpy array of shape (m, n_H_prev, n_W_prev, n_C_prev) Returns: Z -- convolution output, numpy array of shape (m, n_H, n_W, n_C) """np.random.seed(self.seed) self.A_prev = A_prev filter_size, filter_size, n_C_prev, n_C = self.params.shape m, n_H_prev, n_W_prev, n_C_prev = self.A_prev.shape Z = np.zeros((m, self.n_H, self.n_W, self.n_C)) A_prev_pad = self.zero_pad(self.A_prev, self.pad) for i in range(m): a_prev_pad = A_prev_pad[i, :, :, :] for h in range(self.n_H): for w in range(self.n_W): for c in range(n_C): vert_start, vert_end, horiz_start, horiz_end = self.get_corners(h, w, self.filter_size, self.stride) a_slice_prev = a_prev_pad[vert_start:vert_end, horiz_start:horiz_end, :] Z[i, h, w, c] = self.convolve( a_slice_prev, self.params[:, :, :, c], self.params[:, :, :, c]) assert (Z.shape == (m, self.n_H, self.n_W, self.n_C)) return Z`

Output of convolution layer would be of shape (m, 33, 33, 10). The general formula to calculate the height and width of output of convolution layer :

where n is the input dimension. This is useful in checking for the matrix shapes during implementation. Output shape then becomes (m, height, width, number of channels), where number of channels equals to n_filters.

## Relu layer

Forward propogation in ReLU

`def forward(self, Z):""" Forward propogation of relu layer. Parameters: Z -- Input data -- numpy array of shape (m, n_H_prev, n_W_prev, n_C_prev) Returns: A -- Activations of relu layer-- numpy array of shape m, n_H_prev, n_W_prev, n_C_prev) """self.Z = Z A = np.maximum(0, Z) # element-wise return A`

Input to relu layer is the output of convolution layer. relu layer does not change matrix dimensions so the output shape remains the same.

## Maxpool Layer

The hyper parameters of max pool are :

Forward propagation in maxpool layer consists of two steps :

• Get input window based on s
`def get_corners(self, height, width, filter_size, stride):""" Get corners of the image relative to stride. Parameters: height -- height of an image -- int width -- width of an image -- int filter_size -- size of filter -- int stride -- amount by which the filter shifts -- int Returns: vert_start -- a scalar value, upper left corner of the box. vert_end -- a scalar value, upper right corner of the box. horiz_start -- a scalar value, lower left corner of the box. horiz_end -- a scalar value, lower right corner of the box. """vert_start = height * stride vert_end = vert_start + filter_size horiz_start = width * stride horiz_end = horiz_start + filter_size return vert_start, vert_end, horiz_start, horiz_end`
• Apply Maxpool operation on input window and forward propagation over complete training data.
`def forward(self, A_prev):""" Forward prpogation of the pooling layer. Arguments: A_prev -- Input data, numpy array of shape (m, n_H_prev, n_W_prev, n_C_prev) Returns: Z -- output of the pool layer, a numpy array of shape (m, n_H, n_W, n_C) """self.A_prev = A_prev m, n_H_prev, n_W_prev, n_C_prev = self.A_prev.shape Z = np.empty((m, self.n_H, self.n_W, n_C_prev)) for i in range(m): a_prev = self.A_prev[i] for h in range(self.n_H): for w in range(self.n_W): for c in range(self.n_C): vert_start, vert_end, horiz_start, horiz_end = self.get_corners( h, w, self.filter_size, self.stride) #if horiz_end <= a_prev.shape and vert_end <= a_prev.shape: a_slice_prev = a_prev[ vert_start:vert_end, horiz_start:horiz_end, c] Z[i, h, w, c] = np.max(a_slice_prev) assert(Z.shape == (m, self.n_H, self.n_W, n_C_prev)) return Z`

Output of Maxpool layer would be of shape (m, 32, 32, 10). The general formula to calculate the height and width of output of maxpool layer is shown in Figure 1.6. Output shape then becomes (m, height, width, number of channels), where number of channels is the last axis of input dimension.

## Flatten layer

Forward propagation in flatten layer

`def forward(self, A_prev):""" Forward propogation of flatten layer. Parameters: A_prev -- input data -- numpy of array shape (m, n_H_prev, n_W_prev, n_C_prev) Returns: Z -- flatten numpy array of shape (m, n_H_prev * n_W_prev * n_C_prev) """np.random.seed(self.seed) self.A_prev = A_prev output = np.prod(self.A_prev.shape[1:]) m = self.A_prev.shape self.out_shape = (self.A_prev.shape, -1) Z = self.A_prev.ravel().reshape(self.out_shape) assert (Z.shape == (m, output)) return Z`

Output shape is (m, 10240).

## Dense Layer

This is a fully connected neural network layer.

`def forward(self, A_prev):""" Forward propogation of Dense layer. Parameters: A_prev -- input data -- numpy of array shape (m, input_dim) Returns: Z -- flatten numpy array of shape (m, output_dim) """np.random.seed(self.seed) m = A_prev.shape self.A_prev = A_prev Z = np.dot(self.A_prev, self.params) + self.params assert (Z.shape == (m, self.output_dim)) return Z`

This is mostly used before the loss function. Since the number of classes is 6 so the output shape of this layer would be number of classes.

## Softmax Loss

Since its a multi class classification problem so we would use softmax loss function which is also called Categorical Cross-Entropy loss. We would use softmax activation function to generate probability for each individual class with all probability the sum of one as shown in Figure 1.6.

`def softmax(z):""":param Z: output of previous layer of shape (m, 6):return: probabilties of shape (m, 6) """# numerical stability z = z - np.expand_dims(np.max(z, axis=1), 1) z = np.exp(z) ax_sum = np.expand_dims(np.sum(z, axis=1), 1) # finally: divide elementwise A = z / ax_sum return A`

Softmax function is prone to two issues: overflow and underflow

Overflow: it means that incase of exploding gradients, weights could increase significantly which makes the probability useless.

Underflow: It occurs incase of vanishing gradients, weights could be close to zero and thus share the same probability.

To combat these issues when doing softmax computation, a common trick is to shift the input vector by subtracting the maximum element in it from all elements. For the input vector z, define z such that:

`z = z - np.expand_dims(np.max(z, axis=1), 1)`

The loss function over m training data is defined as:

`def softmaxloss(x, labels):""":param x: output of previous layer of shape (m, 6):param labels: class labels of shape (1, m):return: """one_hot_labels = convert_to_one_hot(labels, 6) predictions = softmax(x) epsilon = 1e-12 predictions = np.clip(predictions, epsilon, 1. - epsilon) N = predictions.shape loss = -np.sum(one_hot_labels * np.log(predictions + 1e-9)) / N grad = predictions.copy() grad[range(N), labels] -= 1 grad /= N return loss, grad`

In the loss function we have clipped the softmax predictions by epsilon to prevent extremely large values that could lead to numerical instability.

`epsilon = 1e-12predictions = np.clip(predictions, epsilon, 1. - epsilon)`

# Backward Propagation

## Softmax Loss

Now we would propagate our gradients back to the first layer. First we would compute derivative of cross entropy loss with softmax with respect to dense layer (input to softmax).

`grad = predictions.copy()grad[range(N), labels] -= 1grad /= N`

## Dense Layer

In Dense layer, we receive as input, gradient of cross entropy loss with respect to dense layer. We then compute dW and db and dA_prev. Note we would use loss or cost function interchangeably.

`def backward(self, dA):""" Backward propogation for Dense layer. Parameters: dA -- gradient of cost with respect to the output of the Dense layer, same shape as Z Returns: dA_prev -- gradient of cost with respect to the input of the . Dense layer, same shape as A_prev """np.random.seed(self.seed) m = self.A_prev.shape dW = np.dot(self.A_prev.T, dA) db = np.sum(dA, axis=0, keepdims=True) dA_prev = np.dot(dA, self.W.T) assert (dA_prev.shape == self.A_prev.shape) assert (dW.shape == self.params.shape) assert (db.shape == self.params.shape) return dA_prev, [dW, db]`

## Flatten layer

Flatten layer has no parameters to train so we would not compute dW and db. For propagating gradient backward it just reshapes dA to A_prev which is (m, 10240).

`def backward(self, dA):""" Backward propogation of flatten layer. Parameters: dA -- gradient of cost with respect to the output of the flatten layer, same shape as Z Returns: dA_prev -- gradient of cost with respect to the input of the flatten layer, same shape as A_prev """np.random.seed(self.seed) dA_prev = dA.reshape(self.A_prev.shape) assert (dA_prev.shape == self.A_prev.shape) return dA_prev, []`

## Maxpool Layer

Before doing backpropogation we would create a function which keeps track of where the maximum of the matrix is. True (1) indicates the position of the maximum in matrix, the other entries are False (0).

`def create_mask_from_window(self, image_slice):""" Get mask from a image_slice to identify the max entry. Parameters: image_slice -- numpy array of shape (f, f, n_C_prev) Returns: mask -- Array of the same shape as window, contains a True at the position corresponding to the max entry of image_slice. """mask = np.max(image_slice) mask = (image_slice == mask) return mask`

We keep track of the maximum value in the matirx because this is the input value that ultimately influenced the output, and therefore the cost. Backprop is computing gradients with respect to the cost, so anything that influences the ultimate cost should have a non-zero gradient. So, back propagation will “propagate” the gradient back to this particular input value that had influenced the cost.

Since Maxpool has no parameters so we would not compute dW and db.

`def backward(self, dA):""" Backward propogation of the pooling layer. Parameters: dA -- gradient of cost with respect to the output of the pooling layer, same shape as Z Returns: dA_prev -- gradient of cost with respect to the input of the pooling layer, same shape as A_prev """m, n_H_prev, n_W_prev, n_C_prev = self.A_prev.shape m, n_H, n_W, n_C = dA.shape dA_prev = np.zeros((m, n_H_prev, n_W_prev, n_C_prev)) for i in range(m): a_prev = self.A_prev[i] for h in range(n_H): for w in range(n_W): for c in range(n_C): vert_start, vert_end, horiz_start, horiz_end = self.get_corners(h, w, self.filter_size, self.stride) a_prev_slice = a_prev[vert_start:vert_end, horiz_start:horiz_end, c] mask =self.create_mask_from_window(a_prev_slice) dA_prev[i, vert_start: vert_end, horiz_start: horiz_end, c] += mask * dA[i, h, w, c] assert(dA_prev.shape == self.A_prev.shape) return dA_prev, []`

## Relu Layer

Backward propagation in relu is shown in Figure 2.2

`def backward(self, dA):""" Backward propogation of relu layer. f′(x) = {1 if x > 0} {0 otherwise} Parameters: dA -- gradient of cost with respect to the output of the relu layer, same shape as A Returns: dZ -- gradient of cost with respect to the input of the relu layer, same shape as Z """Z = self.Z dZ = np.array(dA, copy=True) dZ[Z <= 0] = 0 assert (dZ.shape == self.Z.shape) return dZ, []`

## Convolution Layer

In convolution layer we would compute three gradients, dA, dW, db.

In Figure 2.3, Wc is a filter and dZhw is a scalar corresponding to the gradient of the cost with respect to the output of the convolution layer Z at the hth row and wth column (corresponding to the dot product taken at the ith stride left and jth stride down). Note that at each time, we multiply the the same filter Wc by a different dZ when updating dA. We do so mainly because when computing the forward propagation, each filter is dotted and summed by a different a_slice. Therefore when computing the backprop for dA, we are just adding the gradients of all the a_slices. The formula in Figure 2.3 translates to the following code in back propagation:

`da_prev_pad[vert_start:vert_end, horiz_start:horiz_end, :] += self.params[:, :, :, c] * dZ[i, h, w, c]`

aslice corresponds to the slice which was used to generate the activation Zij. Hence, this ends up giving us the gradient for W with respect to that slice. Since it is the same W, we will just add up all such gradients to get dW.

The formula in figure 2.4 translates to the following code in backpropagation.

`dW[:, :, :, c] += a_slice_prev * dZ[i, h, w, c]`

The formula in Figure 2.4 translates to the following code in backpropagation.

`db[:, :, :, c] += dZ[i, h, w, c]`

Bringing it all together, the backward propagation of convolution layer is given below:

`def backward(self, dZ):""" Backward propagation for convolution. Parameters: dZ -- gradient of the cost with respect to the output of the conv layer (Z), numpy array of shape (m, n_H, n_W, n_C) Returns: dA_prev -- gradient of the cost with respect to the input of the conv layer (A_prev), numpy array of shape (m, n_H_prev, n_W_prev, n_C_prev) dW -- gradient of the cost with respect to the weights of the conv layer (W) numpy array of shape (f, f, n_C_prev, n_C) db -- gradient of the cost with respect to the biases of the conv layer (b) numpy array of shape (1, 1, 1, n_C) """np.random.seed(self.seed) m, n_H_prev, n_W_prev, n_C_prev = self.A_prev.shape f, f, n_C_prev, n_C = self.params.shape m, n_H, n_W, n_C = dZ.shape dA_prev = np.zeros(self.A_prev.shape) dW = np.zeros(self.params.shape) db = np.zeros(self.params.shape) # Pad A_prev and dA_prev A_prev_pad = self.zero_pad(self.A_prev, self.pad) dA_prev_pad = self.zero_pad(dA_prev, self.pad) for i in range(m): a_prev_pad = A_prev_pad[i, :, :, :] da_prev_pad = dA_prev_pad[i, :, :, :] for h in range(n_H): for w in range(n_W): for c in range(n_C): vert_start, vert_end, horiz_start, horiz_end = self.get_corners(h, w, self.filter_size, self.stride) a_slice_prev = a_prev_pad[ vert_start:vert_end, horiz_start:horiz_end, :] da_prev_pad[vert_start:vert_end, horiz_start:horiz_end, :] += self.params[:, :, :, c] * dZ[i, h, w, c] dW[:, :, :, c] += a_slice_prev * dZ[i, h, w, c] db[:, :, :, c] += dZ[i, h, w, c] dA_prev[i, :, :, :] = da_prev_pad[self.pad:-self.pad, self.pad:-self.pad, :] assert(dA_prev.shape == (m, n_H_prev, n_W_prev, n_C_prev)) return dA_prev, [dW, db]`

Gradient check is very useful in verifying your back propagation and that you have computed the gradients correctly. It uses two sided difference to numerically approximate gradients. We would randomly select 2 data points from training data and would run gradient check on it. NOTE: Since gradient checking is very slow, do not use it during training.

`def grad_check(): train_set_x, train_set_y, test_set_x, test_set_y, n_class = load_data() # select randomly 2 data points from training data n = 2 index = np.random.choice(train_set_x.shape, n) train_set_x = train_set_x[index] train_set_y = train_set_y[:, index] cnn = make_model(train_set_x, n_class) print (cnn.layers) A = cnn.forward(train_set_x) loss, dA = softmaxloss(A, train_set_y) assert (A.shape == dA.shape) grads = cnn.backward(dA) grads_values = grads_to_vector(grads) initial_params = cnn.params parameters_values = params_to_vector(initial_params) # initial parameters num_parameters = parameters_values.shape J_plus = np.zeros((num_parameters, 1)) J_minus = np.zeros((num_parameters, 1)) gradapprox = np.zeros((num_parameters, 1)) print ('number of parameters: ', num_parameters) epsilon = 1e-7 assert (len(grads_values) == len(parameters_values)) for i in tqdm(range(0, num_parameters)): thetaplus = copy.deepcopy(parameters_values) thetaplus[i] = thetaplus[i] + epsilon # parameters new_param = vector_to_param(thetaplus, initial_params) difference = compare(new_param, initial_params) # make sure only one parameter is changed assert ( difference == 1)  cnn.params = new_param A = cnn.forward(train_set_x) J_plus[i], _ = softmaxloss(A, train_set_y) thetaminus = copy.deepcopy(parameters_values) thetaminus[i] = thetaminus[i] - epsilon new_param = vector_to_param(thetaminus, initial_params) difference = compare(new_param, initial_params) # make sure only one parameter is changed assert (difference == 1)  cnn.params = new_param A = cnn.forward(train_set_x) J_minus[i], _ = softmaxloss(A, train_set_y) gradapprox[i] = (J_plus[i] - J_minus[i]) / (2 * epsilon) numerator = np.linalg.norm(gradapprox - grads_values) denominator = np.linalg.norm(grads_values) + np.linalg.norm(gradapprox) difference = numerator / denominator if difference > 2e-7: print("\033[93m" + "There is a mistake in the backward propagation! difference = " + str( difference) + "\033[0m") else: print("\033[92m" + "Your backward propagation works perfectly fine! difference = " + str( difference) + "\033[0m") return difference`

If your backpropogation works, it would output a message such as in Figure 2.6. If there is some mistake in your back propogation one thing would be to compare individual values of approximate gradients and original gradients and check where the difference is large and look for the implementation of those gradients. One more thing to note we can encounter kinks which can be a source of inaccuracy for failing grad check. Kinks refers to non-differentiable parts of objective function, introduced by functions such as ReLU (max(0,x)). For instance, consider gradient checking at x = -1e-8. As you may recall ReLU backward propagation in Figure 2.2, since x < 0, it would compute a zero gradient. However when computing two sided difference (x + epsilon), epsilon given as 1e-7 would compute 9e-08, which would introduce a non zero gradient.

Adam optimizer is used as an optimization algorithm to minimize the loss function. It has known to work well with a variety of problems. Hyper parameters for Adam are: learning rate, beta1, beta2, epsilon. The default choice for beta1 is 0.9 and default choice for beta2 is 0.999. The choice of epsilon does not matter very much and it is set to 1e-08. Generally all the other hyper parameters are used with default value and learning rate is tuned. Beta1 is computing the mean of the derivatives which is called first moment and Beta2 is used to compute exponentially weighted average of the squares. which is called second moment. Adam has relatively low memory requirements and usually works well even with little tuning of hyper parameters except learning rate. [Kingma et al. 2014]

`class Adam(object): def __init__(self, model, X_train, y_train, learning_rate, epoch, minibatch_size, X_test, y_test): self.model = model self.X_train = X_train self.y_train = y_train self.learning_rate = learning_rate self.beta1 = 0.9 self.beta2 = 0.999 self.epsilon = 1e-08 self.epoch = epoch self.X_test = X_test self.y_test = y_test self.num_layer = len(self.model.layers) self.minibatch_size = minibatch_size def initialize_adam(self): VdW, Vdb, SdW, Sdb = [], [], [], [] for param_layer in self.model.params: # layers which has no learning if len(param_layer) is not 2: VdW.append(np.zeros_like([])) Vdb.append(np.zeros_like([])) SdW.append(np.zeros_like([])) Sdb.append(np.zeros_like([])) else: VdW.append(np.zeros_like(param_layer)) Vdb.append(np.zeros_like(param_layer)) SdW.append(np.zeros_like(param_layer)) Sdb.append(np.zeros_like(param_layer)) assert len(VdW) == self.num_layer assert len(Vdb) == self.num_layer assert len(SdW) == self.num_layer assert len(Sdb) == self.num_layer return VdW, Vdb, SdW, Sdb def update_parameters(self, VdW, Vdb, SdW, Sdb, grads, t): VdW_corrected = [np.zeros_like(v) for v in VdW] Vdb_corrected = [np.zeros_like(v) for v in Vdb] SdW_corrected = [np.zeros_like(s) for s in SdW] Sdb_corrected = [np.zeros_like(s) for s in Sdb] # compute dW, db using current mini batch grads = list(reversed(grads)) for i in range(len(grads)): # layer which contains weights and biases if len(grads[i]) is not 0:  # Moving average of the gradients (Momentum) a = self.beta1 * VdW[i] b = (1 - self.beta1) * grads[i] VdW[i] = np.add(a, b) a = self.beta1 * Vdb[i] b = (1 - self.beta1) * grads[i] Vdb[i] = np.add(a, b) # Moving average of the squared gradients. (RMSprop) a = self.beta2 * SdW[i] b = (1-self.beta2) * np.power(grads[i], 2) SdW[i] = np.add(a, b) a = self.beta2 * Sdb[i] b = (1-self.beta2) * np.power(grads[i], 2) Sdb[i] = np.add(a, b) # Compute bias-corrected first moment estimate den = (1-(self.beta1 ** t)) VdW_corrected[i] = np.divide(VdW[i], den) Vdb_corrected[i] = np.divide(Vdb[i], den) # Compute bias-corrected second raw moment estimate den = 1-(self.beta2 ** t) SdW_corrected[i] = np.divide(SdW[i], den) Sdb_corrected[i] = np.divide(Sdb[i], den) # weight update den = np.sqrt(SdW_corrected[i]) + self.epsilon self.model.params[i] = self.model.params[i] - self.learning_rate * np.divide(VdW_corrected[i], den) # bias update den = np.sqrt(Sdb_corrected[i]) + self.epsilon self.model.params[i] = self.model.params[i] - self.learning_rate * np.divide(Vdb_corrected[i], den) def minimize(self): costs = [] t = 0 np.random.seed(1) VdW, Vdb, SdW, Sdb = self.initialize_adam() for i in tqdm(range(self.epoch)): start = time.time() loss = 0 minibatches = get_minibatches(self.X_train, self.y_train, self.minibatch_size) for minibatch in tqdm(minibatches): # Select a minibatch (minibatch_X, minibatch_Y) = minibatch # forward and backward propogation loss, grads = self.model.fit(minibatch_X, minibatch_Y) loss += loss t = t + 1 # Adam counter # weight update self.update_parameters(VdW, Vdb, SdW, Sdb, grads, t) # Print the cost every epoch end = time.time() epoch_time = end - start train_acc = accuracy(self.model.predict(self.X_train), self.y_train) val_acc = accuracy(self.model.predict(self.X_test), self.y_test) print ("Cost after epoch %i: %f" % (i, loss), 'time (s):', epoch_time, 'train_acc:', train_acc, 'val_acc:', val_acc) costs.append(loss) print ('total_cost', costs) return self.model, costs`

Now we have reached to the end of the article and hopefully you have followed along. If you like it don’t forget to give it a thumbs up 🙂