Source: Deep Learning on Medium
In this guide we will build a deep neural network, with as many layers as you want! The network can be applied to supervised learning problem with binary classification.
Superscript [l] denotes a quantity associated with the lᵗʰ layer.
Superscript (i) denotes a quantity associated with the iᵗʰ example.
Lowerscript i denotes the iᵗʰ entry of a vector.
This article was written assuming that the reader is already familiar with the concept of a neural network. Otherwise, I recommend to read this nice introduction https://towardsdatascience.com/how-to-build-your-own-neural-network-from-scratch-in-python-68998a08e4f6
A neuron computes a linear function (z = Wx + b) followed by an activation function. We generally say that the output of a neuron is a = g(Wx + b) where g is the activation function (sigmoid, tanh, ReLU, …).
Let’s assume that we have a very big dataset with weather data such as temperature, humidity, atmospheric pressure and the probability of rain.
- a training set of m_train weather data labeled as rain (1) or not (0)
- a test set of m_test weather data labeled as rain or not
- each weather data consists of x1 = temperature, x2 = humidity, x3 = atmospheric pressure
One common preprocessing step in machine learning is to center and standardize your dataset, meaning that you subtract the mean of the whole numpy array from each example, and then divide each example by the standard deviation of the whole numpy array.
The standard deviation of a random variable, statistical population, data set, or probability distribution is the…en.wikipedia.org
General methodology (building the parts of our algorithm)
We will follow the Deep Learning methodology to build the model:
- Define the model structure (such as number of input features)
- Initialize parameters and define hyperparameters:
- number of iterations
- number of layers L in the neural network
- size of the hidden layers
- learning rate α
3. Loop for num_iterations:
- Forward propagation (calculate current loss)
- Compute cost function
- Backward propagation (calculate current gradient)
- Update parameters (using parameters, and grads from backprop)
4. Use trained parameters to predict labels
The initialization for a deeper L-layered neural network is more complicated because there are many more weight matrices and bias vectors. I provide the tables below in order to help you keep the right dimensions of the structures.
Table 2 helps us prepare correct dimensions for the matrices of our example neural network architecture from Figure 1.
Parameters initialization using small random numbers is simple approach, but it guarantees good enough starting point for our algorithm.
- Different initialization techniques such as Zero, Random, He or Xavier lead to different result
- Random initialization makes sure different hidden units can learn different things (initializing all the weights to zero causes, that every neuron in each layer will learn the same thing)
- Don’t initialize to values that are too large
Activation functions give the neural networks non-linearity. In our example, we will use sigmoid and ReLU.
Sigmoid outputs a value between 0 and 1 which makes it a very good choice for binary classification. You can classify the output as 0 if it is less than 0.5 and classify it as 1 if the output is more than 0.5.
In Snippet 2 you can see the vectorized implementation of activation functions and their derivatives (https://en.wikipedia.org/wiki/Derivative). The code will be used in the further calculation.
During forward propagation, in the forward function for a layer l you need to know what the activation function in a layer is (Sigmoid, tanh, ReLU, etc.). Given input signal from the previous layer, we compute Z and then apply selected activation function.
The linear forward module (vectorized over all the examples) computes the following equations:
We use “cache” (Python dictionary, which contains A and Z values computed for particular layers) to pass variables computed during forward propagation to the corresponding backward propagation step. It contains useful values for backward propagation to compute derivatives.
In order to monitor the learning process, we need to calculate the value of the cost function. We will use the below formula to calculate the cost.
Backpropagation is used to calculate the gradient of the loss function with respect to the parameters. This algorithm is the recursive use of a “chain rule” known from differential calculus.
Equations used in backpropagation calculation:
The chain rule is a formula for calculating the derivatives of composite functions. Composite functions are functions composed of functions inside other function.
It is difficult to calculate the loss without “chain rule” (equation 5 as an example).
The first step in backpropagation for our neural network model is to calculate the derivative of our loss function with respect to Z from the last layer. Equation 6 consists of two components, the derivative of the loss function from equation 2 (with respect to the activation function) and the derivative of the activation function “sigmoid” with respect to Z from the last layer.
The result from equation 6 can be used to calculate the derivatives from equation 3:
The derivative of the loss function with respect to the activation function from the third layer (equation 7) is used in the further calculation.
The result from equation 7 and the derivative of the activation function “ReLU” from the third layer is used to calculate the derivatives from equation 8 (the derivative of the loss function with respect to Z). Following this, we make a calculation for equation 3.
We make similar calculations for equation 9 and 10.
The general idea:
The derivative of the loss function with respect to Z from lᵗʰ layer helps to calculate the derivative of the loss function with respect to A from (l-1)ᵗʰ layer (the previous layer). Then the result is used with the derivative of the activation function.
The goal of the function is to update the parameters of the model using gradient optimization.
The full implementation of the neural network model consists of the methods provided in snippets.
In order to make a prediction, you only need to run a full forward propagation using the received weight matrix and a set of test data.
You can modify nn_architecture in Snippet 1 to build a neural network with a different number of layers and sizes of the hidden layers. Moreover, prepare the correct implementation of the activation functions and their derivatives (Snippet 2). The implemented functions can be used to modify linear_activation_forward method in Snippet 3 and linear_activation_backward method in Snippet 5.
You can face the “overfitting” problem if the training dataset is not big enough. It means that the learned network doesn’t generalize to new examples that it has never seen. You can use regularization methods such as L2 regularization (it consists of appropriately modifying your
cost function) or dropout ( it randomly shuts down some neurons in each iteration).
We used Gradient Descent to update the parameters and minimize the cost. You can learn more advanced optimization methods that can speed up learning and even get you to a better final value for the cost function for example:
- Mini-batch gradient descent
- Adam optimizer