Original article was published on Deep Learning on Medium
where zᵢ refers to the output from the previous layer (l-1), l=2, 3, …, L and k = 1, 2, 3, …, Nₗ as shown in Fig. 3. The parameters ωₖᵢ are weights and bₖ biases for kᵗʰ neuron of the layer l.
The output of the neuron is then
Thus for the output layer L, we have
The network assigns an input pattern vector x to class cₖ if the output neuron k has the largest output value; that is if zₖ⁽ᴸ⁾> zⱼ⁽ᴸ⁾, j=1,2,…,Nʟ; j≠k.
Thus, the output of the last layer L is of importance as it determines the final correspondence between input vector and its class membership. And this is why the activation function of the output layer sometimes differs from that of hidden layers. The output activation function is chosen such that it has a probabilistic nature(the sum of all the outputs is 1) and the output neuron k with the highest value represents the corresponding class cₖ for a particular input vector x.
Well, now might be the right time to sip that juice🍹 and refresh yourself as we are about to move on to step 2 of the algorithm.
Backpropagation of Error to Train Neural Network
Till now we discussed how a feedforward neural network carries out the computations that result in its output. But after supplying all the input vectors(one at a time) to the network and calculating the activations of all of the hidden and output units in the network by successive application of above equations, we are now left with the task of updating the weights so as to reach the final correct weight values.
We know the desired response of every output neuron of a multilayer neural net but we have no way of knowing what the values of the outputs of hidden neurons should be. So, to evaluate the error we rely on the output layer desired values and then we propagate this error back into the network to compute the changes required to update the weights and biases. The weights are updated until the error reaches an acceptable level.
Let us define the error function for the network
Here Eₙ is the error for a single pattern vector: xₙ and is defined as,
k = 1,2,3,…, Nʟ;
zₖ is the output value of the kᵗʰ neuron of the output layer L, and
rₖ is the desired response for the kᵗʰ neuron of the output layer L.
We will use the gradient descent algorithm to compute the changes in the weights to be made to approach the optimized weights, that is, we will adjust the weights in proportion to the partial derivative of the error function,
where k represents a neuron of layer l (k=1, 2, …, Nₗ) and i represents a neuron of layer (l-1) (i=1,2,..Nₗ₋₁).
Thus, we first need to compute the gradient of the error function, Eₙ with respect to the weights wₖᵢ. The error Eₙ depends on the weight wₖᵢ only via the summed input aₖ to neuron k (eq 3). We can, therefore, apply the chain rule for partial derivatives to get
We now introduce a useful notation
where the δ’s are often referred to as errors. Using (3), we can write
Substituting (10) and (11) into (9), we have
For the output layer, we put l=L and we obtain,
It is easy to compute δₖ⁽ᴸ⁾ for the output layer using chain rule:
Using (7) and (4) we arrive at the following result:
But for hidden layers(l = L − 1, L − 2, …., 1), the process of computing δₖ⁽ᴸ⁾is slightly tricky. We will again make use of the chain rule of partial derivatives,
where the sum runs over all neurons k of layer l to which neuron m of layer l+1 is connected. Also, here we have made use of the fact that variations in aₖ give rise to variations in the error function only through variations in the variables aₘ. If we now try to solve it, we will get
The terms on R.H.S. of the above equation can be computed as all the terms are known. Thus, we have found a way to propagate the error back into the network, starting with the error at the output layer. We may now summarize the whole procedure as follows. For any two layers l and l -1, we will modify the weights, ωₖᵢ that connect the two layers, using the equation
If layer k represents the neuron of the output layer(l=L), δₖ⁽ᴸ⁾ is
If k now represents a hidden layer l and m represents the layer l+1
A similar procedure is followed for the biases. You may also add one neuron to each layer that corresponds to the bias term and an element ’1’ to the pattern vector and follow the above procedure.
Thus, the network tries to find the optimum weights for a given set of pattern vectors by minimizing the error function and this process is referred to as training. During a successful training session, the error decreases with the number of iterations and the procedure converges to a stable set of weights.
After the system has been trained, it can classify patterns using the parameters established during the training phase. In normal operation, all feedback paths are disconnected. Then any input pattern vector is allowed to pass forward through the network and is then assigned a class based on the output value of the neurons of the output layer.