Source: Deep Learning on Medium
In the first part we have seen how back propagation is derived in a way to minimize the cost function. In this article we will see the implementation aspect, and some best practices to avoid common pitfalls.
We are still in the simple mode, where input is handled one at a time.
Consider a fully connected neural network such as in the figure below.
Each layer will be modelled by a Layer object containing the weights, the activation values (output of the layer), the gradient dZ (not represented in the image), the cumulative error delta (𝚫), as well as the activation function f(x) and its derivative f’(x). The reason for storing intermediate is to avoid computing them each time they are needed.
Advice: It is better to organize the code around few classes, and avoid cramming everything into arrays, as it is very easy to get lost.
Note that the input layer won’t be represented by a Layer object since it consists only of a vector.
def __init__(self, dim, id, act, act_prime,
isoutputLayer = False):
self.weight = 2 * np.random.random(dim) - 1
self.delta = None
self.A = None
self.activation = act
self.activation_prime = act_prime
self.isoutputLayer = isoutputLayer
self.id = id
The constructor of the Layer class, takes as parameters:
- dim: dimensions of the weight matrix,
- id: integer as id of the layer,
- act, act_prime: the activation function and its derivative,
- isoutputlayer: True if this layer is the output, False otherwise.
It initializes the weights randomly to numbers between -1 and +1, and set the different variables to be used inside the object.
The layer object has three methods:
- forward, to compute the layer output.
- backward, to propagate the error between the target and the output back to the newtwork.
- update, to update the weights according to a gradient descent.
def forward(self, x):
z = np.dot(x, self.weight)
self.A = self.activation(z)
self.dZ = self.activation_prime(z);
The forward function, computes and returns the output of the Layer, by taking the input x and computes and stores the output A = activation (W.X). It also computes and stores dZ which the derivative of the output relative to the input.
The backward functions takes two parameters, the target y and rightLayer which is the layer (𝓁-1) assuming that the current one is 𝓁.
It computes the cumulative error delta that is propagating from the output going leftward to the beginning of the network.
IMPORTANT: a common mistake, is to think that the backward propagation is some kind of loopback in which the output is injected again in the network. So instead of using dZ = self.activation_prime(z); some uses self.activation_prime(A). This is wrong, simply because what we are trying to do is figure out how the output A would vary relative to input z. This means computing the derivative ∂a/∂z = ∂g(z)/∂z = g’(z) according to the chain rule.
This error might be due to the fact that in the case of sigmoid activation function a = 𝜎(z), the derivative 𝜎’(z) = 𝜎(z)*(1-𝜎(z)) = a*(1-a). Which gives the illusion that the output is injected into to the network, while the truth is that we are computing 𝜎’(z).
def backward(self, y, rightLayer):
error = self.A - y
self.delta = np.atleast_2d(error * self.dZ)
self.delta = np.atleast_2d(
What the backward function does is to compute and return the delta, based on the formula:
Finally the update function uses the gradient descent to update the weights of the current layer.
def update(self, learning_rate, left_a):
a = np.atleast_2d(left_a)
d = np.atleast_2d(self.delta)
ad = a.T.dot(d)
self.weight -= learning_rate * ad
As one might guess layers form a network, so the class NeuralNetwork is used to organize and coordinate the layers.
It’s constructor takes the configuration of the layers that is an array which length determines the number of layers in the network and each element defines the number of nodes in the corresponding layer.
For example [2, 4, 5, ] means that the network has 4 layers with the input layer having 2 nodes, the next hidden layers have 4 and 5 nodes respectively and the output layer has 1 node. The second parameter is the type of activation function to use for all layers.
The fit function is where all the training happens. It starts by selecting one input sample, computes the forward over all the layers, then computes the error between the output of the network and the target value and propagate this error to the network by calling backward function of each layer in reverse order, starting by the last one up to the first.
Finally, the update function is called for each layer to update the weights.
These steps are repeated a number of times determined by the parameter epoch.
After the training is complete, the predict function can be called to test input. The predict function is simply a feed forward of all the network.
def __init__(self, layersDim, activation='tanh'):
if activation == 'sigmoid':
self.activation = sigmoid
self.activation_prime = sigmoid_prime
elif activation == 'tanh':
self.activation = tanh
self.activation_prime = tanh_prime
elif activation == 'relu':
self.activation = relu
self.activation_prime = relu_prime
self.layers = 
for i in range(1, len(layersDim) - 1):
dim = (layersDim[i - 1] + 1, layersDim[i] + 1)
self.layers.append(Layer(dim, i, self.activation, self.activation_prime))
dim = (layersDim[i] + 1, layersDim[i + 1])
self.layers.append(Layer(dim, len(layersDim) - 1, self.activation, self.activation_prime, True))
# tain the network
def fit(self, X, y, learning_rate=0.1, epochs=10000):
# Add column of ones to X
# This is to add the bias unit to the input layer
ones = np.atleast_2d(np.ones(X.shape))
X = np.concatenate((ones.T, X), axis=1)
for k in range(epochs):
i = np.random.randint(X.shape)
a = X[i]
# compute the feed forward
for l in range(len(self.layers)):
a = self.layers[l].forward(a)
# compute the backward propagation
delta = self.layers[-1].backward(y[i], None)
for l in range(len(self.layers) - 2, -1, -1):
delta = self.layers[l].backward(delta, self.layers[l+1])
# update weights
a = X[i]
for layer in self.layers:
a = layer.A
# predict input
def predict(self, x):
a = np.concatenate((np.ones(1).T, np.array(x)), axis=0)
for l in range(0, len(self.layers)):
a = self.layers[l].forward(a)
Running The Network
To run the network we take as example the approximation of the Xor function.
We try the several network configuration, using different learning rate and epoch iterations.
Results are liste below:
Result with tanh
[0 0] [-0.00011187]
[0 1] [ 0.98090146]
[1 0] [ 0.97569382]
[1 1] [ 0.00128179]
Result with sigmoid
[0 0] [ 0.01958287]
[0 1] [ 0.96476513]
[1 0] [ 0.97699611]
[1 1] [ 0.05132127]
Result with relu
[0 0] [ 0.]
[0 1] [ 1.]
[1 0] [ 1.]
[1 1] [ 4.23272528e-16]
It is advisable that you try different configuration and see for yourself which one gives the best and most stable results.
The Source Code
The full code can be downloaded here.
Back propagation can be confusing and tricky to implement. You might have the illusion that you get a grasp of it through the theory, but the truth is that when implementing it, it is easy to fall in many traps. You should be patient and persistent, as back propagation is a corner stone of Neural Networks.