Original article was published on Deep Learning on Medium
Implementing A Simple Artificial Neural Network from Scratch in Python
Unveiling Math and Logic behind it.
What’s a Neural Network?
In Layman terms, a neural network is just a mathematical function where you enter a vector of values, those values get transformed by other values inside the function and the value or vector of values are obtained as the desired output.
Now getting back to the world of data science, a neural network depicts a human brain structure that consists of simple but highly interconnected nodes, called neurons, which are organized in layers that process information received from external inputs through dynamic learning and sends out desired outputs. So, basically we have a set of inputs and a set of target values and we try to predict outputs that match those targets as close as possible.
Neural Network Architecture consists of :
- An input layer, x
- An arbitrary choice of hidden layers
- A set of weights, W and biases, b between layers.
- An arbitrary choice of an activation function, 𝞼 for each hidden layer.
- An output layer, ŷ
Let’s Get Our hand’s dirty with Math
I’m assuming that you know all those fancy terminologies behind the neural networks and little bit knowledge about calculus.
Let’s start implementing simple 2-layered artificial neural network,
Let Input be X = [x1, x2] = [0.1,0.3], Target be Y=, Activation Function for hidden layers be “ReLU” and for output layer be “Sigmoid”.
Okay, it’s time to train our Neural Network:
Step 1: Initialize W and b as random values.
[[w1,w2],[w3,w4]] = [[-0.1,0.2],[0.2,-0.3]]
[b11,b12] = [0,0]
[[w5],[w6]] = [[0.3],[-0.1]]
[b21] = 
Step 2: Feed Forward Propagation
Z = W.X + b
A = activation_function(Z)
For 1st layer,
z1 = w1*x1 + w3*x2 + b11 = -0.1*0.1 + 0.2*0.3 + 0 = 0.05
a1 = ReLU(0.05) = max(0,0.05) = 0.05
z2 = w2*x1 + w4*x2 + b12 = 0.2*0.1 + -0.3*0.3 + 0 = -0.07
a2 = ReLU(-0.07) = max(0,-0.07) = 0.0
a1, a2 are the inputs to the 2nd layer.
For the 2nd layer,
z3 = w5*a1 + w6*a2 + b21 = 0.3*0.05 + -0.1*0.0 + 0 = 0.015
a3 = Sigmoid(0.015) = 1/(1+e^(-z3)) = 0.504
ŷ = a3
Step 3: Compute Error
Cost function we use is “ binary_crossentropy”
Where C is the number of classes, y is the target value and ŷ is the predicted value.
Since our classification is binary, C = 2, the cross_entropy function becomes:
Error E =- y log(ŷ)-(1-y)log(1-ŷ)
Step 4: Backward Propagation
For Output Layer,
dE/dw5 = dE/da3 * da3/dz3 * dz3/dw5 →eq.1
dE/da3 = dE / dŷ = d (- y log(ŷ)-(1-y)log(1-ŷ)) / dŷ
= -( y / ŷ) + ((1-y)/(1-ŷ))
=- (1/0.504) + 0 = -1.985 →eq.2
da3/dz3 = d(1/(1+e^(-z3))) / dz3 = e^(-z3) / (1+e^(-z3))²
= 0.249 →eq.3
dz3/dw5 = d(w5*a1 + w6*a2 + b21)/dw5 = a1 = 0.05 →eq.4
dE/dw5 = ᅀw5 = -1.985 * 0.249 * 0.05 = -0.0247 (from eq.1)
Similarly for w6 and b2,
dE/dw6 = dE/da3 * da3/dz3 * dz3/dw6 →eq.5
dz3/dw6 = d(w5*a1 + w6*a2 + b21)/dw6 = a2 = 0.0 →eq.6
dE/dw6 = ᅀw6 = -1.985 * 0.249 * 0.0 = 0.0 (from eq.5, eq.2, eq.3)
dE/ db21 = dE/da3 * da3/dz3 * dz3/db21
dz3/db21 = d(w5*a1 + w6*a2 + b2)/db21 = 1
dE/db21 = ᅀb21 = -1.985 * 0.249 *1 = -0.4942
For Hidden layer,
dE/dw1 = dE/da1 * da1/dz1 * dz1/dw1 →eq.7
dE / da1 = dE/z3 * dz3/da1 =(dE/da3 * da3/dz3) * dz3/da1 →eq.8
( If there are more nodes in the output layer, then the error propagated to the preceding layer node should be considered from all present layer nodes that lead to the preceding layer node. Suppose, if there are two nodes in the output layer, whose errors and outputs are E1, ŷ1and E2,ŷ2 respectively, then error propagated to node z1 is dE / da1 = dE1/da1 + dE2/da1 )
dz3/da1 = d(w5*a1 + w6*a2 + b21)/da1 = w5 = 0.3 →eq.9
dE/da1 = -1.985*0.249*0.3 = -0.1482 →eq.10 (from eq.2, eq.3, eq.9)
da1/dz1 = d(max(0,z1))/dz1 = 1.0 (since z1 != 0.0) →eq.11
dz1/dw1 = d(w1*x1 + w3*x2 + b11)/dw1 = x1 = 0.1 →eq.12
dE/dw1 = ᅀw1 = -0.1482 * 1.0 * 0.1 = -0.01482 (from eq.7)
dE/dw2 = dE/da2 * da2/dz2 * dz2/dw2 →eq.13
dE / da2 = dE/z3 * dz3/da2 =(dE/da3 * da3/dz3) * dz3/da2
dz3/da2 = d(w5*a1 + w6*a2 + b21)/da1 = w6 = -0.1 →eq.14
dE/da2 = -1.985*0.249*-0.1 = 0.0496 →eq.15(from eq.2, eq.3, eq.13)
da2/dz2 = d(max(0,z2))/dz2 = 0.0 (since z2 == 0.0) →eq.16
dz2/dw2 = d(w2*x1 + w4*x2 + b12)/dw2 = x1 = 0.1→eq.17
dE/dw2 = ᅀw2 = 0.0496 * 0.0 * 0.1 = 0.0 (from eq.13)
We can calculate similarly for w3 and w4
dE/dw3 = ᅀw3 = -0.1482 * 1.0 * 0.3 = 0.04446
dE/dw4 = ᅀw4 = 0.0496 * 0.0 * 0.3 = 0.0
For b11, b12
dE/ db11 = dE/da1 * da1/dz1 * dz1/db11
dz1/db11 = d(w1*x1 + w3*x2 + b11)/db11 = 1
dE/db11 = ᅀb11 = -0.1482 * 1.0 * 1 = -0.1482 (from eq.10, eq.11)
dE/ db12 = dE/da2 * da2/dz2 * dz2/db12
dz1/db12 = d(w2*x1 + w4*x2 + b12)/db12 = 1
dE/db12 = ᅀb12 = 0.0496 * 0.0 * 1 = 0.0 (from eq.15, eq.16)
Step 5: Update parameters W and b
Let learning rate, 𝞮 = 0.01
W = W – 𝞮(𝚫W)
b = b – 𝞮(𝚫b)
w1 = w1-𝞮(𝚫w1) = -0.1–0.01*(-0.01482) = -0.0998
w2 = w2-𝞮(𝚫w2) = 0.2–0.01*(0) = 0.2
w3 = w3-𝞮(𝚫w3) = 0.2–0.01*(0.04446) = 0.1995
w4 = w4-𝞮(𝚫w4) = -0.3–0.01*(0) = -0.3
w5 = w5-𝞮(𝚫w5) = 0.3–0.01*(-0.0247) = 0.3002
w6 = w6-𝞮(𝚫w6) = -0.1–0.01*(0) = -0.1
b11 = b11-𝞮(𝚫b11) = 0–0.01*(-0.1482) =0.001482
b12 = b12-𝞮(𝚫b12) = 0–0.01*(0) = 0
b11 = b11-𝞮(𝚫b11) = 0–0.01*(-0.4942) = 0.004942
Now, repeat the process from step 2 till you get the minimum error.
After training for certain iterations, you will have a 2-layer neural network model with no inbuilt libraries for a binary classification task. Now, you can start making predictions …..
Assembling code chunks
Let’s put all our code together:
But Why From Scratch?
There are many libraries in python which allows you to create a Neural Network without binding much dirt to your hands. Yet, it’s better to have an intuition on how neural networks work, the beauty, and the logic behind them and that is essential for designing effective models.