Deep Learning from scratch

Source: Deep Learning on Medium

Deep Learning from scratch

This is a quick tutorial to show you all the underlining mystery of a DL framework. It introduces basic theory that you can further extend to make a real use case while not getting too much details into the software engineering complexity of how the DL framework is implemented.

Although the goal is to do something end to end without external dependencies, we will use numpy as the tensor computation library. An introduction on how to implement numpy from scratch could be another good topic but that’s not the focus of this post.

A bit theorm that you can skip

Deep learning is nothing but a series of complex tensor operations. Tensor is a linear algebra object that holds numeric values. Tensor could have different dimensions. Scalar is Tensor with 0 dimension, and Vector has dimension 1, while Matrix has dimension 2. Tensor could have different data type (dtype in some framworks), such as float, int, etc. All elements in the Tensor are having the same data type. Different data type has different precisions. In gerneral, as precisions goes up, the required computation power goes with it.

Deep learning network could be thought of as a numeric function, which takes in an input Tensor (often called x) and returns another Tensor (often called prediction or y). A DL network is defined by its instrinstic properties: the network structure and the weights.

When the network structure is given, training a deep learning network means simply finding the network weights for a criterion. AutoML will be the tool to look for if the network structure also needs to be found. We will not cover AutoML here.

If the criterion can be expressed numerically, DL training is to find the weights to minimize or maximize the criterion.

In this example, we want our neural network to do some linear regression and want to minimize the different between the predicted results and the golden results (often called label).

More formally, given 𝑓(𝑥;𝜃) where f is the network structure, x is the input to the model and 𝜃 represents the model weights, we want to find the best 𝜃 so that 𝑙𝑜𝑠𝑠(𝑦,𝑦̂ ) is minimized where y is the predicted result and 𝑦̂ is the label (expected outcome) for giving x.

Note that in this tutorial we limit our scope so the loss function only returns a scalar, but in multi-task learning, the loss function could have higher dimensions or there could be multiple loss functions.

Gradient descent

The most common way of training a deep learning network is using gradient descent. Consider a multi-variable function 𝑓, the gradient of 𝑓 at a point 𝑥 (i.e. ∇𝑓(𝑥)) is a vector showing the direction in which the function rises most quickly.

A tiny movement towards the opposite direction will cause a decrease, i.e.


The value 𝜆 (often called 𝑙𝑟, learning rate) is a positive scalar and is chosen so that 𝑓 is monotonic between 𝑥 and 𝑥−𝜆∇𝑓(𝑥)

For a single variable function, the can be think of as finding a side which causes 𝑓(𝑥) to decrease.

In our case, we’re interested in finding 𝜃 A simple optimization strategy (simple version of SGD) is to make


Adam is another optimization strategy that worth looking into.

To make it easy to type, let’s use w for 𝜃 and dw for ∇𝜃𝑙𝑜𝑠𝑠

for x, y_label in dataset:
y_pred = f(x, w)
loss = criterion(y_label, y_pred)
w -= lr * dw

Note that we often do computation in batches, so

x.shape = (batch, input_features)
y_{label,pred}.shape = (batch, output_features)

Chain rule

Let’s say the network we’re working on is a sequential network, which means that 𝑓 could be composed of multiple sub-functions (sub-layers), i.e.


The chain rule is specially complex and sometimes confusing since the variables are dimensional. How do we calculate them in practise?

Let’s start from loss function. Say we’re using MSE (mean-square-error), then

Simple, isn’t it?

Let’s have a look at a more complicated example:


What would be 𝑑𝑤, 𝑑𝑥 and 𝑑𝑏 ?

Easy to spot that 𝑑𝑏=𝑑𝑧

𝑑𝑤 is more or less the same

With this, we should be able to do train a really simple neural network, here is the draft code:

def Dense(x, w, b):
# (ny, nx) * (b, nx, 1) = (b, ny, 1)
x = np.expand_dims(x, -1)
matmul = np.matmul(w, x)
matmul = np.squeeze(matmul, -1)
return matmul + b

def dDense(dy, y, x, w, b):
db = dy
dw = np.matmul(np.expand_dims(dy, -1), np.expand_dims(x, -2))
dx = np.matmul(np.moveaxis(w, -1, -2), np.expand_dims(y, -1))
dx = np.squeeze(dx, -1)
# Mean over batch dim
return dx, np.mean(dw, 0), np.mean(db, 0)

def Relu(x):
return x * (x>0)

def dRelu(dy, y, x):
return dy * (y>0)

def MSE(y_pred, y_label):
return np.mean(np.square(y_pred - y_label))

# Actually it's dy, y, x, y_label
def dMSE(dy, y, y_pred, y_label):
# Do you know why we can throw away the constant 2/N ? Have a guess.
return y_pred - y_label

To start with, let’s first define some constants:

input_features = 10
output_features = 2
batch = 512
epoch = 10000

w_real = np.random.rand(output_features, input_features)
b_real = np.random.rand(output_features)

Then we could write the training loop as the following. Note that x is coming in batch, so we need to do reduce_mean in a few different places. Also a homework is left to you. Why do we need to throw away the constant in dMSE? Leave me a message in comment box below. Have fun!

w = np.random.rand(*w_real.shape)
b = np.zeros(*b_real.shape)

lr = 0.01

step = 0
while step < epoch:

x = np.random.rand(batch, input_features)
y_expected = Dense(x, w_real, b_real)

z1 = Dense(x, w, b)
z2 = Relu(z1)
loss = MSE(z2, y_expected)

if step % 100 == 0:
print('loss = {}'.format(loss))
dloss = 1
dz2 = dMSE(dloss, loss, z2, y_expected)
dz1 = dRelu(dz2, z2, z1)
dx, dw, db = dDense(dz1, z1, x, w, b)

b = b - lr * db
w = w - lr * dw

step += 1

print('loss = {}'.format(loss))
print('lr = {}'.format(lr))
print('calculated w = {}, b = {}'.format(w, b))

print('difference w-w_real={}'.format(np.abs(w-w_real)))
print('difference b-b_real={}'.format(np.abs(b-b_real)))