Deep Learning from Scratch

Original article was published by Boulevard Consulting on Deep Learning on Medium

Deep Learning from Scratch

Engineering neural networks from scratch with Python

Back in 2018, I first saw The Coding Train’s series on engineering neural networks from scratch with javascript. I felt it was extremely helpful and have since wanted to make my own hacky, from-scratch implementation with python.

As I go along, I’ll attempt to explain my thought process using layman’s terms and as little math as possible.

# exponential function
from math import exp
# a very simple iteration operation
from itertools import tee

# library for n-dimensional arrays... I had made my own classes for vectors and matrices but, without hardware-accelerated operations for dot products and transposition, they relied on very slow loops that proved to be pretty impractical.
import numpy as np
# some standard helper functions
## example datasets
from sklearn.datasets import make_circles, make_blobs
## to randomly split data
from sklearn.model_selection import train_test_split
## to shuffle the data randomly
from sklearn.utils import shuffle
## to score classification tasks
from sklearn.metrics import roc_auc_score, roc_curve, auc

# lastly, a nice plotting module for visualizing my rambles
import matplotlib.pyplot as plt
## a colormap functio
nfrom import get_cmap as cmap

I like to think of machine learning (specifically supervised learning tasks) as an automatic calculator. It is but a series of mathematical operations on some inputs to provide a targeted output. Machine learning models can be as simple as a linear regression, but sometimes more complex models are needed for more complicated issues. Neural networks are a complicated model that can mimic the functionality of linear regression, but also may also scale up to be far more complicated.

A common problem in machine learning is that of nonlinearity. That is, complex interactions among the inputs to a model. Below is a very simple example of what a nonlinear interaction looks like. The meaning of vertical axis in relation to the color is dependent on the horizontal axis. Simply put, there is no line that can separate the two colors. It has linear inseparability.

X, y = make_circles(n_samples=500, noise=0.1, factor=0.1)
plt.scatter(X.T[0], X.T[1], c=y)

Neural networks are extremely generalizable and are excellent at learning nonlinear interactions. This does come at a cost though. Neural networks can be quite computationally expensive to train and they are virtually uninterpretable. In other words, there is no easy way to tell how a neural net made a prediction.

Neural networks are surprisingly similar to linear regression models. I like to think of a “layer” of a neural networks as just many multiple regression models stacked on top one another, each with their own outputs. A neural network is then just a sequence of these layers. The neural network works by iterating over each layer, propagating the outputs of each of them to the next, until the final layer outputs its own final answers.

Before we can get to that, I’m just going to put together some miscellaneous general helper functions:

# this is a handy function to go over the pairs within an iterable object
def pairwise(iterable):
a, b = tee(iterable)
next(b, None)
return zip(a, b)

# for example...
for i, j in pairwise(range(1,6)):
print(f'{i} precedes {j}')
1 precedes 2
2 precedes 3
3 precedes 4
4 precedes 5

Now I need to make a basic activation function. These are used to transform the output of all node. While I am using the sigmoid for uniformity (the output of the final layer is to be constrained between 0 and 1 for my classification problem), the ReLU is much more common these days for many reasons.

The purpose of activation functions is to introduce nonlinearity. I think of activation functions, (especially ReLU), as being analogous to the use of options (financial derivatives): having the access to nonlinearity allows a creative mind — in this case, an artificial one — to design extremely complex, nonlinear outputs much in the same way multiple options can be used strategically to create a spread to hedge financial risks.

def sigmoid(x):
if x >= 0:
z = exp(-x)
return 1 / (1 + z)
z = exp(x)
return z / (1 + z)
# derivative of sigmoid function (needed for calculus in of back-propagation).
prime = lambda x: sigmoid(x) * (1 - sigmoid(x))

# vectorizing a function just means that I can now easily apply it element-wise to a numpy ndarray
sigmoid = np.vectorize(sigmoid)
prime = np.vectorize(prime)

x = np.arange(-10,10,0.1)

plt.plot(x, sigmoid(x))
plt.title("Sigmoid Function")

plt.plot(x, prime(x))
plt.title("Derivative of Sigmoid Function")

As I had mentioned above, the essential unit of a neural network is the layer. In my implementation layer holds two key parameters: biases and the weights.

In the same way a layer is just a stack of linear regression models, biases are a stack of “y-intercepts” for each model. In other words, I implement biases as a vector of β₀, one for each node in the layer.

Similarly, weights is simply a list of coefficients for all the inputs in each linear regression model. Because we have multiple linear models per layer, all of which will have the same number of betas (the number of inputs), weights can be implemented as a matrix!

All of the elements in both these objects are initialized to a standard, normal distribution.

We also have a “setting” (hyperparameter): the “activation function”. In the same way the output of a logistic regression model is applied to a sigmoid function, we will also apply the sigmoid function to the output of each node in out layer. This is a standard implementation that create additional nonlinearity. The sigmoid function is an arbitrary decision here, and there are far better ones with an easier mathematical implementation (namely ReLU!), but sigmoid is actually easier because it fits the desired range we want for classification (constrained between 0 and 1). For the purposes of correcting errors with calculus, we need to store the derivative of the activation function as well.

I’ve found that a really simple but effective Layer class is very important for the technical implementation of a neural network.

class Layer:

def __init__(self, ncol, nrow):

# number of inputs & outputs
self.shape = (nrow, ncol)

# values of the weights
self.weights = np.random.randn(nrow, ncol)

# values of biases
self.biases = np.random.randn(nrow, 1)

# placeholder for unactivated output of layer
self.outputs = np.ones((nrow, 1))

# activation function (currently fixed as sigmoid)
self.activator = np.vectorize(sigmoid)

# derivative of activation function
self.primer = np.vectorize(prime)

layer = Layer(3, 5)

print(f'\n weights: shape = {layer.weights.shape} \n {layer.weights}')
print(f'\n biases: shape = {layer.biases.shape} \n {layer.biases}')
print(f'\n sigmoid(3) = {layer.activator(3):.4}')

weights: shape = (5, 3)
[[ 0.58360291 -1.04477239 2.01744448]
[ 0.80362262 0.70918266 2.00654262]
[ 0.98185151 0.09150979 0.96289542]
[ 0.01239768 -0.12943097 0.41800989]
[-0.07103671 -0.39461366 0.9333896 ]]

biases: shape = (5, 1)
[[ 0.97393686]
[ 0.77734602]
[ 0.81846104]]

sigmoid(3) = 0.9526

The first and last layers of a neural network is always called the input layer and output layer, respectively. All layers in between are called hidden layers as they are beyond our interpretation. As the information flows through them, they exist in latent space.

The final step is actually stitching these layers together with the ability to communicate with one another. Layers need to communicate in two key ways: (1) sending forward through our hidden layers & (2) sending back blame to correct themselves. These two processes are called forward propagation and backward propagation, respectively.

Forward propagation is simple enough. For each hidden layer, there is our stack of linear regression models. Each linear regression model takes all outputs of the previous layer, multiplies them by its vector of weights, and then adds its bias and is passed through the layer’s activation function. This makes one of the potentially many outputs of that layer, all of which collectively then repeat the same process through the next layer until the output layer is reached. Forward propagation works just like a bunch of linear models hooked up together human-centipede style.

Backward propagation is the hard part. Once the final prediction is made from the output layer, that prediction is compared to the real answer. The difference between the two is the error for the final layer. Backward propagation is all about each linear regression model in each layer trying to figure out how responsible it is for this final error, and then, using some calculus, how responsible each of its respective weights (and bias!) is before then sending the aggregated blame back to the previous layer. In layman’s terms, backward propagation is just each layer learning from their own mistakes before telling the previous layer how to learn of their own.

class NeuralNetwork:

def __init__(self, *shape, epochs = 50, lr = 0.001):

# stores number of nodes at each layer
self.shape = shape

# placeholder to store pairs of layers
self.layers = []

# static hyperparameters
self.epochs = epochs = lr

# create layers as pairs of inputs
for inputs, outputs in pairwise(shape):
self.layers.append(Layer(inputs, outputs))

# placeholder to store scores for each epoch
self.scores = []

def forward(self, x):

# reshape input to fit succeeding layers
tensor = np.atleast_2d(x).transpose()

# for each layer
for layer in self.layers:

# multiply inputs by their weights and add their bias
tensor = layer.outputs = + layer.biases

# activate output
tensor = layer.activator(tensor)

# return final activated output
return tensor

def backward(self, error):

# error is already initialized for the final output layer!

# for each layer in reverse order
for index, layer in reversed(list(enumerate(self.layers))):

# find preceding layer
predecessor = self.layers[index - 1]

# define delta as the partial derivative of layers input with respect to its weights
delta = error @ predecessor.activator(predecessor.outputs).transpose()

# change weights by their assigned blame dampened by the learning rate
layer.weights -= delta *

# change biases by their subsequent node's blame dampened by the learning rate
layer.biases -= error *

# update error as the blame for preceding layer
error = layer.weights.transpose().dot(error) * predecessor.primer(predecessor.outputs)

def predict(self, X):

# predict based on each observation in X
predictions = [self.forward(obs) for obs in X]

# return array of predictions
return np.array(predictions).squeeze()

def score(self, inputs, answer, verbose=True):

# find predictions of all inputs
prediction = self.predict(inputs)

# find comparison of predictions with the actual answers
auc = roc_auc_score(answer, prediction)

if verbose:
print(f'auc = [{auc:.1%}].')

return auc

def fit(self, X, y, target = None):

self.decision_boundary = np.average(y)

# for each epoch
for epoch in range(self.epochs):

# shuffle data
X, y = shuffle(X, y)

# append score
score = self.score(X, y, verbose=False)

if target is not None:
if score > target:

# for each record in shuffled data
for inputs, answer in zip(X, y):

# predict answer
prediction = self.forward(inputs)

# find error of output layer
error = prediction - answer

# propagate that backwards

def plot_errors(self, X, y):

# this guy plots the first two dimensions of a dataset in a scatterplot
# each point is colored green if it was correctly predicted, otherwise it is printed red

predictions = self.predict(X) > self.decision_boundary
incorrect = np.logical_xor(predictions, y)
plt.scatter(X.T[0][incorrect], X.T[1][incorrect], c='red')
plt.scatter(X.T[0][~incorrect], X.T[1][~incorrect], c='green')

def plot_scores(self):

# plots the AUC for each epoch

x = list(range(len(self.scores)))
y = self.scores
plt.plot(x, y)
plt.ylim(top=1.0, bottom=0.0)
plt.title("In-sample AUC score prior to each epoch")

def graph(self):

colormap = cmap('RdYlGn', 2)

for index, layer in enumerate(self.layers):
# initialize set of nodes
nodes = set()
output_size, feed_size = layer.shape
for output_index, output in enumerate(range(output_size)):

# arrows for output layer
if index == len(self.layers) - 1:
plt.arrow(index + 1.1, output - output_size/2, 0.05, 0, head_width = 0.4,
head_length = 0.05, color="steelblue", length_includes_head =True)

# add all output nodes (if unique!)
nodes.add((index + 1, output - output_size/2))

# plot weights between biases and hidden/output layers
height = max(self.shape)/2
value = layer.biases[output_index, 0]
plt.plot([index, index+1], [height, output - output_size/2],
c = colormap(value > 0), linewidth = min(8, abs(value)+0.25))

# plot biases
plt.scatter(index, height, s=150, c='white', edgecolors='black', zorder = 10)
plt.text(index, height, s='1', zorder = 100,
horizontalalignment='center', verticalalignment='center')

for feed_index, feeder in enumerate(range(feed_size)):

# arrows for input layer
if index == 0:
plt.arrow(index - 0.15, feeder - feed_size/2, 0.05, 0, head_width = 0.4,
head_length = 0.05, color="steelblue", length_includes_head =True)

# add all input nodes (if unique!)
nodes.add((index, feeder - feed_size/2))

# plot weights between input and output nodes
value = layer.weights[output_index, feed_index]
plt.plot([index, index+1], [feeder - feed_size/2, output - output_size/2],
c = colormap(value > 0), linewidth = min(8, abs(value)+0.25))

# plot set of nodes
for node in nodes:
plt.scatter(*node, s=150, c='black', zorder = 10)

plt.xlim(xmin=-0.2, xmax=len(self.layers) + 0.3)
plt.ylim(ymin= -(max(self.shape)/2) - 1)

Now that the hard part is done, let’s check it out.

First, let’s build a neural network with only one hidden layer.

nn = NeuralNetwork(2,3,1, epochs=50)

Let’s get some (linear) data to play with. This should be easily solvable with the equivalent of a logistic regression model. That being said, because linear models don’t have to worry about nonlinear interactions, they train significantly faster than neural networks with approximately equal performance. Since we’re using a neural network prepared for nonlinear relationships, we still are running over the same data multiple times. This is very computationally inefficient and won’t perform significantly better. It’s good to confirm that my implementation will be able to solve it though.

X, y = make_blobs(n_samples=1000, centers=2, cluster_std=3)

plt.scatter(X.T[0], X.T[1], c=y)

X_train, X_test, y_train, y_test = train_test_split(X, y)

The data here looks easy enough to solve. It has linear separability., y_train, target=0.99)
nn.plot_errors(X_test, y_test)

The neural network learned to use its degrees of freedom to place a decision boundary between the two clusters.

With our test data, the neural net was able to predict the colors of almost all the points. Any mistakes it made are now colored in red.

An AUC of 100% is perfect. Ours is almost there, but still made a few mistakes. But that is to be expected — some of the colors overlapped.

Now let’s look back at our nonlinear problem.

X, y = make_circles(n_samples=1000, noise=0.1, factor=0.1)
plt.scatter(X.T[0], X.T[1], c=y)

X_train, X_test, y_train, y_test = train_test_split(X, y)

We are using this data generator for the simplicity of visualizing its two dimensions — in practice, data hardly looks like this. Furthermore, because the rings are centered at (0,0), we could simply use some feature engineering to make a very simple third variable equivalent to the euclidean distance of any point to center. This would give a single variable perfectly able to explain the color.

X₁ = √(x1² + X₂²)

Regardless, without such feature engineering, there is no way to place a line anywhere to divide the two colors using a linear model. A “logistic regression” model, (a neural network without any hidden layers) is doomed to fail here.

nn = NeuralNetwork(2, 1, epochs = 20)

If we use a neural network without any hidden layers, we won’t be able to pick up on the nonlinear interactions. It tries to place a line between all of the data and will inevitably fail., y_train)
nn.plot_errors(X_test, y_test)

An AUC of 50% is equivalent to guessing randomly. This model offers no value.

Lets use that exact same data but with a more complex neural network. This neural network has enough degrees of freedom to make latent variables on its own. This means it can learn that the problem is nonlinear and will, of its own accord, learn how to place some nonlinear decision boundary between the colors.

nn = NeuralNetwork(2, 9, 6, 1, epochs=500)
nn.graph(), y_train, target=0.975)
nn.plot_errors(X_test, y_test)

The model is significantly better than guessing randomly, but still misses a section of the outer ring. Looks like it formed some sort of slanted parabola around the center, but I cut it off before it had time to tie the parabola into a circle around the center. Based on the high AUC I get the impression it was pretty unsure about those mistakes too.

Nice, so the neural network almost learned how to solve this very simple problem.

Here are but a few of the many shortcomings of my implementation:

  1. It doesn’t use batch learning. That is, it is running backwards propagation for each record it sees. This is very inefficient and computationally expensive. It is standard to aggregate errors before backpropagating them to save time.
  2. It is dependent on loops and is not able to be compiled to run on GPUs. This is perhaps one of the biggest drawbacks and will slow down computation time by well over 100x… It is a limitation of using Python to write this instead of a lower-level language such as C or Rust. Since neural networks use matrix multiplication for forwards and backwards propagation, they are able to run extremely fast if accelerated.
  3. My learning rate is static — this will lead to diminishing increases in accuracy as the neural network approaches its parameters reach their optimal states.

Libraries such as Torch and Tensorflow are built to provide all of this functionality, including a ton more, all of which is way more computationally efficient, accurate and reliable with just a few lines of code. That being said, I feel as though recreating their most basic functions helped me learn alot about what goes on underneath the hood.

About the Author

Grantham Taylor is a Consultant at Boulevard with a focus in data science. He recently graduated with a Masters in Finance and Quantitative Management from Duke University. His interests focus on deep learning, machine vision, natural language processing, non-linear optimization, and anomaly detection.

About Boulevard

The Boulevard Consulting Group, a Native American-owned, Small Business Administration (SBA) 8(a) small business, is a modern management and technology consulting firm that helps clients leverage the power of data science, operational transformation, and cutting-edge technology to build a better future. We develop highly scalable, impactful solutions by combining the latest technology trends with time-tested improvement methodologies, and with the personalized attention of a smaller firm. For more information about Boulevard and its services, visit our website at