Neural Networks: For beginners. By beginners.

Source: Deep Learning on Medium

Hold up! Why should you read an article written by a beginner? The answer is simple — I decided to write an article about neural networks which is written in a language so simplistic even a beginner like me can understand it, while also being resourceful enough to help somebody get a good grasp on this enormous material.


You need to have basic knowledge in:

  • Linear Algebra
  • Python
  • NumPy

No need for you to excel in these, but it will be much easier if you have used them before.


I have put every snippet of code that you need throughout the article, but if you want to have the whole piece by your side here is the Jupyter Notebook:


So, neural nets. It’s the first thing that pops up in the minds of most of the common coders when they hear the buzzwords artificial intelligence and/or machine learning. Although not being the most fundamental material in the book, it is actually a not so bad starting point if explained in a beginner-friendly language.

Throughout this article I will take you on a journey starting from the very beginning of the neural networks ideology, take you through the core modern principles that make it learn, and finally show you a step-by-step implementation of a neural network model from scratch featuring Fully Connected, Activation, Flatten, Convolution and Pooling layers. This implementation is heavily based on and inspired by this amazing article by Omar Aflak which is a must-read for everyone who wants to learn more on the mathematical background of neural networks.

Understanding Neural Networks

The history of neural networks traces back to 1943 when neurophysiologist Warren McCulloch and mathematician Walter Pitts portrayed a model of a human brain neuron with a simple electronic circuit which took a set of inputs, multiplied them by weighted values and put them through a threshold gate which gave as output a value of 0 or 1, based on the threshold value. This model was called the McCulloch-Pitts perceptron.

McCulloch-Pitts perceptron | Source: Wikimedia Commons

This idea was taken further by a psychologist called Rosenblatt who created the mathematical model of the perceptron and called it Mark I Perceptron. It was based on the McCulloch-Pitts model and was one of the first attempts to make a machine learn. The perceptron model also took a set of binary inputs which were then multiplied by weighted values(representing the synapse strength). Then a bias typically having a value of 1 was added(an offset that ensures that more functions are computable with the same input) and once again the output was set to 0 or 1 based on a threshold value. The input mentioned above is either the input data or other perceptrons’ outputs.

While the McCulloch-Pitts model was a groundbreaking research at that time, it lacked a good mechanism of learning which made it unsuitable for the area of AI.

Rosenblatt took inspiration from Donald Hebb’s thesis that learning occurred in the human brain through formation and change of synapses between neurons and then came up with the idea to replicate it in its own way. He thought of a perceptron which takes a training set of input-output examples and forms(learns) a function by changing the weights of the perceptron.

The implementation took four steps:

  1. Initialize a perceptron with random weights
  2. For each example in the training set, compute the output
  3. If the output should have been 1 but was 0 instead, increase the weights with input 1 and vice-versa — if the output is 1 but should’ve been 0, decrease the weights with input of 1.
  4. Repeat steps 2–4 for each example until the perceptron outputs correct values

This set of instructions are what modern perceptrons are based on. Due to significant increase of computing power however, we can now work with many more perceptrons grouped together forming a neural network.

Source: mlxtend

However, they are not just randomly put in the network but are actually part of another building block — a layer.


A layer is made of perceptrons which are linked to the perceptrons of the previous and the next layers if such do happen to exist. Every layer defines it’s own functionality and therefore serves its own purpose. Neural networks consist of an input layer(takes the initial data), an output layer(returns the overall result of the network), and hidden layers(one or many layers with different sizes(number of perceptrons) and functionality).


In order for the network to be able to learn and produce results each layer has to implement two functions — forward propagation and backward propagation(shortly backpropagation).

Base Layer Class

Imagine a train travelling between point A(input) and point B(output) which changes direction each time it reaches one of the points. The A to B course takes one or more samples from the input layer and carries it through the forward propagation functions of all hidden layers consecutively, until point B is reached(and a result is produced). Backpropagation is basically the same thing only in the opposite direction — the course takes the data through the backpropagation methods of all layers in a reverse order until it reaches point A. What differs the two courses though is what happens inside of these methods.

Forward propagation is only responsible for running the input through a function and return the result. No learning, only calculations. Backpropagation is a bit trickier because it is responsible for doing two things:

  • Update the parameters of the layer in order to improve the accuracy of the forward propagation method.
  • Implement the derivative of the forward propagation function and return the result.

So how and why does that happen exactly. The mystery unravels at point B — before the train changes direction and goes through the backpropagation of all the layers. In order to tune our model we need to answer two questions:

  • How good the model’s result is compared with the actual output?
  • How do we minimize this difference?

The process of answering the first question is known as calculating the error. To do that we use cost functions(synonyms with loss functions).

Cost Functions

There are different types of cost functions that do completely different calculations but all serve the same purpose — to show our model how far it is from the actual result. Choosing a cost function is strictly tied to the purpose of the model but in this article we will only use one of the most popular variations — Mean Squared Error(MSE).

Formula for Mean Squared Error(MSE) | Source:

It is a pretty straightforward function — we sum the squares of the difference between the actual output and the model’s output and we calculate the mean. But to help our model implementing MSE only isn’t going to be of any significant help. We must implement its derivative as well.

But why do we need this? Because of the infamous…

Gradient Descent

The last thing we need to do here is to show our model how to minimize the error. To do that we need an optimization algorithm(optimizer). Once again, there are many kinds of optimizers all serving the same purpose but for the sake of keeping things simple but still meaningful we are going to use the most widely used and the one which many other optimization algorithms are based on. Behold the mighty Gradient Descent:

Graphical representation of Gradient Descent | Source: Medium

Doesn’t look as scary as it sounds, does it? Good news everybody, it is a relatively simple concept. By definition, the gradient is a fancy word for derivative, or the rate of change of a function.

3D representation of Gradient | Source: OReilly

So let’s imagine our model is a ball. The surface represents the gradient(derivative) of the error. We want the ball to roll down the surface(descent) as low as possible in order to decrease the altitude(the error). Taking it to Math level — we need to reach a global(or at least a good enough local) minimum.

In order to make the ball move though, we need to update our parameters at a certain rate — called learning rate. This is a predefined parameter that we pass to our model before we run it. Those kind of parameters are called hyperparameters and have a huge role in our model’s performance. Here is what I mean:

Significance of Learning Rate | Source:

If we choose a learning rate that is too big the parameters will change drastically and we might skip the minimum. If our learning rate is too small on the other hand, it will take too much time and hence computing power to reach a satisfying result. That’s why tuning this parameter by testing the model with different values of the learning rate is rather important. It is highly recommended to start with a learning rate of 0.1 or 0.01 and start tuning from there on.

Back to back(propagation)

Now we need to update the model’s parameters layer by layer by passing the appropriate data to the backpropagation methods. The backpropagation takes two parameters — output error and the learning rate. Output error is calculated either as the result of the derivative of the cost function or as the result of the backpropagation of the previous layer(if looked from point B to point A) — as written above, the backward propagation should give as a result the derivative of the forward propagation function. By doing this each layer shows its predecessor its error.

So in other words if for some reason we had a Sine Layer it would look something like this:

Now that we’ve got the two parameters needed, the backpropagation should update the layers weights(if such are present). Since every type of layer is different, it defines its own logic for parameter tuning — something which we will cover in a bit.

Wrapping up Gradient Descent

When each layer’s backpropagation is complete and our train arrives at point A, it takes the next sample(or set of samples) and starts its course through the hidden layers’ forward propagation functions once again — only this time, they should perform a bit better. This process continues on and on until training is completed and/or an error minimum has been reached.

Now that we’ve explained all the theory behind gradient descent, here is how it should look in code:

I hope this snippet gives much more clearance on the algorithm itself. The only thing that we haven’t fully covered yet is what types of layers we can use in a network and how to implement them.

Basic Layers

As though there are many kinds of layers to choose from for a starter, the infamous Fully-Connected Layer is undoubtedly the best choice.


The Fully-Connected Layer is the most widely used class type. Its principles of work are based on the Rosenblatt model and are as follow:

  1. Every single perceptron from the previous layer is linked to every single perceptron of this layer.
  2. Each link has a weighted value(weight).
  3. A bias is added to the results.
  4. The layer’s weights are held in a 2D array with size m x n(where m is the number of perceptrons in the previous layer and n is the number of perceptrons in this layer). They are initialized as random values.
  5. The layer’s bias is held in a 1D array with size n. It is initialized as random values.
Visual representation of Fully-Connected(FC) Layer | Source:

Now let’s head to our implementation:

As you can see the implementation of our to methods here is not something too complicated as long as you know basic linear algebra. And although relatively simple, this is a completely useful and optimized layer implementation which we will easily put to use later.

The only problem with Fully-Connected Layers though is that they are linear. In fact, most layers have completely linear logic. A linear function is a polynomial of one degree. Using only such functions hinders the model’s ability to learn complex functional mappings, hence, learning is limited. That’s why(by convention) it is good to add non-linear functionality after every linear layer using activation layers.

Activation Layer

Activation layers are just like any other type of layer except they don’t have weights but use a non-linear function over the input instead.

A good example of such activation function is tanh which stands for hyperbolic tangent.

Tanh compared to sinh and cosh | Source: Wikipedia

Since we are going to need it when we begin building our model, we need to implement it:

Now that we have our two most important layers implemented, let’s proceed to implementing the whole Neural Network class.

Neural Network Implementation

There are several methods that need to be implemented

  • a constructor — here we need to pass the hyperparameters(learning rate and number of epochs — the number of times our network will run over the input dataset); initialization of necessary fields
  • add layer method — pass an instance of a layer; used for model construction; can(should) be used several times in order to add several layers;
  • use cost function method — specify the cost function to be used when training the model
  • fit method — a standard name for the method that performs the training process; here is where we will place the gradient descent snippet from earlier
  • predict method — a standard name for the method that is used to calculate results only; it is useful once the training process is complete

And here goes the code:

You may have noticed the return self statement being present at the end of every method. The reason I put this is that it allows us to do method chaining. If you are not sure what I am talking about, you are going to see a good example in a bit.

Now let’s put it to work. We are going to use the MNIST database for classifying handwritten digits. You can download it from here, or you can easily import it from Keras.

Since the pixel values are represented in the range [0; 255], we are going to scale that down to a range of [0.0, 1.0].

Another thing we did is we made the y(the results) be a little bit more convenient(note keras.utils.to_categorical). What it does is it represents the numeric result in a one-hot vector:

5 => [0, 0, 0, 0, 0, 1, 0, 0, 0, 0]

This is helpful because our network’s output layer is going to consist of 10 nodes, each holding an output value. Since the ideal case output would be a correct one-hot vector it is now easier for the cost function to do its job.

Now let’s construct our first network by putting some FC and activation layers:

The reason why we use only the first 1000 samples is that it is going to run too long if we use all the samples. Using more samples in this case will give you better results, so you can try a bigger range if you have the time.

Before we run this though, we need to add one more final touch — another pre-output activation function — Softmax. What it does is it normalizes an array with n elements in a probability distribution array consisting of n probabilities proportional to the exponentials of the input numbers, or simply put — calculates the probability that the sample matches a certain class.

Softmax formula | Source:

And the implementation:

Let’s try it once again, only this time we are going to have the Softmax activation as our final layer.

Now that we’ve trained our data, let’s evaluate our final model.


Keep in mind that we have implemented a small model for educational purposes. It is not going to produce a quite high result. I would highly recommend playing around with it in order to get a better accuracy.

In order to evaluate our results, we are going to use a simple utility from sklearn which shows as the accuracy of our model.

Now that you know the basic of neural network construction we can proceed to the more advanced stuff…