Original article was published by Arjun Pandey on Artificial Intelligence on Medium

# Neural Network for Beginners

The past decade has seen incredible advancements in Deep Learning. It has opened so many new paradigms for Artificial Intelligence and taken our capability of making intelligent machines to a whole new level. Today, I will be giving you a brief introduction and history of Deep Learning.

Deep Learning is as close as we have gotten towards mirroring the functionality of brains in computationally efficient systems. This picture from an MIT Deep Learning lecture which outlines the current advancement stage perfectly

Deep Learning is largely based on the idea of neural networks which is based on the workings of our brain. For now, you can understand neural networks as something that has a strong biological inspiration. So before diving into the concept let’s discuss why neural networks and deep learning has been so successful recently even though it was released ages back:

Now we have a good idea of why Deep Learning is exploding let’s get into the functioning of a neural network. Like I said they are inspired by the human brain and the basic building block of a brain is a perceptron which looks a lot like this:

This perceptron has a set of dendrites a cell body and a set of axon terminals. If you think about it, it looks like the dendrites receive information the cell understands and processes the information and the axon terminals pass its processed information forward. Just like a perceptron works, computer scientists tried to develop a ‘neuron’ system that can mimic brain functioning and process data to find various patterns. And thus a system like this was born:

Here a computer is given a set of variables and it performs some mathematical calculations and outputs the data in a sequence for the next neuron to do the same. Turns out, these calculations are nothing but basic arithmetic. The cross signal in the diagram below shows multiplication and the other one is a simple addition:

So as you can probably guess each neuron’s values are multiplied by the triangle. In Deep Learning language this triangle denotes the ‘weight’ of each neuron. Each weight multiplication is summed and then added to the square which is called the ‘bias’ of the neuron. If that rings a bell, we are basically following y = mx + c in a bit complicated manner. I think we can also bring in the activation function here. The output of each such calculation is given by an activation function. So suppose my set of calculations look like this:

After going through the basic multiplication and addition steps my answer comes out to be 4.2. Then I activate a chosen function to give an output that will be transferred to the next neuron. In this case, I chose the max() function in which the function chooses 0 for negative outputs or the number itself for positive outputs. Since we get +4.2 the max function spits our 4.2.

This process of making calculations and processing the activation is called ‘forward propagation’, a rather fancy name for something so basic.

To give you more knowledge let’s see some of the common activation functions:

The ReLU(right) is nothing more than the max function we just discussed. The sigmoid(left) function basically squashes whatever value it has between 1 and 0, hence is extremely profitable in terms of giving out class-specific probabilities, a concept you will soon learn. The hyperbolic tangent is again a non-linear version of stacking our outputs between [1,-1]. Note: it is important for the functions to be non-linear because while we can keep absolutely no activation function the network which we will build will end up being a linear model with limited capabilities. The linear models certainly won’t be able to understand and derive complex mappings from input data and hence be rendered useless. Since that is clear, let’s take a look at how you can compute these functions in code. Please note that these are Tensorflow executions if you want to know the math behind derivations there are some resources linked below

Code Snippets(Just in case)

Neural Networks by definition is a stacked set of neurons that we saw above whose values are updated in a way to optimize the output. Let’s see what this means. This is a sample neural network:

As you might have probably guessed, it’s just a bunch of neurons that we saw arranged and fully connected to each one in the next layer. The four neurons on the left are called the ‘Input Layer’ and the two neurons on the right are called the output layer. The two layers in between are often referred to as ‘hidden layers’. Now you might notice that each neuron in subsequent layers is ‘fully connected’ meaning they hold a link to each neuron ahead and behind it. These layers in Neural Network terminologies are called ‘Dense’ layers. In many places such sort of network might also be referred to as a Multi-Layer Perceptron model or MLP, so don’t get confused! Let’s code some Dense layers from scratch and give you a sense of how they work:

This might be some new code. So let’s walk you through it. By now you should be fairly familiar with what Tensorflow is and for this code that is the only library, you would be needing. Tensorflow allows you to build custom layers and that is exactly what we are doing here.

We define a class *Dense *and use the *Layer* module in Tensorflow to inherit the core properties of a neural network layer. The **initialization function** should be familiar to you from past coding experiences. We use the Super access function to give our class access to the parent which in this case is our ‘Layer’ module. We then define the units, which are the number of neurons we want to be present in our layer. For good measure, you can always initialize them to 32. In the **build function**, the library provides some really convenient functions to initialize weights. You would notice that the values are either initialized as random or 0 because we have no concrete point to start from. In fact, this is a very big area of research in terms of neural network optimization and you should definitely look into it if it fascinates you. Our end goal is to find optimal values of these weights and biases to truly understand the data we are given, hence the initial values are of no concerns, just keep reading if you are confused as of now and everything will make sense. The **cost function** initiates the training phase and uses the matrix multiplication function to calculate the output. In a neural network, this happens recurrently over the training data given and for a set amount of iterations defined by us.

This was a very high overview of what’s going on, so if you have a general sense of what’s going on you can read further. Behind the scenes, neural networks are nothing but a series of matrix multiplication functions interwoven with the activations that we discussed.

Let’s take a moment to discuss the biological comparisons as well. Neural networks are dense, complex systems that try to map out non-linear and complicated relations between data. The model of our brain is based on something called the **Coarse model**. Our brain is very complex and uses ‘synapses’ as methodologies for passing and processing information. These synapses much like neurons are complex, non-linear dynamic systems and thus can somewhat be linked to the success of Neural Networks today.

**Neural Network Training**

**Overview**

Now since we have a good idea of what a layer of a Neural Network looks like, let’s discuss how Networks are trained?

Neural Network training on a high-level note is not so complicated but few of the functions around it are. Our goal like with any other model is to minimize loss thus optimize outputs. If you remember the concept of gradient descent, we will do exactly that here with a function based on calculus called backpropagation. Below is a graph that represents the loss function in terms of gradient descent that we want to achieve with neural networks:

The function J is just a mathematical notation to show the cost or loss of the model. When training a neural network we define a set number of epochs or iterations that define the number of times the neural network will compute the training data. During each epoch, we will use a function to update our weights and biases in a bid to improve our models because if you think about it that’s the only thing we can control except the data. This function is called backpropagation, and if it isn’t clear as of now calculating the values through weights and biases in the neural network was forward propagation or the feed-forward process. Backpropagation was explained in our intro to machine learning math module, but if you don’t have any clue it’s fine. Just remember it as a method in calculus that calculates partial derivatives to update values with something called the learning rate. The learning rate is the value that defines our step function for gradient descent, it determines how big of a step should we take in terms of weights/biases. So the step we are taking in the graph above is defined by the learning rate. I know it might be a bit confusing but all that this function does is go back through the network and try to fix a few values to give the network the opportunity to better performance. This is what I meant when I said optimizing neural networks above.

**Additional Concepts:**

Now before coding the network we will touch on a few key topics for in-depth knowledge.

**Weight Initialization**: As you might have guessed we need a starting point for our weights and biases. Many theories have been discussed and argued over the years but only a few tend to work. One way to initialize weights is to find completely random small numbers. The intuition behind this is since we are looking for non-linearities and all neurons are random and unique, they will be able to compute individually distinct updates and integrate into a connected network. What this means is that there is an expectation for loss to converge and weights to optimize hence starting points barely matter. Over the years, there have been many theories like Batch Normalization and calibrating to the square root of the number of samples. You can read more about them in this article. But for now, our intuition of random initialization is sufficient

**Regularization: **In neural network theory there is a concept of overfitting meaning that the network gets overly dependent on the training data and can’t generalize the patterns it’s learning. Regularization can help us prevent that. One way to do this is Dropout. This might be funny, but in Dropout, we just freeze a certain percentage of neurons between training and compute the model again. This is to check how robust the network really if certain information is lost. Again many methods of doing this. For more information check this out: https://cs231n.github.io/neural-networks-2/.

**Coding a Neural Network**

**OPEN THE NOTEBOOK LINKED BELOW**

If you want raw code please do check out the tutorial above. But we are going to show you an easy way to fit all your needs. First, let’s start with showing you how it’s done in Scikit Learn:

Now, it might be just three lines of code but it doesn’t give you some of the functionalities more advanced libraries like Keras do. MLP stands for multi-layer perceptron classifiers and if you think about it, neural networks is in fact multi perceptron layer models. All code explanations and cell-specific explanations are given in our notebook but just for an introduction, the hidden layer sizes property is responsible to define the number of layers and number of neurons per layer and the subsequent iteration property decides the number of times our model will loop through the training data. This is also called ‘epochs’. Like I said, Scikit learn isn’t the best out there, so let’s get a sneak peek into how this is done in Keras:

As usual, you find an explanation and implementation in the notebook. For these tutorials, just to be clear, Keras is way more functional and allows you to better visualize your model. The above code just defines a bunch of Dense Layers and runs it 10 times(number of epochs defined). It also adds a Dropout layer to avoid overfitting. But we can achieve only a maximum accuracy of 49.7%, there has to be a better way! And like you must have guessed there is: Convolutional Neural Networks.

**Future Experiments**

· Code different networks, different layers, neurons and tell us what combination received an accuracy of > 80%

· Try different dropouts and research about other Neural Network layers

· Check out the Tensorflow playground for better visualization. Link at the end of the notebook

Check out this notebook for all the code.

**References:**

· **CS 231n by Stanford ****lectures on Neural Networks**