Source: Deep Learning on Medium
Demystifying neural networks for complete starters
Neural Network! Deep learning! Artificial Intelligence!
Anyone who is living in a world of 2019, would have heard of these words more than once. And you probably have seen the awesome works such as image classification, computer vision, and speech recognition.
So are you also interested in building those cool AI project but still have no idea of what artificial neural network is? There are already hundreds of articles explaining the concept of the artificial neural network with the name of “a beginner’s guide on back propagation in ANN” or “A gentle introduction of the artificial neural network.” They are really great already, but I found It could be still hard for someone who is not comfortable with mathematical expressions.
Today, I’m going to explain the basics of artificial neural network (ANN) with the least amount of maths. This could be the easiest and the most intuitive explanation ever, so if you’re a math hater or having trouble with linear algebra, come and take your piece! Today’s keyword will be forward propagation, activation functions, backpropagation, gradient descent and weight updating. I’ll also leave additional resources, which can be your next steps after you finish this post. Sounds good? Let’s get it!
So what on earth Neural Network is?
You might have seen lots of articles which start with what is neurons and how they are structured. Yes, the ‘Neural” of artificial neural network came from the neurons of human brains. In 1943, Warren McCulloch and Walter Pitts first made a trial to create a computational model from human neural networks. They wanted to apply the biological processes in the brain to mathematical algorithms and from that point, neural research field was split into two ways. Today’s neural networks in AI are taking a bit different route with real cognitive science field. So I’d rather take the frame of ANN as some kind of a structure or a diagram rather than a neuron. Cause It has not that many things to do with biological neurons of brains.
Let’s start with this picture. There are starting points (input layer) and ending points (output layer). Let’s say these are islands and we are traveling from ‘input-layer’ islands to ‘output-layer’ islands. We can take various ways going from the start to the end. Each route has different points. When we approach the destination, we’ll sum up all the possible scores and determine which island is the best one we were looking for. So It’s like sending our detection boat teams to find out the perfect island for our next vacation.
One Interesting part here is that when the boat teams approach the output layer, they come back to the input layer. And then we repeat this process of sending them toward the output layer and calling back to the input layer. For each trial, there will be an outcome score for each trial, and we will use them to calculate how accurately the prediction is made. Just like what we do with RMSE or MAE in linear regression.
Forward Propagation and Weight
The metaphor I took above is what the neural network does. Let’s go one step ahead with some real computation this time. This is a more simplified diagram of a neural network.
Let’s say our input data is 5 and 2. So we are going to pass these values to the output layer. Let’s start with 5 first. As you can see, there are two possible ways with different points. If 5 takes the upper route, the point will be 10. If 5 takes the lower route, then it will be -10. Then what will be like with the input value 2? Yes. 6 for the upper route, 2 for the lower route. So if we sum up each possible cases, the values at the hidden layer will be like below.
We can easily get the final value in the same way. You probably get the idea of what’s going on here. This is called Forward-Propagation. It’s moving from left to right. The point here is insulting the result of the left layer as input values to the next right layer.
The circles in the picture are called node, which I described as islands. The multiplying values we used are called weight. Weight is a very frequently used terminology in data science. We use it in the sense of the power of certain features or samples. So if a feature gets a high value of the weight, then that feature will give a great impact to the outcome. By giving different weights to the features, we can train our model for better prediction. This word could sound unfamiliar to you, but we already have been using them with other machine learning algorithms such as lasso regression or boosting algos, controlling the coefficient of features in other regression.
There is another new concept that you might not have heard so far, which is the activation function. The activation function is giving non-linear change to the values before submitting the outcome values. Why we need that? If we just use linear calculation without activation functions, just like what we’ve done above, we can’t give any ‘hidden layer effect’ to our model. It will be not that much different from other regression models. To ‘activate’ the real power of neural networks, we need to apply an ‘activation function.’ Cause activation functions help the model to capture non-linearities within the data.
There are several activation functions and we need to choose a proper one depending on the problems. The equations could be found on Wikipedia, but I want you to see the graph of each function before the equations. What kind of shape or characteristics each function has. Because this understanding will give you clues for what to choose.
The sigmoid function is appropriate for the case of binary classification. It transforms the values only between 0 and 1. The higher input values are, the closer it goes to 1. The smaller input values are, the closer it goes to 0. Tanh function (Tangent Hyperbolic) is similar to the sigmoid function, but its lower limit goes to -1 this time. As it sets the center of data at 0, Tanh is more preferred than the sigmoid function. Threshold function and ReLu (Rectifier) have a certain point from which the value changes. Because of the slope, ReLu is efficient to use in most of the cases. But why slope? What’s the relation with the neural network and the slope? That’s where gradient descent comes into play.
Gradient descent is finding the minimum values of a function. Let’s suppose the cost function has a convex shape like the picture below. Our goal is to minimize the value as small as possible.
If the first trial gets the error somewhere on that curve, say point 1. We are going to have another trial. As trying and comparing the outcome several times, the point will go down the hill like the picture and finally approach the lowest point. (but not exactly 0)
In what amount we move the point is determined by dW, the slope of the loss function. And α is a learning rate we have to choose. So how fast the weight will be updated depends on the multiplied value of dW and α. If the learning rate is too big or too small, the model will fail in learning the proper values for the weights. Therefore it’s so important to give the right values and regularization to gradient descent.
There are so much more stories here but I don’t think it’s good to have them all at once. For someone who is willing to go more in-depth, I’ll leave other advanced resources at the end.
Back Propagation and Updates
Okay, let’s get back to our story again with these concepts: an activation function and gradient descent. If we apply reLu activation function to our data, the output layer will be 16. If the actual value was 20, the error of our prediction would be 8.
From now on, we are going to move backward and change the weight values. This is so-called Backpropagation. This time we are moving from right to left, from the output layer to the input layer. What is the meaning of going backward? It’s getting the gradients for each step. And with the slopes, we will update the weights as we talked above.
What we’ve gone through so far is from 1 to 4 stage. We input X values and get the predicted value ŷ by forward-propagation. There is b term called bias, but we just had it as 0 here. If you are familiar with the chain rules, you’ll get the concept of backpropagation easily. If not, don’t worry. It’s like biting the tail of the previous equation. I want you to see what’s the outcome of dW₂. It takes the multiplied values of the error and the slope of the activation function at the given point and the input value.
Just like forward propagation passes the input data through the hidden layer and toward the output layer, backpropagation takes the error from the output layer and propagates it backward through hidden layers. And in each step, the weight value will be updated.
Repeating the whole process several times, the model finds out the optimized weight and the prediction with the best accuracy. And when the metrics converge to a certain level, our model is ready to make a prediction.
Let’s recap what we’ve discussed so far. We talked about 5 steps in ANN.
While passing through hidden layers and updating weights of each layer, what neural networks are doing is representing the pattern of data internally. Other deep learning algorithms start with this basic structure. They are modified version by adding special technique. Therefore having a solid understanding of the basic structure of a neural network should be the first and foremost thing for beginners.
This post was focused on getting the fundamental intuition of neural networks with a minimum amount of maths. This could be a little bit simplified case but truly efficient for starters. If you are ready to take the advanced level, I’d recommend you to read many other resources as well. These are my picks.
- Such a great post by Shirin Glander. This covers how to build an ANN in R with the additional concept of regularization and optimization: ‘How do neural nets learn?’ A step by step explanation using the H2O Deep Learning algorithm
- Maybe the most popular article on this topic by James Loy on Toward Data Science. It’s been a while but still, it’s great to understand deep learning from scratch: https://towardsdatascience.com/how-to-build-your-own-neural-network-from-scratch-in-python-68998a08e4f6
- Excellent course for getting the intuition of neural networks on DataCamp. Even for someone who does not know linear algebra and derivatives: https://www.datacamp.com/courses/deep-learning-in-python
Thank you for reading and hope you found this post interesting. If you’d like to encourage an aspiring data scientist, please hit 👏 👏 👏! I’m always open to talk so feel free to leave comments below and share your thoughts. I also post other valuable DS resources weekly on LinkedIn, so please follow me, contact me and reach out. I’ll come back with another exciting project next year. Until then, happy machine learning!