DeepLearning series: Neural Networks and Deep Learning

In this blog I will start with the basic definition of a neural network, then to deep learning concepts.

To cover the basics of a neural network, I will use a logistic regression, which is an algorithm that is used for binary classification (when the output is 0 or 1), since the implementation is easy to follow.

From there we will move to the representation of a one-hidden layer neural network, which will lay down the foundation for a more complex one, a deep neural network.

Okay, let’s dive deep (yeah, I know, … couldn’t help it!) into this exciting material.

In a previous blog I already wrote about the neural network, but I want to add some intuition regarding the basic concepts. Additionally, I wanted to write about the amazing deep learning networks that I’ve been working on thus far, so I decided to take a step back and start from the “beginning”.

So, what’s a neural network?

You might hear or read a lot about the fact that a neural network is inspired by the human brain’s architecture. . . Well, that’s sexy, but it sounds very complicated and really not that intuitive.

Simply put, a neural network tries to learn a function “f” that can connect the input X to the output Y. (simpler, but not as sexy, right?) It can be represented as:

The “neuron” here computes this function “f”, which can be linear or non-linear.

LOGISTIC REGRESSION

Let’s move to the use of a neural network for a binary classification. In this case the output Y can take the value of 0 or 1.

Given an input feature vector X, I want an algorithm that can output a prediction, ŷ, which is the estimate of Y.

Essentially, ŷ is the probability that Y is equal to 1, given the input features X.

How do we generate this output ŷ In other words, what function can we use to connect ŷ to X through the use of some parameters w and b?

We could think of using a linear function, such as ŷ = w*X+b (technically it should be w transpose (wT), (as X and w are dimensional vectors, but let’s just skip this “detail” for the sake of simplicity). This won’t help us in evaluating this case scenario as ŷ is linearly connected to X, so it can take any sort of value.

Instead, to get a result that ranges from 0 to 1, which is what ŷ is binded to, we can use a logistic (sigmoid) function applied to the linear expression of the data.

The sigmoid function is defined as:

which is represented as:

(ref: https://en.wikipedia.org/wiki/Sigmoid_function)

Let’s call z our linear expression we mentioned before, so

z= wT+b

and therefore, applying the sigmoid function to it we get:

ŷ = σ (z) = σ(wT*X + b)

Great, so we got to the logistic regression model!

To recap, this is what we have defined so far:

We have an input, defined by a set of training examples X with its labels Y, and an output ŷ as the sigmoid of a liner function. (ŷ = σ (wT*X +b)).

We need to learn the parameters w and b so that, on the training set, the output ŷ (the prediction on the training set) will be close to the ground truth labels Y.

To learn these parameters we need to train the network, which means we will need to define a cost function and minimize it so to obtain w and b that get the predictions as close as possible to the ground truth.

Let’s first define the loss function (that computes the error for one training example), which is then generalized to the cost function. The latter encompasses the whole training set.

We could use the square error between the label (y) and the prediction (ŷ), but this is not used in the logistic regression as it creates problems around finding the greatest optimum minimum.

What is generally used for the loss function is this formula:

If y=1 then

Since we need to minimize the loss, we want log ŷ to be large. This means that we want ŷ to be large and as ŷ is a sigmoid function, ŷ will be close to 1.

On the other hand, if y=0 then

To minimize the loss we will want ŷ to be small, therefore close to 0.

Finally, the cost function on the entire training set is defined as:

What we have done so far is compute what is called the “forward propagation”, which moves from the input through the neural network and identifies the loss, so the error between our prediction and the ground truth.

At this point, we want to minimize this error so we use “back propagation” to propagate back what we have learned to adjust the parameters.

Once we have done that, we update the parameters w and b using gradient descent, which computes the derivatives of the cost function with respect to w and b as:

Okay, time for an example! Let’s use a network with two inputs.

Our inputs are x1, x2 with their associated parameters w1, w2 and b. We can then compute “z” as a linear function of those and then apply a sigmoid function to it (called “a” in the graph below) and finally calculate the loss function of “a” with respect to y.

What we have depicted is the forward pass of the logistic regression.

Now, we want to calculate the derivatives of each parameter with respect to the cost function. We do this in order to calculate how much a change of each parameter affects the final loss.

Finally, we compute the gradient descent algorithm, which, at every loop, allows us to update the parameters while keeping in consideration the derivatives and the learning rate.

Thus, we find the parameters w1, w2 and b that minimize the error between our prediction and the ground truth.

The computational graph for forward and back propagation is depicted below.

The formula for forward propagation is:

For back propagation:

Finally, the gradient descent, which computes the updates is:

________________________

NEURAL NETWORK:

In the logistic regression we saw how this model

corresponds to the following computational graph:

A neural network (with a one-hidden layer) looks like this:

where each node (“neuron”) corresponds to the previous two-step calculation of “z” and “a”.

So think of something like this:

Finally, the computational graph for the neural network depicted above is:

The process is the same as the one for the logistic regression, therefore, after the forward propagation (as written above in the computational graph) we will perform the derivatives at each step in order to compute the back-propagation.

Finally, we apply gradient descent to acquire the parameters that minimize the loss.

In this case the parameters we are optimizing are:

I won’t sketch the back propagation steps on the computational graph as it will be as messy as a toddler’s drawing, but you get the idea.

Before moving to a more complex neural network (the deep neural network), I want to expand on the activation function and the initialization of the parameters w and b.

_ _ _ _ _ _ _

Activation function:

So far we have been using the sigmoid function as an activation function (that’s when we do a = σ(z), remember?) which, as you recall, is defined by this curve:

In a binary classification, where the output is a value between 0 and 1, this curve fits pretty good. On the other hand, as the data are not “centered” is not a good choice as activation function for the hidden layers.

For these ones, in fact, we have a better option: the tanh(z) function:

That is “centered” around 0 and assumes values between -1 and 1:

(ref: http://mathworld.wolfram.com/HyperbolicTangent.html)

Although, for the hidden layers is most common to use the ReLU (rectified linear unit) function:

This follows the linear curve for the positive values and 0 everywhere else:

(ref https://www.vaetas.cz/blog/introduction-artificial-neural-networks/#relu-function)

A variant to that is the LeakyReLU function:

This allows a small, non-zero gradient when the unit is not active:

(ref: https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6)

Okay, so now you know that there are several activation functions that you could use. If you are confused on which one to select, the main take-away here is this:

For the hidden units, never use a linear activation, otherwise the output would be a linear function of the input, which doesn’t do us any good — actually in that case it wouldn’t even matter how many layers we have in the network… which defeats the purpose of having a neural network.

Always use non-linear functions. Researchers in the past have used sigmoid, but recently the ReLU function is the most common for the hidden layers.

For the output layer instead, if you are doing a binary classification, then the sigmoid function works great, otherwise for a regression problem you can use a linear activation function (since y can take any real number).

_ _ _ _ _ _ _

Parameters initialization:

You saw in the previous computation graphs that apart of the inputs of the network, there are also the parameters w and b, and you might be wondering what values those might be initiated with.

Great question! The short answer is: random initialization.

Longer answer:

Let’s get the easy one out of the way: the bias b can be initialized to zero. It wouldn’t matter.

For the weights w, instead, the things get tricky (or even problematic!) if we don’t initialize them correctly.

In fact, we need to initialize the weights to small random values!

If the weights are too big, indeed, we end up having a very slow learning and the gradient descent is pretty slow.

To make this more intuitive, let’s look at the curve of the tanh function. A large number there (in absolute terms) would be either close to +1 or -1. Looking at the graph, you see that in that position the slope is pretty much close to zero. Remember what the derivative of a variable is? It’s the slope of the tangent to the curve. And the derivative is what we use in gradient descent to update our parameters.

So, a slope ~0 means a derivative ~0, which means a very small variation update, which means slow learning!

On the other hand, if we initialize the weights to 0, all the hidden units are computing the same function during forward propagation (and the same derivatives during back propagation). Therefore, they have the same influence on the output unit. Thus, the hidden units are symmetric, and no matter how many iterations you perform, they still compute the same function, so it wouldn’t matter how many hidden units you have in the network.

_______________________

If you got this far — thank you!

The last piece is to build a Deep Neural Network. Finally!

DEEP NEURAL NETWORK:

You might have predicted by now that a deep neural network is a network with many hidden layers.

The logistic regression we started with was basically just a “shallow” network.

The following is a deep neural network with a total of 4 layers (3 hidden layers and one output layer):

Here is a bit of nomenclature:

We compute as before the forward, back propagation and gradient descent update for each layer “l” in order to retrieve the parameters w and b (in each layer) that minimize the final cost function.

In an image recognition problem, a deep neural network works very well. Each layer captures different features; for example, in the first layers it captures simpler functions (such as lines), while the latter layers can depict more complex ones.

One final note: in neural-network literature, the term “deep” refers to the depth of the network, therefore the number of hidden layers. The term “small”, instead, refers to the number of units.

There is a theory, called “circuit theory”, that states:

“There are functions you can compute with a small deep neural network that, instead, a shallower network would require exponentially more hidden units to compute.”

A good heads up when you are building your own neural network model!

On the next blog, we’ll discuss the many hyper-parameters that are involved in a deep neural network, such as the learning rate, the number of iterations, the number of hidden layers, units, and a choice of activation function.

Till the next time!

Source: Deep Learning on Medium