Neural Networks | Fundamentals

Source: Deep Learning on Medium

Definition

An artificial neural network (ANN) is a series of algorithms that aims at recognizing underlying relationships in a set of data through a process that mimics the way the human brain operates. Such a system “learns” to perform tasks by analysing examples, generally without being programmed with task-specific rules.

Global Architecture

Neural networks are organized in various layers:

  • Input layer: the input layer neurons receive the information supposed to explain the problem to be analyzed;
  • Hidden layer: the hidden layer is an intermediate layer allowing neural networks to model nonlinear phenomena. This said to be “hidden” because there is no direct contact with the outside world. The outputs of each hidden layer are the inputs of the following layers units;
  • Output layer: the output layer is the last layer of the network; it produces the result, the prediction.

Perceptron

The perceptron is the first and simplest neural network model, a supervised learning algorithm invented in 1957 by Frank Rosenblatt, a notable psychologist in the field of artificial intelligence.

This network is said to be simple because it only has two layers: an input layer and an output layer. This structure involves only one matrix of weights and all the units of the input layer are connected to the output layer’s.

The perceptron is a linear classifier for binary predictions, in other words it can classify or separate the data into two categories.

3D representation of linearly separable data

Perceptron operation

First of all, the simple perceptron takes n input values (x1, x2, …, xn) and is also defined by n+1 constants:

  • n synaptic coefficients (or weights: w1, w2, …, wn);
  • A bias: a neuron in which the activation function is equal to 1. Like the other neurons, a bias connects itself to the previous layer neurons through a weight, usually called the threshold.

Each input value must then be multiplied by its respective weight (wixi), and the result of each of these products must be added to obtain a weighted sum. The neuron will then generate one of two possible values, determined by the fact that the result of the sum is lower or higher than the threshold θ.

The weighted sum can be transformed into a dot product of two vectors, w (weight) and x (input), where w⋅ x = ∑wixi, then the inequality can be resolved by moving θ (threshold) to the other side.

Completing all these steps produces the following architecture:

A neural network generates a prediction after passing all inputs through all layers, up to the output layer. This process is called forward propagation.

Neural networks work the same way as a perceptron. So, in order to make sense of neural networks, the perceptron must be understood!

Activation Functions

The activation function, also known as the transfer function, is an essential component of the neural network. In addition to introducing the non-linearity concept into the network, it aims at converting the signal entering a unit (neuron) into an output signal (response).

The function’s name comes from its biological equivalent the “action potential”: the excitation threshold which, once reached, results in a neuron response.

It is worth noting that thanks to the bias, it is possible to shift the activation function curve up or down, which means greater learning opportunities for the network.

There are two types of activation functions: linear and non-linear.

Linear Activation Function

This is a simple function which takes the form: f(x) = x. The input passes to the output with little or no modification. This is neither more nor less than a proportionality case.

Graphical representation of a linear function

Non-Linear Activation Functions

The non-linear functions are the most commonly used and make it possible to separate data that is not linearly separable. A non-linear equation governs the correspondence between inputs and outputs.

Main non-linear functions:

The sigmoid function (or logistic function) is an “S” curve that generates an output between 0 and 1, it is expressed as probability.

Sigmoid function definition
Graphical representation of the sigmoid function

Being a “smoother” version, it is preferred to the Heaviside step function but is not free of defects. Indeed, the sigmoid function is not zero-centered, so negative inputs might generate positive outputs. Moreover, its impact on neurons is relatively low, the result being often very close to 0 or 1, thus it leads to the saturation of some of them. Finally, the exponential function makes the process expensive as a computation.

  • Hyperbolic Tangent (TanH):

The hyperbolic tangent is a sigmoidal function like the previous one; however it usually produces better results than the logistic function due to its symmetry. Indeed, the difference is that the result of the TanH function is mapped between -1 and 1. It is generally preferred to the Sigmoid function because it is zero-centered. This function is ideal for multilayer perceptrons, especially for hidden layers.

Hyperbolic tangent definition
Graphical representation of the TanH function

Apart from this aspect, the Tanh function shares the same drawbacks as the Sigmoid function.

  • Rectified Linear Unit (ReLU) :

The ReLU function helps solving the saturation problem of the above functions. It is the most frequently used.

ReLU function definition
Graphical representation of the ReLU function

If the input is negative, the output is 0, whereas if it is positive then the output is z. This activation function significantly increases the network convergence and does not saturate.

However, the ReLU function is not perfect either. It is possible that the neuron remains inactive if the input value is negative, consequently the weights are not updated and the network no longer learns.

Why is this activation function needed?

Without a non-linear activation function, an artificial neural network, no matter how many layers it has, will behave as a simple perceptron, because summing its layers would only result in another linear function.

Which function should be used?

Is it a regression or classification problem? In the first case, is it a binary classification case? Ultimately, there is no better activation function, it depends on the task to be handled.

Cost Function

To learn, the perceptron must know that it has made a mistake, as well as the answer it should have given. It is supervised learning. To do this, it is necessary to use a cost function whose purpose is to compute the error, in other words to quantify the gap between the prediction y_hat and the expected value y. It is therefore necessary to minimize the cost function until the optimum: it is neural network training.

To define the cost function J, the mean squared error can be used:

Mean squared error (MSE)

where:

  • m is the number of training examples;
  • y is the expected value;
  • y_hat is the predicted value.

Once the comparison between the prediction and the expected value is made, the information must be returned to the neural network, so it makes the return trip to the synapses and updates the weights. This is neither more nor less than the reverse path of the forward propagation that has been mentioned earlier. It is called backpropagation.

As aforesaid, the purpose of a Machine Learning algorithm is to find a weight combination in order to minimize the cost function

Further reading:

Gradient descent

In a case that has more weights and thus high-dimensional spaces, there is a problem: the curse of dimensionality. A naive approach like brute force can not be used anymore. It is therefore necessary to use a viable method in order to compute the cost function: the gradient descent, one of the most popular algorithms to perform optimization.

Let’s assume that the cost function J is convex:

Tridimensional representation of a convex function

The horizontal axes represent the space of parameters, weight and bias, while the cost function J is an error surface above the horizontal axes. The blue circle is the initial cost value. All that remains is to go down, but which is the best way to go from here?

To answer that question, some parameters have to be changed, namely weights and bias. Then, it will involve the gradient of the cost function, since the gradient vector will naturally indicate the steepest slope. It is important to know that the input values are fixed, so the weights (and bias) represent the only adjustment variables which can be controlled.

Now, imagine that a ball is dropped inside a rounded bucket (the convex function), it just has to reach the bottom of it. This is the optimization. In the case of the gradient descent, it will have to move from left to right in order to optimize its position.

Starting from an initial position, look at the tilt angle in order to draw the tangent to this point: it means computing a derivative. If the slope is negative, the ball goes to the right, if it is positive, it goes to the left.

But something is missing, and not the least because it is a hyperparameter: the learning rate (α). The slope indicates the direction to take, but it doesn’t tell how far the ball should go in that direction. This is the learning rate’s role, which determines the size of each step to reach a minimum.

Putting everything together, the gradient descent can be defined as follows:

Canonical formula for gradient descent

where:

  • θ is the model’s parameters (vector of weights);
  • J(θ) is the gradient of the cost function J, in other words this is the vector containing each of the partial derivatives. We’ll get it by differentiating the function once;
  • α is the learning rate (step size), set before the learning process.

Choosing the right value for the learning rate is important because it will impact the learning speed on the one hand, and the opportunity of finding the local optimum (convergence) on the other hand. This value, if mischosen, may favor two of the main causes of poor performance on predictive models:

  • Overfitting: the algorithm adjusts well to the training data set, too well in reality, and that is a problem because it will no longer be able to generalize data. With a high learning rate, more distance can be covered at every step, but the risk overshooting the lowest point because the slope is constantly changing. Simply put, the loss function is fluctuating around the minimum and may even diverge;
  • Underfitting: setting a very low learning rate makes it possible to move confidently toward the negative gradient. A low α is more precise, but computing the slope takes a lot of time and leads to painfully slow convergence.
Left: α too small; Middle: decent α; Right: α too big.

With a good learning rate and after a few iterations, an appropriate minimum should be found, then the ball can no longer go down.

Finally, the best weights to optimize the network are determined.

This is what the gradient descent is all about: knowing each weight’s contribution in the total error of the network, and thus converging towards an optimized weights configuration.

Stochastic gradient descent

There remains, however, a problem: the gradient descent needs the cost function to be convex. In other words, the curve is entirely above each of its tangents and the derivative of such a function is increasing over its interval. As illustrated earlier, it takes this form: ∪. But what about a function that is not convex?

This time, let’s assume that the cost function J is non-convex:

Tridimensional representation of a non-convex function

In a case like this, it is no longer enough to take the steepest slope. The error surface becomes visually more complex to understand and has specific features such as local minimums and possible saddle points.

The risk is therefore to lead to the blocking of some iterative algorithms, drastically slowing the backpropagation and to fall on a position that is not the the smallest overall value (global minimum).

In order to overcome this problem, it is possible to use the stochastic gradient descent (SGD).

Mathematical formula for SGD

Despite appearances, this one is faster. It provides more fluctuations of weights, which increases the chances of detecting the global minimum without stopping at a local minimum.

As a matter of fact, it is not necessary to test and load the entire data set in memory to adjust the weights only at the end. The stochastic gradient descent will do it after each iteration, making the process lighter as a computation. Furthermore, the standard (or batch) gradient descent is deterministic: if the starting conditions (weights) are always the same, then the result will always be the same as well.

Further readings: