Source: Deep Learning on Medium


Disclaimer: Content of this post and many to come are from different courses & books, I will mention references in last.

Activation Functions

Why are activation functions important?

Consider a network where there are no non-linear activation functions like sigmoid etc.

h2 is activation function → h2 = sigmoid (a2)

Now question is what happens if there is no non linear activation functions in the network?

If there is no non-linear function, in that case formula become

It can only represent linear relations or linear function between x and y, so it will not learn any non-linearity, Here Universal Approximation Theorem does not hold good, It will only learn lines and planes

The representation power of a deep NN is due to its non-linear activation functions

Some popular non-linear activation functions are

Saturation in logistic neuron

Let’s look at the logistic function

The following figure illustrates the logistic function

Plot for sigmoid and derivative both

In neural network h1 is activation function

Now let’s try to understand what saturated logistic neuron is, A logistic neuron is said to be saturated when it reaches its peak values when it is given high extremes of positive or negative values as inputs.

Why we bother about saturated logistic neuron?

In the case where we are calculating the gradient w.r.t a weight associated with a saturated neuron, the saturated neuron’s derivative is 0, thus resulting in the entire gradient becoming 0

This is because the term associated with the saturated neuron in the chain rule for gradient calculation becomes 0, thus making the entire gradient 0

Due to this, the weights are not updated.

This is called the Vanishing Gradient Problem, because the gradient vanishes or becomes 0 due to the presence of a saturated neuron.

Let’s discuss more about vanishing gradient problem



Now similarly if derivative is large it become very-very large so it may lead to situation where converged didn’t work

Suppose we have activation function where derivative value can be greater than 1 in such cases multiplying lots of derivative positive value result in big value in such cases weight do not converged & it will keep on changing in crazy way.


Why do logistic neurons saturate?

Consider a pre-activation function:

And the activation function

In cases where the weights are initialized to very high or very low values, the weighted summation term (a) will become very large or very small (very negative).

This could result in the neuron attaining saturation

Remember to initialize the weights to small values

Another shortcoming with the logistic function is that it is not zero centered

Zero centered: The function is spread out equidistant around the 0 point, i.e. it takes an equal number of positive and negative values.

The logistic function ranges from 0 to 1

The tanh function is a zero centred sigmoid function

Consider the simple neural network with logistic sigmoid neurons

To understand why not zero centred is problematic?

Consider the following gradients

The bracketed terms are common

Both h21 and h22 are outputs of the logistic function, so they are always positive (i.e. ranging from 0 to 1)

Due to this, at all times, both delta w1 and delta w2 will always be of the same sign, either positive or negative. They cannot be different from each other since the bracketed part is common between them and the logistic function output is always positive

The gradients w.r.t all the weights connected to the same neuron are either all +ve or all -ve

Thus, this limits the directions in which the weights can be updated

In above figure it represents only +ve or -ve area delta w1 and delta w2 vectors can lie, so we have restriction in possible directions we can have.

Thus, we cannot arrive at the local minima as fast as possible by moving in all directions.

Also, logistic function is computationally expensive because of e^x

So, we have following 3 problems with sigmoid neuron

— — — — — — — — — — — — — — — — — — —


— — — — — — — — — — — — — — — — — — —

What are the other alternatives to the Logistic function?


The following figure illustrates the tanh function

plot for derivative and tanh function both

The tanh function ranges from -1 to +1, whereas the logistic function ranges from 0 to 1

It is a zero centered function.

The function saturates at f(x) = -1 or 1, because at -1 and +1 the surface is flat so the slope would be zero and thus causing the derivative to zero and so gradients are going to vanish.

tanh is computationally expensive because of e^x

However, it is still preferred over the logistic function, as it preferred more than sigmoid function.


The following figure illustrates the ReLU function

plot for derivative and RELU function both

ReLU outputs the input value itself if it is positive, else it outputs zero, i.e. f(1) = 1, f(-1) = 0

It does not saturate in the positive region as it can go further.

It is not zero centered

Easy to compute (no expensive e^x)

Tanh and ReLU Activation Functions

Is there any caveat in using ReLU?

Consider the following deep neural network that uses the ReLU activation function

What happens if b takes on a large negative value due to a large negative update delta b at some point?

Therefore h1=0 [dead neuron]

Which means derivative of h1 w.r.t. a1 = 0

This zero derivative is involved in the chain rule for computing the gradient w.r.t delta w1

delta w1 becomes 0 leading to the weight not being updated, as in the case of a saturated neuron.

This also applies to delta w2 and delta b, their parameters are not updated, and hence b remains large negative value.

Here, x1 and x2 have been normalized, so they range between 0–1 and are therefore unable to counterbalance any large negative value b

This means that once a neuron has died, it remains dead forever, and weights associated will not get updated, as no new input would be large enough to counter the negative b value.

Thus, there is a very real problem of saturation of a ReLU neuron in the negative region.

In practice, if there is a large number of ReLU neurons, a large fraction (up to 50%) may die during operation if the learning rate is set too high

How to take care of Relu problem?

It is advised to initialize the bias to a positive value

– Using other variants of ReLU is recommended. A good alternative is the Leaky ReLU

The following figure illustrates the leaky ReLU function

ReLU outputs the input value itself if it is positive, else it outputs a fraction of the input value, i.e. f(2) = 2, f(-2) = 0.02

It does not saturate in the positive or negative region

Will not die (0.01x ensures that at least a small gradient will flow through), this means that there isn’t any 0 valued derivative, thereby ensuring that the gradients are all non-zero. Thus, the weights are always updated.

It is easy to compute (no expensive ex)

Close to zero centered outputs