Source: Deep Learning on Medium
ACTIVATION FUNCTION in DEEP LEARNING
Disclaimer: Content of this post and many to come are from different courses & books, I will mention references in last.
Why are activation functions important?
Consider a network where there are no non-linear activation functions like sigmoid etc.
h2 is activation function → h2 = sigmoid (a2)
Now question is what happens if there is no non linear activation functions in the network?
If there is no non-linear function, in that case formula become
It can only represent linear relations or linear function between x and y, so it will not learn any non-linearity, Here Universal Approximation Theorem does not hold good, It will only learn lines and planes
The representation power of a deep NN is due to its non-linear activation functions
Some popular non-linear activation functions are
Saturation in logistic neuron
Let’s look at the logistic function
The following figure illustrates the logistic function
In neural network h1 is activation function
Now let’s try to understand what saturated logistic neuron is, A logistic neuron is said to be saturated when it reaches its peak values when it is given high extremes of positive or negative values as inputs.
Why we bother about saturated logistic neuron?
In the case where we are calculating the gradient w.r.t a weight associated with a saturated neuron, the saturated neuron’s derivative is 0, thus resulting in the entire gradient becoming 0
This is because the term associated with the saturated neuron in the chain rule for gradient calculation becomes 0, thus making the entire gradient 0
Due to this, the weights are not updated.
This is called the Vanishing Gradient Problem, because the gradient vanishes or becomes 0 due to the presence of a saturated neuron.
Let’s discuss more about vanishing gradient problem
VANISHING GRADIENT PROBLEM
EXPLODING GRADIENT PROBLEM
Now similarly if derivative is large it become very-very large so it may lead to situation where converged didn’t work
Suppose we have activation function where derivative value can be greater than 1 in such cases multiplying lots of derivative positive value result in big value in such cases weight do not converged & it will keep on changing in crazy way.
ZERO CENTERED FUNCTION
Why do logistic neurons saturate?
Consider a pre-activation function:
And the activation function
In cases where the weights are initialized to very high or very low values, the weighted summation term (a) will become very large or very small (very negative).
This could result in the neuron attaining saturation
Remember to initialize the weights to small values
Another shortcoming with the logistic function is that it is not zero centered
Zero centered: The function is spread out equidistant around the 0 point, i.e. it takes an equal number of positive and negative values.
The logistic function ranges from 0 to 1
The tanh function is a zero centred sigmoid function
Consider the simple neural network with logistic sigmoid neurons
To understand why not zero centred is problematic?
Consider the following gradients
The bracketed terms are common
Both h21 and h22 are outputs of the logistic function, so they are always positive (i.e. ranging from 0 to 1)
Due to this, at all times, both delta w1 and delta w2 will always be of the same sign, either positive or negative. They cannot be different from each other since the bracketed part is common between them and the logistic function output is always positive
The gradients w.r.t all the weights connected to the same neuron are either all +ve or all -ve
Thus, this limits the directions in which the weights can be updated
In above figure it represents only +ve or -ve area delta w1 and delta w2 vectors can lie, so we have restriction in possible directions we can have.
Thus, we cannot arrive at the local minima as fast as possible by moving in all directions.
Also, logistic function is computationally expensive because of e^x
So, we have following 3 problems with sigmoid neuron
— — — — — — — — — — — — — — — — — — —
INTRODUCING TANH & RELU FUNCTIONS
— — — — — — — — — — — — — — — — — — —
What are the other alternatives to the Logistic function?
The following figure illustrates the tanh function
The tanh function ranges from -1 to +1, whereas the logistic function ranges from 0 to 1
It is a zero centered function.
The function saturates at f(x) = -1 or 1, because at -1 and +1 the surface is flat so the slope would be zero and thus causing the derivative to zero and so gradients are going to vanish.
tanh is computationally expensive because of e^x
However, it is still preferred over the logistic function, as it preferred more than sigmoid function.
The following figure illustrates the ReLU function
ReLU outputs the input value itself if it is positive, else it outputs zero, i.e. f(1) = 1, f(-1) = 0
It does not saturate in the positive region as it can go further.
It is not zero centered
Easy to compute (no expensive e^x)
Tanh and ReLU Activation Functions
Is there any caveat in using ReLU?
Consider the following deep neural network that uses the ReLU activation function
What happens if b takes on a large negative value due to a large negative update delta b at some point?
Therefore h1=0 [dead neuron]
Which means derivative of h1 w.r.t. a1 = 0
This zero derivative is involved in the chain rule for computing the gradient w.r.t delta w1
delta w1 becomes 0 leading to the weight not being updated, as in the case of a saturated neuron.
This also applies to delta w2 and delta b, their parameters are not updated, and hence b remains large negative value.
Here, x1 and x2 have been normalized, so they range between 0–1 and are therefore unable to counterbalance any large negative value b
This means that once a neuron has died, it remains dead forever, and weights associated will not get updated, as no new input would be large enough to counter the negative b value.
Thus, there is a very real problem of saturation of a ReLU neuron in the negative region.
In practice, if there is a large number of ReLU neurons, a large fraction (up to 50%) may die during operation if the learning rate is set too high
How to take care of Relu problem?
– It is advised to initialize the bias to a positive value
– Using other variants of ReLU is recommended. A good alternative is the Leaky ReLU
The following figure illustrates the leaky ReLU function
ReLU outputs the input value itself if it is positive, else it outputs a fraction of the input value, i.e. f(2) = 2, f(-2) = 0.02
It does not saturate in the positive or negative region
Will not die (0.01x ensures that at least a small gradient will flow through), this means that there isn’t any 0 valued derivative, thereby ensuring that the gradients are all non-zero. Thus, the weights are always updated.
It is easy to compute (no expensive ex)
Close to zero centered outputs