Original article can be found here (source): Deep Learning on Medium
Different Types of Activation Functions
Sigmoid, tanh, ReLu & Leaky ReLu
The sigmoid function range is [0,1]
It is good to use for the problem of binary classification as it outputs a value between 0 and 1 (0&1 inclusive).
It is some times not used in hidden layers because as the value of the x becomes large the steepness of the graph decreases, the gradient values become very small, this can slow the learning of our model.
Tanh function range is [-1,1]
This is better than the sigmoid function as it is steeper than the sigmoid for small values which make learning faster as the gradient is large.
As it ranges between -1 and 1 it cannot be used as the activation function of the last layer for binary classification as we need the binary output.
Moreover, just like the sigmoid function, the gradient is very less for the large values which again makes the learning slow.
Nowadays, this activation is mostly used as the gradient is significant and also it remains the same for large values.
For negative values, the gradient drops to 0 which can make learning significantly slow. So, this should be used when most of the input values for a given layer are positive.
This is a modified version of the ReLu function.
In this function for negative values, there is some slight slope that ensures that the gradient does not drop to zero, and the learning of the model is not being much affected.
This is not as such a major drawback but sometimes the slope of function in the negative region needs to be fine-tuned.