Complete Guide of Activation Functions

Source: Deep Learning on Medium


A practical guide related to benefits, problems, and comparison of activation functions like Sigmoid, tanh, ReLU, Leaky ReLU and Maxout

Go to the profile of Pawan Jain

If you ever in dilemma which activation function should I use after this layer? What are the pros and cons of that one? Is it necessary to use an activation function? What is the logic behind using any ‘x’ activation function here?

I’ll try to give answers to all such questions below. Still, if you have any other doubt related to activation function post them below in the comment section

Why We Need Activation Functions

Ideally, we would like to provide a set of training examples and let the computer adjust the weight and the bias in such a way that the errors produced in the output are minimized

Now let’s suppose we have some images of humans and others not containing images of humans. Now while the computer processes these images, we would like our neurons to adjust its weights and bias so that we have fewer and fewer images wrongly recognized. This requires that a small change in weights (and/or bias) causes only a small change in outputs.

Unfortunately, our neural network does not show this little-by-little behavior. A perceptron is either 0 or 1 and that is a big jump and it will not help it to learn. We need something different, smoother. We need a function that progressively changes from 0 to 1 with no discontinuity.

Mathematically, this means that we need a continuous function that allows us to compute the derivative.

Hence, we decided to add “activation functions” for this purpose. To check the resultant value produced by a neuron and decide whether outside connections should consider this neuron as “fired” or not. Or rather let’s say “activated” or not.

Why derivative/differentiation is used?

When updating the curve, to know in which direction and how much to change or update the curve depending upon the slope.

Sigmoid or Logistic Activation Function

The sigmoid function is defined as follows:

Sigmoid activation function translates the input ranged in [-Inf; +Inf] to the range in (0; 1)

The softmax is a popular choice for the output layer activation.

  • It mimics the one-hot encoded labels better than the absolute values.
  • If we use the absolute (modulus) values we would lose information, while the exponential intrinsically takes care of this.

Terminology Alert : A more generalized logistic activation function that is used for multiclass classification called softmax function

Problems With Sigmoid Function

  • The exp( ) function is computationally expensive.
  • The problem of vanishing gradients
  • Not useful for the regression tasks as well. Simple linear units, f(x) = x should be used.
Comparing tanh and sigmoid function

tanh Activation Function

The tanh function is defined as follows:

  • It is nonlinear in nature, so we can stack layers
  • It is bound to the range (-1, 1)
  • The gradient is stronger for tanh than sigmoid ( derivatives are steeper)
  • Like sigmoid, tanh also has a vanishing gradient problem.

In practice, optimization is easier in this method hence in practice it is always preferred over Sigmoid function. And it is also common to use the tanh function in a state to state transition model (recurrent neural networks).

What is Vanishing Gradient Problem

Well, we see there is something called vanishing gradient problem in both tanh and sigmoid function. Let’s try to understand this problem

  • The problem of vanishing gradients arises due to the nature of the backpropagation optimization
  • Gradients tend to get smaller and smaller as we keep on moving backward
  • Implies that neurons in earlier layers learn very slowly compared to neurons in the last layers

Vanishing Gradient Problem results in a decrease in the prediction accuracy of the model and take a long time to train a model.

ReLU Activation Function

The ReLU(Rectified Linear Units) function is defined as follows:

ReLU is linear (identity) for all positive values, and zero for all negative values

Benefits of ReLU

  • Cheap to compute as there is no complicated math and hence easier to optimize
  • It converges faster. It accelerates the convergence of SGD compared to sigmoid and tanh (around 6 times).
  • Not have vanishing gradient problem like tanh or Sigmoid function
  • It is capable of outputting a true zero value allowing the activation of hidden layers in neural networks to contain one or more true zero values called Representational Sparsity

Problems with ReLU

  • The downside for being zero for all negative values called dying ReLU. So if once neuron gets negative it is unlikely for it to recover. This is called “dying ReLU” problem
  • If the learning rate is too high the weights may change to a value that causes the neuron to not get updated at any data point again.
  • ReLU generally not used in RNN because they can have very large outputs so they might be expected to be far more likely to explode than units that have bounded values.

ReLU is used only within hidden layers of neural network models. We use softmax, tanh or linear activation for the output layer.

Variants Of ReLU

Leaky ReLU

Leaky ReLUs attempt to fix the “dying ReLU” problem. Instead of the function being zero when x < 0, a leaky ReLU gives a small negative slope

PReLU

PReLU gives the neurons the ability to choose what slope is best in the negative region. They can become a ReLU or a leaky ReLU with certain values of α.

ELU

The exponential Linear Unit leads to higher classification results than traditional ReLU. It follows the same rule for x>= 0 as ReLU, and increases exponentially for x < 0.

ELU tries to make the mean activations closer to zero which speeds up training.

Maxout Activation

The Maxout activation function is defined as follows

The Maxout activation is a generalization of the ReLU and the leaky ReLU functions. It is a learnable activation function.

  • It is a piecewise linear function that returns the maximum of the inputs, designed to be used in conjunction with the dropout regularization technique.
  • Both ReLU and leaky ReLU are special cases of Maxout. The Maxout neuron, therefore, enjoys all the benefits of a ReLU unit (linear regime of operation, no saturation) and does not have its drawbacks (dying ReLU).
  • However, it doubles the total number of parameters for each neuron and hence, a higher total number of parameters need to be trained.

Best Practices

  • Use ReLU in hidden layer activation, but be careful with the learning rate and monitor the fraction of dead units.
  • If ReLU is giving problems. Try Leaky ReLU, PReLU, Maxout. Do not use sigmoid
  • Normalize the data in order to achieve higher validation accuracy, and standardize if you need the results faster
  • The sigmoid and hyperbolic tangent activation functions cannot be used in networks with many layers due to the vanishing gradient problem.

Previous stories you will love