Activation Functions in Deep Learning

Original article was published by Ranjit Singh Rathore on Deep Learning on Medium


Activation Functions in Deep Learning

Hey Everyone ,I am back with another blog related to deep learning , have you ever wonder why activation function are so important and useful in our neural networks, what would happen if there was no activation function used in our network ,etc. In this blog we are going to learn about Activation Functions in deep learning.

We are going to cover following topics in this blog :

  1. What are activation functions .
  2. Why do we use activation function.
  3. Type of activation functions
  4. Which Activation function to use for our neural network

1. What are Activation Functions .

Activation function are the mathematical gate or a function between the input feeding to the current neuron and its output going to the next layer. They basically decide whether the current neuron should be activated or not .

In simple terms, Activation function calculates a weighted sum(Wi) of its input (Xi) and add a bias, then it decides whether it should be fired or not .

2. Why do we use Activation function .

If we do not have Activation function , then the weights and bias would simply do a linear transformation.

A linear equation is simple to solve but is limited in its capacity to solve complex problems and have less power to learn complex functional mappings from data.

A neural network without an activation function is just a linear regression model. Therefore the neural networks use a non -linear activation function which can help the network to learn complex data, compute and learn almost any function representing a question, and provide the accurate prediction.

Why use a non — linear activation function

If we were to use a linear activation functions then our neural network would output a linear function of input .so no matter how many layers our neural network has, it will still behave just like a single layer network because summing these layers will give us another linear function which is not strong enough to model data.

Properties a Activation function should hold .

Monotonic function : The activation function should be strictly increasing or strictly decreasing

Derivative or Differential: Change in y-axis w.r.t. change in x-axis. It is also known as slope.

Types of Activation functions

Activation functions are mainly divided into 2 types :

  • Linear or Identity Activation function
  • Non Linear Activation function

Linear Activation Function

Linear Activation Function takes the form :

y = Wx + b

It takes the inputs (Xi’s), multiplied by the weights(Wi’s) for each neuron, and creates an output proportional to the input. In simple term, weighted sum input is proportional to output.

As mentioned above , activation function should hold some properties with fails in linear function.

Problem with Linear activation function are :

Differential result is Constant: Differential of linear function is constant and has no relation with the input. Which implies weights and bias will be updated during the backprop but the updating factor (gradient) would be the same.

All layer of the neural network collapse into one: linear activation functions, no matter how many layers in the neural network, the last layer will be a linear function of the first layer — Meaning Output of the first layer is same as the output of the nth layer.

A Neural Network with Linear Activation function is just a linear regression model.

Non Linear Activation Function

Modern Neural networks uses Non linear activation function , since it allows the model to crate a complex mapping b/w networks inputs and outputs , which are essential for learning and modeling complex data, such as images, video, audio, and data sets which are non-linear or have high dimensionality.

Non- linear Activation functions overcome the problems occurred in the linear activation functions:

  1. They allow back propagation because they have a derivative function which is related to the inputs.
  2. They allow stacking of multiple layers of neuron to create a deep neural network . Multiple hidden layers of neurons are needed to learn the complex data set with high level of accuracy.

Some commonly used Non — Linear Activation functions

Non Linear Activation function are classified on the basis on their range and curves.

1 . Sigmoid/Logistic Activation function : The sigmoid function takes the real values as a input and map them into value b/w 0 and 1. Sigmoid function create a ‘S’ shape curve .It map the predicted values to probabilities.

  • The output of the sigmoid function ranges from 0 to 1, normalizing the output of each neuron..
  • Derivative /Differential of the sigmoid function (f’(x)) will lies between 0 and 0.25.
  • Sigmoid Activation function is very popular in Classification tasks.

Advantage of Sigmoid Activation function

  1. The function is differentiable. That means, we can find the slope of the sigmoid curve at any two points.
  2. Output values bound between 0 and 1, normalizing the output of each neuron.

Disadvantage of Sigmoid Activation function

  1. Vanishing Gradient — for a very high or low value of X ,there is almost no change to the prediction , causing vanishing gradient problem.
  2. Due to vanishing Gradient problem ,sigmoid have slow convergence, as its computationally heavy.(Reason use of exponential math function ).
  3. Output not zero centered.
  4. computationally expensive.

2. Tan-h /Hyperbolic Tangent :

Tanh is the modified version of sigmoid function. Hence have similar properties of sigmoid function.

Advantages of Tan-h Activation function :

  1. Zero centered — making it easier to model inputs that have strongly negative, neutral, and strongly positive values.
  2. The function and its derivative both are monotonic.
  3. Works better than sigmoid function.

Disadvantage of Tan-h Activation function:

  1. It also suffers vanishing gradient problem and hence slow convergence.

“Tanh is preferred over the sigmoid function since it is zero centered and the gradients are not restricted to move in a certain direction”

2. ReLU (Rectified Linear Unit):

ReLU is the non-linear activation function that has gained popularity in AI. ReLu function is also represented as f(x) = max(0,x).

ReLU Activation function (max(x,0))

Advantages of ReLU :

  1. Computationally efficient — allows the network to converge very quickly
  2. Non-linear — although it looks like a linear function, ReLU has a derivative function and allows for back-propagation.

Disadvantages of ReLU:

  1. The Dying ReLU problem — when inputs approach zero, or are negative, the gradient of the function becomes zero, the network cannot perform back-propagation and cannot learn.

4. Leaky ReLU Activation function

Leaky ReLU function is nothing but an improved version of the ReLU function with introduction of “constant slope”

Leaky Relu Activation (blue), Derivative (orange)

Advantage of Leaky ReLU:

  1. Prevents dying ReLU problem — this variation of ReLU has a small positive slope in the negative area, so it does enable back-propagation, even for negative input values.

Disadvantage of Leaky ReLU :

  1. Results not consistent — leaky ReLU does not provide consistent predictions for negative input values.
  2. During the front propagation if the learning rate is set very high it will overshoot killing the neuron.

The idea of leaky ReLU can be extended even further. Instead of multiplying x with a constant term we can multiply it with a hyper-parameter which seems to work better the leaky ReLU. This extension to leaky ReLU is known as Parametric ReLU.

5. Softmax Activation function

Softmax function calculates the probabilities distribution of the event over ’n’ different events.

Advantages of Softmax Activation function:

  1. Able to handle multiple classes only one class in other activation functions — normalizes the outputs for each class between 0 and 1, and divides by their sum, giving the probability of the input value being in a specific class.
  2. Useful for output neurons — typically Softmax is used only for the output layer, for neural networks that need to classify inputs into multiple categories.

Which Activation function to use in out Neural Network

Till Now, we have learned about some of the common activation functions ,although there are many more different different activation function available for us.

But a question arises that ,Which Activation function should we use in our neural network?

There are many factors which decides ,which Activation function to use in neural networks like:

  • How difficult it is to compute the derivatives (if it is differentiable at all ?)
  • How quickly the a network converge with your Activation function.
  • How smooth it is, whether it satisfies the conditions of the universal approximation theorem, whether it preserves normalization.
  • So on ….

But here are some fact and tip ,which may help you to select your Activation function in your neural network:

  • Sigmoid Activation function and there combination works better in case of Classification type of problems.
  • Sigmoid and Tanh Activation function are sometime avoided because of Vanishing gradient problem .
  • ReLU Activation function is widely used and default choice as it gives better result most of the time.
  • ReLU should be only used in the hidden layers of our network.
  • If we encounter the dead neurons case in our network ,then Leaky ReLU is best choice as our Activation function.
  • Tanh is avoided most of the time , due to dead neuron problem.
  • The output layer can be linear activation function in case of Regression problems.
  • Softmax can be used in output layer in case of Multiclass Classification problem.

Now ,we have come to the end of this blog , I hope you all have learned lot about the Activation functions by this blog . Thankyou very much for investing your time in reading this blog.