Vanishing and Exploding Gradients

Original article was published on Deep Learning on Medium

In this blog, I will explain how a sigmoid activation can have both vanishing and exploding gradient problem. Vanishing and exploding gradients are one of the biggest problems that the neural network faces. Vanishing gradient leads to slow convergence and the exploding gradient leads to too much change in the weights and this too much change is mostly undesired. Due to the vanishing gradient and exploding gradient, neural networks do not converge.

This blog is about vanishing and exploding gradients in sigmoid activation. So, let’s discuss what is a sigmoid function first.

Sigmoid Function

This is a mathematical function that has an ‘S’-shaped curve (or sigmoid curve). This function can be defined as:

Figure — 2: Sigmoid Function

The derivative of this function will be —

Figure — 3: The derivative of Sigmoid Function

This is the beauty of this function because its derivative is in terms of it. This derivative is symmetric about the y-axis and the maximum value of the derivative is 0.25. So the derivative of sigmoid lies in [0,0.25]. For derivative to be maximum x=0. Let’s see the plot of both sigmoid and its derivative.

Figure — 4

Vanishing Gradient

In vanishing gradient, the gradient becomes very small (close to 0) which leads to a very small change in weights (or almost no change). No change in weights (i.e. gradient=0) is a condition for termination, but this is NOT a converged solution. Let’s see how and why this vanishing gradient is happening.

Let’s see how data flow through a neuron —

x: is input to the neuron, O: is the output of the neuron. Similarly, we can define the architecture of many layers using this concept. The below figure is showing the neural network architecture.