Original article was published on Deep Learning on Medium In this blog, I will explain how a sigmoid activation can have both vanishing and exploding gradient problem. Vanishing and exploding gradients are one of the biggest problems that the neural network faces. Vanishing gradient leads to slow convergence and the exploding gradient leads to too much change in the weights and this too much change is mostly undesired. Due to the vanishing gradient and exploding gradient, neural networks do not converge.

This blog is about vanishing and exploding gradients in sigmoid activation. So, let’s discuss what is a sigmoid function first.

# Sigmoid Function

This is a mathematical function that has an ‘S’-shaped curve (or sigmoid curve). This function can be defined as:

The derivative of this function will be —

This is the beauty of this function because its derivative is in terms of it. This derivative is symmetric about the y-axis and the maximum value of the derivative is 0.25. So the derivative of sigmoid lies in [0,0.25]. For derivative to be maximum x=0. Let’s see the plot of both sigmoid and its derivative.