Source: Deep Learning on Medium
The Deep Learning Story: How this evolved
The universal approximation theorem told us how much the complex relationship between input & output or y=f(x). We can find a neural network such that the output produced (ŷ) would be as close to the true output. Also, backpropagation told us the Deep neural networks are good. Backpropagation is a gradient descent applied with chain rule. Both of these are from 1989–1991 & gradient descent dates back to 1841.
There were some of the challenges in the deep neural network. In backpropagation, we have chain rule of derivative that we apply. So, if we want to update the weight suppose
Each quantity that we are multiplying is small. The output would be very small it means that when we are trying to update our weight it gets updated by very small quantity which means wherever we are we not moved much around it, which does not make progress.
When people use a very deep neural network in practical, they use to find training them is very hard, they don’t converge it means the loss does not quickly drops when the weights initialized randomly. So, it takes a lot of time for the neural network to train. Also, there were no supercomputers at that time memory & other resources are very fewer. That’s why it was not so popular. But some of the neural networks developed between this period like LSTM, RNN and CNN also came.
In 2006, Hinton & others did some interesting work & published a paper where they showed that it is possible to train a very deep neural network using some technique known as unsupervised pre-training.
Now, people know the deep neural network can be trained & started focusing on proposing better learning algorithm earlier we had only gradient descent. Better initialization as people perform some experiments, they realize that the neural network is very sensitive. The initialization if you don’t initialize the weight properly then the network is not going to train properly. Similarly, people studied sigmoid, tan h and other non-linearity function to use. So, they focus on much better activation. Techniques like drop out and other things were proposed to use better regularization when we are trying to train neural network otherwise, they might overfit on the training data. And now we have more data, also it is easy to compute today. So, all this develops an interest in people in Deep Learning.