Original article was published by MBenedetto on Deep Learning on Medium
The error on the test set it still around 3 times bigger than the irreducible error that was estimated in part I from the classical Machine Learning solution, so there is still room for improvement.
A two layer network
Adding more layers to make a deeper network in principle does not increase the approximation power of a single layer NN, since in theory it can approximate any function provided it is wide enough. Why have more layers then? It turns out that a deeper network is exponentially more efficient at using its weights than shallow and wide networks. Intuitively, composing non linearities makes it easier to fit complex behavior than just linearly combining ReLUs, although theoretically both cases can achieve an approximation with an arbitrarily small error.
Keeping track of an explicit expression for the fit that the model produces gets unwieldy very rapidly, as this small (3,2) network shows:
The difference between a deeper and a wider approach will be made clear by the following graphics, where a comparison between a 1-layer network with 15 units and a mode with 2 layers of 5 units each is shown next. The number of parameters is the same for both (46).