Why Does Deep and Cheap Learning Work So Well?

Source: Deep Learning on Medium

Why Does Deep and Cheap Learning Work So Well?

This paper was read in our weekly reading group discussion here.

This papers talks about the effectiveness of Deep Learning. Particularly the paper raises the question Why does deep learning works so well ? Not only deep learning is good but it does that with extremely small number of parameters compared to the huge space of all possible input combinations? . The authors refer this as “Cheap Learning”. We will try to answer some of those questions later in this post.

This paper focuses mainly on expressibility and efficiency of Deep NNs. What do these terms mean? The paper discusses these and other factors to decide the quality of any machine learning model :

  • Expressibility: What class of functions can the neural network express? We know that even with a neural network with a single hidden layer and non-linearity ( eg. sigmoid) can approximate any function within some defined error ε with some probability p[Cybenko].
  • Efficiency: How much resources (neurons, parameters, etc.) does the neural network require to approximate a given function? Again, Cybenko’s paper doesn’t provide the exact number of hidden units required but only says that if you have a large number of hidden units + some activation you can approximate any function.
  • Learnability: How rapidly can the neural network learn good parameters for approximating a function? In deep learning people use different optimizers such as SGD and its variants to iteratively find the best parameters. What are the convergence rate of such optimization process for different type of models ?

The million dollar question: “How can neural networks approximate functions well in practice, when the set of possible functions is exponentially larger than the set of practically possible networks?”

For an 8-bit image of size 1000 x 1000 there are 256¹⁰⁰⁰⁰⁰⁰ possible images. Consider a classification task to find p(y|x), where x is the feature and y is the output class. There are a possible 256¹⁰⁰⁰⁰⁰⁰ ways of an image being of a particular category y. To put things into perspective, 256¹⁰⁰⁰⁰⁰⁰ is a very large number, greater than atoms in our universe (about 10⁷⁸).

One really interesting connection that the paper makes is many terms that are already known in physics.

Using Baye’s theorem for classification we can define the probability of any image belonging to a certain class as p(y|x):

These two types of probability functions are encountered frequently with logarithms in machine learning as below:

In physics -ln(p(x)) is known as self information or surprisal and H is known as Hamiltonian or energy of x given y. Now, Equation 1 can be written as below where N(x) is for normalizing the function p(x).

The paper further simplifies the notations in a vector form as below:

The paper further simplifies the notations in a vector form as below:

This looks familiar to one of the common function called Softmax:

So finally, p(x) can be written as:

So, probability vector can be computed using this Hamiltonian vector H(x) and self entropy simple becomes the bias term in the final layer of NN.

So why are Hamiltonians interesting?

Hamiltonians are polynomial functions and can be expanded as power series ie. sum of constant, linear , quadratic terms and higher powers. It seems that many real world physical phenomena can be sufficiently explained using only low-order polynomials typically 2 to 4. If we can express physics then of course we must be able to do it for any arbitrary worldly data for our Machine Learning problems too.

To prove the capacity of NNs they consider mulitplication of two numbers u and v. They found that arbitrary precision multiplication can be done with only 4 neurons.

Only 4 neurons are sufficient to multiply two numbers upto arbitrary precision which can be controlled by changing λ or β.

It is often found that simplest explanations are preferred for any given condition ( Ockham’s razor). For machine learning even though the number of possible combinations are huge, there exist some simple functions that can explain the most common observations that we encounter in nature. When we imagine zebra we almost certainly imaging dark and white shades and not Polka-Dotted one. It is because of this reason we get away with potentially very simple models. This explains the Cheap Learning in Deep Neural Networks. The next question is why do we need “Deep” models ?

One of the most striking features of the physical world is its hierarchical structure. Spatially, it is an object hierarchy: elementary particles form atoms which in turn form molecules, cells, organisms, planets, solar systems, galaxies, etc. Causally, complex structures are frequently created through a distinct sequence of simpler steps.

As we stack more layers in our deep neural networks with mostly local operations on vector A ,then the overall combination function f(x) has the ability to represent much more complex functions.

This really was an interesting paper which connected to perspectives from Physics. Please, let me know what you feel.