Neural networks as non-leaky mathematical abstraction

Source: Deep Learning on Medium

So, mathematics ends up being “leaky” because it has no such thing as an “observation”. The fact that 2 * 3 = 6 is not observed, it’s simply “known”. Or… is it?

The statement 95233345745213 * 4353614555235239 = 414609280180109235394973160907 is just as true as 2 * 3 = 6.

Tell a well-educated ancient Greek, “ 95233345745213 times 4353614555235239 is always equal to 414609280180109235394973160907, this must be fundamentally true within our system of mathematics,” and he will look at you like you’re a complete nut.

Tell that same ancient Greek,“ 2 times 3 is always equal to 6, this must be fundamentally true within our system of mathematics,” and he might think you are a bit pompous, but overall he will nod and agree with your rather banal statement.

To a modern person there is little difference in how “obviously true” the two statements seem, because a modern person has a calculator.

But before the advent of more modern techniques for working with numbers, such calculations were beyond the reach of most if not all. There would be no “obvious” way of checking the truth of that first statement… people would be short 414609280180109235394973160897 fingers.

So really, there is something that serves a similar role to observations in mathematics; that which most can agree on as being “obviously true”. Obviously there’s cases where this concept breaks down a bit (e.g. Ramanujan summation), but these are the exception rather than the rule.

Computers basically allow us to raise the bar for “obviously true” really high. So high that “this is obviously false” brute force approaches can be used to disprove sophisticated conjectures. As long as we are willing to trust the implementation of the software and hardware, we can quickly validate any mathematical tool over a very large finite domain.

Computers also allow us to build functional mathematical abstraction. Because suddenly we can “use” those abstractions without understanding them. Being able to use a function and understanding what a function does are different things in a modern age, but this is a very recent development.

For the most part computers have been used to run leaky mathematical abstractions. Leaky by design, made for a world where one had to “build up” their knowledge to use them from the most obvious of truths.

However, I think that non-leaky abstractions are slowly showing up to the party, and they have a lot of potential. In my view, the best specimen of such an abstraction is the neural network.

Neural Networks as mathematical abstractions

As far as dimensionality reduction (DR) methods go, an autoencoder (AE) is basically one of the easiest ones me to explain, implement and understand. Despite being quite sophisticated compared to most other nonlinear DR methods.

The way I’d explain it (maybe a bit more rigorous than necessary) is:

1. Build a network that gets the data as an input and has to predict the same data as an output.

2. Somewhere in that network, preferably close to the output layer, add a layer E of size n, where n is the number of dimensions you want to reduce the data to.

3. Train the network.

4. Run your data through the network and extract the output from the layer E as the reduced dimension representation of said data.

5. Since we know from training the network that the values generated by E are good enough to reconstruct the input values, they must be a good representation.

This is a very simple explanation compared to that of most “classic” nonlinear DR algorithms. Even more importantly, it uses no mathematics whatsoever. Or rather, it uses a heap of mathematics, but it’s all abstracted away behind the word “train”.

Consider for a moment the specific case of DR. It could be argued that a subset of AEs are basically performing PCA (1) (2).

In the case of most “real” AEs however, they are nonlinear. We could see them as doing the same thing as a kernel PCA, but removing the need for us to choose the kernel.

AEs are also impressive because they “work well”. Most people working on a hard DR problem will probably use some brand of AE (e.g. a VAE).

Even more than that, AEs are fairly “generic” compared to other nonlinear DR algorithms, and there are a lot of these algorithms. So, one can strive to understand when and why to apply various DR algorithms, or one can strive to understand AEs. The end result will be similar in efficacy when applied to various problems, but understanding AEs is much quicker than understanding a few dozen other DR algorithms.

Even better, in order to understand AEs the only “hard” part is understanding neural networks, but neural networks are a very broad tool so many people may already possess some understanding of them.

To understand neural networks, you need three other concepts: automatic differentiation as used to compute the gradients, loss computation and optimization.

Automatic differentiation is hard, to the extent that I assume most people otherwise skilled in coding and ML wouldn’t be able to replicate Google’s jax or Pytorch’s autograd. Luckily for us, automatic differentiation is very easy to abstract away and explain: “Based on a loss function computed between the real values and your outputs (the error), we use {magic} to estimate how much every weight and bias was responsible for that error and whether it’s influence increased or decreased the error. Furthermore, we can use {magic} to do this efficiently in batches”.

This sort of explanation is not that satisfactory, but I doubt most people go through life acquiring a much deeper understanding than this. I have no strong evidence that this is true, but empirically I notice new papers about new optimizers and loss functions pop up every day on /r/MachineLearning. I can’t remember the last time I saw a paper proposing significant improvements to common autograd methods or scrapping them altogether for something new.

Optimizers are… not that hard. Explaining something like Adam or random gradient descent to someone is fairly intuitive. Since the optimization itself is applied to rather trivial 2d functions and the process is just repeated a lot of times, it’s not that hard to conceptualize how it works. Alas, you can probably construct a “just so” story for why optimizers work similar to the one above, tell people to use AdamW and the world will be just fine.

Loss functions are certainly a trivial concept and if you stick to the heuristic behind basic loss functions and don’t try to factor them into crazy inferences, I’d argue anyone could probably write their own loss functions the same way they write their own training loops.

It’s really tough to argue how much “harder” it is to understand these three concepts to the degree where you can understand how to design neural networks, versus how hard it is to understand the kernel trick. I would certainly say they are probably in the same difficulty and time ballpark.

However, the advantages of neural networks are:

a) You don’t actually have to deeply understand the concept of automatic differentiation and optimization in order to design a network that does roughly what you want.

b) Once you learn the three basic concepts, you have the foundation on which to understand anything neural network related. Be it AEs, RNNs, CNNs, Residual blocks/nets or transformers.