Neural network in C++ from scratch and backprop-free optimizer

Original article was published by Hyugen AI on Deep Learning on Medium


4. Exotic backpropagation-free optimizer: ShakingTree optimizer

Here I’ll introduce a new kind of exotic optimizer which I named “Shaking Tree optimizer”.

Why “ShakingTree”? Because we’re going to randomly change the weights, like if we were shaking the graph as a tree or if we were randomly shaking some branches so that only the weaker ones are replaced by greater ones.

Theory

The goal is of course to do better than a backpropagation algorithm. How does backprop work? (1) do the forward pass (2) compute the loss (3) do complex computations to evaluate the gradients (4) loop.

Can you change the weights of a neural network without doing a forward pass to compute the loss? Well at some points you need to evaluate what you’re doing and how it impacts the loss (1)+(2), so the main thing I wanted to remove is the gradients computations (3).

But I still need a way to change the weights. Ok, some people tried genetic algorithms, it works but it’s quite random, and of course the more weights, the harder it gets to find the right values…but that’s also true with vanishing gradients. We know that gradients give a great estimation of how to change the weights of a neural network. The main idea of ShakingTree is to try to approximate the gradients (OR something that would be at least as great as gradients) from random values without explicitly computing the gradients.

So, is it worth it to quickly try random values? Or is it better to slowly compute gradients? Is there a way to be quick and compute “random” values that end up being better than the gradients? That’s what I tried to answer.

I tried two methods, a “basic” one and a “complex” one.

  1. Basic method
  • Evaluate the score with current weights on a part of the dataset
  • Save current weights
  • Apply new weights randomly (amplitude, distribution, number of weights etc.. many hyperparameters)
  • Re-evaluate the score on (another?) part of the dataset
  • If the new score is worst, re-apply old weights

2. Complex method

  • Evaluate the score with current weights
  • Generate a delta to be applied to all weights (this delta is meant to act like the gradients)
  • Apply the delta weight, evaluate it, save the delta score and the delta weight
  • Now technically you can evaluate a kind of f(delta_weight) = delta_score (that’s what the gradients aim to be!)
  • Each N iterations, average the delta_weight of the weights that reduced the score (delta_score < 0), and apply that delta

That’s all for the theory, now you know the idea, what are the resuls?

In practice

The dataset contains two 2d-gaussian distribution, so input is size 2 and output is size 1 (<0.5 is first gaussian, ≥ 0.5 is the second onde), so that’s far from any real-world use case for deep learning algorithm. I don’t tune hyerparameters much. Meaning I’m pretty sure any of these curves could be higher or lower than the others with tuning but as we’re not in a real-world use case, it doesn’t really matter. The neural network contains five 10-neurons hidden layer with sigmoid activations (again, far from real-world use case).

The plots:

Loss on test set = f(seconds) on a dedicated machine

Why is it noisy? Because it’s random and it can sometimes quickly get to bad results. The fact that it goes from bad results to good results partially shows how easy the task is (but changing the final output layer parameters’ can also quickly give bad results and changing them back again can give good results again quickly). minimizeBasicLarger2 and minimizeBasic are the same configuration of the basic method of ShakingTree with different hyperparameters. Backpropagation is in blue and the complex version of ShakingTree is in red.

The first hyperparameters I chose for minimizeBasicLarger gave much more random results so I changed them once or twice, giving the green curve. I didn’t change the parameters of the other algorithms on that particular task.

Keeping only the best iteration:

Best loss on test set = f(seconds) on a dedicated machine

Limitations

Just to be clear, be sure that these algorithms do bad outside of their confort zone, which, again, to be honest, would be the confort zone of many other algorithms that don’t scale. For example, the minimizeBasic algorithm changes one weight per iteration, which is ok in a neural network with 500 weights but you sure can’t use that version for a neural network with 5M parameters. And if you randomly change 500.000 weights, you’re less prone to find a great configuration by chance, meaning you have to use smarter metrics.

These ShakingTree algorithms with their worst hyperparameters could probably look like a random search but the added hyperparameters are there to avoid being too close to a random search. These algorithms could probably also be approximated by genetic algorithms, but the same apply, it’s not exactly the same algorithm.

We’re not doing much better than any algorithm that probably performs well on MNIST and actually proving that these algorithms work on these datasets is far from being able to prove that they work on real-life dataset. But maybe with some hyperparameter tuning and with a more tensor-oriented approach it could be applied to convolutional filters on a more standard framework like Pytorch.

This article is mainly a proposition for a new backpropagation-free algorithm. There are many other backprop-free algorithms that could perform better on small dataset with small networks, but which probably are always worst than backprop on larger datasets with larger networks.