My Experiments In Replacing Deep Learning Backpropagation (SGD) With A Genetic Algorithm

Source: Deep Learning on Medium

My Experiments In Replacing Deep Learning Backpropagation (SGD) With A Genetic Algorithm

This article describes experiments I ran to replace stochastic gradient descent (SGD) and backpropagation in deep learning models with a genetic algorithm (GA).

As a spoiler, let me say that these early experiments did not perform anywhere close to SGD. However, I say more about this in the conclusions.

Context

For the purposes of this article, I’ll use the following diagram to provide the salient flow used today when training deep learning models.

The cost function calculates an error amount (E) by comparing the training results and the expected results (ground truth). In turn, SGD applies changes to model parameters in a direction that should improve the next training result. Of note, the functions used in the model must be differentiable in order for the SGD to search for optimal solutions.

The parameter update cycle continues until the desired accuracy (not shown in the diagram) is met.

I’ll use W to represent the set of model parameters. A deep learning model can easily contain tens (if not hundreds) of millions of parameters. After training is done, W will form the basis of the trained model.

Outline of a Genetic Algorithm Approach

The following diagram shows the SGD being replaced by a GA and the single W being replaced by a population of W, which I’ll use [W] to represent.

The cost function was removed and replaced by an accuracy function which provides a fitness function in GA terminology. The fitness value being the fitness of a member of a population given some criteria. Fitness is a value of 0 to 1, with 0 being most fit. I’ll use [F] to represent the fitness for each member of [W].

Note that in a GA approach the model functions do not have to be differentiable, as the GA will not be using function derivatives to converge upon a solution.

Genetic Algorithms Overview

There are many scholarly and detailed books on GAs, for example, “Genetic Algorithms in Search, Optimization and Machine Learning” by Goldberg. Here, I’ll stay at a (very) high level.

Given a population [W], and a fitness value for each member of the population [F], a GA will determine the makeup of the population in the next generation.

[W]next = GA( [W]current, [F]current )

A simple GA can be described as:

  1. Selection — Pairing of members using fitness as a bias
  2. Combination — Combine the selected pairs to create a next generation. This is also known as crossover, whereby the genes of parents are crossed over in the next generation.
  3. Mutation — Before unleashing the new population, apply (a small amount of) random mutation

The Experiment

I initially started with a simple XOR problem to test out the flow of the GA. The GA did converge upon a solution relatively quickly. However, for this article I will describe a more substantive problem.

The experiment is based upon the TensorFlow MNIST Deep example (mnist_deep.py as found the TensorFlow GitHub repos). This is a deep convolutional neural net for recognizing the MNIST handwritten digit dataset . The model contains tens of millions of parameters; I thought it would be a good non-toy example.

I externalized all of the model parameters (weights and biases) into TensorFlow variables so that it was easy to manipulate the values outside of the TensorFlow tensors. The parameters, W, were loaded into the TensorFlow model graph using “feed_dict”.

The population size is parameterized as N. Each of the N starting models are initialized using numpy.random.randn (a Gaussian distribution with a mean of 0 and a variance of 1).

Then for each W of [W], a fitness value is computed. The number of generations to run is G (or until the solution has converged).

for generation = 1 to G
for index = 1 to N
F[index] = session.run(accuracy, feed_dict = W[index])

The process of calculating the next generation of [W] is:

Random pairing of [W], biased towards fitter membersCrossover of each pair in [W]Random mutation of each member of [W]

Crossover

Each W is formed by a set of matrices where each matrix represents a piece of the machine learning model (weights and biases). Given two members of a pair:

Wa -> [a1, a2, a3, ...] # where each a is a matrix
Wb -> [b1, b2, b3, ...] # where each b is a matrix

There are many ways one could crossover Wa and Wb. For this experiment, I chose to randomly cross the elements in corresponding matrices. For example:

mask = random 0 or 1 in shape of a1
mask_inv = opposite of mask
new_a1 = a1 * mask + b1 * mask_inv
new_b1 = b1 * mask + a1 * mask_inv
and so on for (a2,b2), (a3, b3), etc.

Mutation

As a GA hyper-parameter, there is a “mutation amount”, but the mutation amount itself varies based upon how close or far the current solution convergence is. Ultimately, I found mutation to be a challenge to understand and apply.

Results

The first diagram shows a population size of 200 run for 300 generations.

The above diagram shows the best results in each generation (orange). Recall that the MNIST data has 10 digits, so a random choice would be 0.9 in the above scale. There is a marked leveling as the run progresses. For each generation, the best model was also run against validation data (data put aside that is not seen as part of the training).

Regardless of the clearly not-state-of-the-art accuracy, the genetic algorithm was, to some extent, able to sift through millions of parameters and create better models over time; all without backpropagation.

From the same run, the following diagram depicts 100 numeric values of the best model in each generation. The diagrams shows initial volatility and then stabilization of values:

The challenge being that early stabilization leads to a ceiling in performance, but no stability would lead to a random walk over the solution space and no convergence. The level of mutation should be a contributing factor to the balance required.

Conclusions and Discussion

It would be fair to say that population sizes of 200 or even 1000 are likely too small to achieve meaningful results. Perhaps population sizes should be in the millions.

Aside from the population size, there are several other hyper-parameters that could be considered. Some examples are:

  1. Ranking and pairwise selection. How much bias is given to fitter members? In my experiments, I boosted the top performers so that they had more influence than strictly dictated by fitness.
  2. Mutation. How much mutation? How is it applied? Does it change depending on the convergence?
  3. Crossover. How are two model crossed? I choose to evenly crossover at a matrix granularity. Should the stronger of the pair retain more values? Should the matrices be sliced and then joined? Should crossover be defined at the switching of entire matrices?
  4. Batch size. I tried a variety of batch sizes, settling on a batch size of 64.

For myself, the next steps are undetermined as the leveling-off of convergence has been a stubborn barrier to overcome (at least within the computer power available to me, a MacBook Pro 2018).

If you have made it this far, thank you for your attention and time, and I hope you have enjoyed the article!