# 5-minute Paper Review: Evolutionary Stochastic Gradient Descent

Source: Deep Learning on Medium

# 5-minute Paper Review: Evolutionary Stochastic Gradient Descent

## How evolutionary algorithms can speed up optimization of deep neural networks

This paper proposes a new evolutionary version of stochastic gradient descent in deep neural networks. Stochastic gradient descent (abbreviated as SGD) was first proposed by Robins and Monro in their paper “A Stochastic Approximation Method.” It is essentially an iterative optimization method which randomly draws a single sample during iteration k, and uses it to computer its gradient. The resulting stochastic gradient is then used to update the deep network’s weights given a step size (or learning rate.) These weights being updated are considered as a single parameter vector θ in the original SGD algorithm.

In their introduction, the authors suggest that the complementary relationship between SGD and Evolutionary Algorithms (EAs) is worth investigating. Unlike SGD, EAs are generally gradient-free. The authors hypothesize that EAs are more advantageous when dealing with complex optimization problems than SGD due to their gradient-free nature. Their proposal of combining EAs and SGD may help implementing optimization on large, distributed networks.

The new approach that the authors propose uses empirical risk of the parameter vector θ as a fitness function, and searches for the minimum fitness values given a population of parameters θs. In their methodology and implementation sections, the authors give detailed descriptions of the math behind the proposed Evolutionary Stochastic Gradient Descent (ESGD) algorithm. The main idea is given a population of randomly initialized parameter vectors, the algorithm searches for the ones which give the lowest empirical risks. The best offsprings are selected through m-elitist average fitness, which is essentially the average fitness of the best m individuals from ranking them in ascending order. For example, when m = 3 then the m-elitist average fitness gives the fitness of the best 3 individuals out of the entire population.

In each generation of ESGD, the initial population is generated through random vectorization of parameters. Using predetermined optimizers (including regular SGD and Adaptive Moment- ADAM optimizers) and a set of fixed hyperparameters, each individual is updated until its fitness degrades. The final fitness values of the population are then ranked and the top m individuals plus µ-m additional individuals are selected as parents. Offsprings for the next generation are produced by intermediate recombination of these top-performing individuals, and the addition of random zero-mean Gaussian noise. When the empirical risk of the population is at a minimum, the set of parameters which produce the minimum fitness value are chosen as real parameters of the model.

Multiple experiments are performed with ESGD on well-known benchmark datasets including BN-50, SWB300, CIFAR10 and PTB. These datasets cover tasks involving speech recognition (BN-50 and SWB300), image recognition (CIFAR10) and language modeling (PTB). The three deep learning models chosen for evaluation were Deep Neural Network (DNN), ResNet-20 (Residual Network), and LSTM (Long Short-Term Memory Network). Using fitness values of two additional baseline models as references, the authors show that the fitness values produced by their proposed algorithm are always non-increasing throughout the aforementioned datasets.

In their discussion, the authors suggest the importance of maintaining a good diversity in ESGD’s populations. This is due to the fact that if the population were to become homogeneous in terms of its fitness values, the algorithm will reach premature convergence and will not produce the ideal results. Therefore, the m-elitist selection strategy is introduced in ESGD as a measure induce diversity within the population. Besides population diversity, the authors also point out the importance of complementary optimizers arising from ESGD. It has been a known fact that Adam optimizers can reach convergence much quicker than conventional SGD. However, SGD tends to catch up and reach a better optimum point than Adam in longer runs. The authors suggest that in ESGD not only the best parameters are chosen, but their complementary optimizers as well since they are responsible of producing better fitness values.

In conclusion, the ESGD algorithm provides a new approach to select the best parameters for deep learning algorithms. Effectiveness of the algorithm has been demonstrated across 4 datasets and 3 different deep learning models. Overall, the SGD algorithm can be viewed as a coevolution where individuals evolve together to generate fitter offsprings for the next generation.