Source: Deep Learning on Medium
5-minute Paper Review: Evolutionary Stochastic Gradient Descent
How evolutionary algorithms can speed up optimization of deep neural networks
This paper proposes a new evolutionary version of stochastic gradient descent in deep neural networks. Stochastic gradient descent (abbreviated as SGD) was first proposed by Robins and Monro in their paper “A Stochastic Approximation Method.” It is essentially an iterative optimization method which randomly draws a single sample during iteration k, and uses it to computer its gradient. The resulting stochastic gradient is then used to update the deep network’s weights given a step size (or learning rate.) These weights being updated are considered as a single parameter vector θ in the original SGD algorithm.
In their introduction, the authors suggest that the complementary relationship between SGD and Evolutionary Algorithms (EAs) is worth investigating. Unlike SGD, EAs are generally gradient-free. The authors hypothesize that EAs are more advantageous when dealing with complex optimization problems than SGD due to their gradient-free nature. Their proposal of combining EAs and SGD may help implementing optimization on large, distributed networks.
The new approach that the authors propose uses empirical risk of the parameter vector θ as a fitness function, and searches for the minimum fitness values given a population of parameters θs. In their methodology and implementation sections, the authors give detailed descriptions of the math behind the proposed Evolutionary Stochastic Gradient Descent (ESGD) algorithm. The main idea is given a population of randomly initialized parameter vectors, the algorithm searches for the ones which give the lowest empirical risks. The best offsprings are selected through m-elitist average fitness, which is essentially the average fitness of the best m individuals from ranking them in ascending order. For example, when m = 3 then the m-elitist average fitness gives the fitness of the best 3 individuals out of the entire population.
In each generation of ESGD, the initial population is generated through random vectorization of parameters. Using predetermined optimizers (including regular SGD and Adaptive Moment- ADAM optimizers) and a set of fixed hyperparameters, each individual is updated until its fitness degrades. The final fitness values of the population are then ranked and the top m individuals plus µ-m additional individuals are selected as parents. Offsprings for the next generation are produced by intermediate recombination of these top-performing individuals, and the addition of random zero-mean Gaussian noise. When the empirical risk of the population is at a minimum, the set of parameters which produce the minimum fitness value are chosen as real parameters of the model.
Multiple experiments are performed with ESGD on well-known benchmark datasets including BN-50, SWB300, CIFAR10 and PTB. These datasets cover tasks involving speech recognition (BN-50 and SWB300), image recognition (CIFAR10) and language modeling (PTB). The three deep learning models chosen for evaluation were Deep Neural Network (DNN), ResNet-20 (Residual Network), and LSTM (Long Short-Term Memory Network). Using fitness values of two additional baseline models as references, the authors show that the fitness values produced by their proposed algorithm are always non-increasing throughout the aforementioned datasets.
In their discussion, the authors suggest the importance of maintaining a good diversity in ESGD’s populations. This is due to the fact that if the population were to become homogeneous in terms of its fitness values, the algorithm will reach premature convergence and will not produce the ideal results. Therefore, the m-elitist selection strategy is introduced in ESGD as a measure induce diversity within the population. Besides population diversity, the authors also point out the importance of complementary optimizers arising from ESGD. It has been a known fact that Adam optimizers can reach convergence much quicker than conventional SGD. However, SGD tends to catch up and reach a better optimum point than Adam in longer runs. The authors suggest that in ESGD not only the best parameters are chosen, but their complementary optimizers as well since they are responsible of producing better fitness values.
In conclusion, the ESGD algorithm provides a new approach to select the best parameters for deep learning algorithms. Effectiveness of the algorithm has been demonstrated across 4 datasets and 3 different deep learning models. Overall, the SGD algorithm can be viewed as a coevolution where individuals evolve together to generate fitter offsprings for the next generation.
This is a well-written, and a particularly interesting research paper to me since optimization is one of the most important aspects in my research. It actually has an implementation section on top of its methodology section. This helps practical minds such as mine to better understand how the proposed math is actually implemented in practice. I suppose this paper has undergone stringent reviews since it was selected as an entry in the 2018 NeuroIPS (NIPS) conference. Therefore, it is no surprise to me that this article is so articulate and well-formatted. However, there is ambiguity particularly involving the actual family of optimizers used for optimization in the ESGD population. The authors can provide more clarity by listing all the optimizers and hyperparameters used in their algorithm, instead of merely listing several examples and move on to the next part. I am also curious why the authors choose to use conventional SGD and ADAM as choices of optimizers over other optimizers such as RMSProp, Adagrad and Adadelta. There seems to be no explanation for why SGD and ADAM are specifically chosen to be optimizers for the population.
The experimental section provides a good summary of the results obtained from multiple experiments run on the single baseline model, top-15 baseline model and ESGD. However, I find the 4th graph in Figure 1 quite confusing. This graph contains the LSTM network results for the PTB dataset. Even though the authors explain the increase in perplexity in the ESGD model, they do not address the flatlining of the fitness values in the top-15 model.
My favorite part of the article is perhaps the discussion section. The authors provide many meaningful discussions regarding the significance of their algorithm. I am happy to see the authors discussing about the computational efficiency of their algorithm. They conclude that given parallel computating capabilities of modern GPUs, the ESGD algorithm should take about the same time to compute end-to-end as the regular SGD algorithm. However, the authors do not provide any future steps or possible extensions for ESGD. One possible extension I’d like to see is the coevolution of not only network parameters and optimizers, but with the actual network architecture as well.