How NAS was improved. From days to hours in search time.

Original article can be found here (source): Deep Learning on Medium

How NAS was improved. From days to hours in search time.

Neural Architecture Search(NAS) has revolutionized the process of constructing new neural network architectures. Using this technique it is possible to automatically find an optimal neural network architecture for a specific problem. What’s more? The definition of optimal can be adjusted to model a tradeoff between multiple features, such as size of the network and accuracy[1]. What’s even more impressive is the fact that NAS can now be performed within only a few hours on a single GPU instead of 28 days on 800 GPUs. This leap in performance has only taken an astonishing two years and now you don’t need to be a Google employee anymore to use NAS.

But how have researchers been able to achieve this leap in performance? In this article I’ll go through the new ideas that helped pave the way for this success story.

The Catalyst

The story of NAS started back in 1988 with the idea of self-organizing networks[2], but it wasn’t until 2017 that the first major breakthrough was made. This was when the idea of training a recurrent neural network(RNN) to generate neural network architectures was presented.

Figure 1: an overview of the iterative process of training the NAS controller.

In simple terms the process is very reminiscent of how a human would try to find the best architecture. Based on a defined search space of the most promising operations and hyperparameters, the controller will test different neural network configurations. In this context testing a configuration means to assemble, train and evaluate a neural network in order to observe its performance. Then, after many iterations, the controller will learn which configurations make up the best neural networks within the search space. Unfortunately, just as for a human, the number of iterations required to find the best architecture within a search space is extremely large, making it a slow process. This is partly because the search space is suffering from combinatorial explosion; meaning that the number of possible networks in the search space increases greatly with the number of components added to the search space. However, this approach was indeed able to find a state-of-the-art(SOTA) network that is now commonly known as NASnet[3], but that required 28 days on 800 GPUs. Such high computational costs make the search algorithm impractical to utilize for most people.

So how can this idea be improved in order to make it more accessible? In the NAS process the majority of the time comes from training and evaluating networks that are suggested by the controller. Utilizing multiple GPUs makes it possible to train models in parallel, but their individual training time is still quite slow. A reduction in the computational cost of training and evaluating the neural networks would have a big impact on the total search time of NAS.

This leads to the question, how does one reduce the computational cost of training and evaluation neural networks without negatively impacting the NAS algorithm?