Microsoft Efficient Forward Architecture Search

Source: Deep Learning on Medium

Microsoft Efficient Forward Architecture Search

We have to figure out how many layers we need and more → but we do not know this information beforehand. We usually sit there and try new things → , however, this is not the best idea to do. So how can we make this easier?

The thing is nobody really knows → how to set all those hyper-parameters. (those are just magic numbers LOL) → how should we start things?

The game is to remove the hand engineering part away from the model design → this is critical.

This problem is not that difficult → to want to find the architecture that meets some criteria → such as accuracy and more. (this is same as search algorithms).

There seems to be a macro and micro search → knowing what is the correct search space is also important. (and we are able to inject some priors → such as after convolution we cannot put batch normalization).

The search is pretty expensive → this is a problem → after certain architecture is found we need to find the statistic → training → evaluation of model accuracy is VERY HARD and LONG. (we do not want to train one model for two days → we want some way to evaluate the model much faster).

Cell Search Space → have some pre-defined cells → and we are connecting those cells by one connection. (this method is very popular → since it reduces the search space in general). (so we are going to take advantage of this information)

The opposite of this would have a general search space and search for anything and everything. (this field is pretty hot).

There is a website for all of the papers that are related to NAS → super cool → this really does seem to be the future direction. (there are 50 to 60 papers on Arix)

But thankfully, a lot of them are noise and most of there are only few key ideas.

And one of the key ideas → is RL combined with deep learning → have some controller → and another set of model searching → this is so cool.

Make good networks from another network.

This is a process of generating a new model → the only thing is this is SOOOO expensive.

Another new paper came out → for shorter GPU → much more practical.

We are still doing the same thing → but the cost of doing back-prop is expensive → this is the very expensive part → so what if we do not do that → there are multiple models they look pretty similar → and share the weights.

The gradients → share the gradient → this is very effective → one forward and back-prop → but we are able to search for multiple model architectures. (this is an easy idea → but very efficient)

The number of days reduced → and most of the parameters → we were able to optimize everything. (quite new architectures were found).

This was one of the most epic paper → DARTS → here is the idea → already has cell connections → there are operation connections.

The connections are already there → but the weights of the connections are able to change → and that is exactly what we are optimizing. (so some connections are going to be lost → while other connections are going to get stronger)

Hence at the end of the day, → we are able to get a specific model architecture. (and it takes less than a day → but one problem is → the search space is limited → we cannot create new node → also we have to train a massive graph → this is not memory efficient).

To fix the GPU issue → another paper came out → here we are only going to keep a portion of the graph inside the GPU → so only train that part → this is such an interesting approach.

There are many optimizations we can do when it comes to NAS.

Also, there is some bayesian approach to optimizing the model architecture. (for DARTS → there might be compilation failure)

Most of the methods are doing a backward search → here we are going to do a forward search! (how does that work?)

We are able to generate as we go → hey do we need another layer? Another activation function? And more??

Some of the methods are borrowed ideas from other statistical prediction methods.

First, we are going to start small → and we are going to have some candidates → and we want to know which ones are useful. (and this process will be repeated over and over again).

A bit of a complicated process → but a very novel idea → we are in general, not changing the gradient → this is critical. (since we do not know which ones are the best when it comes to choosing the candidates).

So from a possible choice of operation → slowly choose the ones that are increasing the performance of the model. (interesting idea)

It’s like an evolution tree → slowly morphing into the best model. (putting a lot of these things together).

And we can even do this in an asynchronous manner LOL.

Loss is the signal that we are taking into → not the accuracy → the loss value is the idea that we are taking advantage of.

Depending on the loss value → we do not have to search some data points that does not give good results. (data augmentation, as well as hyper-parameter search, is not taken into account) → hence reproducibility is a problem.

Quite good results on image net as well, but this is not a solved problem.