Why and How is Neural Architecture Search is Biased?

Original article was published by Devansh on Artificial Intelligence on Medium

Why and How is Neural Architecture Search is Biased?

What does this mean for their performance?

Neural Architecture Search (NAS) is being touted as one Machine Learning’s big breakthroughs. It is a technique for automating the design of neural networks. As someone interested in automation and machine learning, this is something I’ve been following for a while. Recently a paper titled “ Understanding the wiring evolution in differentiable neural architecture search” by Sirui Xie et al caught my attention. It delves into the question of whether “neural architecture search methods discover wiring topology effectively”. This paper provides a framework for evaluating bias by proposing “a unified view on searching algorithms of existing frameworks, transferring the global optimization to local cost minimization”. It categorically shows that differentiable NAS is biased when designing networks, and expands upon the 3 common types.

A quick overview of the NAS. The paper looks at Differentiable (calculus-based) NAS

In this article, I will explain the types of biases, why they exist, and how they are detected. By understanding the techniques, you will be able to understand how to implement them to evaluate your own NAS (and other related techniques). Please be sure to leave your feedback on this article, and share it if you find it useful. NOTE: I use NAS, but this paper and article are specific to differentiable NAS. Experiments on other ones aren’t done yet.

A Tale of 3 Biases

Read like a clock.

The team did a thorough investigation of the 3 common patterns found in differentiable Networks created through NAS. In the words of the team: “Our investigation is motivated by three observed searching patterns of differentiable NAS: 1) they search by growing instead of pruning; 2) wider networks are more preferred than deeper ones; 3) no edges are selected in bi-level optimization”. Figure 1 is an illustration from the paper showing the first 2 in a concise manner.

The team provides possible reasons for each pattern, as well as validations for their theories. I will be explaining each of them in detail.

Pattern 1: Growing instead of Pruning

Pruning as an example

Those familiar with Trees and Backpropagation would recognize the term pruning. Pruning refers to removing all the redundant or useless edges in a tree (or graphs in general). This is very useful in optimizing algorithms and used to simplify Decision Trees. Since Neural Networks have the same structure as Directed and Weighted graphs, Pruning can be implemented to reduce the cost of the network, while sometimes boosting results by reducing error that can occur since lower quality nodes are used.

A quick Demo of how effective it can be

In the case of differentiable NAS frameworks though we see something else happen. Instead of a Neural Network sniping of the low-quality edges in the network, the first step has the Network drop all edges. It then proceeds to pick the ones that have the best scores. This might not be a problem by itself but it leads to some sticky situations. A lot of the details and nuances of the proof for this involve a lot of math, that would require an entire series to break down. If you are interested, they are on Pages 5–7. In my annotated version of the paper (linked at the end of the article), I have highlighted the important aspects. They should help understand the flow a bit better. Here I will be attaching the graphs that show the trends that clearly show a tendency to grow.

Surprisingly, for all operations except None, cost is inclined towards positive at initialization (Fig.4(a)). Similarly, we estimate the cost mean statistics after updating weight parameters for 150 epochs4 with architecture parameters still fixed. As shown in Fig.4(b), most of the cost becomes negative. It then becomes apparent that None operations are preferred in the beginning as they minimize these costs. While after training, the cost minimizer would prefer operations with the smallest negative cost.” None operations have a cost of 0, making them the easiest to lower costs. As training occurs, we see a shift from positive to then to negative. This is an indication that cell wiring topology is in fact growing.

Pattern 2: Preference to Width Over Height

This one is slightly easier to understand. The proof stems from an analysis of the data gathered over the first hypothesis (NAS biases growing over pruning). To phrase the problem simply, we want to find out if NAS created networks bias Wide Neural Nets over Deep ones. To understand the distinction look at the figure below. Wide networks would have lots of input layers, while deep ones would have more layers. Another way to understand is the following: Wide networks have fewer layers but more neurons per layer while Deep networks have more layers but fewer neurons per layer.

A standard Neural Net

This shows itself in a simple way. Remember how NAS networks tend to drop all layers starting out? While growing we see a clear preference for the network to recover edges (connections)to input neurons before going to one of the intermediate (hidden) neurons. To understand width-bias, we need to understand 2 things: 1) NAS makes a distinction between input and intermediate neurons; 2) It favors the former. We also need to show that these are problems caused by bias in the NAS.

Cell refers to neurons

The paper hypothesizes that bias occurs because intermediate cells (neurons) are less trained. Taking an example from the paper: “ Note that in A every input must be followed by an output edge. Reflected in the simplified cell, 0;1 and 0;2 are always trained as long as they are not sampled as None. Particularly, 0;1 is updated with gradients from two paths (3–2–1–0) and (3–1–0). When None is sampled on edge(1; 2), 0;1 can be updated with gradient from path (3–1–0). However, when a None is sampled on edge(0; 1), 1;2 cannot be updated because its input is zero. Even if None is not included in edge(0; 1), there are more model instances on path (3–2–1–0) than path (3–2–0) and (3–1–0)that share the training signal.”

It validates this through the following experiment:

Showing we can alter preference from Width to Depth through training, we show unequal training to be the cause of the bias.

Pattern 3: No edge selected in bi-level optimization

Bi-level optimization visualized

Bilevel optimization is a special kind of optimization where one problem is embedded (nested) within another. The outer optimization task is commonly referred to as the upper-level optimization task, and the inner optimization task is commonly referred to as the lower-level optimization task. For some reason, we see that bi-level optimization tasks don’t mesh with the NAS generated networks.

The paper did not go into great detail as to why or with the proof. It explains the pattern by stating that “Fig.11(b) shows the comparison of L and H in the training set and the search set. For correct classification, L and H are almost comparable in the training set and the search set. But for data classified incorrectly, the classification loss L is much larger in the search set. That is, data in the search set are classified poorly. This can be explained by overfitting … In sum, subnetworks are erroneously confident in the held-out set, on which their larger Lactually indicates their misclassification. As a result, the cost sum in bi-level optimization becomes more and more positive. None operation is chosen at all edges.”

If that was a bit much here’s the summary: There is an indication of overfitting (large error in misclassification). This causes the cost of the bi-level optimization to rise, causing NAS to choose None at edges.


This paper was great in popping the hood behind the creation of Neural Networks through differentiable Neural Architecture Search methods. An analysis of other protocols (evolutionary algorithms) etc would be interesting. Other than that, this paper was very comprehensive.

Reach Out To Me

Please leave your feedback on this article below. If this was useful to you, please share it and follow me here. I have lots of articles here. A clap goes a long way to helping me out. Additionally, check out my YouTube channel. I will be posting videos breaking down different concepts there. I will also be streaming on Twitch here. I will be answering any questions/having discussions there. Please go leave a follow there. If you would like to work with me email me here: devanshverma425@gmail.com or reach out to me LinkedIn. Follow my Instagram to keep up with me.Use my RobinHood Referral Link to get a free stock at the commission-free stock platform Robinhood.

Highlighted Paper

Below is the paper. I have highlighted what I thought was important and added definitions to some important concepts. Hope it helps. The paper had a lot of details about math, that you might find interesting.