Adaptive Computation Time (ACT) in Neural Networks [2/3]

Source: Deep Learning on Medium

Adaptive Computation Time (ACT) in Neural Networks [2/3]

Part 2: ACT in Residual Networks

The next step in ACT development was done by Michael Figurnov (then an intern at Google and Ph.D. student in Higher School of Economics, now a researcher in DeepMind).

“Spatially Adaptive Computation Time for Residual Networks” by
Michael Figurnov, Maxwell D. Collins, Yukun Zhu, Li Zhang, Jonathan Huang, Dmitry Vetrov, Ruslan Salakhutdinov
Paper: https://arxiv.org/abs/1612.02297
Code: https://github.com/mfigurnov/sact
Video: https://www.youtube.com/watch?v=xp5lLiA-hA8

#1. ResNets with ACT: SACT

The idea is clever: not only we can adaptively choose the network depth during the inference, but we can do choose the depth individually for every part of the image, following the intuition that some parts of the image are “harder” and/or more important than others.

This flavor of ACT is called Spatially Adaptive Computation Time (SACT).

The main difference from the original paper of ACT is that now we apply ACT to feed-forward neural networks (FFN) instead of RNNs. OK, the special type of FFNs called Residual Networks (or ResNets). ResNets are famous for their huge depth (another important but much less known type of very deep network are Highway Networks, they even appeared earlier than ResNets).

ResNets comprise of blocks, each consisting of several residual units. A residual unit has a form F(x) = x + f(x), where the first term is called a shortcut connection and the second term is a residual function. A residual function consists of three convolutional layers: 1×1 layer that reduces the number of channels, 3×3 layer that has an equal number of input and output channels and 1 × 1 layer that restores the number of channels.

A branch is added to the outputs of each residual unit which predicts a halting score, a scalar value in the range [0, 1]. As long as the sum of halting scores for the particular position does not reach one (actually 1.0 minus a small epsilon), computation continues. When the sum reaches this value, the following units of the block are skipped. Other blocks are not skipped, they perform similar computations.

ResNet with 101 convolutional layers. Each residual unit contains three convolutional layers. ACT is applied to each block of ResNet to learn an image-dependent policy of stopping the computation.

The authors set the halting distribution to be the evaluated halting scores with the last value replaced by a remainder. This ensures that the distribution over the values of the halting scores sums to one. The output of the block is then re-defined as a weighted sum of the outputs of residual units, where the weight of each unit is given by the corresponding probability value.

Finally, a ponder cost is introduced that is the number of evaluated residual units plus the remainder value. Minimizing the ponder cost increases the halting scores of the non-last residual units making it more likely that the computation would stop earlier. The ponder cost is then multiplied by a constant τ (so you still need to tune τ, a time penalty hyperparameter) and added to the original loss function. ACT is applied to each block of ResNet independently with the ponder costs summed.

Adaptive Computation Time (ACT) for one block of residual units. The computation halts as soon as the cumulative sum of the halting score reaches 1. The remainder is R = 1 − h1 − h2 − h3 = 0.6, the number of evaluated units N = 4, and the ponder cost is ρ = N + R = 4.6

ACT has several important advantages. First, it adds very few parameters and computations to the base model. Second, it allows calculating the output of the block “on the fly” without storing all the intermediate residual unit outputs and halting scores in memory (this would not be possible if the halting distribution were a softmax of halting scores, as done in soft attention). Third, we can recover a block with any constant number of units l ≤ L by setting h_1 = · · · = h_{l−1} = 0, h_l = 1. Therefore, ACT is a strict generalization of standard ResNet.

Then ACT is applied to each spatial position of the block. Active positions are the spatial locations where the cumulative halting score is less than one (and the computation continues), inactive positions are the positions where computations halted. The evaluation of a block can be stopped completely as soon as all the positions become inactive.

Note that SACT is a more general model than ACT, and, consequently, than standard ResNet.

SACT for one block of residual units. We apply ACT to each spatial position of the block. As soon as the position’s cumulative halting score reaches 1, we mark it as inactive.

The ACT and SACT models are general and can be applied to any ResNet architecture (classification, detection, segmentation, etc).

SACT looks like an attention mechanism incorporated into ResNets — we stop processing the part of the image when its features become “good enough”. The SACT maps can be used for interpretation and gaining insights of how does a neural network work.