Neural network architectures

Source: Deep Learning on Medium

Neural network architectures

Review a few important neural network architectures, including VGG, Resnet, GoogleNet(Inception), MobileNet.

Since 2012 AlexNet was published, many architectures have been developed to significantly improve the accuracy, increase the depth of neural networks, and reduce the model size as well as calculation operations. Here I study and review a few important developments.

An analysis of deep neural network models for practical applications

Let’s first have a big picture of these neural architectures regarding the accuracy, size, operations, inference time and power usage. This is a paper from 2016 so it doesn’t include MobileNet and other latest developments.

Figure 1 shows 1-crop top-1 accuracies of the most relevant entries submitted to the ImageNet challenge, from the AlexNet (Krizhevsky et al., 2012), on the far left, to the best performing Inception-v4 (Szegedy et al., 2016). The newest ResNet and Inception architectures surpass all other architectures by a significant margin of at least 7%. Note 1-crop, 5-crop or10-crop means # of times of cropping an image for testing, explained here.

Figure 2 shows model size (# of parameters) and the amount of operations required for a single forward pass (inference) in addition to the top-1 accuracy. The first thing that is very apparent is that VGG, even though it is widely used in many applications, is by far the most expensive architecture — both in terms of computational requirements and number of parameters. Its 16- and 19-layer implementations are in fact isolated from all other networks. The other architectures form a steep straight line, that seems to start to flatten with the latest incarnations of Inception and ResNet. This might suggest that models are reaching an inflection point on this data set. At this inflection point, the costs — in terms of complexity — start to outweigh gains in accuracy.

Figure 3 reports inference time per image on each architecture, as a function of image batch size (from 1 to 64). Is this the batch size for inference??? We notice that VGG processes one image in a fifth of a second, making it a less likely contender in real-time applications on an NVIDIA TX1.

In Figure 7, for a batch of 16 images, there is a linear relationship between operations count and inference time per image. Therefore, at design time, we can pose a constraint on the number of operation to keep processing speed in a usable range for real-time applications or resource-limited deployments.


The main work is to increase the depth to 16–19 layers by using small (3*3) convolution filters.


  • simple generic structure
  • deeper, depth matters
  • smaller filters have the same receptive field but more non-linearity and fewer parameters
  • multi-scaling to augment images
  • fully conv network


  • VGG 16 and 19 compared to ResNet 152, similar computation complexity
  • a large number of parameters, model size is large


GoogleNet + Inception

GoogleNet is also called Inception-v1. It is developed to Inception v2, v3, and v4. Inception-v4 combines inception block and residual block. In contrast to ResNet, GoogleNet makes the network “wider” by adding multiple-scale convolution filters as Inception block and concatenating the feature maps from multi-scales.

More readings in Mandarin:


从Inception V1 到 Inception V4网络结构的变化


ResNet is a milestone that increases the depth of neural nets to 50, 100, even 1000 with reasonable training and test accuracy. Before ResNet, VGGNet and GoogleNet have ~20 layers.

There is a paradox shown in the ResNet paper, that deeper neural nets have higher training error.

The training error is larger for the 56-layer NN compared to the 20-layer.

If the test error is higher, it is probably due to overfitting. But it turns out the training error is also higher for deeper networks. In theory, adding more layers to a network is like adding more polynomial terms to an equation, the goodness of fitting on the training data should never get worse. As stated in the paper, “Let us consider a shallower architecture and its deeper counterpart that adds more layers onto it. There exists a solution by construction to the deeper model: the added layers are identity mapping, and the other layers are copied from the learned shallower model.” ResNet solves this paradox by adding the residual block.

Residual block

Right, problem solved. But here are two questions to think about: (1) what is the cause of the paradox? (2) how can we explain that ResNet solves it?

The first question: what is the cause? Vanishing/exploding gradients due to deeper networks? As the paper states, “this problem, however, has been largely addressed by normalized initialization and intermediate normalization layers.” For example, Xavier init for linear activation and He init for ReLU, batch normalization to normalize the activations. Thus, the degradation is due to the optimization complexity of adding more layers. Although theoretically the new layers can be just identity mappings, it is not easy to fit them exactly as identity mappings due to the complex and stocastic optimization. Thus, this brings the second question. It turns out that by fitting the residual, it is easier to find the comparably good or even better solution. It is hypothesized that it is easier to optimize the residual mapping than to optimize the original mapping.

Compared to VGG 16 or 19, the memory and computation complexity of ResNet 152 is lower! It is because VGG uses too many conv filters. But the training time for ResNet is still long.

More readings about what problem ResNet solves:

The Shattered Gradients Problem: If resnets are the answer, then what is the question?

Identity mapping in deep residual networks





MobileNet uses the depthwise separable convolution to replace the standard convolution. Given an image, depth separable convolution maintains the the same input and output dimensions but has fewer parameters in the conv layer compared to the regular convolution. Here is a good visualization to compare the standard and depthwise separable conv.

The standard convolution
Depthwise convolution
Pointwise convolution
Computation reduction
# image dimension: 32*32*3 image, feature map dimension: 32*32*16# regular conv
# H*W*C*F (# of filters, e.g. 16)
parameters: H*W*C*F = 3*3*3*16 = 432
calculations: (H*W*C)*(N*N)*F = (3*3*3)*(32*32)*16 = 442368
#depthwise separable conv
#depthwise conv - H*W*C, pointwise conv - 1*1*C*F
parameters: H*W*C + 1*1*C*F = (H*W+F)*C = 3*3*3 + 1*1*3*16 = 75
calculations: H*W*(N*N)*C + C*(N*N)*F = (H*W+F)*(N*N)*C = 3*3*32*32*3 + 3*32*32*16 = 76800
parameters: (H*W+F)*C / (H*W*C*F) = (H*W+F) / (H*W*F)
calculations: (H*W+F)*(N*N)*F / (H*W*C*N*N*F) = (H*W+F) / (H*W*F)

Question: But with fewer parameters, how can depthwise separable conv acheive the same accuracy as regular conv?

In addition to reducing the number of parameters and calculations compared to standard convolutions, depthwise separable convolution offers benefit more than that. Explanation from the paper: “It is not enough to simply define networks in terms of a small number of Mult-Adds. It is also important to make sure these operations can be efficiently implementable. For instance, unstructured sparse matrix operations are not typically faster than dense matrix operations until a very high level of sparsity. Our model structure puts nearly all of the computation into dense 1*1 convolutions. This can be implemented with highly optimized general matrix multiply (GEMM) functions. Often convolutions are implemented by a GEMM but require an initial reordering in memory called im2col in order to map it to a GEMM. … 1*1 convolutions do not require this reordering in memory and can be implemented directly with GEMM which is one of the most optimized numerical linear algebra algorithms. MobileNet spends 95% of its computation time in 1*1 convolutions which also has 75% of the parameters as can be seen in Table 2.”

Optional readings in Mandarin:




“Deep-wise结合1×1的卷积方式代替传统卷积不仅在理论上会更高效,而且由于大量使用1×1的卷积,可以直接使用高度优化的数学库来完成这个操作。以Caffe为例,如果要使用这些数学库,要首先使用im2col的方式来对数据进行重新排布,从而确保满足此类数学库的输入形式;但是1×1方式的卷积不需要这种预处理。” im2col是优化卷积运算的一种操作,也就是说计算regular conv需要一些类似于im2col的操作,而1*1 conv这种不需要这些操作。“在MobileNet中,有95%的计算量和75%的参数属于1×1卷积。”