Review: PolyNet — 2nd Runner Up in ILSVRC 2016 (Image Classification)

Source: Deep Learning on Medium

By Using PolyInception Module, Better Than Inception-ResNet-v2

Go to the profile of SH Tsang

In this story, PolyNet, by CUHK and SenseTime, is reviewed. A building block called PolyInception module is introduced. A Very Deep PolyNet is composed based on the module. Compared to Inception-ResNet-v2, PolyNet reduces the Top-5 validation error on single crops from 4.9% to 4.25%, and that on multi-crops from 3.7% to 3.45%.

PolyNet, By using PolyInception module, better than Inception-ResNet-v2

As a result, PolyNet (with the team name CU-DeepLink) obtains 2nd Runner Up in ILSVRC 2016 classification task as below. And it is published as 2017 CVPR paper. (SH Tsang @ Medium)

Compared with ResNet (The winner in ILSVRC 2015), which got 3.57%, PolyNet got 3.04% as shown below:

ILSVRC 2016 Classification Ranking (Team Name: CU-DeepLink, Model Name: PolyNet)

This relative improvement is about 14%, which is not trivial!!!

What Are Covered

  1. Brief Review of Inception-ResNet-v2 (IR-v2)
  2. PolyInception Modules
  3. Ablation Study
  4. Results

1. Brief Review of Inception-ResNet-v2 (IR-v2)

With the success of ResNet and GoogLeNet (Inception-v1), Inception-ResNet-v2 (IR-v2) was introduced to combine both:

Inception-ResNet-v2: Stem (Leftmost), Inception-A (2nd Left), Inception-B (2nd Right), Inception-C (Rightmost)

As shown above, there is a skip connection. also there are several parallel convolution path, which is originated by GoogLeNet. And multiple Inception-A, Inception-B and Inception-C are cascaded at different levels. And finally, Inception-ResNet-v2 (IR-v2) obtains a high classification accuracy.

And the Inception module can be formulated as an abstract residual unit, which shown as below:

Inception Module (Left), Abstract Residual Unit Denoted by F (Right)

The output become x + F(x), which is similar to residual block.

2. PolyInception Modules

To increase the accuracy, a polynomial composition is added, by simply add a second-order term:

With the second-order term, it is formed as a PolyInception Module.

Different Types of PolyInception Module ((a) and (b) are the same)

Different Types of PolyInception Module are suggested:

  • (a) poly-2: The first path is a skip connection or identity path. The second path is the first-order Inception block. The third path is the second-order term which consists of two Inception blocks.
  • (b) poly-2: As the first Inception F is used for the first-order path and second-order path, the first Inception F can be shared. By sharing the parameters as well, the second Inception F is the same as the first one. This can increase the representation power without introducing additional parameters
    We might also imagine it is a kind of recurrent neural network (RNN). At the second-order path, the output from Inception F is going back to the input to Inception F again. This become 1+F+F².
  • (c) mpoly-2: If the second Inception G does not share parameters with F, we got mpoly-2. This become 1+F+GF.
  • (d) 2-way: This is a first-order PolyInception, I+F+G.

This concept can be extended to higher-order PolyInception module. We can have poly-3 (1+F+F²+F³), mpoly-3 (1+F+GF+HGF) and 3-way (I+F+G+H).

3. Ablation Study

There are so many choices in our hands now. But which combinations are the best? Authors tried a lot of combinations to find the best one.

Inception-ResNet-v2 (IR-v2, IR 5–10–5) (Top), PolyNet (Bottom)

Inception-ResNet-v2 (IR-v2) is denoted as IR 5–10–5 meaning that it has 5 Inception-A modules (IR-A) at stage A, 10 Inception-B modules (IR-B) at stage B, and 5 Inception-C modules (IR-C) at stage C.

3.1. Single Type Replacement

To speed up the experiment, a scaled down version of IR 3–6–3, is used as baseline. For each time, one of the six PolyInception modules is replaced with Inception-A, Inception-B, or Inception-C modules, as below.

Top-5 Accuracy vs Time (Top), Top-5 Accuracy vs #Params (Bottom), with replacement at Inception-A (Left), Inception-B (Middle), and Inception-C (Right)

From above figures, we can find that:

  • Any second-order PolyInception module is better than Inception module.
  • Enhancing Inception-B leads to largest gain. And mpoly-3 seems to be the best one. But poly-3 has competitive result while only got 1/3 parameter size of mpoly-3.
  • For the other stages, A and C, 3-way perform slightly better than mpoly and poly.

3.2. Mixed Replacement

IR 6–12–6 is used as baseline. And those 12 Inception-B are the focusing point since it got the largest improvement in the previous study. And only one type of mixed PolyInception (mixed B) is tested, i.e. (3-way > mpoly-3 > poly-3) × 4.

Top-5 Error with different Inception module at Stage B

Mixed B has the lowest Top-5 error.

3.3. Final Model

  • Stage A: 10 2-way PolyInception module
  • Stage B: 10 mixtures of poly-3 and 2-way (20 in total)
  • Stage C: 5 mixtures of poly-3 and 2-way (10 in total)

Some modifications are made to fit GPU memory, reducing cost while maintaining depth.

4. Results

4.1. Some Training Details

Initialization by Insertion (Left), Path Dropped by Stochastic Depth (Right)

Initialization by Insertion: To speed up the convergence, as shown above, second order Inception module is removed first, and interleaved modules are trained first. Thus, a smaller network is trained at the beginning.

Stochastic Depth: By randomly dropping some parts of the network, overfitting can be reduced. It can be treated as a special case of dropout which drops all neurons of one path.

4.2. ImageNet

Single Model Results on 1000-Class ImageNet Dataset (Left) Top-5 Accuracy (Right)
  • Very Deep Inception-ResNet: Inception-ResNet-v2 with IR 20–56–20. 19.10% Top-1 Error and 4.48% Top-5 Error are obtained.
  • Verp Deep PolyNet (10–20–10 mix): 18.71% Top-1 Error and 4.25% Top-5 Error are obtained.
  • With multi-crop, Very Deep PolyNet got 17.36% Top-1 Error and 3.45% Top-5 Error, which is consistently better than Very Deep Inception-ResNet-v2.
  • Thus, the second-order PolyInception module does help to improve the accuracy.

As image classification has only one objective, that is to recognize the single object within the image, a good model used in image classification usually become the backbone of the network for object detection, and semantic segmentation, etc. Therefore, it is worth to study the models for image classification. And I will also study ResNeXt as well.