Review: DeepID-Net — Def-Pooling Layer (Object Detection)

In this story, DeepID-Net is briefly reviewed. A deformable part based CNN is introduced. A new deformable constrained pooling (def-pooling) layer is used to model the deformation of the object parts with geometric constraint and penalty.

That means, except detecting the entire object directly, it is also crucial to detect object parts which can then assist to detect the entire object. It is the 1st Runner Up in ILSVRC 2014 for object detection task. And it has published in 2015 CVPR [1] and 2017 TPAMI [2] papers with about 300 citations in total. (SH Tsang @ Medium)


The steps in black color actually are the old stuff existed in R-CNN. The steps in red color actually are not appeared in R-CNN.

I will mention each step in the above diagram, and results at the end of the story.


  1. Selective Search
  2. Box Rejection
  3. Pretrain Using Object-Level Annotations
  4. Def-Pooling Layer
  5. Context Modeling
  6. Model Averaging
  7. Bounding Box Regression

1. Selective Search

Selective Search
  1. First, color similarities, texture similarities, region size, and region filling are used as non-object-based segmentation. Therefore we obtain many small segmented areas as shown at the bottom left of the image above.
  2. Then, bottom-up approach is used that small segmented areas are merged together to form larger segmented areas.
  3. Thus, about 2K region proposals (bounding box candidates) are generated as shown in the image.

2. Box Rejection

R-CNN is used to reject bounding boxes that are most likely to be background.

3. Pretrain Using Object-Level Annotations

Object-Level Annotation (Left), Image-Level Annotation (Right)

Usually, pretraining is on image-level annotation. It is not good when the object is too small within the image because the object should occupy large area within the bounding box created by selective search.

Thus, pretraining is on object-level annotation. And the deep learning model can be any models such as ZFNet, VGGNet and GoogLeNet.

4. Def-Pooling Layer

Overall Architecture with More Details

Say for example we use ZFNet, after conv5, the output will go through the original FC layers fc6 and fc7, as well as a set of the conv and proposed def-pooling layers.

Def-Pooling Layers (Deformable Constrained Pooling), High Activation Value for the Circle Center of Each Light
Def-Pooling Equations

For the def-pooling path, the output from conv5, goes through conv layer, then goes through def-pooling layer, and then have a max pooling layer.

To be brief, the summation of ac multiplied by dc,n, is the 5×5 deformation penalty in the figure above. The penalty is the the penalty of placing the object part from the assumed anchor position.

The def-pooling layers learn the deformation of object parts with different sizes and semantic meanings.

By training this def-pooling layer, object parts of the object to be detected will give a high activation value after def-pooling layer if they are closed to their anchor places. And this output will connect to the 200-class scores for improvement.

5. Context Modeling

In object detection task in ILSVRC, there are only 200 classes. And there is also a classification competition task in ILSVRC for classifying and localizing 1000-class objects. The contents are more diverse compared with object detection task. Hence, the 1000-class scores, obtained by classification network, are used to refine the 200-class scores.

6. Model Averaging

Multiple models are used to increase accuracy, and the results from all models are averaged. This technique has been used since LeNet, AlexNet, and so on.

7. Bounding Box Regression

Bounding box regression is just to fine-tune the bounding box location, which has been used in R-CNN.


Incremental Results
  • R-CNN with selective search (Step 1): 29.9% mAP (mean average prediction)
  • + bounding box rejection (Step 2): 30.9%
  • Changed from AlexNet to ZFNet (Step 3): 31.8%
  • Changed from ZFNet to VGGNet (Step 3): 36.6%
  • Changed from VGGNet to GoogLeNet (Step 3): 37.8%
  • + pretraining on object-level annotations (Step 3): 40.4%
  • + edge to have more bounding box proposal from [Ref 60]: 42.7%
  • + Def-Pooling Layers (Step 4): 44.9%
  • + multi-scale training suggested at VGGNet: 47.3%
  • + context modeling (Step 5): 47.8%
  • + bounding box regression (Step 7): 48.2%
  • + model averaging (Step 6): 50.7% !

Compared with multi-model multi-crop GoogLeNet, DeepID-Net’s mAP is 6.1% higher. However, as we can see, some contributions are actually from other papers. Nevertheless, there are two of the most novel ideas which are the pretraining on object-level annotations, and Def-Pooling Layers.

Source: Deep Learning on Medium