Review: Hikvision — 1st Runner Up in ILSVRC 2016 (Object Detection)

Source: Deep Learning on Medium

1st Place for Single Model Results in ILSVRC 2016 Object Detection Challenge

Go to the profile of Sik-Ho Tsang
Hikvision CCTV Product

This time, the approach by Hikvision (海康威视), in ILSVRC 2016 object detection challenge, is briefly reviewed. Hikvision was launched in 2001 based at Hangzhou in China. Hikvision advances the core technologies of audio and video encoding, video image processing, and related data storage, as well as forward-looking technologies such as cloud computing, big data, and deep learning.

Hikvision has won several competitions in ILSVRC 2016:

  • Object Detection: 2nd place, 65.27% mAP
  • Object Localization: 2nd place, 8.74% error
  • Scene Classification: 1st place, 9.01% error
  • Scene Parsing: 7th place, 53.5% Average of IoU & pixel accuracy

In this story, I only focus on detection challenge. Though Hikvision has the state-of-the-art results on detection task, there is not much innovative technology or novelty. Maybe due to this reason, they haven’t published any papers or technical reports about it.

Instead, they only shared their approaches and results in the ImageNet and COCO joint workshop in 2016 ECCV. (Sik-Ho Tsang @ Medium)

Hikvision VCA video analytics (


  1. Cascaded RPN
  2. Global Context
  3. Other Techniques
  4. Summary of Object Detection Elements
  5. Results

1. Cascaded RPN

Cascaded RPN
  • Cascaded Region Proposal Network (RPN) is used to generate proposals.
  • Naïve RPN: Batch size of 256 with Negative/Positive Samples ratio usually > 10
  • Cascaded RPN: Batch size of 32 with Max N/P ratio of only 1.5.
  • With Cascaded RPN and better N/P ratio, recall is improved.
  • 9.5% Recall@0.7 gain.

2. Global Context

Global Context
  • With Global Context, global features are extracted and concatenated with ROI features to have better classification accuracy.
  • 3.8% mAP gain is obtained.

3. Other Techniques

  • Pre-training on ImageNet LOC: 0.5% mAP gain.
  • Balanced Sampling: 0.7% mAP on VOC 2007.

4. Summary of Object Detection Elements

Object Detection Elements

5. Results

5.1. ILSVRC 2016 Detection Challenge

ILSVRC 2016 Detection Challenge
  • Using single model, actually, Hikvision obtains Rank 1, better than CUImage team, using GBD-Net.
  • However, using ensemble models, GBD-Net obtains better results.

5.2. ILSVRC 2016 Localization Challenge

ILSVRC 2016 Localization Challenge
  • Hikvision obtains Rank 2 with classification error of 3.7% and localization error of 8.7%.

5.3. PASCAL VOC 2012

  • Hikvision outperforms ResNet.

With different techniques combined together, Hikvision is able to obtain 2nd place in ILSVRC 2016 detection challenge.


[2016 ECCV] [Hikvision] (Slides Only)
Towards Good Practices for Recognition & Detection

My Previous Reviews

Image Classification
[LeNet] [AlexNet] [ZFNet] [VGGNet] [Highway] [SPPNet] [PReLU-Net] [STN] [DeepImage] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2] [Inception-v3] [Inception-v4] [Xception] [MobileNetV1] [ResNet] [Pre-Activation ResNet] [RiR] [RoR] [Stochastic Depth] [WRN] [FractalNet] [Trimps-Soushen] [PolyNet] [ResNeXt] [DenseNet] [PyramidNet] [DRN] [DPN] [MSDNet]

Object Detection
[OverFeat] [R-CNN] [Fast R-CNN] [Faster R-CNN] [MR-CNN & S-CNN] [DeepID-Net] [CRAFT] [R-FCN] [ION] [MultiPathNet] [NoC] [G-RMI] [GBD-Net / GBD-v1 & GBD-v2] [TDM] [SSD] [DSSD] [YOLOv1] [YOLOv2 / YOLO9000] [YOLOv3] [FPN] [RetinaNet] [DCN]

Semantic Segmentation
[FCN] [DeconvNet] [DeepLabv1 & DeepLabv2] [CRF-RNN] [SegNet] [ParseNet] [DilatedNet] [PSPNet] [DeepLabv3] [DRN]

Biomedical Image Segmentation
[CUMedVision1] [CUMedVision2 / DCAN] [U-Net] [CFS-FCN] [U-Net+ResNet] [MultiChannel] [V-Net]

Instance Segmentation
[SDS] [Hypercolumn] [DeepMask] [SharpMask] [MultiPathNet] [MNC] [InstanceFCN] [FCIS]

Super Resolution

Human Pose Estimation
[DeepPose] [Tompson NIPS’14]