Detecting Objects as Paired Keypoints

Original article was published on Deep Learning on Medium

Detecting Objects as Paired Keypoints

Most of the commonly used target detection algorithms are currently based on anchors, such as Faster RCNN series, SSD, YOLO (v2, v3), etc.

They used tens of thousands of anchors of various sizes and aspect ratios to ensure sufficient overlap with ground truth boxes, which results in the number of positive samples being much smaller than negative samples and slows down training.

The use of anchor boxes introduces many hyperparameters such as the number , size, and aspect ratio of the anchor ,these choices are largely made through temporary heuristics, and when combined with multi-scale architectures, they become more complex.

CornerNet [1] proposes a new single-stage object detection method, without using the Anchors box to assist, directly use the corner information to determine the location of the target it detect and group the upper left and lower right corners of the bounding box,it uses stacked HourGlass networks to predict the heat maps of the corners, and then uses associate embeddings to group them .

Fig. 4 Overview of CornerNet. The backbone network is followed by two prediction modules, one for the top-left corners andthe other for the bottom-right corners. Using the predictions from both modules, we locate and group the corners. Source [1]

Object detection based on key points is a class of methods for generating object bounding boxes by detecting and grouping their key points eliminating the need for anchors and providing a simplified detection framework.

Hourglass Network

Hourglass Network uses a fully convolutional neural network to output the precise pixel positions of key points of the human body for a given single RGB image, and uses multi-scale features to capture the spatial position information of each joint point of the human body.

The network structure looks like an hourglass, repeating top-down to bottom-up to infer the position of the joints of the human body.

Fig. 1.Our network for pose estimation consists of multiple stacked hourglass moduleswhich allow for repeated bottom-up, top-down inference. Source [2]

In order to capture the features of the picture at multiple scales, it is common practice to use multiple pipelines to separately process information at different scales, and then combine these features in the back part of the network, and the method used by the author is to use skip layers to save the spatial information at each scale.

Fig. 3.An illustration of a single “hourglass” module. Each box in the figure corre-sponds to a residual module as seen in Figure 4. The number of features is consistentacross the whole hourglass.Source [2]

Convolution and Max Pooling are used to reduce the features to a very low resolution. In each Max Pooling step, the network generates branches and uses more volumes , when the lowest resolution is reached, the network starts to upsample and combines features at different scales.

The output of the network is a set of heatmaps.

Fig. 2.Example output produced by our network. On the left we see the final poseestimate provided by the max activations across each heatmap. On the right we showsample heatmaps. (From left to right: neck, left elbow, left wrist, right knee, rightankle).Source [2]

For a given heatmap, the heat map contains the relationship between the related nodes, which can be regarded as a graph model. Therefore, using the heat map from the first hourglass network as the input to the next hourglass network means that the second hourglass network can use the relationship between the joint points, thereby improving the prediction accuracy of the joint points.

Fig. 4. Left : Residual Module [14] that we use throughout our network.Right: Illus-tration of the intermediate supervision process. The network splits and produces a set of heatmaps (outlined in blue) where a loss can be applied. A 1×1 convolution remaps the heatmaps to match the number of channels of the intermediate features. These are added together along with the features from the preceding hourglass.Source [2]

The feature passes through a 1×1 full convolution network ,then it is divided into upper and lower branches.

The lower branch first undergoes 1×1 convolution to generate a heat map, which is the blue part of the figure.The blue square in the picture above is narrower than the other three squares. This is because the depth of the heat map matrix is ​​consistent with the number of nodes in the training data and the other several have higher depths.The heatmap continues to undergo 1×1 convolution, adjusting the depth to be consistent with the upper branch and finally merging with the upper branch, together as an input to the next hourglass network.

Stacking hourglass modules yields a stacked hourglass network, after each hourglass module, the final result can be predicted and the loss calculated, which serves as an intermediate supervision.


Experiments have confirmed that the prediction accuracy is much better than considering only the loss of the last hourglass prediction.

Corner Pooling

Corner pooling is a pooling method proposed in CornerNet and used in prediction modules.

When determining whether a pixel is the upper left corner, you need to look right in the horizontal direction and look down in the vertical direction. When determining whether to point at the lower right corner, you need to look left in the horizontal direction and look up in the vertical direction.

The above picture shows the top-left corner poolling.

Fig. 6 The top-left corner pooling layer can be implemented very efficiently. We scan from right to left for the horizontalmax-pooling and from bottom to top for the vertical max-pooling. We then add two max-pooled feature maps.Source [2]

If we can detect the boundary feature of the same object in a row and a column, then the intersection of this row and this column is the corner, which is an indirect and effective way to find the corner.

Embedding Vector For Grouping Corners

The upper left and lower right corner points are separately detected on the two feature maps, if you want to form a bbox, you need to combine the upper left and lower right corner of the same target. This is the role of embedding vector.

Fig. 1 We detect an object as a pair of bounding box corners grouped together. A convolutional network outputs a heatmapfor all top-left corners, a heatmap for all bottom-right corners, and an embedding vector for each detected corner. The networkis trained to predict similar embeddings for corners that belong to the same object. Source [1]

Embeddings is an n-dimensional vector generated according to each point on the heatmap and its surrounding features, each corner will generate an n-dimensional vector, this vector encodes the information corresponding to the target of this corner.

If a point in the upper left corner and a point in the lower right corner are the same target, the corresponding two embedding vectors should be very similar, the distance between the two vectors is very small, because they encode the information of the same target ,on the other hand the distance between the corners of different targets is very large.

CornerNet Structure

CornerNet is a one-stage detection method , first we use the backbone network ( Hourglass Network) to extract relevant features ,the backbone network is followed by two prediction modules, one of which detects top-left corners and the other detects bottom-right corners.

Finally, the two sets of corners are screened, combined, and a pair of corners of the object are corrected to locate the object’s box.

Fig. 7 The prediction module starts with a modified residual block, in which we replace the first convolution module withour corner pooling module. The modified residual block is then followed by a convolution module. We have multiple branchesfor predicting the heatmaps, embeddings and offsets.Source [1]

The first half of the Prediction modules is similar to the residual structure. After feature maps in Backbone enter prediction modules, they are divided into three branches.

The above two branches undergo a 3×3 convolution , corner pooling is performed, and the summation is integrated into one, followed by 3×3 convolution and batch normalization

The bottom branch performs 1×1 convolution and batch normalization, it is added to the upper path and fed into the Relu function. Subsequently, 3×3 convolutions are performed on the feature maps, and then three groups of 3×3 convolutions and Relu are applied to generate three sets of feature maps.

Heatmaps predict which points are most likely Corners points, and Embeddings are used to characterize the similarity of corners belonging to the same object. The final Offsets are used to correct the position of the corner.


CornerNet allows simplified design without the use of anchor boxes, and achieves the SOTA accuracy of COCO in the One-Stage detector.

However, a major disadvantage of CornerNet is its Inference speed. Its average accuracy (AP) on COCO is 42.2%, and the Inference cost per image is 1.147s, which is too slow for video applications that require real-time or interactive rates.


  1. Hei Law, Jia Deng:CornerNet: Detecting Objects as Paired Keypoints.
  2. Alejandro Newell, Kaiyu Yang, Jia Den : Stacked Hourglass Networks for Human Pose Estimation
  3. CornerNet: Detecting Objects as Paired Keypoints