Self-supervised Keypoint Learning

Original article can be found here (source): Deep Learning on Medium

Self-supervised Keypoint Learning

Keypoint or interest point detection is the building block for many computer vision tasks, such as SLAM (simultaneous localization and mapping), SfM (structure from motion) and camera calibration. Interest point detection has a long history predating deep learning, and many glorious algorithms in wide industry application (such as FAST, SIFT and ORB) are based on hand-crafted features. As in many other computer vision tasks, people have been exploring how to use deep learning to outperform hand crafted algorithms.

The FAST corner detection algorithm. Every time I look at it and wonder how come people have not come across such an elegant and practical algorithm until 2006!

Before we move on, let us clear up on some concepts. There are several types of keypoints in common use in computer vision. Keypoint can refer to facial keypoint/landmark detection for facial recognition, or keypoint for or human/vehicle pose estimation. In these tasks, each keypoint bears a semantic meaning, such as left eye corner, right shoulder, or front left tire hub. We will refer to them as semantic keypoints hereafter.

Various kinds of keypoints in computer vision. This post addresses the last kind (d).

The keypoints we will focus on in this post are more low-level, such as a corner point or ending point of a segment, and do not have definite semantic meaning. This makes it fundamentally different from semantic keypoints and we will refer to them as interest point.

Deep learning method has dominated state-of-the-art semantic keypoint detection. Mask RCNN (ICCV 2017) and PifPaf (CVPR 2019) are two representative (one top-down and one bottom-up) methods for detecting semantic keypoints. Both methods are supervised learning and requires large amount of human labels, which for one thing can be expensive. Moreover, supervising interest point is unnatural, as the interest point is semantically ill-defined and a human annotator cannot reliably and repeatedly identify the same set of interest points. It is therefore impossible to formulate the task of interest point detection as a supervised learning problem. And now enter self-supervised learning.

Self-supervised learning (or unsupervised learning if you focus on the fact that it does not require explicit human annotation) is a re-emerging topic as of early 2020. This includes recent MoCo by FAIR, and SimCLR by Geoffrey Hinton’s team. After all, self-supervised learning is the cake génoise, the real cake per Yann LeCun’s famous quote at NeurIPS 2016. This post will be on the special topic of keypoint learning. For more general trend of self-supervised learning, I would recommend Lilian Weng’s blog.

LeCun’s cake analogy for self-supervised learning (source)

Interest Point and Descriptor

Just some quick background on interest point detection before we dive into the deep learning papers, as many of them still build upon the classical framework and use similar terminology as before.

The task of finding interest point consists of detection and description. Detection is the localization of interest point (or feature point, or keypoint, depending on the literature) in an image, and description is to describe each of the detected points with a vector (i.e., descriptor). The overall goal is to find characteristic and stable visual features effectively and efficiently. In the blog post below, we will see how interest point learning tackle both tasks of detection and description.

The blog post below are based on my paper reading notes I took when I read the paper for the first time. Star/fork/comments are welcome!


SuperPoint: Self-Supervised Interest Point Detection and Description (CVPR 2018) is the seminal work of using self-supervised learning for interest point detection and description. In summary, it first pretrains an interest point detector on synthetic data and then learns the descriptors by generating image pairs with known homography transformation.

This paper still follow the approach of many classical algorithms: detect first and then describe. How to learn a robust detector in the first place? We can render 2D projections with 3D objects with known interest points, such as corners of a cuboid, and end points of a line segment. The authors call this detector the MagicPoint (a nice naming by the authors from Magic Leap.).

Now you may say that this contradicts the fact that interest points are semantically ill-defined, but in practice this seems to work pretty well. Of course, this leaves one area to improve for future works (such as unsuperpoint discussed below).

MagicPoint: Pretraining on synthetic data

From synthetic to real images, in order to bridge the sim2real gap, test time augmentation (TTA) is used to accumulate interest point features. This intensive TTA (~100 augmentations) is called “homographic adaptation”. This step implicitly requires the MagicPoints yields high precision detection results with low false positive rate. The aggregation step is to increase recall and create more interest point. Similar techniques are also used in a more recent work of UR2KiD (they aggregated keypoints from different concept groups in a technique called Group-Concept Detectand-Describe).

Homographic Adaptation: a TTA scheme to bridge the sim2real transfer gap

Now from the stand point of generative modeling, if we know the key points of one image, we can do homographic transformation of the image together with the key points. This will generate tons of training data to learn descriptor. The authors used contrastive loss (CVPR 2006, Yann LeCun’s group) to learn the descriptor, which basically include a pulling term for paired points and a pushing term for unpaired points. Note that there are many terms in this loss, O(N²) where N is the number of points per image. This is yet another example of transferring knowledge of a well-defined math problem to neural nets.

SuperPoint use the same encoder for both detector and descriptor for fast inference

One technique I found particularly interesting is that the detector uses classification together with a channel2image trick to achieve high precision detection. In summary, it warps each 8×8 pixels in the input image represented by each pixel in feature map into 64 channels, followed by one dustbin channel. If there is no interest point in the 8×8 region, dustbin has high activation. Otherwise, the 64 other channels pass through softmax to find the interest point in the 8×8 region.

The above steps largely summarizes the main ideas of SuperPoint.

  • MagicPoint: pretraining on synthetic data for keypoint detector
  • Homographic Adaptation: TTA on real images
  • SuperPoint: MagicPoint with descriptors trained with image pairs undergoing known homographic transformation. The descriptor is used for image matching tasks.