Source: Deep Learning on Medium
Taking into consideration the uniqueness of Human
In recent years, research related to “humans” in the computer vision community has become increasingly active because of the high demand for real-life applications, among them is instance segmentation.
The standard approach to image instance segmentation is to perform the object detection first, and then segment the object from the detection bounding-box. More recently, deep learning methods like Mask R-CNN perform them jointly. However, as human associated tasks becoming more common like human recognition, tracking etc. one might wonder why does the uniqueness of the “human” category does not taken into account.
The uniqueness of the “human” category, can be well defined by the pose skeleton. Moreover, the human pose skeleton can be used to better distinguish instances with heavy occlusion than using bounding-boxes.
In this post, I am going to review “Pose2Seg: Detection Free Human Instance Segmentation”, which presents a new pose-based instance segmentation framework for humans which separates instances based on human pose.
In this post I’ll cover three things: First, an overview of Instance Segmentation task. Second, an overview of “Pose2Seg”.
We’re sharing the code here. Including the dataset and the trained model. Follow along!
1. What is Instance Segmentation?
Instance segmentation a task where we want to identify each object at the pixel level. This means that labels are both class-aware and instance-aware. E.g. Figure 2(d) visualizes a separate labeling for sheep 1, sheep 2, etc.
Instance segmentation is considered the most challenging among the common use-cases:
- Classification: There is a person in this image. Figure 2(a)
- Object Detection: There are 5 sheep in this image at these locations. Figure 2(b)
- Semantic Segmentation: There are sheep, person and dog pixels. Figure 2(c)
- Instance Segmentation: There are 5 different sheep, 1 person and one dog at these locations. Figure 2(d)
2. Pose2Seg: Detection Free Human Instance Segmentation
The main idea behind Pose2Seg is that while General Object Instance Segmentation approaches work well, the majority are based on powerful object detection baseline. i.e. first generate a large number of proposal regions, then remove the redundant regions using Non-maximum Suppression (NMS) as shown in Figure 3.
However, when two objects of the same category have a large overlap, NMS will treat one of them as a redundant proposal region and eliminates it. This means that almost all the object detection methods cannot deal with the situation of large overlaps.
But, when dealing with mostly “human” category, it can be well defined by the pose skeleton. As shown in Figure 1, Human pose skeletons are more suitable for distinguishing two heavily intertwined people, because they can provide more distinct information about a person than bounding-boxes, such as the location and visibility of different body parts.
The main idea of the bottom-up methods is to first detect keypoints for each body part for all the people, and then group or connect those parts to form several instances of human pose, which makes it possible to separate two intertwined human instances with a large overlap
2.2 Network Structure
A overall network structure is shown in Figure 4 below. The input for the network is both RGB image and human pose for all human instances exist. t. Firstly, a backbone network is used to extract the features of the image. Then, the module called Affine-Align is used to align RoIs to a uniform size (for consistency) based on the human pose. In addition, skeleton features are generated for each human instance.
Now, both RoIs and skeleton features are fused and passed to the segmentation module called SegModule to yield instance segmentation per RoI. Finally, the estimated matrices in Affine-Align operation to reverse the alignment for each instance and get the final segmentation results.
The networks sub-modules are describe in details the subsections below.
2.3 Affine-Align Operation
The Affine-Align operation is mainly inspired by the RoI-Pooling presented in Faster R-CNN and RoI-Align in Mask R-CNN. But, while those align human according to their bounding-boxes, Affine-Align is used to align based on human pose.
To do that, the most frequent human poses are stored offline, later to be compared with every input pose at training/inference (see Figure 5 below). The idea is to choose the best template for each estimated pose. This is accomplished by estimating an affine transformation matrix H between input pose and the templates and choosing the one yielding the best score.
Here P_u represent a pose template and P represent a single person pose estimation. The matrix H* is the affine transform chosen for the best suited per pose template. Finally, The transformation H* that results with the best score is applied on the image or features and transform it to the desired resolution. There are a lot more details on Affine-Align Operation, please refer to the paper for more details.
2.4 Skeleton Features
Figure 6 shows the skeleton features. For this task part affinity fields (PAFs) is adopted. The output from PAF is a 2-channel vector field map for each skeleton. PAF is used to represent the skeleton structure of a human pose along with a part confidence maps for body parts to emphasize the importance of those regions around the body part keypoints.
The SegModule is a simple Encoder-Decoder architecture. One main consideration is its receptive field. As skeleton features were introduced after alignment the SegModule needs to have enough receptive fields to not only fully understand these artificial features, but also learn the connections between them and the image features extracted by the base network. There fore it is designed based on the resolution of the aligned RoIs.
The networks starts with a 7 × 7, stride-2 convolution layer, and is followed by several standard residual units to achieve a large enough receptive field for the RoIs. After that, a bilinear up-sampling layer is used to restore the resolution, and another residual unit, along with a 1 × 1 convolution layer are used to predict the final result. Such a structure with 10 residual units can achieve about 50 pixels of receptive field, corresponding to the alignment size of 64 × 64. Fewer units will make the network less capable of learning, and more units enable little improvement on the learning ability.
3. Experiments & Results
Pose2Seg was evaluated on two datasets: (1) OCHuman, which is the largest validation dataset that is focused on heavily occluded humans, and proposed in this paper; and (2) COCOPersons (the person category of COCO), which contains the most common scenarios in daily life.
The algorithm was mainly compared to Mask-RCNN, the commonly used detection based instance segmentation framework.
In the test on occluded data using OCHuman data set, as shown in Table 1 Pose2Seg framework achieved nearly 50% higher that the performance of Mask R-CNN.
In the test on general cases , evaluated on COCOPerson validation set Pose2Seg got 0.582 AP (Avergare precision) on the instance segmentation task, while Mask R-CNN got 0.532. See Table 2.
To get a better understanding on the advantages of Pose2Seg from bounding-box based frameworks see Figure 7 below. See how “out-of-box” organs are not segmented in Mask R-CNN.