PoseFlow — real-time pose tracking

Source: Deep Learning on Medium

Detectron2 is a robust framework for object detection and segmentation (see the model zoo). It allows us to detect person keypoints (eyes, ears, and main joints) and create human pose estimation.

The person keypoints estimation is done on individual images and to fully understand human behaviour and be able to analyse the full scene, we need to track the person from frame to frame. The person tracking opens the possibility for action recognition, person re-identification, understanding human-object interaction, sports video analysis and much more.

Project setup

We will use the source code based on my previous story:

I encourage you to read it first! If you followed it just run the command below in the project directory:

$ git pull
$ git checkout 3acd755f03c16f3f3f0adeb6e12e18e721134808
$ conda env update -f environment.yml

or if you prefer to start from the beginning follow with:

$ git clone git://github.com/jagin/detectron2-pipeline.git
$ cd detectron2-pipeline
$ git checkout 3acd755f03c16f3f3f0adeb6e12e18e721134808
$ conda env create -f environment.yml
$ conda activate detectron2-pipeline

Pose estimation

To check what the pose estimation is all about, run the command:

$ python process_video.py -i assets/videos/walk.small.mp4 -p -d --config-file configs/COCO-Keypoints/keypoint_rcnn_R_50_FPN_3x.yaml

We will get the following results on the screen:

Pose estimation on a video sequence (Video by sferrario1968 from Pixabay)

As you can see Detectron2 gives us the bounding box of the human and their keypoint estimations thanks to available COCO Person Keypoint Detection model with Keypoint R-CNN.

This model is based on Mask R-CNN, which is flexible enough to extend it to human pose estimation. The keypoint’s location is modelled as a one-hot
mask and Mask R-CNN is adopted to predict K masks, one for
each of K keypoint types (e.g., left shoulder, right elbow).

It’s a top-down method where we first detect human proposal and then estimate keypoints within each box independently.
There is also a bottom-up approach which directly infers the keypoints and the connection information between keypoints of all persons in the image without a human detector.

Another very popular, alternative estimators are:

There is a good article Human Pose Estimation with Deep Learning summarizing different approaches to human pose estimation.

Pose tracking

If you are perceptive enough, you will notice on the result above that the human poses are already tracked at least with the colour of the person bounding box. It’s a very naive heuristics to assign the same colour to the same instance for visual purpose based on intersection over union (IoU) of boxes or masks (see video_visualizer.py).

Using bounding box IoU to track pose instances will most likely fail when an instance moves fast thus the boxes do not overlap, and in crowed scenes where boxes may not have the corresponding relationship with pose instances.

Multi-person articulated pose tracking in unconstrained videos is a very challenging problem, and there are a lot of solutions for that.

The solution I would like to present is PoseFlow described in the paper PoseFlow: Efficient Online Pose Tracking by Yuliang Xiu, Jiefeng Li, Haoyu Wang, Yinghong Fang, Cewu Lu. The source code for the solution is available on GitHub, and it is also included in the AlphaPose repository.