Human Pose Estimation : Simplified

Source: Deep Learning on Medium

Take a peek into the world of Human Pose Estimation

What is Human Pose Estimation anyway?

Human pose estimation is an important problem in the field of Computer Vision. Imagine being able to track a person’s every small movement and do a bio-mechanical analysis in real time. The technology will have huge implications. Applications may including video surveillance, assisted living, advanced driver assistance systems (ADAS) and sport analysis.

Formally speaking, Pose Estimation is predicting the body part or joint positions of a person from an image or a video.

Image courtesy Microsoft COCO Dataset (Lin et al., 2014)

Why this blog?

I have been working on Human Pose Estimation for over 8 months now. The research in this field in vast, both in terms of width and depth. However, most of the literature (research papers and blogs) in Pose Estimation are fairly advanced, making it difficult for someone new to get accumulated.

The scope of future research in Pose Estimation is immense and creating a learning slope can get more people interested. The aim of the blog is to provide a rudimentary understanding of Pose Estimation and possibly spark a interest in the field. Anyone with absolutely no previous experience of Computer Vision can superficially follow the blog. Even a basic understanding of Computer Vision concepts is enough to fully understand the blog.

Problem Definition

As I said earlier, Human Pose Estimation is a field with vast amount of research, both in terms of depth and width. The problem statement can be classified based on the following axes :

Number of People Being Tracked

Depending on the number of people being tracked, pose estimation can be classified into Single-person and Multi-person pose estimation. Single-person pose estimation (SPPE) is the easier of the two, with the guarantee of only one person present in the frame. On the other hand, Multi-person pose estimation (MPPE) needs to handle the additional problem of inter-person occlusion. Initial approaches in pose estimation were mostly focused on SPPE, however with the availability of huge multi-person datasets, the MPPE problem has lately been getting increased attention.

Single-person vs Multi-person Pose Estimation.

Input Modality

Modality refers to the different types of inputs available. Based on the ease of availability, the top 3 forms of inputs are,

  • Red-Green-Blue (RGB) image : The images that we see around us on a daily basis, and the most common type of input for Pose Estimation. Models working on RGB-only input have a huge advantage over others in terms of the mobility of the input source. This is due to the ease of availability of common cameras (which capture RGB images), making them the models that can be used across a huge number of devices.
  • Depth (Time of Flight) image : In a Depth image, the value of a pixels relates to the distance from the camera as measured by time-of-flight. The introduction and popularity of low-cost devices like Microsoft Kinect has made it easier to obtain Depth data. Depth image can complement RGB image to create more complex and accurate Computer Vision models, whereas Depth-only models are vastly used where privacy is a concern.
  • Infra-red (IR) image : In an IR image, the value of a pixel is determined by the amount of infrared light reflected back to the camera. Experimentation in Computer Vision based on IR images are minimal, as compared to RGB and Depth images. Microsoft Kinect also provides IR image while recording. However, currently there are no datasets that contain IR images.
RGB image vs Depth image

Static Image vs Video

A video is nothing but a collection of images, where every two consecutive frames share a huge portion of the information present in them (which is the basis of most of the video compression techniques). These temporal (time based) dependence in videos can be exploited while performing Pose estimation.

For a video, a series of poses need to be produced for the input video sequence. It is expected that the estimated poses should ideally be consistent across successive frames of video, and the algorithm needs to be computationally efficient to handle large number of frames. The problem of occlusion might be easier to solve for a video due to the availability of past or future frames where the body part is not occluded.

If temporal features are not a part of the pipeline, it is possible to apply static pose estimation for each frame in a video. However, the results are generally not as good as desired due to jitters and inconsistency problems.

Notice the jitter in Single-frame model and the smoothness in Temporal model. Image courtesy Pavllo et al. (2018)

2D vs 3D Pose Estimation

Depending on the output dimension requirement, the Pose Estimation problem can be classified into 2D Pose Estimation and 3D Pose Estimation. 2D Pose Estimation is predicting the location of body joints in the image (in terms of pixel values). On the other hand, 3D Pose Estimation is predicting a three-dimensional spatial arrangement of all the body joints as its final output.

2D Pose Estimation vs 3D Pose Estimation

Most 3D Pose Estimation models first predict 2D Pose, and then try to lift it to 3D Pose. However, some end-to-end 3D Pose Estimation techniques also exist which directly predict 3D Pose.

Body Model

Every pose estimation algorithm agrees upon a body model beforehand. It allows the algorithm to formalize the problem of human pose estimation into that of estimating the body model parameters. Most algorithms use a simple N-joint rigid kinematic skeleton model (N is typically between 13 to 30) as the final output. Formally, kinematic models can be represented as a graph, where each vertex V represents a joint. The edges E can encode constraints or prior beliefs about the structure of the body model.

Such a model suffices for most applications. However, for many other applications such as character animation, a more elaborate model may be needed. Some techniques have considered a highly detailed mesh models, representing the whole body with a point cloud.

Another rather primitive body model that was used in earlier Pose Estimation pipelines is a shape-based body model. In shape-based models, human body parts are approximated using geometric shapes like rectangles, cylinders, conics etc.

Kinematic Model vs Shape-based Model vs Mesh-based Model

Number of cameras

A major portion of research involves solving the pose estimation problem using input from a single camera. However, there are certain algorithms which try to use data from multiple viewpoints/cameras, combining them to generate more accurate poses and handle occlusions better. The research on multi-camera pose estimation is currently somewhat limited, primarily due to lack of good datasets.

Pose Estimation Pipeline


  • Background removal : Might be required for segmentation of Human from the background, or removal of some noise.
  • Bounding box creation : Some algorithms, specially in MPPE, create bounding boxes for every Human present in the image. Each bounding box is then separately evaluated for Human Pose.
Bounding Box creation. Image courtesy Fang et al. (2017)
  • Camera calibration and image registration : Image registration is required in case inputs from multiple cameras are used. In case of 3D Human Pose Estimation, camera calibration also helps in converting the reported groundtruth into standard world coordinates.

Feature Extraction

Feature extraction in Machine Leaning refers to the creation of derived values from raw data (such as an image or video in our case), that can be used as input to a learning algorithm. Features can either be explicit or implicit. Explicit features include conventional Computer Vision based features like Histogram of Oriented Gradients (HoG) and Scale Invariant Feature Transform (SIFT). These features are calculated explicitly before feeding the input to the following learning algorithm.

Left : Image along with corresponding color gradients, Right : Image with SIFT features

Implicit features refers to deep learning based feature maps like outputs from complex Deep Convolutional Neural Networks (CNNs). These feature maps are never created explicitly, but are a part of a complete pipeline trained end-to-end.

VGG16 : A CNN based feature extraction and image classification architecture


Confidence Maps : A common way of predicting joint locations is producing confidence maps for every joint. Confidence maps are probability distribution over the image, representing the confidence of the joint location at every pixel.

Confidence map examples
  • Bottom Up Approach : Bottom up approaches involve first detecting the parts or joints for one or more humans in the image, and then assembling the parts together and associating them with a particular human.
    In simpler terms, the algorithm first predicts all body parts/joints present in the image. This is typically followed by the formulation of a graph, based on the body model, which connects joints belonging to the same human. Integer linear programming (ILP) or bipartite matching are two common methods of creating this graph.
Cao et al. complete pipeline. An example of a bottom up approach. Image courtesy Cao et al. (2017)
  • Top Down Approach : Top down approaches involve a segmentation step at the start, where each human is first segmented into a bounding box, followed by pose estimation being performed individually on each bounding box.
    Top down pose estimation can be classified into generative body-model based and deep learning based approaches. Generative body-model based approach involves trying to fit a body-model on the image, allowing the final prediction to be human like. Deep Learning based approached directly predict joint locations, thus the final prediction have no guarantee of being human like.
Fang et al. complete pipeline. An example of a top down approach. Image courtesy Fang et al. (2017)


A lot of algorithms, including both bottom up and top down approaches, do not have a relation constraint on the final output. To put it in layman terms, an algorithm predicting joint positions from an input image does not have any filter on rejecting/correcting unnatural human pose. This can sometimes lead to weird Human Pose Estimation.

Pose Estimation using Kinect containing weird and unnatural pose

To cope with this, there exist a set of postprocessing algorithms, which rejects unnatural human poses. The output pose from any Pose Estimation pipeline is passed through a learning algorithm which scores every pose based on its likeliness. Poses that get scores lower than a threshold are ignored during the testing phase.


A brief introduction of a few common datasets in Human Pose Estimation.

  • MPII : The MPII human pose dataset is a multi-person 2D Pose Estimation dataset comprising of nearly 500 different human activities, collected from Youtube videos. MPII was the first dataset to contain such a diverse range of poses and the first dataset to launch a 2D Pose estimation challenge in 2014.
  • COCO : The COCO keypoints dataset is a multi-person 2D Pose Estimation dataset with images collected from Flickr. COCO is the largest 2D Pose Estimation dataset, to date, and is considering a benchmark for testing 2D Pose Estimation algorithms.
  • HumanEva : HumanEva is a single-person 3D Pose Estimation dataset, containing video sequences recorded using multiple RGB and grayscale cameras. Ground truth 3D poses are captured using marker-based motion capture (mocap) cameras. HumanEva was the first 3D Pose Estimation dataset of substantial size.
  • Human3.6M : Human3.6M is a single-person 2D/3D Pose Estimation dataset, containing video sequences in which 11 actors are performing 15 different possible activities were recorded using RGB and time-of-flight (depth) cameras. 3D poses are obtained using 10 mocap cameras. Human3.6M is the biggest real 3D Pose Estimation dataset, to date.
  • SURREAL : SURREAL is a single-person 2D/3D Pose Estimation dataset, containing virtual video animations created using mocap data recorded in the lab. SURREAL is the biggest 3D Pose Estimation dataset, but is not yet accepted as a benchmark for comparing 3D Pose Estimation algorithms. This is mainly because it is a synthetic dataset.


Human Pose Estimation is an evolving discipline with opportunity for research across various fronts. Recently, there has been a noticeable trend in Human Pose Estimation of moving towards the use of deep learning, specifically CNN based approaches, due to their superior performance across tasks and datasets. One of the main reason for the success of deep learning is the availability of large amounts of training data, especially with the advent of the COCO and Human3.6M datasets.

Thank you for investing your time to read the article. If you liked the article please clap and share with your friends. This is first blog post and any feedback is highly appreciated and welcome.

I am working on a Deep Learning focused Human Pose Estimation article. I will add the link here when done.


[1] Cao, Zhe, et al. “Realtime multi-person 2d pose estimation using part affinity fields.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017.
[2] Microsoft corporporation. kinect for xbox 360, 2009.
[3] Pavllo, Dario, et al. “3D human pose estimation in video with temporal convolutions and semi-supervised training.” arXiv preprint arXiv:1811.11742 (2018).
[4] Fang, Hao-Shu, et al. “Rmpe: Regional multi-person pose estimation.” Proceedings of the IEEE International Conference on Computer Vision. 2017.
[5] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.