Realtime Multiple Person 2D Pose Estimation using TensorFlow2.x

Original article was published by Marcelo Rovai on Deep Learning on Medium

Realtime Multiple Person 2D Pose Estimation using TensorFlow2.x

Images source: Left: Bailarine Eugenia Delgrossi — Right: OpenPose — IEEE-2019


As described by Zhe Cao in his 2017 Paper, Realtime multi-person 2D pose estimation is crucial in enabling machines to understand people in images and videos.

However, what is the Pose Estimation?

As the name suggests, it is a technique used to estimate how a person is physically positioned, such as standing, sitting, or lying down. One way to obtain this estimate is to find the 18 “joints of the body” or as named in the Artificial Intelligence field: “Key Points.” The images below show our goal, which is to find these points in an image:

Image source: PhysicsWorld — Einstein in Oxford (1933)

The Keypoints go from 0 (Top neck) going down on body joints and returning to head, ending with point 17 (right ear).

The first significant work that appeared using the Artificial Intelligence-based approach was DeepPose, a 2014 paper by Toshev and Zegedy from Google. The paper proposed a human pose estimation method based on Deep Neural Networks (DNNs), where the pose estimation was formulated as a DNN-based regression problem towards body joints.

The model consisted of an AlexNet backend (7 layers) with an extra final layer that outputs 2k joint coordinates. The significant problem with this approach is that first, a single person must be detected (classic object detection) following by the model application. So, each human body found on an image must be treated separately, which increases considerably the time to process the image. This type of approach is known as “top-down” because first find the bodies and from it, the joints associated with them.

Challenges with Pose Estimation

There are several problems related to Pose Estimation, as:

  1. Each image may contain an unknown number of people that can appear at any position or scale.
  2. Interactions between people induce complex spatial interference, due to contact, occlusion, or limb articulations, making association of parts difficult.
  3. Runtime complexity tends to grow with the number of people in the image, making realtime performance a challenge.

To solve those problems, a more exciting approach (that is the one used on this project) is OpenPose, which was introduced in 2016 by ZheCao and his colleagues from the Robotics Institute at Carnegie Mellon University.


The proposed method of OpenPose uses a nonparametric representation, referred to as Part Affinity Fields (PAFs), to “connect” each finds body joints on an image, associating them with individual people. In other words, OpenPose does the opposite of DeepPose, first finding all the joints on an image and after going “up,” looking for the most probable body that will contain that joint without using any person detector (“bottom-up” approach). OpenPose finds the key points on an image regardless of the number of people on it. The below image, retrieved from OpenPose presentation at ILSVRC and COCO workshop 2016, give us an idea about the process.

Image source: OpenPose presentation at ILSVRC and COCO workshop 2016

The image below shows the architecture of the two-branch multi-stage CNN model used for training. First, a feed-forward network simultaneously predicts a set of 2D confidence maps (S) of body part locations (keypoints annotations from (dataset/COCO/annotations/) and a set of 2D vector fields of part affinities (L), which encode the degree of association between parts. After each stage, the two branches’ predictions, along with the image features, are concatenated for the next stage. Finally, the confidence maps and the affinity fields are parsed by greedy inference to output the 2D keypoints for all people in the image.

Image source: 2017 OpenPose Paper

During the execution of the project, we will return to some of those concepts for clarification. However, it is highly recommended to follow the OpenPose ILSVRC and COCO workshop 2016 presentation and the video recording at CVPR 2017 for a better understanding.

TensorFlow 2 OpenPose installation (tf-pose-estimation)

The original OpenPose was developed using the model-based VGG pre-trained network and using a Caffe framework. However, for this installation, we will follow Ildoo Kim TensorFlow approach as detailed on his tf-pose-estimation GitHub.

What is tf-pose-estimation?

tf-pose-estimation is the ‘Openpose’, human pose estimation algorithm that has been implemented using Tensorflow. It also provides several variants that have some changes to the network structure for realtime processing on the CPU or low-power embedded devices.

The tf-pose-estimation GitHub, shows several experiments with different models as:

  • cmu: the model-based VGG pretrained network described in the original paper with weights in Caffe format converted to be used in TensorFlow.
  • dsconv: same architecture as the cmu version except for the depthwise separable convolution of mobilenet.
  • mobilenet: based on the mobilenet V1 paper, 12 convolutional layers are used as feature-extraction layers.
  • mobilenet v2: similar to mobilenet, but using an improved version of it.

The studies on this article were done with mobilenet V1 (“mobilenet_thin”), that has an intermediary performance regarding computation budget and latency:

Part 1 — Installing tf-pose-estimation

We follow here, the excellent Gunjan Seth article Pose Estimation with TensorFlow 2.0.

  • Go to terminal and create a working directory (for example, “Pose_Estimation”), moving to it :
mkdir Pose_Estimation
cd Pose_Estimation
  • Create a Virtual Environment (for example Tf2_Py37)
conda create --name Tf2_Py37 python=3.7.6 -y 
conda activate Tf2_Py37
pip install --upgrade pip
pip install tensorflow
  • Install basic packages to be used during development:
conda install -c anaconda numpy
conda install -c conda-forge matplotlib
conda install -c conda-forge opencv
  • Clone tf-pose-estimation repository:
git clone
  • Go to tf-pose-estimation folder and install the requirements
cd tf-pose-estimation/
pip install -r requirements.txt

In the next step, install SWIG, an interface compiler that connects programs written in C and C++ with scripting languages such as Python. It works by taking the declarations found in C/C++ header files and using them to generate the wrapper code that scripting languages need to access the underlying C/C++ code.

conda install swig
  • Using Swig, build C++ library for post-processing.
cd tf_pose/pafprocess
swig -python -c++ pafprocess.i && python3 build_ext --inplace

Now, install tf-slim library, a lightweight library used for defining, training, and evaluating complex models in TensorFlow.

pip install git+

That is it! Now, it is essential to run a quick test. For that return to the main tf-pose-estimation directory.

If you follow the sequence, you must be inside tf_pose/pafprocess. Otherwise use the appropriated command to change directory.

cd ../..

Inside tf-pose-estimation directory there is a python script, let’s run it, having as parameters:

  • model=mobilenet_thin
  • resize=432×368 (size of the image at pre-processing)
  • image=./images/ski.jpg (sample image inside images directory)
python --model=mobilenet_thin --resize=432x368 --image=./images/ski.jpg

Note that during a few seconds, nothing will happen, but after a minute or so, the terminal should present something similar to the below image:

However, more important, an image will appear on an independent OpenCV window:

Great! The images are proof that everything was properly installed and working fine! We will enter in more detail in the next section. However, for a quick explanation about what the four images mean, the top-left (“Result”) is the pose detection skeleton drawn having the original image (in this case, ski.jpg) as background. The top-right image is a “heat map”, where the “parts detected” (Ss) are shown, and both bottom images show the part association (Ls). The “Result” is the connected S’s and L’s to individual persons.

The next test is a live video:

If the computer has only one camera installed, use: camera=0

python --model=mobilenet_thin --resize=432x368 --camera=1

If everything goes well, a window will appear with a real live video, like this screenshot:

Image source: PrintScreen Author’s WebCam

Part 2 — Going Deeper with Pose Estimation in Images

In this section, we will go more in-depth with our TensorFlow Pose Estimation implementation. It is advised to follow the article, trying to reproduce Jupyter Notebook: 10_Pose_Estimation_Images, which can be downloaded from Project GitHub.

As a reference, this project was 100% developed on a MacPro (2.9Hhz Quad-Core i7 16GB 2133Mhz RAM).

Import Libraries

import sys
import time
import logging
import numpy as np
import matplotlib.pyplot as plt
import cv2
from tf_pose import common
from tf_pose.estimator import TfPoseEstimator
from tf_pose.networks import get_graph_path, model_wh

Model definition and TfPose Estimator creation

It is possible to use the models located on model/graph sub-directory, as mobilenet_v2_large or cmu (VGG pretrained model).

For cmu, the *.pb files were not downloaded during installation, because they are significant in size. To use it, run the bash script that is located on /cmu sub-directory.

This project uses mobilenet_thin (MobilenetV1), considering that all images used should be reshaped to 432×368.


w, h = model_wh(resize)

Create estimator:

e = TfPoseEstimator(get_graph_path(model), target_size=(w, h))

Let us load a simple human image for ease analysis. OpenCV is used to read images. The images are stored as RGB, but internally, OpenCV works with BGR. Using OpenCV to show an image has no problem because it will be converted from BGR to RGB before image presentation on a specific window (as saw with ski.jpg on the previous section).

Once the image should be plotted on a Jupyter cell, Matplotlib will be used instead OpenCV. Because of that, the image should be converted before display, as shown below:

image_path = ‘./images/human.png’
image = cv2.imread(image_path)
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

Observe that this image has a shape of 567×567. OpenCV when reading an image, automatically convert it to an array, where each value goes from 0 to 255, where 0=”white” and 255=”Black”‘.

Once the image is an array, it is simple to verify its size, using shape:


The result will be (567, 567, 3), where the shape is (width, height, color channels).

Spite that the image can be read using OpenCV; we will use the function read_imgfile(image_path) from the library tf_pose.common to prevent any trouble with color channels.

image = common.read_imgfile(image_path, None, None)

Once we have the image as an array, we can apply the method inference to the estimator (e), having the image array as input (the image will be resized using the parameters w and h defined at principle).

humans = e.inference(image, resize_to_default=(w > 0 and h > 0), upsample_size=4.0)

After running the above command, let us inspect the array e.heatmap. This array has a shape of (184, 216, 19), where 184 is h/2, 216 is w/2, and 19 is related to the probability of that specific pixel to belong to one of the 18 joints (0 to 17) + one (18: none). For example, inspecting the top-left pixel, a “none” should be expected:

It is possible to verify the last value of this array

which is the highest value of all; what can be understood that with 99.6% of chance, this pixel does not belong to any one of the 18 joints.

Let us try to find the base of the neck (midpoint between shoulders). It is located on the original picture around mid-width (0.5 * w = 108) and around 20% of height, starting top/down (0.2 * h = 37). So, let us inspect this specific pixel:

It is easy to realize that position 1 has a maximum value of 0.7059… (or by calculating e.heatMat[37][108].max()), which means that that specific pixel has a 70% probability of being a “base neck.” The figure below shows all 18 COCO Keypoints (or “body joints”), showing that “1” corresponds to the “base neck”.

COCO keypoint format for human pose skeletons.

It is possible to plot for every pixel, a color representing its maximum value. Doing that, a heat map, showing the key points will magically appear:

max_prob = np.amax(e.heatMat[:, :, :-1], axis=2)

Le us now plot the key points over the reshaped original image:

bgimg = cv2.cvtColor(image.astype(np.uint8), cv2.COLOR_BGR2RGB)
bgimg = cv2.resize(bgimg, (e.heatMat.shape[1], e.heatMat.shape[0]), interpolation=cv2.INTER_AREA)
plt.imshow(bgimg, alpha=0.5)
plt.imshow(max_prob, alpha=0.5)

So, it is possible to see the keypoints (S’s) over the image, being the values shown at colorbar means that more yellow means higher probability.

To get the L’s, the most probable connections (or “bones”) between the key points (or “joints”), we can use the resulted array of e.pafMat. This array has a shape of (184, 216, 38), where here the 38 (2 x 19) is related to the probability of that pixel be part of a horizontal (x) or vertical (y)connection with one of the 18 specific joints + nones.

The functions to plot the above figures are in the Notebook.

Draw the skeleton using method draw_human

With the list human, resultant of e.inference() method, it is possible to draw the skeleton using method draw_human:

image = TfPoseEstimator.draw_humans(image, humans, imgcopy=False)

The result will be below image:

If desired, it is possible to plot only the skeleton, as shown here (let us rerun all code for a recap):

image = common.read_imgfile(image_path, None, None)
humans = e.inference(image, resize_to_default=(w > 0 and h > 0), upsample_size=4.0)
black_background = np.zeros(image.shape)
skeleton = TfPoseEstimator.draw_humans(black_background, humans, imgcopy=False)

Getting the Key points (Joints) coordinates

Pose estimation can be used on a series of applications such as robotics, gaming, or medicine. For that, it could be interesting to get the physical keypoints coordinates from the image to be used by other applications.

Looking at the human list resulted from e.inference(), it can be verified that it is a list with a single element, a string. In this string, every key point appears with its relative coordinate and associated probability. For example, for the human image used so far, we have:

For example:

BodyPart:0-(0.49, 0.09) score=0.79
BodyPart:1-(0.49, 0.20) score=0.75
BodyPart:17-(0.53, 0.09) score=0.73

We can extract an array (size of 18) from this list with the real coordinates related tothe original image shape:

keypoints = str(str(str(humans[0]).split('BodyPart:')[1:]).split('-')).split(' score=')keypts_array = np.array(keypoints_list)
keypts_array = keypts_array*(image.shape[1],image.shape[0])
keypts_array = keypts_array.astype(int)

Let us plot this array (being that the array’s index is the key point), over the original image. Here the result:

plt.axis([0, image.shape[1], 0, image.shape[0]])
plt.scatter(*zip(*keypts_array), s=200, color='orange', alpha=0.6)
img = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
for i, txt in enumerate(keypts_array):
ax.annotate(i, (keypts_array[i][0]-5, keypts_array[i][1]+5)

Creating Functions to reproduce the studies on generic images quickly:

The Notebook shows all the code developed so far, “encapsulated” as functions. For example, let us see another image:

image_path = '../images/einstein_oxford.jpg'
img, hum = get_human_pose(image_path)
keypoints = show_keypoints(img, hum, color='orange')
Image source: PhysicsWorld — Einstein in Oxford (1933)
img, hum = get_human_pose(image_path, showBG=False)
keypoints = show_keypoints(img, hum, color='white', showBG=False)

Studying images with multiple persons

So far, only was explored images that contain a single person. Once the algorithm was developed to capture all joints (S’s) and PAFs (L’s) at the same time from the image, finding the most probable connections was only for simplicity. So, the code to get the result is the same; only when we get the result (“human”), for example, the list will have a size compatible with the number of people in the image.

For example, let us use a “busy image” with five people on it:

image_path = './images/ski.jpg'
img, hum = get_human_pose(image_path)
plot_img(img, axis=False)
Image source: OpenPose — IEEE-2019

The algorithm found all Ss and Ls associating them with the five people. The result is excellent!

From reading the image path to plotting the result, all the process took less than 0.5s, independent of the number of people found in the image.

Let us complicate it and see an image where people are more “mixed” as a couple dancing:

image_path = '../images/figure-836178_1920.jpg
img, hum = get_human_pose(image_path)
plot_img(img, axis=False)
Image source: Pixabay

The result also seems very good. Let us plot only the keypoints, having a different color for each person:

plt.axis([0, img.shape[1], 0, img.shape[0]])
plt.scatter(*zip(*keypoints_1), s=200, color='green', alpha=0.6)
plt.scatter(*zip(*keypoints_2), s=200, color='yellow', alpha=0.6)
plt.title('Keypoints of all humans detected\n')

Part 3: Pose Estimation in Videos and live camera

The process of getting the pose estimation in videos is the same as we did with images because a video can be treated as a succession of images (frames). It is advised to follow the section, trying to reproduce Jupyter Notebook: 20_Pose_Estimation_Video which can be downloaded from Project GitHub.

OpenCV does a fantastic job of handling videos.

So, let us get a .mp4 video and inform OpenCV that we will capture its frames:

video_path = '../videos/dance.mp4
cap = cv2.VideoCapture(video_path)

Now let us create a loop that will capture each frame. Having the frame, we will apply e.inference(), and from the result, we will draw the skeleton, the same way as we did with images. A code at the end was included to stop the video when a key (‘q’, for example) is pressed.

Below the necessary code:

fps_time = 0while True:
ret_val, image =
humans = e.inference(image,
resize_to_default=(w > 0 and h > 0),
if not showBG:
image = np.zeros(image.shape)
image = TfPoseEstimator.draw_humans(image, humans, imgcopy=False)
cv2.putText(image, "FPS: %f" % (1.0 / (time.time() - fps_time)), (10, 10),
cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
cv2.imshow('tf-pose-estimation result', image)
fps_time = time.time()
if cv2.waitKey(1) & 0xFF == ord('q'):
Image source: ScreenShot from video sample on tf-pose-estimation GitHub
Image source: ScreenShot from video sample on tf-pose-estimation GitHub

The result is fantastic, but a little slow. The movie that originally had around 30 FPS (Frames per second), will run here in “slow camera”, around 3 FPS.

Here another experience where the movie was run twice, recording the pose estimated skeleton with and w/o the background video. The videos were manually synchronized, but if the result is not perfect, it is fascinating. I cut the last scene of the 1928 Chaplin movie “The Circus, “ where the way the Tramp walks is classic.

Testing with a live camera

It is advised to follow the section, trying to reproduce Jupyter Notebook: 30_Pose_Estimation_Camera which can be downloaded from Project GitHub.

The code needed to run a live camera is almost the same as that used with video, except that the OpenCV videoCapture() method will receive as an input parameter an integer that refers to what real camera is used. For example, an internal camera uses “0” and an external “1”. Also the camera should be set to capture frames as ‘432×368’ as used by the model.

Parameters initialization:

camera = 1
resize = '432x368' # resize images before they are processed
resize_out_ratio = 4.0 # resize heatmaps before they are post-processed
model = 'mobilenet_thin'
show_process = False
tensorrt = False # for tensorrt process
cam = cv2.VideoCapture(camera)
cam.set(3, w)
cam.set(4, h)

The loop part of the code should be very similar to the one used with video:

while True:
ret_val, image =
humans = e.inference(image,
resize_to_default=(w > 0 and h > 0),
image = TfPoseEstimator.draw_humans(image, humans, imgcopy=False)
cv2.putText(image, "FPS: %f" % (1.0 / (time.time() - fps_time)), (10, 10),
cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
cv2.imshow('tf-pose-estimation result', image)
fps_time = time.time()
if cv2.waitKey(1) & 0xFF == ord('q'):
Image source: PrintScreen Author’s WebCam

Again, the standard video capture at 30 FPS, is reduced to around 10% when the algorithm is used.
Here a full video where the delay can be better observed. However, the result is excellent!


As always, I hope this article can inspire others to find their way in the fantastic world of AI!

All the codes used in this article are available for download on project GitHub: TF2_Pose_Estimation

Regards from the South of the World!

See you in my next article!

Thank you