Original article can be found here (source): Deep Learning on Medium
Human Pose Estimation with Stacked Hourglass Network and TensorFlow
Originally published at https://www.yanjia.li on Dec 30, 2019.
For full source code, please go to https://github.com/ethanyanjiali/deep-vision/tree/master/Hourglass/tensorflow. I really appreciate your ⭐STAR⭐ that supports my efforts.
Human is good at making different poses. Human is good at understanding these poses too. This makes body language become such an essential part of our daily communication, work, and entertainment. Unfortunately, poses have so much variance, so it’s not an easy task for a computer to recognize a pose from a picture…until we have deep learning!
With a deep neural network, the computer can learn a generalized pattern of human poses, and predict joints location accordingly. The Stacked Hourglass Network is just such kind of network, and I’m going to show you how to use it to make a simple human pose estimation. Although first introduced in 2016, it’s still one of the most important networks in pose estimation area, and widely used in lots of applications. No matter if you want to build a software to track basketball player’s action, or make a body language classifier based on a person’s pose, this would be a handy hands-on tutorial for you.
To simply put, Stacked Hourglass Network (HG) is a stack of hourglass modules. It got this name because the shape of each hourglass module closely resemble an hourglass, as we can see from the picture below:
The idea behind stacking multiple HG (Hourglass) modules instead of forming a giant encoder and decoder network is that each HG module will produce a full heat-map for joint prediction. Thus, the latter HG module can learn from the joint predictions of the previous HG module.
Why would a heat-map help human pose estimation? This is a pretty common technique nowadays. Unlike facial keypoints, human pose data has lots of variances, which makes it hard to converge if we just simply regress the joint coordinates. Smart researchers come up with an idea to use heat-map to represent a joint location in an image. This preserves the location information, and then we just need to find the peak of the heat-map and use that as the joint location (plus some minor adjustment since heat-map is coarse). For a 256×256 input image, our heat-map will be 64×64.
In addition, we would also calculate the loss for each intermediate prediction, which helps us to supervise not only the final output but also all HG modules effectively. This is a brilliant design back then because pose estimation relies on the relationship among each area of the human body. For example, without seeing the location of the body, it’s hard to tell if an arm is left arm or right arm. By using a full prediction as the next modules’ input, we are forcing the network to pay attention to other joints while predicting a new join location.
So how does this HG (Hourglass) module itself look like? Let’s take a look at another diagram from the original paper:
In the diagram, each box is a residual block plus some additional operations like pooling. If you are not familiar with residual block and bottleneck structure, I’d recommend you to read some ResNet article first. In general, an HG module is an encoder and decoder architecture, where we downsample the features first, and then upsample the features to recover the info and form a heat-map. Each encoder layer would have a connection to its decoder counterpart, and we could stack as many as layers we want. In the implementation, we usually make some recursions and let this HG module to repeat itself.
I understand that it still seems too “convoluted” here, so it might be easier just to read the code. Here’s a piece of code copied from my Stacked Hourglass implementation on Github deep-vision repo:
This module looks like an onion, and let’s start from the outmost layer first. up1 went through two bottleneck blocks and added together with up2. This represents two bigs boxes on the left and top, and also the right-most plus sign. The whole flow is up in the air, so we call it up channel. On line 17, there’s also a low channel. This low1 goes through some pooling and bottleneck block, then goes into another smaller Hourglass module! On the diagram, it’s the second layer of the big onion. And this is also why we are using recursion here. We keep repeating this HG module until layer 4, where you just have a single bottleneck instead of an HG module. And this final layer in the three tiny boxes in the middle of the diagram.
If you are familiar with some image classification networks, it’s clear that the author borrows the idea of skip connection very heavily. This repeating pattern connects the corresponding layers in the encoder and decoder together, instead of just having one flow of features. This not only helps the gradient to pass through but also lets the network consider features from different scales when decoding.
Now that we have an Hourglass module, and we know that the whole network consists of multiple modules like this, but how do we stack them together precisely? Here comes the final piece of the network: intermediate supervision.
As you can see from the diagram above, when we produce something from the HG module, we split the output into two paths. The top path includes some more convolutions to further process the features and then go to the next HG module. The interesting thing happens at the bottom path. Here we use the output of that convolution layer as an intermediate heat-map result (blue box) and then calculate loss between this intermediate heat-map and the ground-truth heat-map. In other words, if we have 4 HG modules, we will need to calculate four losses in total: 3 for the intermediate result, and 1 for the final result.
Prepare the Data
Once we finished the code for the Stacked Hourglass network, it’s time for us to think about what kind of data we’d like to use to train this network. If you have your own dataset, that’s great. But here I’d like to mention an open dataset for those beginners who want to have something to train on first. And it’s called MPII Dataset (Max Planck Institute for Informatics). You could find the download link here.
Although this dataset is mostly used for single person pose estimation, it does provide joints annotations for multiple people in the same image. For each person, it gives the coordinates for 16 joints, such as the left ankle or right shoulder.