HAMR — 3D Hand Shape and Pose Estimation from a Single RGB Image

Source: Deep Learning on Medium

2.1. 2D Pose Estimation

Similar to other recent methods, the authors employ a cascade encoder-decoder network to predict the 2D Gaussian like heat-maps for 2D pose estimation, which is noted as Φ in Equation 1 below.

Equation 1. 2D Gaussian like heat-maps representing 2D joints estimation.

Here, K indicates the number of joints and {H, W} are the resolutions of the heat-maps. Each keypoint has a corresponding heat-map, and each pixel value on the heat-map indicates the confidence of the keypoint located in that given 2D position.

In addition, as perspective ambiguity cannot be resolved when applying direct regression from 3D pose heat-maps, the authors concatenate the intermediate-layer features with the M heat-maps, and then feed them into the following iterative regression module for additional information.

2.2. Hand Mesh Recovery

2.2.1 Hand Mesh Representation

MANO (hand Model with Articulated and Non-rigid defOrmations) is utilized as the generic 3D hand model in the HAMR framework. In particular, MANO factors hand mesh into its shape — it mainly models hand properties such as finger slenderness and palm thickness and pose.

MANO parameterizes a triangulated mesh M ∈ R^{N×3} with a set of parameters θ_mesh = {β , θ }, where β ∈ R¹⁰ denotes the shape parameters and θ ∈ R^{K×3} denotes the pose parameters. Technically speaking, β represents the coefficients of PCA components that sculpt the identity subject, and θ denotes the relative 3D rotation of K joints on a Rodrigues Vector representation.

For the enthusiastic reader:
For more details on “MANO” check out the formal project page or check out their video demo.

2.2.2 Derived Hand Pose Representations

Given the recovered mesh, The 3D joint locations Φ_3D can be computed via linear interpolations between the vertexes of the mesh, while also obtaining the 2D joint locations Φ_2D with a projection of the 3D joints.

More specifically:

where Φ_3D is a set of 3D coordinates (x,y,z) for each of the keypoints K. Φ_2D is a set of corresponding 2D coordinates (u,v) in camera space. θ_cam={(s,tx,ty)} denotes the camera parameters.

Lastly the 3D-to-2D projection function can be defined as:

where the authors use a weakly perspective camera model, i.e. an orthographic projection.

2.3. Iterative Regression Module

A regression module is applied to fit the camera parameters θ_cam and the mesh parameters θ_mesh. However, the complicated domain gap makes it difficult to produce reasonable estimates in one go.

Inspired by several previous works that revealed that a cascade and coarse-to-fine method is more appropriate than a one-pass solution, an iterative regression module was implemented.The iterative regression module is designed to fit the camera and mesh parameters from semantic features extracted from previous 2D pose modules.

Figure 3. Architecture of Iterative Regression Module. This module takes the cross-level features as input and regresses the camera and mesh parameters in an iterative way

Intuitively, the current parameters θ are taken at time t as additional inputs upon the image features, thus yielding an estimate of a more accurate θ at time t+1. As illustrated in Figure. 3, the iterative regression module consists of a simple fully convolutional encoder and multiple fully connected layers.

The predictions are strong supervised by utilizing a ground-truth of the camera parameters computed from paired 3D and 2D annotations.

2.4. Loss

For the target to recover the hand mesh from a single RGB image, a deep convolutional network is leveraged to fit the mesh parameters θ_mesh.

However, in a real-world scenario, it’s almost impossible to obtain the ground-truth mesh parameters when annotating from single RGB images. Fortunately, the HAMR framework can define derived 3D and 2D joint locations from the mesh. By doing so, HAMR is trained with widely-available 3D and 2D annotations, thus enabling the mesh reconstruction.

HAMR defines the overall loss function as follows:

{λ3D, λ2D, λgeo, λcam, λht, λseg} are hyperparameters to trade-off among different types of supervision over the whole framework

Here, L2 loss is employed between the derived 3D and 2D representations and ground-truth labels, which results in L3D and L2D, respectively. Additionally, the geometric constraints are reformulated as regularizers leading to Lgeo, which defines over the predicted 3D poses (more details about geo loss can be found in the paper).

Furthermore, L2 loss is utilized to supervise the estimated camera parameters with ground-truth camera parameters leading to Lcam. Finally the model penalizes the misalignment between the rendered mask and ground-truth silhouette via L1 loss, leading to Lseg (more details about seg loss can be found in the paper).

The whole process is fully differentiable with respect to all the learnable parameters, thus making the HAMR framework trainable from end-to-end.