Source: Deep Learning on Medium
Orientation Estimation in Monocular 3D Object Detection
Monocular 3D object detection is the task to draw 3D oriented bounding box around objects in 2D RGB image. This task has attracted lots of interest in the autonomous driving industry due to the potential prospects of reduced cost and increased modular redundancy. The task of reasoning 3D with a single 2D input is highly challenging, and estimation of vehicle orientation is one important step toward this important task.
In monocular 3D object detection, one important concept that keeps coming up in literature is the difference between allocentric and egocentric orientation. I have recently explained the difference to many of my colleagues and I think it is a good time to write a short blog post about this.
Note that this post focuses on the concept of 3D orientation in autonomous driving. If you want to know the state-of-the-art ways for orientation regression, please read my previous post on Multi-modal Target Regression, in particular for orientation regression.
Egocentric vs Allocentric
The concepts of egocentric and allocentric came from the field of human spatial cognition. However, these concepts in the context of perception in autonomous driving are quite specific: egocentric orientation means orientation relative to camera, and allocentric orientation is orientation relative to object (i.e., vehicles other than the ego vehicle).
Egocentric orientation is sometimes referred to as global orientation (or rotation around Y-axis in KITTI, as mentioned below) of vehicles, as the reference frame is with respect to the camera coordinate system of the ego vehicle and does not change when the object of interest moves from one vehicle to another. Allocentric orientation is sometimes referred to as local orientation or observation angle, as the reference frame changes with the object of interest. For each of the object, there is one allocentric coordinate system, and one axis in the allocentric coordinate system aligns with the ray emitting from the camera to the object.
To illustrate this simple idea, the paper FQNet (Deep Fitting Degree Scoring Network for Monocular 3D Object Detection, CVPR 2019) has a great illustration.
In (a), the global orientations of the car are all facing the right, but the local orientation and appearance will change when the car moves from the left to the right. In (b), the global orientations of the car differ, but both the local orientation in the camera coordinates and the appearance remain unchanged.
It is obvious to see that the appearance of an object in the monocular image only depends on the local orientation, and we can only regress the local orientation of the car based on the appearance. Another great illustration is from the Deep3DBox paper (3D Bounding Box Estimation Using Deep Learning and Geometry, CVPR 2017).
The car in the cropped images rotates while car direction in 3D world is constant — following the straight lane lines. From only the image patches on the left, it is almost impossible to tell the global orientation of the car. The context of the car in the entire image is critical to infer the global orientation. On the other hand, local orientation can be fully recovered from the image patch alone.
Note that following KITTI’s convention of assuming zero roll and zero pitch, the orientation is reduced to simply yaw. Thus the above two orientations are referred to as global yaw and local yaw as well.
Converting local to global yaw
To compute the global yaw using local yaw, we need to know the ray direction between the camera and the object, which can be calculated using the location of the object in the 2D image. The conversion is a simple addition, as explained in the diagram below.
The angle for the ray direction can be obtained using a keypoint from the bounding box position and the camera intrinsics (principal point and focal length) of the camera. Note that there are different choices for picking the keypoint of a 2D bounding box. Some popular choices are:
- center of detector boxes (may be truncated)
- center of amodal boxes (with guessed extension for occluded or truncated object)
- projection of 3D bounding box on the image (can be obtained from lidar 3D bounding box groud truth)
- bottom center of 2D bounding box (which is often assumed to be on the ground)
The bottom line is, unless the vehicle is really closeby or severely truncated or occluded, the above methods will yield angle estimation of about 1 to 2 degrees apart.
What does the KITTI say?
KITTI dataset’s 2D object detection ground truth provides two angles for each bounding box:
- alpha: Observation angle of object, ranging [-pi..pi]
- rotation_y: Rotation ry around Y-axis in camera coordinates [-pi..pi]
The above two angles correspond to the local (allocentric) yaw and global (egocentric) yaw, respectively. These values seem to come from 3D bounding box ground truth based on lidar data. This makes it easy to perform angle estimation on the 2D images.
KITTI has one official metric on orientation estimation: Average Orientation Similarity (AOS), a value between 0 and 1, and 1 represents perfect prediction. I will not go to details about the metric here, but it is quite similar to the idea of Average Precision and the details can be found in the original KITTI paper.
There is another metric in literature popularized by 3D RCNN, Average Angular Error (AAE), defined below.
- It is possible to estimate the local (allocentric) orientation (yaw) from a local image patch.
- It is impossible to estimate the global (egocentric) orientation (yaw) from a local image patch.
- With camera intrinsic (principal point, focal length) and global information of the image patch, it is possible to convert the local orientation to global orientation.
- The regression of viewpoint orientation is one of the hardest regression issues in deep learning. Refer to my previous post on Multi-modal Target Regression.
I will write a review of monocular 3D object detection soon. Stay tuned!