Comparison of 3D Detection Techniques

With the advancement of the Self Driving Cars, we see a great push in the computer vision domain for the applications like objection detection (2D/3D), object tracking, 3D reconstruction, traffic behavior understanding, 3D scene analysis, etc. Here we will try to cover some of the latest advancements in one of these technologies which is 3D detection. LiDAR and Depth Sensors gives 3D information of the world around it, which can’t be directly captured by passive 2D camera setup. The techniques we discuss today are trying to detect the objects in the raw point clouds without any texture information (just with the representation of the position (x,y,z).

Previously researchers were doing object detection with combined depth and RGB information. Then researchers were like why not just do it only on pointclouds, which would make the detections independent from image. There are few reasons for this:

  1. There is no one compact sensor like we have kinect sensor (indoors) for self driving cars (because of range issue).
  2. Because of the above reason researchers are using two separate sensors camera and LiDAR. When the car is moving and lidar is rotating, the pointcloud that is registered for one spin is distorted because of car motion.
  3. When you want to project this LiDAR depth information on to camera, it needs to exactly synchronized with the camera (timestamp should be almost same, there should be a common clock for both the sensors).
  4. Then it can be projected on to the camera after correcting the motion distortion in the point cloud using different sensors like GPS/IMU.

So because of the above reasons, researchers are like why not just directly detect objects in pointcloud itself. (We got Deep Neural Networks to do that)


Researchers at stanford[1] proposed pointnet with the idea that why not directly pass the points (x,y,z) information to the deep neural network to segment out the object. It is nice, but it was only used for small objects with less number of points as we don’t have super computers to process millions of points (computation scalability issue). Then they are like we will use detections from camera to detect objects first and select the points with in that bounding box (frustrum) and pass those less number of points to the pointnet (CNN) to detect objects.

Frustrum based pointnets [1]

Nice, but it has it’s own issues like described above. But the novel part about this paper is they introduced this concept of segmenting/detecting objects directly on unordered point sets. The paper[1] mentions a kernel could be used that is symmetric for giving same output no matter how you give the input (unordered). The paper[1] says max-pooling is used for that reason.

Max Pooling (symmetric function) for processing un-ordered point sets [1]


Apple Inc., everyone believes might be doing research on self driving cars published a paper on 3D Detection Algorithm called VoxelNet. Eventhough there were some papers on voxelizing and processing pointcloud for detection. It[2] was the first one to do with end to end by voxelizing the pointcloud. Basically voxel is nothing but a cubes/grids (like a different datastructure representing a pointcloud), where in this paper each voxel is presented by median and points deviation from that local median in that voxel.

Voxelnet (block diagram) [2]

It is using same idea of max-pooling presented in pointnet. The novelty of this paper was voxel feature encoding layer. where can learn local feature representations using element wise max-pooling and global feature representation by skip connections and Convolution Middle Layers.

Voxel Feature Encoding layer [2]

One thing I observed in this paper is that they use Mean Square Error Loss function for the orientation (-pi, pi) of the bounding box as well, which I kind of don’t like (might work in many cases but not good loss function for orientation), which will be discussed below.

Complex YOLO:

YOLO (you only look once), every one knew because of its fast performance. Complex YOLO is from researchers at NYU, not the same authors as YOLO for 2D detections. This algorithm uses MV3D (multi view 3D) architecture, which was older than pointnet. Here it uses YOLO-V2 architecture but the input is maximum intensity, normalized density and maximum height to RGB channels, where as height is discretised (not continuous).

Complex yolo (uses MV3D structure) [3]

Here rotation is not represented as a number from -pi to pi radians, rather it is represented as a complex number as shown below (with real and imaginary components of a vector).

Euler Representation of Rotation

Then the loss function of the orientation becomes as:

Rotation Loss function

The paper[3] also mentions directly using Mean Square Error (MSE) was not giving better accuracy.


With Intersection of Union (IOU)=0.5 and using NVIDIA TX2, the comparison of 3D detection techniques presented in complex-yolo paper.

comparison table [3]

For Lidar Only algorithms, complex yolo is running at around 50 fps (12x times faster than voxelnet) with close accuracy to voxel net. It is doing better for pedestrians and cyclists than voxelnet. This comparison table is for KITTI dataset which is collected with Velodyne HDL- 64 E LIDAR, which is dense pointcloud comparatively. When doing detections on Velodyne HDL-32E which is very sparse, pedestrains would be more challenging as there would be very few points on the pedestrians.

Disclaimar for KITTI Dataset: Easy, Moderate, Hard are created using occlusion percentages.

Notable Recent Works:

Fast and Furious, CVPR 2018 [4]: Detection, Tracking and Motion Forecasting is done end to end with one network.


[1]Qi, Charles R., et al. “Pointnet: Deep learning on point sets for 3d classification and segmentation.” Proc. Computer Vision and Pattern Recognition (CVPR), IEEE 1.2 (2017): 4.

[2]Zhou, Yin, and Oncel Tuzel. “VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection.” arXiv preprint arXiv:1711.06396 (2017).

[3]Simon, Martin, et al. “Complex-YOLO: Real-time 3D Object Detection on Point Clouds.” arXiv preprint arXiv:1803.06199(2018).

[4]Luo, Wenjie, Bin Yang, and Raquel Urtasun. “Fast and Furious: Real Time End-to-End 3D Detection, Tracking and Motion Forecasting With a Single Convolutional Net.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.

Source: Deep Learning on Medium