[CV 2018 / Paper Summary ]Complex-YOLO: An Euler-Region-Proposal for Real-time 3D Object Detection…

Source: Deep Learning on Medium

3D-Object Detection

Another great paper on the 3D object detection using already existing architecture and keeping it simple and efficient

Please note that this post is for my future self to look back and review the materials on this paper without reading it all over again


Many of the automotive industry are interested in the research of 3D data which helps them to estimate the depth of the object along with planning . There are few algorithms which work directly on the 3D data and process it to obtain the output which are computationally heavy and time consuming. In this paper the author has extended the available object detection method i.e yolov2 with an additional method ERPN to detect 3D objects from the point cloud . The important functionality of E-RPN is it uses both the imaginary and real part for the regression due which it avoids the singularities and functions in closed complex space due to which angle estimation is accurate . The author has performed experimentation on Kitti dataset which is a open source point cloud dataset having 8 class objects.


From the authors point of view 3D points for automotive driving has more advantage than 2D since we are able to extract the depth information of the detected vehicle which help to take accurate decisions during navigation. In general using deep learning object detection and classification is a know task in which most of the architecture use RPN for efficient and accurate detection of objects , for 3D object detection currently there are 3 main approaches

  1. Processing the point cloud directly using multi layer perceptron
  2. Translating the point cloud data into voxel regions and applying CNN
  3. Combination/ Fusion of the Approach 1 & 2

1.2 Contribution

The main contribution from the author is he is able to achieve 50 fps on Nvidia Titan X compared to other approaches on Kitti data by multiview fusion of data and generate one single birds eye view RGB map . E-RPN method is the additional feature which the author is able to get the orientation of the objects using imaginary and real parts , this helps exact localisation of 3D objects and accurate class predictions . The mentioned approach is processed on the Lidar data in single forward path which is computationally efficient.

2. Complex-YOLO Pipeline

The complex yolo has Point Cloud Preprocessing , Complex yolo architecture and Loss function to achieve the accuracy and performance in its workflow

2.1 Point Cloud pre-processing

The 3D point cloud data obtained is converted into a birds-eye view RGB map which covers an area of 80mx40m. The R channel refers to Height , G channels to Intensity and B channel to Density of the point data , the size of the grid map used is 1024×512 resolution and a constant height of 3m for all the objects.

The author considers the calibration data obtained from the Kitti for defining a mapping function for placing the respective points into the respective grid cells of the RGB map, hence channel of each pixel is calculated based on the Velodyne values.

2.2 Architecture

Input for complex yolo architecture is the BEV image which can be process as a normal RGB image similar to processing of yolov2 with an addition of complex angle regression by ERPN method , to detect multi-class parts of 3D objects.


The author’s ERP considers the 3D objects position , class probability , objectiveness and orientation . To obtain proper orientation the author has modified the normal grid search approach by adding the complex angle . For each grid they are able to predict 5 objects including probability score and class scores which calculates upto 75 features.

ERPN values

Anchor Box Design

The author uses 3 different size and 2 angle directions as priors based on the distribution in the kitti dataset

Anchor boxes

Complex Angle Regression

The author has considered the orientation of the 3d object by computing the responsible regression parameters tim and tre, which correspond to the phase of the complex number . The angle is computed using arctan which avoids singularities and results in closed mathematical space which helps in generalisation of the model.The regression values are directly passed to the computation of loss function.

3D bbox regressor

2.3 Loss Function

The authors loss function is based on the concept of YOLOv2 who defined the loss as sum of squared errors using the multi-part loss in which the Euler angle regression loss is added on the similar terms . The Euler angle regression loss is obtained by calculating the difference between the ground truth and predicted angle which is always assumed to be inside the unit circle . The IOU is calculated by taking angle also into consideration between the ground truth and predicted values by the 2D polygon geometry.

Loss Function

2.4 Efficiency Design

The author bring out the major advantages of using his network is due to the following .

  1. The ERPN is a part of the yolo network
  2. One end to end training as its a single network
  3. Prediction of bounding boxes happens in one single inference pass
  4. Lower runtime in comparision wrt to other models having RPN
Design efficiency

3.Training & Experiments

The author has used Kitti dataset for his training which consists of 3 categories for cars , pedestrians and cyclists .

  1. 2D object detection
  2. 3D object detection
  3. Birds eye view

The author has trained the model from scratch with the following Training Details

  1. Optimizer: SGD
  2. Decay:0.0005
  3. Momentum : 0.9
  4. Training set:85% of data
  5. Validation set:15% of data
  6. Epochs:1000
  7. Learning rate: small at the beginning then gradually increased
  8. Activation fuction : Leaky Relu
  9. Regularization: Batch Norm

Evaluation Details:

  1. IOU threshold :0.7 Car,0.5 Pedestrian and Cyclist
  2. Metric: Average Precission (AP)

Results :

  1. Birds Eye View : Complex yolo is able to achieve good results in comparission with other state of the art algorithm , the positive point is it is able to achieve high FPS on a Titan X gpu compared to other architecture
Performance Results of Birds eye view

2. 3D object detection : Below table gives the comparision results of the 3D object detection wrt to other methods ,the author has pre-defined the spatial height of the objects which is similarly followed by MV3D

Performance Comparision of 3D object detection

3. Performance comparison

Peformance plot

From the above table its is evident that performance of complex yolo w.r.t to other SOTA is on par on Titan X/Xp but on Tx2 board it is still need few optimisation to increase the performance


In conclusion the author proposed an efficient real time deep learning model for 3D object detection for Lidar based point cloud which has 50 fps on NVIDIA Titan X and 4 fps on TX2 by a novel approach of E-RPN . In the future the author has planned to work on height estimation of the objects and increase the class distinction by better cloud pre-processing .

Final Words ….

The architecture implemented is very cool for 3D object detection since it can process at 50 fps which shall be more important in ADAS based systems .

If any errors found please mail be at abhigoku10@gmail.com…


  1. Redmon, J.: Darknet: Open source neural networks in c. http://pjreddie.com/darknet/ (2013–2016)
  2. Chen, X., Kundu, K., Zhu, Y., Berneshawi, A., Ma, H., Fidler, S., Urtasun, R.: 3dobject proposals for accurate object class detection. In: NIPS. (2015)
  3. Redmon, J., Farhadi, A.:YOLO9000: better, faster, stronger.CoRR
    abs/1612.08242 (2016)
  4. Chen, X., Ma, H., Wan, J., Li, B., Xia, T.: Multi-view 3d object detection network for autonomous driving. CoRR abs/1611.07759 (2016)