Original article was published by Sarim Mehdi on Deep Learning on Medium

As part of my master thesis, I propose a new technique for doing trajectory prediction of dynamic obstacles from an ego-centric perspective. Very few attempts have been made previously to do this. Most trajectory prediction is done where the background is static (for example, surveillance camera footage from a busy area). On the other hand, many methods try to treat this purely as a regression problem and end up with biased results. My algorithm takes images from a calibrated stereo camera as input or data from a laser scanner and outputs a heat map that describes all possible future locations of any detected 3D object for the next few frames. This research has many applications, most notably for autonomous cars as it allows them to make better driving decisions if they are able to anticipate where another moving object is going to be in the future. The code is public and available here: https://github.com/sarimmehdi/master_thesis

# Method

Here I introduce the methodology in a step-by-step fashion. First I show an algorithm that can recursively predict 3D bounding boxes into the future. Then, I simplify the technique and only display a discretized potential field. Finally, this potential field is converted to a heatmap (which is hot close to the object and gets colder farther away from it). Then, this heatmap is used to make predictions about the object’s action in the future (turn left, turn right, or go straight).

## 3D Bounding Box Propagation

In this approach, 3D bounding boxes are combined with semantic information to predict the future likely location of an object. The assumption here is that an object like a car, cyclist, or person will always continue in the direction they are facing. The algorithm is summarized in Figure 1.

Given the point cloud (obtained via laser scanner), a 3D object detector is applied to it to get the 3D bounding boxes. PointPillars was used to get 3D bounding boxes due to its reasonable compromise on speed and accuracy. Another way to obtain the point cloud data was directly from stereo images. Here, the Pseudo-LIDAR [3] approach is used where the disparity map is converted to a depth map. MADNet [2] was ultimately selected for its reasonable compromise on speed and accuracy.

In the vehicle coordinate frame, an arc of 60 degrees is generated in front of the 3D object bounding box based on its rotation angle (yaw angle). 60 degrees makes sense because most vehicles can steer up to 30 degrees in both directions. The radius of the arc was kept at 1 meter. 10 points are picked on the boundary of the arc such that they are equidistant from each other. First, the farthest point is chosen as a likely future location for the 3D bounding box. The bounding box is redrawn on this point with the same dimensions as before but with the rotation recalculated according to:

It is then projected back to the image plane. The base of the bounding box must have an IoU of 90% with the road, pavement, and/ or grass semantic labels. A neural net was used to get the panoptic segmentation [1]. If the IoU is less, then the next farthest point is chosen and the process is repeated until a likely future location for the bounding box is discovered.

The algorithm is recursive in the sense that the prediction process of sweeping an arc with respect to the current position of the bounding box is repeated. Only this time, the verified position is used as the new center and the arc is drawn with respect to it. Ideally, the recursion would keep going on until a point comes where none of the chosen points on the boundary of the arc are valid predictions (they have a low IoU with the road, pavement, and/ or grass semantic labels, i.e. they are not completely on those surfaces when projected back to the image plane). For cars, only IoU with road pixels is computed. For cyclists, IoU with both road and pavement pixels is computed. And, for pedestrians, the IoU is computed with the road, pavement, and grass pixels. In Figure 2, we can see that this algorithm can predict 3D bounding boxes reasonably far into the future.

This approach is good but it has two drawbacks. It is computationally expensive as you recursively try to predict the bounding box as far as possible (or as far as you wish) until the road can no longer be observed in the image plane. One can also use all the points on the boundary of the arc and recursively predict the future bounding box in this way and obtain many (multiple) future locations. However, this would not be feasible if the real-time operation is the main focus. A much better approach would be to use a more continuous representation for the predictions instead of a discretized approach with several 3D bounding boxes represented at predefined distances from the current position of the object.

## Potential Field

As in the previous approach, the bounding box is pushed forward in space up to a prescribed limit (2 meters in this case). Before, we only computed the IoU of the base of the new bounding box with the road pixels (obtained via segmentation) when that new bounding box was far enough. However, now we check the IoU after every 0.1 meters from the original position of the bounding box. We repeat this process by rotating the bounding box and obtaining a new heading angle and we push it in this new direction and check the IoU every 0.1 meters. In this way, within an arc of 60 degrees (which is the limit we set here), we get several points that are viable candidates for the future location of the bounding box. These points make up the potential field. This approach can still be made better and more continuous and, also, more visually appealing.

## Heatmap

Now, we convert the radial potential field into a heatmap that is ‘hot’ close to the object and gets ‘warmer’ and eventually ‘cold’ at a certain limit from the object. To obtain this heatmap, we use the same approach for getting the potential field but now a larger number of points are used. So, instead of checking the IoU after every 0.1 meters, we check every 0.01 meters. As a result, in the image plane, we get several points that are good candidates for the future position of the detected object. These points, bunched together, form a heatmap that is color-coded to show its proximity to the object. The red portion of the heatmap is very close to the object and the green and blue portions are reasonably far away.

## Short-term Trajectory Prediction

We can go a step further and utilize this heatmap to predict the future motion of the object. Specifically, we can predict whether an object would turn left, turn right, or just go straight. To do this, we divide the heatmap cone into three parts (left, right, and forward) and the size of the heatmap (determined by the number of acceptable points) in each part is recorded and then the probability is computed as the size of the heatmap in a given part divided by the total size of the entire heatmap (Figure 5).

## Speed and Steering Value Recommendation System Prototype

As a way to show the potential usefulness of the presented approach, a simple recommendation system was also developed. This would provide the speed and steering recommendation based on the heatmap. Since actual speed and steering values were not provided for the KITTI dataset, some assumptions on the default values were made. Specifically, it is assumed that the car, by default, always goes straight and would always move with the maximum possible speed. To make speed and steering recommendation, an arc is generated in front of the ego-car and its intersection with the heatmap is calculated. If, for example, the arc has a greater intersection with detected objects and their heatmaps towards the left, as compared to the right side, then that means there is heavy traffic in that direction. And so the recommended steering is towards the right. The speed recommendation is based on the number of times the arc intersects with 3D bounding boxes and their heatmaps.

## Other results

Applying the 3D object detector on laser scanner data was less accurate. Despite the decrease in accuracy, because the panoptic segmentation is ultimately used to decide whether a proposal for the predicted future position is valid or not, an inaccurate bounding box doesn’t create that many problems. However, there were some situations in which the slight inaccuracy of the bounding box created problems. This usually happened on two-way narrow roads where both sides were separated by a barrier.

There were also cases where the panoptic segmentation didn’t perform so well. This happened in the Waymo Open Dataset, specifically during the night time scenarios. However, this problem can easily be alleviated by finding a better neural network to do panoptic segmentation.

The trajectory prediction pipeline is robust because, instead of end-to-end learning, separate problems are solved and their solutions are combined to get the final desired solution. In this case, the separate problems are 3D object detection and panoptic segmentation and also depth generation (if only stereo images are to be used). If a better neural net is available for any of the mentioned tasks, we can simply replace the current component in the pipeline with a new one. So, for example, if a better 3D object detector is available, we can replace our current detector in the pipeline with the new one. All previous solutions tried to do end-to-end learning and that suffers from the same kind of problems that plague supervised learning.

# Conclusions

In this research, a new method for trajectory prediction was presented which can work with or without a LIDAR. A heatmap of possible future positions is generated and this is used to calculate the probability of an object going left, right, or straight. A small prototype recommendation system was also presented that provides a speed and steering value based on the detected objects and their predicted trajectory. Future work and the obvious sequel to this research is to experiment with different self-driving car algorithms and see how a participating agent can use such a heatmap to make better driving decisions.

# References

[1] Rui Hou et al. “Real-Time Panoptic SegmentationFrom Dense Detections”. In: The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). June 2020.

[2] Alessio Tonioni et al. “Real-time self-adaptive deep stereo”. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). June 2019.

[3] Yan Wang et al. “Pseudo-LiDAR from VisualDepth Estimation: Bridging the Gap in 3D ObjectDetection for Autonomous Driving”. In: CoRRabs/1812.07179 (2018). arXiv:1812.07179. url:http://arxiv.org/abs/1812.07179.