Behavior Prediction and Decision Making in Self-Driving Cars Using Deep Learning

Original article was published on Deep Learning on Medium

Behavior Prediction and Decision Making in Self-Driving Cars Using Deep Learning

In this post, I want to talk about different approaches for motion prediction and decision making using Machine Learning and Deep Learning (DL) in self-driving cars (SDCs). I tried to select works and papers from interesting companies and startups to cover different approaches in this area. This is a summary of my talk at the Aalto University robotics seminar series.

In general, we can consider the following software stack for self-driving cars:


As you can see in the above figure, we have the Perception module which gets sensor data and does some tasks like, object detection, traffic light detection, traffic light state detection, localization, …. Then the Behavior Prediction module will get this information and try to predict the future trajectory of other agents in the scene. After that, the Planner module will get these future trajectories to use them in its decision making procedure. It is possible to use some information from the Perception module too. Then the Controller will get the trajectory generated by the Planner and generate control commands like throttle and steering. We have also some other information like HDMap info which different modules can use based on the techniques they are using.

Each one of the following modules can use ML separately or it is also possible to combine several modules and use one single ML model to do several tasks together. We will talk about some of them in the rest of this post.

Behavior Prediction

The first approach is to use ML and DL in the Behavior Prediction module.

Uncertainty-aware Short-term Motion Prediction of Traffic Actors for Autonomous Driving

This work is form Uber to do motion prediction for other agents in the scene. This technique can be used for the Behavior Prediction module only. They get the information from the Perception module and HDMap and render them on one single RGB image to create a bird-eye view image. They also render motion history of agents on the image to include temporal information and to be able to extract motion information for motion prediction. This type of input representation is going to be popular because it is easy to combine several sources of information and use CNNs to extract features. There are also some other ways instead of rendering history on one single image to extract temporal information. We can stack several frames for sequential time-steps or use CNNs to extract features from each frame and then use LSTMs to extract temporal info. In summary, the input and outputs of their model is as follows:

  • Input: BEV image + (velocity, acceleration, heading change rate)
  • Output: (x, y, std) for each point in the trajectory

For trajectory prediction, it is also possible to use LSTM networks to generate waypoints in the trajectory sequentially.

Multimodal trajectory predictions for autonomous driving using deep convolutional networks

This is an extension of the previous work to predict multi-modal future trajectory prediction. The network almost the same except the final layer to predict M trajectories with their probability.



The next approach is to use ML and DL in the Planner module.

Path Planning using Reinforcement Learning and Objective Data

This is a master thesis from the University of Chalmers which uses an Option-Critic architecture for the Planner module. It can be categorized as Hierarchical Reinforcement Learning. The idea is to have two levels in the Planner module. The higher level is responsible to select some high-level behaviors like following the lane, stop, turn left/right, …. Then the lower-level planner should get these commands and execute them. They used Q-learning to learn the high-level policy and DDPG for the lower-level planner. It is also possible to use PID controllers for low-level policy and just use ML and RL for the high-level policy. Option-critic architecture is as follows:


The architecture used in this thesis is as follows:


Let’s review an example. Consider we have an intersection and two cars, ego car which is autonomous, and the other car which is not. I selected the following figure from Voyage Open Autonomous Safety:


Consider the high-level policy to be responsible to select one of the two options: yield or drive. The lower level policy can be ML-based or PID. The black car is autonomous and the white car is not. At state A, the autonomous car sees that the white car is a little bit far from the intersection. So it decides to drive. Then at state B, it sees that the white car is coming into the intersection. So it needs to decide if it is safer to yield or drive? It decides to yield and let the white car to pass at state C. And after the white car is passed, it drives to its final destination.

This option selection is similar to state-machine in rule-based approaches, but it uses ML and RL to do the task.

Behavior Prediction + Planner or Mid-to-Mid Driving

The next approach is to combine Behavior Prediction and Planner modules.



The next work is from Waymo. They tried to do prediction and planning together using one single neural network using Imitation Learning (IL).

They decided to use mid-level information from the Perception module and HDMap to create BEV images as the input for their model. You can see the different inputs they used as input:


It is easy to augment this type of representation and create some fake data for some corner cases like collisions, going off the road, …. You can see one example of creating a fake trajectory to teach the car to come back to the road when it is going off the road:


Using these augmented that’s their model is able to handle these cases. It also learns to avoid a parked car and nudge as you can see in the following gif:


They collected 30 million real-world expert driving examples, corresponding to about 60 days of continual driving. But again this amount of data is not enough to learn to drive using pure IL. They used some techniques to improve the performance of their model like:

  • easy to synthesize data → add perturbations to the expert’s driving → create collisions and/or going off the road cases
  • Augment the imitation loss with losses that discourage bad behavior and encourage progress

Their model architecture is as follows:


The Road Mask Net is responsible to predict Road Mask and force the Feature Net to learn about this road mask concept. They also used the Perception RNN to predict future motion of other agents and again force the Feature Net to learn this concept. Their model is a multi-task network that tries to learn better representation for the scene by using multiple tasks. You can see the different loss terms they used in the figure.

Here are some other gifs for their model performance from here:

Perception + Behavior Prediction

The next approach is to do some perception tasks and behavior prediction together using one single neural network.

Fast and Furious


This work is from Uber ATG and the University of Toronto. They try to perform detections, tracking, and short term motion forecasting of the objects’ trajectories using raw point cloud data. The network does this three tasks simultaneously in as little as 30 ms. The model can predict the trajectory for just 1s in the future. By using these three tasks together, multi-task learning somehow, each task can use the knowledge of other tasks to perform better for its own task.

The input for this model is a BEV created from point cloud data like this:

  • Quantize the 3D world to form a 3D voxel grid
  • Assign a binary indicator for each voxel encoding whether the voxel is occupied
  • Consider height as the third dimension like channel in RGB images
  • Consider time as 4th dimension

The model is a single-stage detector which takes the created 4D input tensor and regresses directly to object bounding boxes at different timestamps without using region proposals.

They propose two fusion versions to exploit the temporal dimension:

1- Early fusion

  • Aggregates temporal information at the very first layer.
  • Fast as using the single frame detector.
  • Lacks the ability to capture complex temporal features as this is equivalent to producing a single point cloud from all frames, but weighting the contribution of the different timestamps differently.
  • Uses 1D convolution with kernel size n on temporal dimension to reduce the temporal dimension from n to 1

2- Late fusion

  • Gradually merges the temporal information. This allows the model to capture high-level motion features.

Similar to SSD, they use multiple predefined boxes for each feature map location. There are two branches after the above computed featuremap:

  • One for binary classification to predict the probability of being a vehicle for each pre-allocated box.
  • One to predict (regress) the bounding box over the current frame as well as n − 1 frames into the future → size and heading

Here are some examples of their results:


IntentNet: Learning to predict intention from raw sensor data

The next work is again from Uber ATG and the University of Toronto. It is actually an extension of the previous work. Here, in addition to the BEV generated from the point cloud, they use HDMap info and fuse the extracted information from both, point cloud and HDMap, to do detection, intention prediction, and trajectory prediction.

The inputs and outputs of the network are as follows:

1- Voxelized LiDAR in BEV → height and time are stacked in the channel dimension → use 2D conv

2- Rasterized Map (both static and dynamic info) → 17 binary masks used as map features


1- Detected objects

2- Trajectory

3- High-level discrete intention: multi-class classification with 8 classes: keep lane, turn left, turn right, left change lane, right change lane, stopping/stopped, parked, other

The network architecture is as follows:


As you see in the image, there are two branches to process data from BEV and HDMap and then fuse them and finally three heads for tasks.

Here is an example of their results:



The final approach is to use one single neural network for all tasks to get raw sensor data as input and generate control commands which is call it End-to-End.

Learning to Drive in a Day

This work is from which is one of my favorite startups. Their mission is to solve the self-driving problem end-to-end. They have several interesting works that show promising results that end-to-end approach can learn how to drive like a human.

In this work, they used RL to train a driving policy to follow a lane from scratch in less than 20 minutes!! Without any HDMap and hand-written rules!!! This is the first example where an autonomous car has learned online, getting better with every trial. They used the DDPG algorithm and used a single monocular camera image as input to the network to generate control commands like steering and speed.


The learning procedure is completely done using one on-board GPU. The network architecture is a deep network with 4 convolutional layers and 3 fully connected layers with a total of just under 10k parameters and the reward signal is the distance traveled by the vehicle without the safety driver taking control.

Here is a video of the training procedure they published on their website:


Learning to Drive Like a Human

The next work is again from In this work, they tried to use Imitation learning in addition to Reinforcement Learning. They first copy the driving skills from the expert driver using IL and then use RL to fine-tune it and learn from the safety driver interventions and correction signals.

They used sat-nav command in addition to camera images, compared to the previous work, and output control commands:

They used some auxiliary tasks like segmentation, depth estimation, and optical flow estimation to learn a better representation of the scene and use it to train the policy.

Here are two examples of the performance of their model which is able to handle two complex scenarios, intersection and a narrow road:

And finally, this is a brief explanation for their work:


We reviewed some different interesting works that used deep learning to solve the self-driving car problem. We divided different approaches based on the usage of ML and DL in which modules, from one single module to several modules, and finally to use ML for full-stack and end-to-end approach. I hope we see more interesting works in the future. Let’s see.