Recognizing Depth in Autonomous Driving

Source: Deep Learning on Medium

Recognizing Depth in Autonomous Driving

This article will describe some of the state-of-the-art methods in depth predictions in image sequences captured by vehicles that help in the development of new autonomous driving models without the use of extra cameras or sensors.

As mentioned in my previous article “How does Autonomous Driving Work? An Intro into SLAM”, there are many sensors that are used to capture information while a vehicle is driving. The variety of measurements captured include velocity, position, depth, thermal and more. These measurements are fed into a feedback system that trains and utilizes motion models for the vehicle to abide by. This article focuses on the prediction of depth which is often captured by a LiDAR sensor. A LiDAR sensor captures distance from an object using a laser and measuring the reflected light with a sensor. However, a LiDAR is not affordable for the everyday driver, so how else could we measure depth? The state-of-art methods I will describe are unsupervised deep learning approaches that use the disparity, or difference in, pixels from one frame to the next to measure depth.

  • Note image captions as most images are taken from the original papers being referenced and are not a product of or creation of my own.


Authors in [1] developed a method that uses a combination of depth and pose networks to predict depth in a single frame. They do this by training their architecture on a sequence of frames and several loss functions to train the two networks. This method does not require a ground truth dataset for training. Instead, they use consecutive temporal frames in an image sequence to provide a training signal. To help constrain learning, they use a pose estimation network. The model is trained on the difference between the input image and the image reconstructed from the output of the pose network and the depth network. The reconstruction process will be described in more detail later. The main contributions of [1] are:

  1. An auto-masking technique to remove focus on unimportant pixels
  2. Modification to the photometric reconstruction error with depth maps
  3. Multi-scale depth estimation


The approach of this paper uses a depth network and a pose network. The depth network is a classic U-Net [2] encoder-decoder architecture. The encoder is a pre-trained ResNet model. The depth decoder is similar to previous work in which it converts the sigmoid output to depth values.

Sample image of U-Net [2].
6-DoF. Image from Wikipedia.

The authors use a pose network from a ResNet18 that is modified for taking two colored images as input to predict a single 6-DoF relative pose, or rotation and translation. The pose network uses the temporal frames as the pair of images rather than the typical stereo pair. It predicts the appearance of a target image from the viewpoint of another image in the sequence, either a frame before or a frame after.


The below figure illustrates the training process of the architecture.

Images taken from KITTI and [1].

Photometric Reconstruction Error

The target image is at frame 0 and the images used for our prediction process can be the frame before or the frame after, so frame+1 or frame-1. The loss is based on the similarity between the target image and a reconstructed target image. The process of reconstruction starts by calculating the transformation matrix from the source frame, either frame+1 or frame-1, using the pose network. This means we are calculating the mapping from the source frame to the target frame using information about rotation and translation. We then use the depth map predicted from the depth network for the target image and the transformation matrix from the pose network to project into a camera with intrinsics matrix K to get a reconstructed target image. This process requires transforming the depth map into a 3D point cloud first and then using the camera intrinsics to transform 3D positions into 2D points. The resulting points are used as a sampling grid to bilinearly interpolate from the target image.

The goal of this loss is to reduce the difference between the target image and the reconstructed target image, in which both pose and depth are required.

Photometric Loss function from [1].
Benefit of using minimum photometric error. Pixel area circled are occluded. Image from [1].

Typically, similar methods average together the reprojection error into each of the source images, e.g. frame+1 and frame-1. However, if a pixel is not visible in one of these frames but is in the target frame because it is close to the image boundary or occluded, the photometric error penalty will be unfairly high. To address issues related to this, they instead take the minimum photometric error over all source images.


The final photometric loss is multiplied by a mask that addresses issues related to a change in the assumption that the camera is moving in a static scene, e.g. an object is moving at a similar speed as the camera or the camera has stopped while other objects are moving. The problem with this situation is the depth map predicts infinite depth. The authors address this with an auto-masking method that filters pixels that do not change appearance from one frame to the next. They generate their mask using binary in which it is 1 if the minimum photometric error between the target image and the reconstructed target image is less than the minimum photometric error of the target image and the source image, and 0 otherwise.

Auto-masking generation in [1] where Iverson bracket returns 1 if true and 0 if false.

When the camera is static, the result is all pixels in the image are masked out. When an object is moving at the same speed as the camera, it results in the pixels of stationary objects in the image to be masked out.

Multi-Scale Estimation

The authors combine individual losses at each scale. They upsample the lower resolutions depth maps to the higher input image resolution and then re-project, re-sample and compute the photometric error at the higher input resolution. The authors claim this constrains the depth maps at each scale to work towards the same objective, an accurate high resolution reconstruction of the target image.

Other Losses

The authors also use an edge-aware smoothness loss between the mean-normalized inverse depth map values and the input/target image. This encourages the model to learn sharp edges and to smooth away noise.

The final loss function becomes:

The final loss function in [1] which is averaged over each pixel, scale and batch.


The authors compared their model on three datasets that contain driving sequences. Their method outperformed almost all other methods across all experiments. An example of their performance is in the following image:

Image from [1] GitHub repository:

For more details on their results, please see the original paper “Digging into self-supervised monocular depth estimation

Monodepth2 Extension: Struct2Depth

Object motion modeling

Authors from Google brain published [3] which extends Monodepth2 even further. They improve upon the pose network from before by prediction motions of individual objects instead of the entire image as a whole. So instead of the reconstructed images being a single projection, it is now a sequence of projections that are then combined. They do this through two models, an object motion model and an ego-motion network (similar to the pose network described in the previous sections). The steps are as follows:

Sample output for Mask R-CNN [2]. Image from [2].
  1. A pre-trained Mask R-CNN [2] is applied to capture segmentation of potentially moving objects.
  2. A binary mask is used to remove these potentially moving objects from the static images (frame -1, frame 0, and frame +1)
  3. The masked image is sent to the ego-motion network and the transformation matrix between frame -1 and 0 and frame 0 and +1 are output.
The masking process to extract the static background followed by the ego-motion transformation matrix without objects that move. Equation from [3].
  1. Use the resulting ego-motion transformation matrix from step 3 and apply it to frame -1 and frame +1 to get a warped frame 0.
  2. Use the resulting ego-motion transformation matrix from step 3 and apply it to the segmentation masks of the potentially moving object to frame -1 and frame +1 to get a warped segmentation mask for frame 0, all per object.
  3. A binary mask is used to keep the pixels associated with the warped segmentation mask.
  4. The masked images are combined with the warped images and passed to the object motion model which outputs the predicted object motion.
The object motion model for one object. Equation from [3].

The result is a representation of how the camera would have to move in order to “explain” the change in appearance of the objects. We then want to move the objects according to the resulting motion models from the step 4 of the object motion modelling process. Finally, we combine the warped object movements with the warped static background to get the final warping:

Equation from [3].

Learning Object Scale

While Monodepth2 addresses issues of static objects or objects moving at the same speed as the camera through their auto-masking technique, these authors propose actually training the model to recognize object scale to improve the modelling of object motion.

Image from Struct2Depth. Middle column shows the problem of infinite depth being assigned to objects moving at the same speed of the camera. The third column shows their method improving this.

They define a loss for the scale of each object based on the category of the object, e.g. a house. It aims to constrain the depth based on the knowledge of the objects scale. The loss is the difference between the output depth map of the object in the image and the approximate depth map calculated by using the camera’s focal length, height prior based on the category of the object, and the actual height of the segmented object in the image, both scaled by the mean depth for the target image:

The formulation for the loss that helps the model learn object scale. Equations from [3].


The extensions described in [3] were directly compared to the Monodepth2 model and showed significant improvement.

The middle row shows the results from [3] while the ground truth is shown in the third row. Image from [5].


The common methods of depth estimation in autonomous driving is to use a stereo pair of images, requiring two cameras, or a LiDAR depth sensor. However, these are costly and not always available. The methods described here are able to train deep learning models that predict depth on one image and are trained on just a sequence of images. They show good performance and a great future for research in autonomous driving.

To try out the models yourselves, both papers have repositories located below:




[1] Godard, C., Mac Aodha, O., Firman, M., & Brostow, G. (2018). Digging into self-supervised monocular depth estimation. arXiv preprint arXiv:1806.01260.

[2] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional networks for biomedical image segmen-tation. InMICCAI, 2015.

[3] Vincent Casser, Soeren Pirk, Reza Mahjourian, Anelia Angelova: Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos. Thirty-Third AAAI Conference on Artificial Intelligence (AAAI’19).

[4] He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 2961–2969).

[5] Vincent Casser, Soeren Pirk, Reza Mahjourian, Anelia Angelova: Unsupervised Monocular Depth and Ego-motion Learning with Structure and Semantics. CVPR Workshop on Visual Odometry & Computer Vision Applications Based on Location Clues (VOCVALC), 2019