Making a pseudo LiDAR with cameras and deep learning

Original article was published on Deep Learning on Medium


Making a pseudo LiDAR with cameras and deep learning

LiDAR, or light detection and ranging, is a popular remote sensing method used for measuring the exact distance of an object. They are able to generate accurate 3D information about the objects around it.

LiDAR generated point clould

As you can see LiDARS generate a very accurate 3D map of the world around it. This map is visualized in the form of a point cloud. A point cloud consists coordinates of points in the 3D space along the X, Y and Z axis. These points when plotted in the 3D space construct a scene like shown in the visualization above. This scene can be then used for things like path planning for autonomous vehicles, mapping of the environment, for AR applications, and can be used for any other application where ‘Depth Information’ is required.

The Problem with LiDARs

LiDARs are very accurate at calculating the ‘depth’ information. Depth information is one of the most important things for autonomous vehicles to do things like path planning, maintaining safe distance from objects, etc. This makes LiDARs a great candidate to be integrated into self driving cars. But the problem is, they are too damn expensive !!

LiDARs are expensive

Earlier, High Ranged LiDARs costed around $75,000. But expensive research has been going on to reduce the cost of LiDARs. Waymo, a parent company of Alphabet Inc, were able to bring down the costs down by 90 % through their extensive research !

But nonetheless, LiDARs used in self driving cars till date cost more than some of the low tier cars itself. Maintenance and processing outputs of LiDARs are still an expensive task and headache. So this makes them a less suitable option for commercial production of Self Driving cars.

Secondly, LiDARs don’t work well in bad weather conditions, they generate puffy point clouds that may render the output of LiDAR point cloud inaccurate.

But still, companies should still invest in approaches that use LiDARs and point cloud processing for self driving purposes, because who knows, maybe one day LiDARs may also get cheap as hell ?

The Problem with Cameras

Cameras are great at capturing High resolution detail of the scene. But the problem is, they don’t provide us with ‘Depth Information’ as LiDARs do 🙁 Trade-offs are everywhere in the world. The output of the camera is a high resolution but a **flat 2D image**. And it is almost impossible to obtain ‘depth information’ from a single image. There are methods to obtain depth from images using stereo vision.

Calculating Depth Map via Stereo Image Pairs

Given two images captured from two cameras placed at same horizontal level at some distance, we can estimate the depth information using computer vision algorithms.

There exist a lot of stereo depth estimation algorithms in the computer vision literature but as far as I know none of them simultaneously achieve:

  • Real time processing
  • High accuracy
  • Fully automatic

How Neural Networks Come Into Play

But wait, Humans use stereo vision (Eyes) and are brilliant at estimating depth even if its’s a single image. You can even close one eye and still reasonably estimate depth !

Whoa ! what’s happening here ? Did humans rather actually ‘learn’ how to perceive depth ? We can’t really answer this because we can’t recall how the world really looked when we were just ‘born’.

But still, can depth be treated as a learning problem so it becomes ‘good enough’ to solve self driving ?

Elon musk hates LiDARs. Tesla is currently one of the most successful self driving company that commercially produces autonomous cars for consumers, and their tech stack mainly includes cameras.

Andrej’s talk on depth learning

There are several papers out now that treat depth estimation from vision as a learning problem.

Supervised Depth Estimation

The concept behind ‘Supervised’ depth learning is simple, collect RGB images and their corresponding depth maps, train an ‘autoencoder’ like architecture for depth estimation. (Not as simple to train though, FCN’s never really work without integrating some special tricks through the training process :p). There are other supervised learning approaches,

Though, this method is simpler to grasp, but collecting depth maps in real life is an expensive task. LiDAR data can be used for training these kinds of networks, So if we are training on data collected by LiDAR, the neural network will perform significantly worse than LiDAR but still, it’s ok because we do not need that level of accuracy to drive a car for example, knowing exact distance if leaves on a tree.

We would be using this approach in this blog post to train our network.

Unsupervised Depth Estimation

Just recording quality depth data in a range of environments is a challenging problem. Unsupervised methods can learn depth without ground truth depth maps !

“This approach just looks at unlabeled videos, and finds a way to create depth maps by not trying to be right, but trying to be consistent.”

For more information you can refer this paper.

Since then, many unsupervised learning approaches that produce even better results have been introduced.

In this blog post though, we would be training a supervised network.
Unsupervised approaches deserve a blog post of their own, which we would be covering in the next part.

Implementation Details

We would be training a neural network based on data collected from CARLA , a self driving car simulator. The implementation would be based upon this paper.

In brief, this paper makes use of the pre-trained DenseNet model trained on ImageNet as the encoder, and defines a decoder based on Bilinear Upsampling. (Will discuss in a bit below).

The loss function (Will discuss in detail below) used is a combination of MAE, SSIM, and image gradients difference.

Data Collection

First of all we need to collect data for training from CALRA simulator. CARLA Makes collection of depth maps and corresponding RGB images very easy. But this data collection could be very tricky for beginners to CARLA.

How a depth map looks like

The Depth maps look like this, where the things closer to the camera are darker than the things that are away.

The depth maps are stored as ‘float64’ or ‘float32’ object type arrays. The problem with saving these types of depth maps ‘as it is’ as images is that, images are usually stored as ‘uint8’. uint8 data ranges from 0–255, which is too discrete to store depth. We need continuous and precise depth measurements like 0.254588, 5.56314, 100.25656….. etc.
To solve this, CARLA has it’s own raw depth map encoding that is an RGB image ad looks like this:

CARLA’s encoded depth map as RGB Image

This can be now stored as an RGB image.

This RGB encoded raw CARLA depth map can be converted to actual depth using the formula:

normalized = (R + G * 256 + B * 256 * 256) / (256 * 256 * 256 - 1) 
in_meters = 1000 * normalized
Conversion from CARLA’s depth map encoding to actual pixel wise depths

The code can be found here.

After this conversion, we finally have the per-pixel depth in meters.
So now, we can start storing the RGB images and their corresponding ‘raw rgb depth maps images’.

To make the data collection process even easier, we would set up an ego-vehicle in CARLA and put that into autonomous mode so it would automatically drive around the city and keep collecting data and saving to disk on it’s own !

But one more problem to solve before finally start collecting depth data is, the car in autonomous mode keeps stopping at traffic lights causing a lot of redundant data. To solve this, whenever the ego vehicle reaches a traffic light, it would automatically turn green. But make sure you do not spawn more than 50 vehicles in the map since then turning the traffic lights green would cause accidents and our ego vehicle would get stuck in it !

Making traffic lights green to avoid data redundancy

The data collection Script was written by Raghav Prabhakar ! And thanks to him for uploading and opensourcing the data.

13 GB of Data uploaded by Raghav Prabhakar can be found here

One last thing, save the camera’s FOV, image width, height. These would be used to construct the camera’s intrinsic matrix that would be used to project pixels in image space to 3D World using the depth information. (more on this later)

Neural Network Architecture

Image resizing methods

In brief, we use Bilinear Upsampling because it results in a ‘smoother’ image overall after upsampling. We can also use advanced upsampling techniques like Bi-Cubic upsampling, lanczos3, lanczos5, etc. But I noticed that these methods tend to produce a bit more artifacts after training + they are computationally more expensive and take more time while training. So Bilinear Upsampling is a sweet spot.

The output is a depth map that is Half the size of the image. This helps the network learn faster. We can always upsample the output to the original image size later.

Loss function

Initially, a model trained via Simple MAE or MSE loss function, the results were not really satisfying.

Initial results (Not even close to the final model)

The loss function in the paper and the one used here consists of 3 parts:

  • MAE: This is use to penalize the predicted depth values. This is a pixel wise loss that is independent of other neighboring pixels. Convolutional neural networks work well because they take into account the neighboring pixels and images are highly correlated. Our neural network in this case has an ‘image-like’ output, therefore it makes sense to take into account an image-wise loss rather than using a ‘pixel-wise’ loss alone. A good candidate is, SSIM.
  • SSIM (Structural Similarity Index): This loss actually measures the perceptual **_difference_** between two similar images. Structural information is the idea that the pixels have strong inter-dependencies especially when they are spatially close. These dependencies carry important information about the structure of the objects in the visual scene.
    — SSIM is widely used as a loss in deep learning for image-reconstruction tasks. SSIM is a structural loss, so it is important to use another loss like MSE/MAE along with it so now we can both penalize the structure as well as pixel wise depths.

    I have earlier worked on image compression on deep learning, and using standalone SSIM as loss resulted in great output, but the colors were off. SSIM won’t know if the image was losing contrast or color. The image shown below is from another project, it isn’t really the best example to show color loss, but I can tell that output was way more structurally similar than MAE, but the problem was color info loss. So using a weighted combination of MSE and SSIM solved this problem.

SSIM as a loss function for image reconstruction
  • Image gradient loss: An image gradient is a directional change in the intensity or color in an image. The gradient of the image is one of the fundamental building blocks in image processing. For example, the Canny edge detector uses image gradient for edge detection. This loss penalizes the edges in depth output.

Fortunately, tensorflow already did the hardwork of implementing these loss functions. Code implementation can be found here.

Image augmentation

For image augmentation, we can use the following techniques:

  • Image flipping
  • Color channel shuffling of input image
  • Add noise to the input image
  • increase contrast, brightness, temperature, etc of the input image

This would make sure that model keeps seeing new data throughout the training process and generalizes better on unseen data.

Code can be found here.

Depth normalization

Depth normalization is the idea taking the inverse of the depth-map because we need to penalize the things that are ‘closer’ more than the things that are far away, because for planning, closer objects would matter more mostly.

depthNormalized = maxDepth / original_depth_map

where maxDepth is the max depth value in the whole dataset, which is 1000 in our case.

Before performing depth normalization, make sure that you clip your depthmaps between min_depth and max_depth which are 0.1 and 1000 respectively to prevent Division By Zero error if somehow a zero is present in your depth map. This would lead to error shooting up to NaN, exploding the gradients and making the whole network useless.

Training the network

The thing is, you have to be very careful with the hyperparameters, a single wrong parameter, and the loss would shoot up to NaN.

The model was trained using Adam optimizer with learning rate = 0.0001 with no amsgrad for 10 epochs. One epoch took 3.5 hours on colab’s P4 GPU :]

Total, the final model was trained for 35 Hours. Other variants were trained too, so it took a lot of time to get results.

The network was collaboratively trained epochby epoch by Raghav Prabhakar , Chirag Goel, Mandeep and me.

Results

Results

Whoaa ! That almost looks like a perfect depth map !

I was really excited about how the 3D reconstruction would look like.
To do the 3D reconstruction, you need to know some math.

This blog post explains this really well.

So here are the results of the 3D reconstruction on unseen data:

3D-Reconstruction results

Interesting ! The results look really good and could be used for path planning, which we would cover in some other blog post.
The model runs in real time on a GTX 1060 too !

Although, the predicted depth map looked sharp and original, the 3D reconstruction still looks a bit wobbly. This shows that even slight imperfections in depth can lead to large errors in 3D reconstructed point cloud. This may be fixed by Reverse Huber loss.

berhu Loss

This loss penalizes the objects that are further away, rather than the closer objects. But remember, the affect would be reverse if we are using it with Depth Normalization.

The model was not fully trained, so there was a lot of room for improvement. Secondly, To improve the reconstruction, you could use outlier removal as explained here.

Further improvements

The current model takes in the input a single image. It is actually impossible to estimate correct depth of all objects from simple image.

Depth Illusion from single image

To solve this, we can take input either a sequence of frames, or a pair of stereo images to get better estimate of the things that don’t map to a one-to-one solution via a single image.