Source: Deep Learning on Medium
Understanding how far things are relative to a camera remains difficult but absolutely necessary for exciting applications such as robotics, autonomous driving, 3D scene reconstruction and AR. In robotics, depth is a key prerequisite to perform multiple tasks such as perception, navigation, and planning. If we would like to create a 3D map, computing depth allows us to back project images captured from multiple views into 3D. Then, registration and matching of all the points can perfectly restructure the scene.
Estimating depth from images has been very challenging due to the nature of the problem where solution is not unique. But if solved, all these applications mentioned above would be realised at consumer-grade application for its relative cheap production cost! Right now, the best alternative to retrieve depth would be to use a range sensor such as Lidar or Radar. These are naturally high fidelity sensors, providing highly precise depth information.
Having worked on depth estimation, and in the application of autonomous vehicle in particular, it is indeed challenging dues to various reasons such as occlusion, dynamic object in the scene and imperfect stereo correspondence. On a high level, reflective, transparent, mirror surface are the biggest enemy of stereo matching algorithm. E.g. Cars’ windshield often degrades matching and hence estimation.
Therefore, most companies still rely on Lidar to reliably extract distance. However, the current trend in autonomous vehicle perception stack are steering toward sensor fusion, for each sensor has its strength in the features they extract. You can check this out to understand how the data can be manipulated and fused! Nonetheless, this field has gained much traction and outstanding results since the inception of Deep Learning. Much research has been dedicated to solving these issues.
In computer vision, depth is extracted from 2 prevalent methodologies. Namely, depth from monocular images (static or sequential) or depth from stereo images by exploiting epipolar geometry. This post will mainly give readers a background into depth estimation and the problems associated with it. An adequate understanding of camera projective geometry is required to follow through which I plan to cover in a future post.
By reading this article, I would like you to gain an intuitive understanding of depth perception in general. Also, the trends of depth estimation research and the basics as well. We will then discuss some(many) of the associated problems.
Various depth estimation algorithms will be elaborated in subsequent posts. I will need more than this post to describe the technical details 😉
How we view the world
Let’s start with how we humans perceive depth in general. This will give us some valuable insights on depth estimation since many of these methods were derived from our human vision system. Both machine and human vision share similarities in the way image is formed (Fig 2). Theoretically, when light rays from a source hit surfaces, it reflects off and directs towards the back of our retina, projecting them and our eye processes them as 2D  just like how an image is formed on an image plane.
So how do we actually measure distance and understand our environment in 3D when the projected scene is in 2D? For example, suppose someone is about to give you a punch, you instinctively know when you are going to be hit and dodge it when his/her fist comes too close! Or when you are driving a car, somehow you could gauge when to step the accelerator or press the brakes to keep a safe distance around so many other drivers and pedestrians.
The mechanism at work here is our brain starts to reason about the incoming visual signals by recognizing patterns such as the size, texture and motion about the scene known as Depth Cues. There is no distance information about the image but somehow we could interpret and recover depth information effortlessly. We perceive which aspect of the scene is close and farther away from us. Also, these cues allow us to view objects and surfaces which are supposedly on flat images as 3D .
How to destroy depth (Not human/computer vision)
Just to highlight an interesting fact, interpreting these depth cues begins with how scenes are projected to perspective view in human and camera vision. On the other hand, an orthographic projection to front view or side view is one which destroys all depth information.
Consider figure 3, an observer could disentangle which aspect of the house is nearer to him/her as seen in the left image. However, it is totally impossible to distinguish relative distances from the right image. Even the background might be lying on the same plane as the house.
Judging Depth Using Cues
There are basically 4 categories of depth cues: Static monocular, depth from motion, binocular and physiological cues . We subconsciously take advantage of these signals to perceive depth remarkably well.
Pictorial Depth Cues
Our ability to perceive depth from a single still image depends on the spatial arrangement of things in a scene. Below, I have summarized some of the hints that enable us to reason about the distance of different objects. It may already feel natural to you without putting too much thought into figuring the various cues.
Depth Cues from Motion (Motion Parallax)
This should not be surprising to you as well. When you, as an observer, is in motion, things around you pass by faster than the one that is farther away. The farther something appears, the slower it seems to pass away from the observer.
Depth Cues from Stereo Vision (Binocular Parallax)
Retina Disparity: Yet another interesting occurrence that empowers us to recognize depth, which can be understood intuitively from a simple experiment.
Place your index finger in front of you as close to your face with one eye closed. Now, repeatedly close one and open the other. Observed that your finger moves! The difference in view observed by your left and right eye is known as retina disparity. Now hold out your finger at arm’s length and perform the same action. You should notice that the change in your finger position becomes less obvious. This should give you some clue on how stereo vision works.
This phenomenon is known as stereopsis; ability to perceive depth due to 2 different perspectives of the world. By comparing images from the retinas in the two eyes, the brain computes distance. The greater the disparity, the closer things are to you.
Depth Estimation in Computer Vision
The goal of depth estimation is to obtain a representation of the spatial structure of a scene, recovering the three-dimensional shape and appearance of objects in imagery. This is also known as the inverse problem , where we seek to recover some unknowns given insufficient information to fully specify the solution. Meaning that the mapping between the 2D view and 3D is not unique (fig 10) I will cover classical stereo method and deep learning methods in this section.
So how do machines actually perceive depth? Can we somehow transfer some of the ideas discuss above? The earliest algorithm with impressive result begins with depth estimation using stereo vision back in the 90s. A lot of progress was made on dense stereo correspondence algorithms   . Researchers were able to utilize geometry to constrain and replicate the idea of stereopsis mathematically and at the same time running at real-time. All of these ideas were summarised in this paper .
As for monocular depth estimation, it recently started to gain popularity by using neural networks to learn a representation that distils depth directly . Besides this, there has been great advancement in self-supervised depth estimation . which is particularly exciting and groundbreaking! In this method, a model is trained to predict depth by means of optimising a proxy signal. No ground truth label is needed in the training process. Most research either exploits geometrical cues such as multi-view geometry or epipolar geometry to learn depth. We will touch on this later.
Depth Estimation From Stereo Vision
The main idea of solving for depth using stereo camera involves the concept of triangulation and stereo matching. The formal depends on good calibration and rectification to constrain and the problem so that it can be model on a 2D plane known as the epipolar plane, greatly reduces the latter to a line search along the epipolar line (fig 7). More technical details about epipolar geometry will be discussed in a future post:)
Analogous to binocular parallax, once we are able to match pixel correspondences between the 2 views, the next task is to obtain a representation that encodes the differences. This representation is known as disparity, d. To obtain depth from disparity, the formula can be worked out from similar triangles (fig 8)
The steps are as follows
- Identify similar points from feature descriptors.
- Match feature correspondence using a matching cost function.
- Using epipolar geometry, find and match correspondence in one picture frame to the other. A matching cost function  is used to measure the pixel dissimilarity
- Compute disparity from known correspondence
d = x1 — x2as shown in figure 8.
- Compute depth from known disparity
z = (f*b)/d
Age of Deep Learning
Deep learning excels in high-level perceptual and cognitive task such as recognition, detection and scene understanding. Depth perception falls into this category and likewise should be a natural way forward. There are currently 3 broad frameworks to learn depth:
Supervised Learning: The seminal work of estimating depth directly from a monocular image started from Saxena . They learned to regress depth directly from monocular cues in 2D images via supervised learning, by minimising a regression loss. Since then, many varieties of approaches have been proposed to improve the representation learning by proposing new architectures or loss functions
Self-supervise depth estimation using SFM framework: This method frame the problem as learning to generate a novel view from a video sequence. The task of the neural network is to generate the target view
I_t from source view by taking image at different time step
I_t-1, I_t+1 and applying a learnt transformation from a pose network to perform the image warping. Training was made possible by treating the warped view synthesis as supervision in a differentiable manner using a spatial transformer network . At inference time, the depth CNN would predict depth from a single view RGB image (fig 10). I would recommend you to read this paper to learn more. The idea is worth exploring! Do note that this method does have some shortcomings such as unable to determine scale and modelling moving objects described in the next section.
Self-supervise monocular depth estimation using Stereo: Yet another interesting approach. Here (fig 11), instead of taking image sequence as input, the model will predict the disparities
d_l, d_r only from the left RGB,
I_l. Similar to the above method, a spatial transformer network will warp the RGB image pair
I_l, I_r using the disparity. Recall that
x2 = x1 — d. So the paired view can be synthesis and a reconstruction loss between the reconstructed views
I_pred_l, I_pred_rand the target views
I_l, I_r is used to supervise the training.
For this method to work, the assumption is baseline must be horizontal and known. The image pair must be rectified for the transformation via disparity to be accurate. So that the calculation
d = x1 — x2 holds as in fig 8.
Why is measuring depth so difficult?
Lastly, let’s try to understand some of the fundamental problems of depth estimation. The main culprit lies in the projection of 3D views to 2D images. Another problem is deeply seeded when there is motion and moving objects. We will go through them in this section.
Depth Estimation is ill-posed
Often when conducting research in monocular depth estimation, many authors will mention that the problem of estimating depth from a single RGB image is an ill-posed inverse problem. What it means is that many 3D scenes observed in the world can indeed correspond to the same 2D plane (fig 6 & 7).
Ill-posed: Scale ambiguity
Recall that adjusting the focal length will proportionately scale the points on the image plane. Now, suppose we scale the entire scene, X by some factor
kand, at the same time, scale the camera matrices, P by the factor of
1/k, the projections of the scene points in the image remain exactly the same
x = PX = (1/k)P * (kX) = x
That is to say, we can never recover the exact scale of the actual scene from the image alone!
Note that this issue exists for monocular base techniques, as the scale can be recovered for a stereo rig with known baseline.
Ill-pose: Projection ambiguity
Suppose we perform a geometric transformation of the scene, it is possible that after transformation, these points will map to the same location on the plane. Once again, leaving us with the same difficulty. See figure below