360° Computer Vision: Problems, solutions, and a suggestion for the path ahead

Original article was published on Deep Learning on Medium

This is interesting, but how does it change anything?

Well, it can become an even better spherical approximation if we’re willing to budge a little on the perfectly equal-sized faces. Using Loop subdivision (Loop, 1987), we can interpolate the vertices and start to approximate a sphere:

Subdividing the icosahedron using Loop subdivision.

If we do this ad infinitum, our icosahedron will eventually be indistinguishable from the sphere. In effect, we can consider this to be a 3D method of exhaustion.

A quick note on nomenclature. When talking about subdivided icosahedra, I’m going to refer to the 20-face regular icosahedron as a level 0 icosahedron, while after one subdivision, it will be called level 1, etc. This is simply an easy way to refer to different subdivision levels.

Why use the subdivided icosahedron?

If it’s not immediately clear from the subdivision pictures above, the answer to this question is pretty simple. Summarizing from one of the most prominent minds in modern cartography (Kimerling et al., 1999):

The subdivided icosahedron is among the least-distorted spherical representation.

As we talked about before: cartographers have been looking at this problem for millenia. Let’s not try to reinvent the wheel and, instead, leverage some existing insight into the matter.

Take a look at the figure below. Here, I’ve projected our Earth image onto the the faces of the the original icosahedron via gnomonic projection, and I’ve unfolded the icosahedron to its net. I’ve also super-imposed Tissot’s indicatrices on the faces so you can visualize the distortion characteristic.

The net of the level 0 icosahedron with Tissot’s indicatrices super-imposed. Look at how little distortion there is!

Look at how nearly-uniform those circles are. Again, our goal is to have perfect circles that are all the same size. I’d say we get pretty close here.

So, clearly the icosahedron is great for reducing distortion. But, there is also another reason it’s become so popular:

Subdivision parallels image up-sampling

Take a look at the equations below. These show how to compute the number of faces and vertices for each successive subdivision level.

Computing the number of faces and vertices at the k-th subdivision level.

Take note of that factor of 4. A subdivision turns 1 face into 4 faces, just like how image up-sampling turns 1 pixel into 4 pixels.

Showing how subdivison (top row) parallels image up-sampling (bottom row).

For vertices, it’s not quite the same, but it’s fairly close.

Now remember, many of our favorite fully-convolutional network architectures (like FCNs, U-Nets, and ResNets) incorporate these down-sampling and up-sampling operations to encode and then decode learned features. Because subdivision has this nice parallel with these operations, the subdivided icosahedron can fit nicely into our existing CNN paradigm.

So many ways to convolve on the icosahedron!

As I pointed out before, 6 different papers proposed novel analysis using the icosahedron in 2019. Let’s take a shallow dive into each of these papers to see, empirically, how and why the icosahedron works, and what considerations we have to make to use it.

I will start this section with my own contribution to this approach, because I directly compare different spherical representations.

Mapped Convolutions (Eder et al., arXiv 2019) [Code]
Convolutions on Spherical Images (Eder and Frahm, CVPR Workshops 2019, Oral)

Demonstrating how mapped convolution can project the convolutional grid onto a sphere. (Image source: Eder et al., arXiv 2019)

The key contribution of these papers is to tout the representational benefits of the subdivided icosahedron and to provide a solution to convolve on its surface. In these works, I look at the tasks of depth estimation and semantic segmentation using an equirectangular image, a cube map, and a level 7 subdivided icosahedron. To provide an apples-to-apples comparison, I generalize the location-adaptive methods by developing a “mapped convolution” operation that accepts an adjacency list to determine where a convolutional kernel should sample. This approach allows us to map the kernel to the faces of the subdivided icosahedron without changing the convolution operation in any way. The results show a 12.6% improvement in overall semantic segmentation mean IOU and a nearly 17% improvement in absolute error for depth estimation,

Simply by resampling the image to the subdivided icosahedron!

These outcomes reinforce the importance of the choice of spherical image representation.

Cube map filter ambiguity at the corners (Image source: Eder et al., arXiv 2019)

An additional result is that the cube map is not a great choice of spherical image representation due to orientation inconsistency at +/-Y faces that we highlighted earlier (remember how content radiates from the center?) and filter ambiguity at the corners.

The problem with my mapped convolution approach in these works is that it can get quite slow as network depth or image resolution increases.

Spherical CNNs on Unstructured Grids (Jiang et al., ICLR 2019) [Code]

Visualizing how convolution is reparameterized as the linear combination of differential operators on a spherical mesh in the UG-SCNN paper. (Image source: Jiang et al., ICLR 2019)

This paper is where my mutually-exclusive division of the related work falls apart. This approach uses the subdivided icosahedron, but it is also a type of reparameterization method. This work, often abbreviated as UG-SCNN, reparameterizes convolution as a linear combination of differential operators on the surface of an icosahedral mesh. This was the first work to attempt to take advantage of the lower distortion present in the icosahedron representation as well as the efficiency gains that can be achieved by reparameterizing the convolution. This method circumvents the scalability problem of kernel-modifications by leveraging fast differential computations to approximation convolution, and it scales better to higher resolution spherical images as a result. This scalability comes at the cost of transferability, however. Because it no longer uses the traditional convolution operation, it does not permit network reuse. That being said, it still provides some of the best performance to-date for spherical image tasks.

Gauge Equivariant Convolutional Networks and the Icosahedral CNN (Cohen et al., ICML 2019)

Illustrating how the charts used by Cohen et al. are formed from strips of the icosahedron and aligned to a pixel grid representation. (Image source: Cohen et al., ICML 2019)

I’m not going to go into too much depth on this paper, because someone else has already written a great explanatory overview of it. The high-level, though, is that Cohen et al. recognize that the icosahedral net consists of five parallelogram strips that can have one of six orientations, or gauges, around the icosahedron. They use these strips to define an atlas of charts that relate the the planar parallelograms to locations on the icosahedron. They are then able to apply standard 2D 3×3 convolutional filters (masking out 2 weights) on the charts, using a gauge transform to ensure consistent feature orientation. Like the other methods in this sections, this paper addresses the issue of filter orientability. However, perhaps speaking to the credit of the author, Taco Cohen, this work, like his previous Spherical CNNs paper, formalizes the problem in a rigorous way.

There are a couple things to observe from this method. First, the use of the standard 2D convolution operator. This is great, because many of the other approaches either modify the convolutional kernel (i.e. location-adaptive methods) or approximate it (i.e. reparameterization methods). Those changes either slow things down, inhibiting scalability, or break transferability. By using the 2D convolution, this approach can take advantage of efficient convolution implementations. The drawback of this approach is that the charts mean we can’t use Loop subdivision to more closely approximate the sphere. Unlike the Mapped Convolutions and UG-SCNN that operate on a mesh representation, we are limited here to the distortion-reduction properties of the level 0 icosahedron (which are still pretty good, by the way).

SpherePHD: Applying CNNs on a Spherical PolyHeDron Representation of 360 Degree Images (Lee et al., CVPR 2019)

Demonstrating the different kernel types and orientations defined on the faces of the icosahedron. (Image source: Lee et al., CVPR 2019)

Like the three laid out above, this paper also proposed the use of the subdivided icosahedron a a spherical image representation. This work analyzes the distortion properties of the representation and defines new, orientation-dependent kernels for convolution and spatial pooling. The authors also propose a weight-sharing design to address the differing orientations of faces across the icosahedron. In some ways, because Lee et al. redefine the convolution operator, this can be considered a reparameterization method as well.

Orientation-Aware Semantic Segmentation on Icosahedron Spheres (Zhang et al., ICCV 2019)

Visualization of how the authors use a hexagonal convolution on the icosahedral net to perform semantic segmentation. (Image source: Zhang et al., ICCV 2019)

This final icosahedral paper of 2019 proposes a method that makes use of the icosahedral net. The authors define a special hexagonal convolution on the vertices of the icosahedron that can interpolate to and from the standard 2D convolution operation. In this way, they provide transferability for existing networks, while still operating on the triangularly tessellated icosahedron.

Focused on the task of semantic segmentation, this approach demonstrates the best scalability and transferability of the 2019 papers (and the ones that came before them). Like Cohen et al. (2019), it is limited to the distortion properties of the level 0 icosahedron because it represents an image on the icosahedral net. Once again, though, let’s remember that even the level 0 icosahedron is a lot better than the cube map and equirectangular image in terms of distortion. This shows up in the high accuracy and IOU scores for semantic segmentation results in this approach.

Representational drawbacks

Despite all this recent work touting the benefits of the subdivided icosahedron, it has some drawbacks as well.

First, it’s comprised of (“tessellated by”) a bunch of triangles. The 2D convolution we know and love is built for pixel grids. As a result, these methods have to either:

(1) modify the convolutional kernel, which means we can’t take advantage of super-efficient implementations provided by many popular deep learning libraries (e.g. PyTorch, TensorFlow, cuDNN, etc.), or

(2) add some extra operations like special padding, transforms, or interpolation to the pipeline

With these changes, we typically either run into scalability issues for high resolution images and deep architectures or we lose or impede network transferability.

The second drawback actually comes from something we thought was a benefit. It turns out that our beautiful analogy between up-sampling and subdivision is an albatross.

With this analogy, if we want to represent high resolution spherical images, we need to use high and higher subdivision levels to get there. There’s a reason most of the listed work operated at levels 5 (“UG-SCNN,” “Gauge Equivariant CNNs”), 7 (“SpherePHD,” “Mapped Convolutions”), and, most recently, 8 (“Orientation-Aware Semantic Segmentation”).

Most existing work doesn’t scale well to high resolution spherical images. The equivalent equirectangular resolution is given under each image. Note the angular resolution of these images, highlighted in green.

Our subdivision-to-up-sampling analogy does not scale!

But, this is a problem!

Remember what we said earlier:

Discretizing the world into pixels is a lossy operation

A measure of how much detail is preserved by any pixel representation is the angular resolution of image. This is simply the field of view in a certain direction divided by the number of pixels along that dimension. A lower angular resolution per pixel means a more detailed image.

Take a look at the angular resolution of a central-perspective VGA image below. It has a much lower angular resolution than the levels 5, 7, and 8 spherical images. It’s most similar to a level 10, which is 4 times larger than the highest resolution examined by the icosahedral methods we just reviewed.

Compare the angular resolutions highlighted in yellow here to those highlighted in green in the previous figure. We need at least a level 10 icosahedron to start hitting the level of detail available to us in a standard VGA image.

If we want to provide our network with the same high level of detail that is available to our state-of-the-art networks trained on central-perspective images, we need to scale to much higher spherical images than we are currently. But this subdivision/up-sampling analogy is holding us back!

Let’s quickly recall why we wanted to use the icosahedron to begin with: distortion reduction.

A reminder of how low-distortion the icosahedron is.

It turns out that we’re about as good as we’re going to get after ~3 subdivisions.

This graph shows the ratio of the subdivided icosahedron’s surface area to that of a sphere. A ratio of 1.0 is a perfect approximation. After 3 subdivision levels, we’re pretty close. Distortion is largely nullified at this point. (Image source: Eder et al., CVPR 2020)

Any additional subdividing is only necessary to match to the spherical image resolution. This is problematic if we think back to those 3 guiding principles I laid out at the beginning of this article. We need scalability for 360° images. But this analogy is not the way to get there.

A potential compromise solution: tangent images

In the last part of this section, I am going to explain my most recent work on this subject. I am a big believer that approaches that focus on new representations, like the one I present here, provide the best bet for an all-around solution for 360° computer vision.

Tangent Images for Mitigating Spherical Distortion (Eder et al., CVPR 2020)[Code]

Visualization of the tangent image representation, which is derived from the icosahedron. (Image source: Eder et al., CVPR 2020)

This work addresses many of the shortcomings of the aforementioned research (representational difficulties, scalability limitations, and transferability concerns), and:

It provides the best scalability and transfer performance of any approach so far, by a wide margin.

One of the reasons for this is that, in this approach, I depart slightly from the true icosahedron. Instead I propose a new representation, derived from the icosahedron. I call this the tangent image representation.

What are tangent images?

Tangent images are the gnomonic projection of a spherical image onto square, oriented pixel grids set tangent to the sphere at the center of each icosahedral face.

Displaying how tangent images are created. From left: a plane set tangent to a face of the icosahedron; a random set of such planes — note how they are all consistently oriented; tessellating each plane with squares turns each one into a pixel grid that we can project an image onto; a clipped visualization of the full tangent image representation.
Representing the Earth with tangent images (rotation added for visual effect)

To create tangent images, we first set a base level, b, of subdivision. This determines our distortion characteristic, the number of tangent images we’re going to generate (it’s equal to the number of faces of that base level), and the field of view of each tangent image.

The dimension of each square tangent image, d, will depend on the resolution of our spherical image, s (in terms of equivalent subdivision level), by the relation:

So let’s say we have a 4k equirectangular image, which is roughly equivalent to a level 10 icosahedron. With a base level of 1, we will create a set of 80 512×512 pixel images.

The nice thing about this design, is that it actually preserves the factor-of-four scaling we get from image up-sampling and subdivision, but without tethering subdivision to 360° image resolution.

With this design, tangent images decouple subdivision level from image resolution, resolving the problem of scalability. As they are derived from the icosahedron, they provide a very low distortion characteristic as well:

A base level 1 tangent image with Tissot’s indicatrixes in red. The associated face is projected in yellow.

And finally, because they are a pixel grid representation, they require no changes to the convolutional kernel, nor any special padding, transformation, or interpolation during inference, which makes network transfer extremely easy.

In fact, using them is easy. Simply resample to tangent images, run whatever algorithm you want, and then resample back to the sphere. Two resampling operations are all you need:

Operational flow for using tangent images. (Example image from the Stanford 2D-3D-S dataset)

Here’s an example of a 360° image with a level 6 resolution represented by tangent images at base levels 0 and 1.

TODO

I’ll point you to the paper for the nitty-gritty details, but suffice it to say that, with tangent images:

  • We can efficiently operate at at least 4x higher resolution of any prior work (level 10 spherical images)
  • Provide a 10% improvement in network transfer performance, without fine-tuning
  • And even unlock improved performance thanks to the extra wide field of 360° images, compared to an equivalent network trained only on central-perspective image.

The final, really cool thing about tangent images is:

They’re not just for deep learning!

Check out how tangent images improve SIFT keypoint detection (Lowe, 1999), which is a key step in structure from motion or sparse SLAM:

Left: SIFT keypoints detected on the equirectangular image. Right: SIFT keypoints detected using the tangent image representation. Notice there’s a lot less noise on the right. Quantitative results in the paper.

Even low-level tasks, like edge detection, are improved using tangent images:

Running the Canny edge detection algorithm (Canny, 1986) on tangent images preserves the strong, detailed edges without the noise induced by spherical distortion on the equirectangular image.

This is possible because:

We have addressed spherical distortion through the representation, yet we still use a pixel grid.

This facilitates very easy usage and a very broad application scope.

Now, I don’t want to end this on a low note, but there are still some drawbacks to tangent images.

Perhaps the most important one is that we are, in some ways, giving up the biggest benefit of 360: the wide field of view. You can think of tangent images as modeling a polydioptric rig with each camera arranged in a specific way, with a specific field of view, sharing the same center of projection.

Although the initial results are exciting, there’s still a lot of interesting work to be done for more advanced ways to share information between views during the learning process.