360° Computer Vision: Problems, solutions, and a suggestion for the path ahead

Original article was published on Deep Learning on Medium

This is interesting, but how does it change anything?

Well, it can become an even better spherical approximation if we’re willing to budge a little on the perfectly equal-sized faces. Using Loop subdivision (Loop, 1987), we can interpolate the vertices and start to approximate a sphere:

If we do this ad infinitum, our icosahedron will eventually be indistinguishable from the sphere. In effect, we can consider this to be a 3D method of exhaustion.

A quick note on nomenclature. When talking about subdivided icosahedra, I’m going to refer to the 20-face regular icosahedron as a level 0 icosahedron, while after one subdivision, it will be called level 1, etc. This is simply an easy way to refer to different subdivision levels.

Why use the subdivided icosahedron?

If it’s not immediately clear from the subdivision pictures above, the answer to this question is pretty simple. Summarizing from one of the most prominent minds in modern cartography (Kimerling et al., 1999):

The subdivided icosahedron is among the least-distorted spherical representation.

As we talked about before: cartographers have been looking at this problem for millenia. Let’s not try to reinvent the wheel and, instead, leverage some existing insight into the matter.

Take a look at the figure below. Here, I’ve projected our Earth image onto the the faces of the the original icosahedron via gnomonic projection, and I’ve unfolded the icosahedron to its net. I’ve also super-imposed Tissot’s indicatrices on the faces so you can visualize the distortion characteristic.

Look at how nearly-uniform those circles are. Again, our goal is to have perfect circles that are all the same size. I’d say we get pretty close here.

So, clearly the icosahedron is great for reducing distortion. But, there is also another reason it’s become so popular:

Subdivision parallels image up-sampling

Take a look at the equations below. These show how to compute the number of faces and vertices for each successive subdivision level.

Take note of that factor of 4. A subdivision turns 1 face into 4 faces, just like how image up-sampling turns 1 pixel into 4 pixels.

For vertices, it’s not quite the same, but it’s fairly close.

Now remember, many of our favorite fully-convolutional network architectures (like FCNs, U-Nets, and ResNets) incorporate these down-sampling and up-sampling operations to encode and then decode learned features. Because subdivision has this nice parallel with these operations, the subdivided icosahedron can fit nicely into our existing CNN paradigm.

So many ways to convolve on the icosahedron!

As I pointed out before, 6 different papers proposed novel analysis using the icosahedron in 2019. Let’s take a shallow dive into each of these papers to see, empirically, how and why the icosahedron works, and what considerations we have to make to use it.

I will start this section with my own contribution to this approach, because I directly compare different spherical representations.

Mapped Convolutions (Eder et al., arXiv 2019) [Code]
Convolutions on Spherical Images (Eder and Frahm, CVPR Workshops 2019, Oral)

The key contribution of these papers is to tout the representational benefits of the subdivided icosahedron and to provide a solution to convolve on its surface. In these works, I look at the tasks of depth estimation and semantic segmentation using an equirectangular image, a cube map, and a level 7 subdivided icosahedron. To provide an apples-to-apples comparison, I generalize the location-adaptive methods by developing a “mapped convolution” operation that accepts an adjacency list to determine where a convolutional kernel should sample. This approach allows us to map the kernel to the faces of the subdivided icosahedron without changing the convolution operation in any way. The results show a 12.6% improvement in overall semantic segmentation mean IOU and a nearly 17% improvement in absolute error for depth estimation,

Simply by resampling the image to the subdivided icosahedron!

These outcomes reinforce the importance of the choice of spherical image representation.

An additional result is that the cube map is not a great choice of spherical image representation due to orientation inconsistency at +/-Y faces that we highlighted earlier (remember how content radiates from the center?) and filter ambiguity at the corners.

The problem with my mapped convolution approach in these works is that it can get quite slow as network depth or image resolution increases.

Spherical CNNs on Unstructured Grids (Jiang et al., ICLR 2019) [Code]

This paper is where my mutually-exclusive division of the related work falls apart. This approach uses the subdivided icosahedron, but it is also a type of reparameterization method. This work, often abbreviated as UG-SCNN, reparameterizes convolution as a linear combination of differential operators on the surface of an icosahedral mesh. This was the first work to attempt to take advantage of the lower distortion present in the icosahedron representation as well as the efficiency gains that can be achieved by reparameterizing the convolution. This method circumvents the scalability problem of kernel-modifications by leveraging fast differential computations to approximation convolution, and it scales better to higher resolution spherical images as a result. This scalability comes at the cost of transferability, however. Because it no longer uses the traditional convolution operation, it does not permit network reuse. That being said, it still provides some of the best performance to-date for spherical image tasks.

Gauge Equivariant Convolutional Networks and the Icosahedral CNN (Cohen et al., ICML 2019)

I’m not going to go into too much depth on this paper, because someone else has already written a great explanatory overview of it. The high-level, though, is that Cohen et al. recognize that the icosahedral net consists of five parallelogram strips that can have one of six orientations, or gauges, around the icosahedron. They use these strips to define an atlas of charts that relate the the planar parallelograms to locations on the icosahedron. They are then able to apply standard 2D 3×3 convolutional filters (masking out 2 weights) on the charts, using a gauge transform to ensure consistent feature orientation. Like the other methods in this sections, this paper addresses the issue of filter orientability. However, perhaps speaking to the credit of the author, Taco Cohen, this work, like his previous Spherical CNNs paper, formalizes the problem in a rigorous way.

There are a couple things to observe from this method. First, the use of the standard 2D convolution operator. This is great, because many of the other approaches either modify the convolutional kernel (i.e. location-adaptive methods) or approximate it (i.e. reparameterization methods). Those changes either slow things down, inhibiting scalability, or break transferability. By using the 2D convolution, this approach can take advantage of efficient convolution implementations. The drawback of this approach is that the charts mean we can’t use Loop subdivision to more closely approximate the sphere. Unlike the Mapped Convolutions and UG-SCNN that operate on a mesh representation, we are limited here to the distortion-reduction properties of the level 0 icosahedron (which are still pretty good, by the way).

SpherePHD: Applying CNNs on a Spherical PolyHeDron Representation of 360 Degree Images (Lee et al., CVPR 2019)

Like the three laid out above, this paper also proposed the use of the subdivided icosahedron a a spherical image representation. This work analyzes the distortion properties of the representation and defines new, orientation-dependent kernels for convolution and spatial pooling. The authors also propose a weight-sharing design to address the differing orientations of faces across the icosahedron. In some ways, because Lee et al. redefine the convolution operator, this can be considered a reparameterization method as well.

Orientation-Aware Semantic Segmentation on Icosahedron Spheres (Zhang et al., ICCV 2019)

This final icosahedral paper of 2019 proposes a method that makes use of the icosahedral net. The authors define a special hexagonal convolution on the vertices of the icosahedron that can interpolate to and from the standard 2D convolution operation. In this way, they provide transferability for existing networks, while still operating on the triangularly tessellated icosahedron.

Focused on the task of semantic segmentation, this approach demonstrates the best scalability and transferability of the 2019 papers (and the ones that came before them). Like Cohen et al. (2019), it is limited to the distortion properties of the level 0 icosahedron because it represents an image on the icosahedral net. Once again, though, let’s remember that even the level 0 icosahedron is a lot better than the cube map and equirectangular image in terms of distortion. This shows up in the high accuracy and IOU scores for semantic segmentation results in this approach.

Representational drawbacks

Despite all this recent work touting the benefits of the subdivided icosahedron, it has some drawbacks as well.

First, it’s comprised of (“tessellated by”) a bunch of triangles. The 2D convolution we know and love is built for pixel grids. As a result, these methods have to either:

(1) modify the convolutional kernel, which means we can’t take advantage of super-efficient implementations provided by many popular deep learning libraries (e.g. PyTorch, TensorFlow, cuDNN, etc.), or

(2) add some extra operations like special padding, transforms, or interpolation to the pipeline

With these changes, we typically either run into scalability issues for high resolution images and deep architectures or we lose or impede network transferability.

The second drawback actually comes from something we thought was a benefit. It turns out that our beautiful analogy between up-sampling and subdivision is an albatross.

With this analogy, if we want to represent high resolution spherical images, we need to use high and higher subdivision levels to get there. There’s a reason most of the listed work operated at levels 5 (“UG-SCNN,” “Gauge Equivariant CNNs”), 7 (“SpherePHD,” “Mapped Convolutions”), and, most recently, 8 (“Orientation-Aware Semantic Segmentation”).

Our subdivision-to-up-sampling analogy does not scale!

But, this is a problem!

Remember what we said earlier:

Discretizing the world into pixels is a lossy operation

A measure of how much detail is preserved by any pixel representation is the angular resolution of image. This is simply the field of view in a certain direction divided by the number of pixels along that dimension. A lower angular resolution per pixel means a more detailed image.

Take a look at the angular resolution of a central-perspective VGA image below. It has a much lower angular resolution than the levels 5, 7, and 8 spherical images. It’s most similar to a level 10, which is 4 times larger than the highest resolution examined by the icosahedral methods we just reviewed.

If we want to provide our network with the same high level of detail that is available to our state-of-the-art networks trained on central-perspective images, we need to scale to much higher spherical images than we are currently. But this subdivision/up-sampling analogy is holding us back!

Let’s quickly recall why we wanted to use the icosahedron to begin with: distortion reduction.

It turns out that we’re about as good as we’re going to get after ~3 subdivisions.

Any additional subdividing is only necessary to match to the spherical image resolution. This is problematic if we think back to those 3 guiding principles I laid out at the beginning of this article. We need scalability for 360° images. But this analogy is not the way to get there.

A potential compromise solution: tangent images

In the last part of this section, I am going to explain my most recent work on this subject. I am a big believer that approaches that focus on new representations, like the one I present here, provide the best bet for an all-around solution for 360° computer vision.

Tangent Images for Mitigating Spherical Distortion (Eder et al., CVPR 2020)[Code]

This work addresses many of the shortcomings of the aforementioned research (representational difficulties, scalability limitations, and transferability concerns), and:

It provides the best scalability and transfer performance of any approach so far, by a wide margin.

One of the reasons for this is that, in this approach, I depart slightly from the true icosahedron. Instead I propose a new representation, derived from the icosahedron. I call this the tangent image representation.

What are tangent images?

Tangent images are the gnomonic projection of a spherical image onto square, oriented pixel grids set tangent to the sphere at the center of each icosahedral face.

To create tangent images, we first set a base level, b, of subdivision. This determines our distortion characteristic, the number of tangent images we’re going to generate (it’s equal to the number of faces of that base level), and the field of view of each tangent image.

The dimension of each square tangent image, d, will depend on the resolution of our spherical image, s (in terms of equivalent subdivision level), by the relation: