Eyes on the Sphere

Source: Deep Learning on Medium

An Overview of Spherical CNNs, Best Paper Award in the 2018 ICLR Conference by Taco Cohen and his team!

Convolutional Neural Networks (CNNs), which is a class of deep learning neural networks, have become the go-to method for 2D image detection/classification as it produce accurate results without taking too much computing power or time. You can find out more about how it works here and the original motivation, which was inspired by human brain activity here. However, with the increased popularity of self driving cars, omnidirectional images, and other 3D maps (such as wind maps, drones, temperature maps, etc.) we arrive at the challenge of conducting image recognition on spherical images.

Example of spherical image data from Formula One. This is really cool! Drag around in the video to see views from all sides 🙂

Note that we are unable to just project the image on the sphere onto a plane because the regions would be immensely distorted.

Figure from Cohen’s Paper

In this figure, we see that a region highlighted in the middle (aligned over the equator) would produce what looks like a square when projected onto a plane. That same region on the same sphere where the sphere is oriented differently (rotated some number of degrees) and then projected again onto the plane now looks nothing like the sphere. Nothing is preserved, so it is very difficult for a traditional 2D CNN to detect regions on a spherical image.

Similarly, if I asked you what Antarctica looked like, your views are probably heavily influenced by what sort of map you are familiar with. Consider the following views below which I have collected from a variety of images I found on Google.

Which one is Antarctica? None of the images are my own and are used for educational purposes.

Antarctica looks very different in all of these pictures, which are just rotated views of the globe. So how do we construct a CNN to detect images on the sphere? In Cohen’s paper on Spherical CNNs, they present a Fast Fourier Transform based algorithm to detect images on a sphere that solves two primary challenges. First, there is no symmetrical grid that we can fit to a sphere the way we can a plane. This means that we can’t just consider a single pixel on the sphere. Second, if we consider the set of all table of all possible rotations of the sphere, then there is something on the order of O(n⁶) entries in this table, which leads to computational inefficiency. In this overview, we will go over more of the theory behind spherical CNNs and why it works and present some results.

Traditional CNN. Image from https://towardsdatascience.com/covolutional-neural-network-cb0883dd6529

In 2D CNNs (as shown above), we detect some sequence of pixels, with the invariant being translations. So it shouldn’t matter where the bird is located on the image. In other words, we can slide the bird around anywhere on the image and we would still be able to detect that the image is an image of a bird.

In spherical images, the invariant we want to preserve are rotations. To do this, we define the following terms that will be useful to understand the theoretical foundations of spherical CNNs.

Unit Sphere
The unit sphere is just some sphere with radius 1 and we will be considering images defined on it.

Spherical Signals
These are functions defined on the sphere f : S² -> R³ where is the number of channels (in this example 3, consider R, G, B channels)

Rotation Group on the Sphere SO(3)
The Rotation Group on the sphere can be thought of as the group of all possible rotations on the sphere (around the origin in 3D space). Since rotations are linear transforms in R³, we can treat them like matrices. Furthermore, we verify that SO(3) is a group through the following: composing two rotations results in another rotation, every rotation has a unique inverse rotation, and the identity map satisfies the definition of a rotation, the set of all rotations is a group under composition. But because rotations are not commutative, it is a nonabelian group. Furthermore, the rotation group has a natural structure that makes it a manifold for which group operations are smoothly differentiable, which makes it a 3 dimmensional and compact. It is also a Lie group. A bit of a tangent: it is not extremely intuitive as to how a group can both be a manifold and an algebraic structure or why it is useful, but we can consider SO(2), which is the group of rotations on the 2D plane. Geometrically, it looks like a circle (so it is also a manifold). Note that the rotation group preserves distance (so it’s an isometry) and orientation. This will be important when we prove invariants later on.

Image from the Cohen Paper

In the image above, note that x denotes the original image and phi(x) denotes some filter (or a function) applied to the image. We now want to verify that for every rotation R, R(phi(x)) will give us the same thing as phi(R(x)). We define the following notation to help us do this

Rotation Operator

We define the above rotation operator LR that takes in a function f and then creates the rotated function. Note that this is the same as computing the function on the rotated image.

Inner Product

Now we will show that the inner product between a spherical signal (image) and filter (function) is invariant. This will help us prove that spherical correlations are held invariant for rotations, which is the bulk of the proof for the theory of spherical CNNs. First note that the volume of some spherical height map does not change when rotated, which gives us the following identity.

Integrals held invariant under rotations

This means that the rotation operator is a unitary operator, which preserves dot product, is normal, and preserves orientation. It also implies that the inverse of the rotation Operator is adjoint to the rotation operator, which means that if A, B are two adjoint operators, (Ax, y) = (x, By). See proof below.

Now we will show that the spherical correlation is held invariant between rotations. To give some intuition for this, on 2D surfaces, the correlation, which is the measurement of the similarity between two signals, is computed by finding the inner product between the input feature map and a filter shifted by some translation (shift by some value). For rotations on the sphere, we compute correlations as an inner product between the input feature map and a filter rotated by some R in SO(3). Thus the spherical correlation is defined as

Spherical Correlation

Note that there is a subtle, but important distinction between correlation and convolution, which is the measure of the effect one signal has on another. The output of the spherical correlation is a function on SO(3), which you can read more about in Appedix B of the paper, while spherical convolution, gives an output on the sphere. However, this restriction for convolutions causes the filter to be circularly symmetric about the Z axis, which the authors believed would not be able to capture subtleties about the data that would allow them to classify images regardless of what rotation is applied to the image.

Finally, we will show that equivariance is maintained in the correlation between the signal and some rotated filter. Equivariance is important to establish that we in a rotated image, we are still able to detect the same object. Consider a 2D correlation operation and some signal where we detect a maximum activation energy somewhere on the image (for image detection). If the image is translated, max activation location is also translated (but not transformed in any other way). A function f is equivariant under some group G if for a given transformation g in G: f(g(<input>) = g’(f(<input>)). This means, the function applied to an input transformed by an element of the group is equal to some other transformation in the group, applied to the output of the function with an unmodified input. We see the proof for equivariance below.

The spherical CNN is efficiently implemented using Generalized Fast Fourier Transforms, but we will not delve into it in this review. We will now go into some results discussed in this paper.

Rotated MNIST Handwritten digits dataset. Image from Cohen paper

In the figure above, we examine handwritten MNIST digits that are projected onto the sphere. We obtain the following results using traditional CNN architecture vs spherical CNNs.

Results from Cohen paper

Note that the NR/NR means non-rotated training data/non-rotated test data, R/R means rotated training data/rotated test data, andNR/R means non-rotated training data/rotated test data. The NR/R displays the most significant improvement, which indicates that spherical CNNs are able to detect images that have been projected onto the sphere through various rotations.

Feel free to view some demos here!