Unsupervised Feature Learning

Source: Deep Learning on Medium

Deep Convolutional Networks on Image tasks take in Image Matrices of the form (height x width x channels) and process them into low-dimensional features through a series of parametric functions. Supervised and Unsupervised Learning tasks both aim to learn a semantically meaningful representation of features from raw data.

Training Deep Supervised Learning models requires a massive amount of data in the form of labeled (x, y) pairs. Unsupervised Learning does not require the corresponding labels (y), the most common example of which being auto-encoders. Auto-encoders take x as input, pass it through a series of layers to compress the dimensionality and are then criticized on how well they can reconstruct x. Auto-encoders eventually learn a set of features that will describe the data x, however, these features are likely not to be very useful for Supervised Learning or Discriminative tasks.

One extension to Unsupervised Feature Learning with Auto-encoders is De-noising Auto-encoders. De-noising Auto-encoders take as input a corrupted image, (the original image added with some form of random matrix), and reconstruct the original image. Again, these features are not very useful for discriminative tasks, however, hopefully these two examples are a sufficient explanation of how unsupervised feature learning tasks can be constructed.

Dosovitskiy et al. propose a very interesting Unsupervised Feature Learning method that uses extreme data augmentation to create surrogate classes for unsupervised learning. Their method crops 32 x 32 patches from images and transforms them using a set of transformations according to a sampled magnitude parameter. They train a Deep CNN to classify these patches according to their augmented ‘surrogate’ classes.

The top left image is a 32 x 32 patch taken from the STL-10 dataset. Dosovitskiy et al. proceed to form all of these other images by sampling a parameter vector that defines a set of transformations. Each of these resulting images belongs to a surrogate class. A Deep Learning model will classify images according to these classes.

Link to Paper:

In the paper, there are 6 transformations used: translation, scaling, rotation, contrast 1, contrast 2, and color additions. Each of these transformations comes with a parameter that defines the magnitude of the augmentation. For example, translate → (vertical, 0.1 (of patch size)). The magnitude parameters can be stored in a single vector. These vectors are sampled from the overall distribution of parameters to transform patches.

These magnitude parameters are discretized such that there are a finite space of values between the boundaries of parameters e.g. translation magnitude ranging between [0.2, 0.1, 0, -0.1, -0.2]. The refinement of the discretization results in a large number of surrogate classes constructed overall. For example, if there are 5 values for translation, 5 values for scaling, 5 values for rotation, 5 values for contrast1, 5 values for contrast2, and 5 values for color addition. There are 5⁶ = 15,625 resulting surrogate classes.

Thus, the following question that Dosovitskiy et al. sought to answer was:

How many surrogate classes should be used?

Number of Surrogate Classes Constructed

Note the y-axis on the left corresponds to the classification performance using unsupervised features on the supervised STL-10 discriminative task and the y-axis on the right corresponds to the error of the classifier on different surrogate classes, (very low error rate for 50–2000 surrogate classes)

This plot shows that performance began to level off around 2,000 surrogate classes. Thus, there are not too many intervals between most of the augmentations. Another interesting question arises:

How many augmented samples should be used for each surrogate class?

Augmented Samples used for Surrogate Classes

The plot above shows that with the 2,000 surrogate classes proven to be optimal from the previous plot; performance begins to level off at around 32 to 64 samples per class.

It is interesting to think about the size of the augmented dataset used for this approach.

32x32x3 patches → 2,000 surrogate classes → 64 samples per class
(32x32x3) x 2000 x 64 = 393,216,000 pixels

Only around ~393 MB, which really isn’t a huge problem for most modern computers. However, if they change the approach from 32 x 32 patches to the full image, this could become enormous, necessitating the use of on-line data augmentation techniques.

Diversity of Transformations

They also investigated the use of different transformations for constructing the surrogate classes. As shown in the plot above, there is some variation between the datasets and augmentations used especially evident in the Caltech-101 spike when using only color and contrast. However, using all augmentations has consistently high performance across all three datasets. This suggests that the results could maybe be further improved on by adding more transformations to the mix.

Thank you for reading this paper introducing Unsupervised Feature Learning! I think it is very interesting to see how Deep Neural Networks can learn features in one task that transfers well to another. Using Deep Features trained on tasks such as the Exemplar-CNN described in this article can be useful for Discriminative tasks as well. Please let me know your thoughts on this in the comments.