Original article was published by Andre Ye on Artificial Intelligence on Medium
What the Human Brain Has That Deep Learning Desperately Needs: A Guide to Zero-Shot Learning
And the Embarrassingly Simple ZSL Algorithm
Deep learning has a big problem: it needs to gobble up massive quantities of data before it can generalize reasonably well and become useful. This is in fact one of the limitations of deep learning that restrict its applications in many fields where data is either not abundant or difficult to obtain.
Contrast this with humans — who, while being portrayed on the losing side of a human versus machine intelligence battle — can learn complex concepts with only a few training examples. A baby who has no idea of what a cat or a dog is can learn to classify them after seeing a few images of a cat and a few images of a dog. Even as we grow to learn more complex concepts, we are able to acquire most of the learning with a small dataset.
This attribute makes teaching humans concepts laborious, but possible. A fourth grader can master the principles of basic algebra with a few dozen problems and good teacher. Deep learning needs a carefully designed architecture, tens of thousands (if not more) of painfully scraped problems and answers encoded in a specialized format, and a few days of compute time.
Perhaps something even more intriguing about the human brain is that we can handle hundreds or even thousands of classes with ease. Think about all the objects in the environment right around you — your computer, the apps on your computer, features in those apps, the dozens of colors, your colleague’s names, all the words in the English language.
Furthermore, you’re able to recognize concepts even if you’ve never seen them before. Consider the following ideas (hopefully you haven’t seen these before): classify them into classes. The actual names of the classes don’t matter — call them zookildezonk if you want to.
This is a very similar process to the naming of new species. If a scientist spots several bald eagles, he or she can simply put a name on the species — the ‘bald eagle’ — who are grouped by similar characteristics: a wingspan of 6–7 feet, a dark brown tail, a white tail, a white head, bright yellow eyes. This is despite him or her not knowing what a ‘bald eagle’ exactly was beforehand.
We don’t need to put names on concepts to recognize them; names are arbitrary and are just a quick way to access an idea. Similarly, we can classify these abstract shapes by any name we please, as long as the names are indicative of a broader concept (in this case, two squares for ‘Zonkizonk’ and three squares for ‘Bonkibonk’).
Zero-shot learning is an effort to bring this human capability of recognizing previously unseen concepts to machines. Clearly, this is a crucial step towards real artificial intelligence and building algorithms that think more like humans, but it’s also very practical in problems where there are too many classes, or where data is limited or expensive to obtain.
In a world where problems deep learning has yet to solve increasingly involve intricate and human-like cognition, zero-shot learning is an answer.
Similarly, one-shot or few-shot learning refers to the understanding of an entire class given only one or a few training examples from that class, like the two-headed Siamese network.
One simple but effective zero-shot learning method goes by “Embarrassingly Simple Zero-Shot Learning” (ESZSL), which uses matrix factorization and unsupervised learning in creative ways to produce a model that yields surprisingly good results. Understanding it gives an intuitive look into the dynamics of many other zero-shot learning techniques.
ESZSL yields over 65% accuracy on the SUN dataset, which consists of tens of thousands of objects, for classes it has never seen during training. View the paper for an in-depth summary of the method’s results on synthetic and real-data datasets.
At root, ESZSL is a linear model. Given an input matrix X with shape
(number of rows, number of features) and a weight matrix of shape
(number of features, number of classes), the linear combination output would be of the shape
(number of rows, number of classes).
The goal of ESZSL is to find the value of the weight matrix W.
Consider the two steps a reasonably sophisticated model must complete:
- Interpret the input by mapping the feature space (input X) to an attribute space of dimensionality a, where attributes can be things like whether an image has four feet, if it is brown, etc. What each attribute means needs to be determined by the model.
- Combine knowledge from the attribute space into an output. For instance, if the image has four feet and is brown, the output is dog.
These two purposes can be represented as matrices.
- V has shape (number of features, a).
When X is multiplied by V, the result has shape (number of rows, a). Each row is now represented by learned attributes. This is very similar to the connection in a neural network layer (without the bias and activations).
- S has shape (a, number of classes).
When V is multiplied by S, the result has shape (number of rows, number of classes). This multiplication combines learnings from the attribute space to produce an output. This is like the output layer of a neural network.
We can then write the model as:
It turns out that the latter half of the pipeline — S, the relationship between certain learned attributes and the classes — can be found through unsupervised learning methods like PCA, or with more sophisticated manifold learning techniques such as Locally Linear Embeddings and t-SNE.
- Train a dimensionality reduction algorithm (PCA, LLE, etc.) on X, the training input data, into a dimensions.
- The resulting data should have shape (r, a), where r is the number of rows and a is the number of learned attributes. Call this matrix M.
- Allocate matrix S of size (a, c), where c is the number of classes.
- For each unique class, find the rows in M whose label matches that class. Find the average of learned attributes a for each of those rows and fill the information into S.