Meta learning is learning to learn. Usually applied for hyperparameter tuning, recent applications have started focussing on few-shot learning. Before we explore two novel techniques to achieve this, lets understand some key aspects of the problem. As a guiding example, we will focus on our ability to grasp new words in some language we know.
I) Behavioral change after few samples
Humans are good at decoding a word’s meaning after seeing it used in just a couple of sentences. Similarly, we would want our ML algorithms to generalize to new tasks without the need for a large dataset every time.
II) Learning how to learn
For our guiding problem, learning a new word entails realizing its meaning. Initially, during your early years, you were given meanings of words explicitly — in textbooks or by your teachers. Over time, you acquired the ability to use your existing knowledge of words in a sentence, to get a better picture of what new words mean. Thus, you effectively ‘learnt to learn words’.
III) Learning at two time-scales with internal representations
Few-shot learning typically occurs at two-time scales. Over the long term, you fine-tune your intuition of the language, and the way sentences are constructed to form meaning (learning to learn). This helps you develop a good (internal)representation of any word in your mind — for example, hearing “Apple” would point you to the fruit or the company, depending on whether the other person is taking about nutrition or phones. In the short term, you use this intuition to form your internal representations of new terms and thus grasp new ‘lingo’ (learning).
IV) Datasets/tasks as training examples
In typical learning (on a single dataset), each individual sample-target pair functions as a training point. In case of few-shot learning however, every ‘new’ sample-space is essentially a task in itself. So when you join a workplace, understanding their way of using peculiar words becomes a new task for your language-understanding model. To ensure that an ML framework can exhibit similar behavior, we have to train it on multiple tasks by themselves— thus making each individual dataset a new training sample.
Now that we understand the problem, lets look at two methods that have been proposed to tackle few-shot learning with Neural Networks:
Memory-Augmented Neural Networks (or MANNs) are characterized by 2 elements- a Neural Network controller, and an external memory source — ‘external’ because it is outside of the storage used for internal NN parameters. In a given task, MANNs basically use the external memory as a cache of sorts to map its internal representations to the expected output.
Conceptually, this idea is intuitive — the controller is taught to produce versatile internal representations over a multitude of datasets during training. When presented with a new task, the memory is reset and the output of a MANN is random at first.
However, the memory gets updated with a history of what the expected output is supposed to be, given a few examples. Memory-updates occur with a ‘least recently’ used principle, to allow for learning more relevant samples.
Consider the problem of reaching some point on the circumference of a circle, given a certain position within it. Now consider the meta-problem of where to stand inside the circle, so that you reach any point on the circumference with the minimum number of steps.
Given the meta-objective, your goal is no longer a spot on the edge — your objective is now to situate yourself in such a way that you can solve any of the individual tasks (points on the circumference) with the maximum ease. This is pretty much what the Model-Agnostic Meta-learning paper tries to do.
We all know what Gradient Descent is. During training, given some location in the optimization space, MAML tasks a bunch of individual tasks and computes what the gradient-descented parameters would look like.
Given this list of parameter-lists, it then ‘goes back’ to the original location to move in such a way that the gradient descent for each of the individual tasks requires the minimum shift in location. In essence, it is performing gradient descent to improve gradient descent for each individual task.
In the online mode, MAML does not need to perform full-fledged backpropagation. Instead, it can adjust its location with minimal samples, so as to produce the correct output for the individual task at hand.
In case the above algorithms got you interested, you can take a look at some other papers that attempt few-shot learning:
Source: Deep Learning on Medium