Original article was published on Deep Learning on Medium
This ‘learning-to-learn’ is very much aligned with human and animal learning in which learning methods incrementally improve over a period of time. This approach has advantages of data and compute efficiency.
Meta learning algorithm can be understood as made up of 2 levels of learning — inner and outer algorithm. Inner learning is similar to conventional learning algorithm such as improving image classification. During meta-learning, an outer (or upper, meta) algorithm updates the inner learning algorithm, such that the model learned by the inner algorithm improves an outer objective. This objective could be generalization performance or learning speed of the inner algorithm. Learning iterations of the base task can be thought of as providing the stimulus needed by the outer algorithm in order to learn the base learning algorithm.
Meta learning algorithm objective function can be mathematically expressed as
L:a function that measures the match between true labels and those predicted by f(θ)θ: parameter for inner algorithm
D: the Dataset under consideration.
ω: meta-knowledge to make explicit the dependence of this solution on factors such as choice of optimizer for θ or function class for f, which we denote by ω.
Meta-learning first appears in the literature in 1987 in two separate and independent pieces of work, by J. Schmidhuber and G. Hinton. They set the theoretical foundations for a new family of algorithms that can learn how to learn, using self-referential learning.
Proposals for training meta-learning systems using gradient descent and backpropagation were first made in 2001. Finally in 2012 Meta-learning was re-introduced in the modern era of deep neural networks, which marked the beginning of modern meta-learning of the type discussed here.
Meta Learning landscape can be broadly categorized as follows, each identifying parameters and functioning of the algorithms:
This classification is based on representation of meta-knowledge ω. This includes estimation of model parameters used for optimizer initialization.
This classification is the choice of optimizer to use for the outer level during meta-training. This includes methods such as gradient-descent, to reinforcement learning and evolutionary search. A large family of methods use gradient descent on the meta parameters ω. This requires computing derivatives dL meta/dω of the outer objective, which are typically connected via the chain rule to the model parameter θ, dL meta/dω = (dL meta/dθ)(dθ/dω).
This classification is based on the goal of meta-learning which is determined by choice of meta objective L meta, task distribution p(T), and dataflow between the two levels.
Let us explore few of the models shown in the figure above.
A good initialization is just a few gradient steps away from a solution to any task T drawn from p(T ). Here the goal is to find the find family of methods ω which correspond to the initial parameters of a neural network. Applications of this approach include few-shot learning where models can be trained using few examples without overfitting.
Here the goal is to learn the inner algorithm optimizer by training a function that takes as input optimization states such as θ and produces the optimization step to take at each base learning iteration.
Black-Box Models (Recurrent, Convolutional, HyperNetwork)
These methods train ω that provide a feed-forward mapping directly from the support set to the parameters required to classify test instances rather than relying on gradient.
Embedding Functions (Metric Learning)
In this case, embedding network ω transforms raw inputs into a representation suitable for recognition by simple similarity comparison between query and support instances with cosine similarity or euclidean distance.
Losses and Auxiliary Tasks
Here the small neural network inputs quantities that are inputs to losses (e.g., predictions, features, or model parameters) and outputs a scalar to be treated as a loss by the inner task optimizer.
In these methods ω includes hyperparameters of the base learning algorithm such as regularization strength, per-parameter regularization, task relatedness in multi-task learning, or sparsity strength in data cleansing
Reinforcement learning is used when base learner or the meta-objective is non-differentiable. This estimates the gradient ∇ωL meta, typically using the policy gradient theorem.
Evolution learning methods have found interesting applications where they can optimize any type of base model and meta objective with no requirement on differentiability. They do not rely on backpropagation, which rectifies both gradient degradation issues and avoids the cost of high-order gradient computation required by conventional gradient-based methods above.
Many vs Few-Shot Episode
Depending upon whether the goal is improving few- or many-shot performance, inner loop learning episodes may be defined with many or few examples per-task.
Fast Adaptation vs Asymptotic Performance
Asymptomatic Performance: Validation loss is computed at the end of the inner learning episode, meta-training encourages better final performance of the base task.
Fast Adaptation: Validation loss is computed as the sum of the validation loss after each inner optimization step, then meta training also encourages faster learning in the base task
Multi vs Single Task
Mutli Task: The goal is to tune the learner to better solve any task drawn from a given family, then inner loop learning episodes correspond to a randomly drawn task from a family of tasks.
Sinple Task: The goal is to tune the learner to simply solve one specific task better, then the inner loop learning episodes all draw data from the same underlying task.
Computer Vision and Graphics
Meta-learning-based fewshot learning methods train algorithms that enable powerful deep networks to successfully learn on small datasets. Applications include Object Detection, Landmark Prediction, Object Segmentation, Image Generation, Video Synthesis, Density Estimation.
Meta Reinforcement Learning and Robotics
Reinforcement learning is typically concerned with learning control policies that enable an agent to obtain high reward in achieving a sequential action task within an environment. Meta learning has proved to be effective in enabling Reinforcement learning. This is because
Meta-knowledge of ‘how to stand up’ for a humanoid robot is a transferable skill for all tasks within a family that require locomotion, while meta-knowledge of a maze layout is transferable for all tasks that require navigating within the maze.
Environment Learning and Sim2Real
In Sim2Real we are interested in training a model in simulation that is able to generalize to the real-world, which is challenging since the simulation does not match the real world exactly.
Meta learning algorithms are particularly useful where the inner-level optimization learns a model in simulation, the outer-level optimization L meta evaluates the model’s performance in the real-world, and the meta-representation ω corresponds to the parameters of the simulation environment.
Neural Architecture Search
Architecture search involves finding ω which specifies the architecture of a neural network. The inner optimization trains networks with the specified architecture, and the outer optimization searches for architectures with good validation performance.
Bayesian Meta Learning
Rather than learning direct learning of optimization parameters, Bayesian Hierarchical modeling and Bayesian inference emphasizes on learning by inference. Also, it provides uncertainty measures for the θ parameters, and hence measures of prediction uncertainty.
Continual, Online and Adaptive Learning
Continual: Meta-learning has been applied to improve continual learning in which human-like learning capability can be achieved where new tasks are learned better given past experience, without forgetting previously learned tasks, and without needing to store all past data for rehearsal against forgetting.
Online and Adaptive Learning: Online and Adaptive Learning also consider tasks arriving in a stream, but are concerned with the ability to effectively adapt to the current task in the stream, more than remembering the old tasks.
Meta-generalization in Meta-learning encounters a generalization challenge across tasks analogous to the challenge of generalising across instances in conventional machine learning.
Multi-modality of task distribution Many meta learning algorithms assume unimodal task distribution whereas in reality the distribution if often multi-modal and single learning strategy may not give good results.
Meta-learning algorithms often lead to a quadratic number of learning steps, since each outer step requires multiple inner steps. Moreover, there are a large number of inner steps in the case of many-shot experiments, and these need to be stored in memory. For this reason most meta-learning frameworks are extremely expensive in both time and memory, and often limited to small architectures in the few-shot regime
Cross-modal transfer and heterogeneous tasks
Most meta-learning methods studied so far have considered tasks all drawn from the same modality such as vision, text, proprioceptive state, or audio. To be able to able to extract knowledge from set of tasks each having its own modality and transfer it into another task with a unique modality is an open challenge.
There has been an exponential growth in meta learning algorithms and its easy to confuse one with its related fields. With the help of taxonomy and broad classification which focuses on functioning and goal, we are able to clearly identify and benchmark the performance of these algorithms and effectively evaluate their applications.