Source: Deep Learning on Medium
Summaries of papers that address learning from few examples
Last week (5/6/19) marked the start of the International Conference on Learning Representations (ICLR). As such I thought I would dive into some of the ICLR papers that I found the most interesting. Most of these papers are related to areas of personal interest for me (unsupervised learning, meta-learning, attention, NLP) but some I chose simply because of the high-quality and impact in their respective areas. This first part will address breakthroughs in the area of deep learning on small datasets. The second part will talk about papers that address breakthroughs in NLP and other types of sequential data. Finally, the third part will be an assortment of miscellaneous papers that I found interesting.
Transfer, meta, and unsupervised learning
The problem of limited training data affects a wide variety of industries including healthcare, agriculture, automotive, retail, entertainment, etc. In other cases there is a lot of data, however it is not annotated. This problem is a frequent barrier to the integration of deep learning due to the steep time/cost of gathering and annotating data.
This paper builds upon ideas in both meta-learning and unsupervised learning (herein refered to as Metz et al). Specifically, the paper proposes leveraging meta-learning to learn effective representations for downstream tasks in an unsupervised fashion. The paper focuses on “semi-supervised” classification, but what makes it particularly interesting is that, at least in theory, the learning rule “could be optimized to generate representations for any subsequent task.” This is useful because in most works on the unsupervised learning of representations, the authors define a specific training algorithm or loss function. Whereas here the models “ learn[s] the algorithm that creates useful representations as determined by a meta-objective.” These custom rules often require a significant amount of experimentation and domain knowledge and hence cannot easilly be adaptated to new domains. An example of this would be the use of auto-encoders which learn representations by trying to encode, then decode an output identical to the original. Auto-encoders often require a specialized loss function.
To understand exactly how this works recall that typically with meta-learning we have both an inner loop and an outer loop. In the inner loop the model works on a specific task, for instance in image classification this could be identifying dogs and cats. Normally, the inner loop would run on a certain number of examples, n (generally n is between 1 and 10). Then the outer loop would use some parameters from the inner loop (either the weights themselves, cumulative loss or something else) to perform a meta-update. The specifics of this meta-update vary from model to model, but they generally follow this approach.
With this in mind the architecture of their model is essentially to meta-learn a method to update the inner model after creating representations. This rule effectively replaces SGD in updating the inner model after creating a representation. Furthermore, the unsupervised update rule is updated at the end of the cycle in contrast to how the weights themselves would be updated by MAML or the weights of the attention model in the case of SNAIL. This means that this unsupervised learning rule can be applied to not just similiar tasks but also completely new tasks, new base models, and even new modalities of data (for instance from images to text).
The authors evaluate their results first by demonstrating problems with prior approaches. For instance, a VAE suffers from objective function (i.e., loss) mismatch that causes poor performance over time. Whereas prototypical networks transfer features, therefore if the dimensionality of features between tasks is different it begins to breakdown. In contrast, because Metz et al’s approach learns an update it can generalize better than the VAE in the few shot classification tasks. They also show in training the meta-update that the network can generalize to improving text classification even when it was only trained on image classification tasks (though though they did see a steep decline in performance if the meta function was trained too long on the image classification task as it overfitted the imaging task).
Altogether this a really good paper and big step forward in unsupervised techniques. Even though it doesn’t set any state of the art results it could definitely be applied to a lot of areas where data is scarce. The author’s code for the paper is available at the following link.
Unsupervised Learning Via-Meta Learning
Fascinatingly, ICLR this year features two papers that both proposed combining meta-learning and unsupervised learning albeit in two completely different ways. In this paper instead of using meta-learning to learn unsupervised learning rules it uses unsupersived learning to partition datasets for meta-learning.
This paper is one of my favorites as it opens the door for meta-learning without explicit task descriptions. Part of the problem with meta-learning is that it often requires very well-defined sets of tasks. This limits meta-learning to areas where you have very large annotated meta-datasets (that are already partitioned into distinct sub-datasets). This approach proposes automatically partitioning datasets into distinct subsets. The authors find that even when using simple unsupervised clustering algorithms, such as K-means, the meta-learner is still able to learn from these tasks and perform better on subsequent human labeled tasks than methods that learn directly on these embeddings (as is the case of unsupervised-learning followed by supervised classification). The two meta-learning techniques they use are ProtoNets and MAML. This paper demonstrates an interesting form of semi-supervised learning where we have unsupervised pre-training followed by supervised learning. In this case, the “supervised” component is doing few-shot learning.
The authors compare their methods to unsupervised learning methods on four datasets (MNIST, Omniglot, miniImageNet, and CelebA). In the end they found that their approach outperforms pretty much all other unsupervised + supervised learning methods, including cluster matching, MLP, linear classifier, and KNN. Altogether the paper is a good step in the direction of making meta-learning more accessible to a variety of different types of problems rather than just ones with well defined task splits.
This paper aims to combine gradient based meta-learning with a latent representation network. LEO operates in two steps: first it learns a low-dimensional embedding of the model parameters, then meta learning is performed on the low-dimensional embedding space of the model. Specifically, first the model is given a task T with inputs that are then passed to an encoder. The encoder produces a latent code which is then decoded into a set of parameters. The relation network is part of this encoder which helps the code become context dependent. These parameters are then optimized in the inner-loop, while the encoder, decoder, and relation net are optimized in the outer-loop. The authors note the main contribution of their work is to show that meta-learning in a low-dimensional embedding space works a lot better then in the high-dimnesional space like that used by MAML. LEO acheives strong experimental results on both the tieredImageNet and miniImageNet datasets (including an impressive 61% accuracy on 1-shot 5 way benchmark and 77 on the 5 shot 5 way). Like many other papers it tests only on image data so it is not clear how well it would generalize to other types of data.
As the author of this paper has already posted a detailed Medium article about how it works I won’t go into too much detail about the technical aspects. In the broader context of the other meta-learning papers this paper had several parts that are worth highlighting. First it evaluates in both few-shot learning scenarios and larger data scenarios. This is important as often meta-learning algorithms do not look at how well meta-optimization works when there are larger number of examples, but still too few to train the model from scratch. It also looks at several other areas that aren’t explored. Specifically, it addresses the often underexplored area of ‘far-transfer’ that is enabling postive knowledge transfer between significantly different tasks.
This paper discusses using a new type of Variational Autoencoder (VAEs) designed to better cluster high-dimensional data. Clustering items into distinct groupings is an important preliminary-step in unsupervised learning. The authors note that many types of data can be clustered via many different parts of their attributes. The authors note that “LTVAE produces multiple partitions of data, each being given by one super latent variable.”
The LT-VAE not only learns the location of each cluster to best represent the data, but also their number and the hierarchical structure of the underlying tree. This is achieved by a three-step learning algorithm. Step 1 is a traditional training of the encoder and decoder neural networks to improve their fitting of the data. Step 2 is an EM-like optimization to better fit the parameters of latent prior to the learned posterior. And step 3 adapts the structure of the latent prior to improve its BIC score , which balances a good fit of the latent posterior with the number of parameter (and thus complexity) of the latent prior.
The main advantage of this approach is that it improves interpretability of the clustering even if the overall result, in terms of log-likelihood, is not as good. Additionally, the fact that you cluster based on specific facets makes it attractive fro many realy world applications. Although the article is different from many of the other papers and does explicitly address few-shot learning, I think its approach to cluster could prove useful when combined with few-shot methods. For instance, it could possibly be used as the task partition for in “Unsupervised learning via meta-learning setting.”
This article focuses on using meta-learning and a Chinese Restaurant Process to rapidly update reinforcement learning models when they are running online (i.e., in production). This is inspired by the fact that humans are often faced with new situations that we have not (exactly) experienced before; however, we can utilize our past experiences combined with feedback from the new experience to rapidly adapt.
The authors’ approach first utilizes MAML to initially train the model. After MAML gives an effective prior, comes the use of the online learning algorithm. The online learning algorithm utilizes the Chinese Restaurant Process to spawn new models with the appropiate initialization or to select an existing model. SGD is then used to update the model parameters based on the results. The authors name this proposed method meta-learning for online learning (or MoLE in short).
The authors evaluate their methodology on several RL environments. The first environment is a simulated cheetah traversing slopes of varying difficulty. The second environment is a six-legged crawler robot with crippled legs. MOLe outperforms model based RL, k-shot adaptation with meta learning, and continued gradient steps with meta-learning (though interestingly it only slightly outperforms gradients steps with ML).
When a neural networks learns a sequence of tasks it often suffers a problem called catastrophic forgetting. With catastrophic forgetting the neural network can no longer perform well on the previous tasks it was trained on. Catastrophic forgetting can be considered as a special case of transfer learning where there is significant negative backward transfer. Transfer learning (as the majority of people refer to it) and meta-learning usually seek to maximize the forward positive transfer on the final task, but generally do not pay any attention to the impact on the original task(s). This paper seeks to to strike more of balance where they still want to have positive transfer but not at the expense of catastrophic forgetting (interference).
To address this problem Riemer et al. propose a approach called Meta Experience Replay (MER). MER utilizes standard experience replay, where past training examples are interleaved with the current training examples to prevent catastrophic forgetting. These past examples are given a lower learning rate. Secondly, MER employs the popular meta learning algorithm REPTILE to train on new data. However, MER interleaves previous examples from the memory buffer with new incoming examples into the inner training loop powered by REPTILE in order to prevent catastrophic forgetting.
I like that this paper explores both the idea of postive transfer and negative transfer. Its results on Omniglot and the reinforcement learning setting appear to be quite good. However, particularly in the supervised-classification setting the authors only test on “toy” datasets. They should have also tested either on CIFAR-10 benchmark, CALTech-Birds or on CORRE50. At this point there no real reason to test only on the permuted MNIST or Omnigolt when there are many other more realistic CL datasets. Additionally. I found some of their terminology confusing as the authors “renamed” several previously named concepts. Also, ideally in the case of continual learning we would not have to retrain on any of the previous data (as retraining has an added computational cost). However, all and all it is step in the right direction and I hope more papers look at both forward and backward transfer. For more info on this paper IBM has a blog post and code is located here.
This is an fascinating application of meta-learning to seq2seq modeling. In this instance the authors use meta-learning to enable few-shot adaptation to a speaker’s voice. This is important as many times you might not have 100s or 1000s of examples of a particular person’s voice. Specifically, the authors extend the WaveNet architecture in order to incoporate meta-learning. Interestingly, according to the authors, MAML did not produce a meaningful prior in their preliminary experiments. Therefore, they had to develop their own architecture.
The architecture functions in three-steps: (1) train the model on a large corpus of text-speaking pairs from a variety of speakers; (2) adapt the model on a few text speaking pairs from a single speaker; and (3) finally perform inference on pure text and convert it to the appropiate voice. The authors discuss two scenarios for few-shot learning: parametric few-shot adaptation with an embedding encoder (SEA-ENC) and non-parametric few shot adaptation with fine tuning (SEA-ALL). In the case of SEA-ENC, the authors train an auxillary embedding network that predicts a speaker embedding vector given the new data. In contrast, for SEA-ALL the authors train both the network and the embeddings together. In evaluation, SEA-ALL seems to perform the best, although the authors state that the model seems to overfit with SEA-ALL. Therefore, they recommend using early stopping. (It is only on Librispeech in the 10s range their model is outperformed by pervious papers).
This paper is a good example of few-shot-learning being applied to a tricky problem outside of the typical image classification domain and adjustments necessary for it to actually work. Hopefully, we see more attempts to apply few-shot learning to generative models in the future. The authors have a website where you can demo their TTS model, unfortunately it does not seem to contain their actual code.
Short summaries of additional relevant papers at ICLR
Mudrarkarta et al. introduce a model patch that consists of small number of learnable parameters that will specialize to each task. This method acts in-lieu of the common practice of fine-tuning the final layer of the network. The authors find that not only does this method reduce the number of parameters (from over 1million to 35k) but it also improves the fine-tuned accuracy in the both transfer and multitask learning context. The only down side is the patch seems fairly architecture specific.
Although the first part of this paper is titled “Unsupervised Domain Adaptation” it really addresses transfer learning. Recall that with domain adaptation generally the target domains have the same set of labels. However, in this case the authors assume an unlabeled target domain. As some of the reviewers note, the paper is confusing for this reason; however it still has several worthwhile take aways. The authors propose a feature transfer network FTN, to separate the feature space of the source and target domains. The authors achieve state of the art performance on cross ethnicity face recognition.
This article discusses how an application of meta-learning for synthesizing programs. In it they construct a syntax guided program that takes a logical formulation and a grammar then makes a program. It is a good example of a application of meta-learning beyond the typical few-shot image datasets.
This paper goes into the theory of learning and transfer learning. The authors state “our theory reveals that knowledge transfer depends sensitively, but computably, on the [signal to noise ratio] and input feature alignments of pairs of tasks.” Altogether this paper is interesting for those who like to delve into theory.
I hope this provides a good overview of the majority of few-shot learning papers at ICLR this year (though I likely missed a few). As you can see there are a variety of interesting new techniques that are now opening up ways to use deep learning in data constrained situations. Stay tuned for part two of my three part ICLR series where I will discuss advances in NLP (including Goal Oriented Dialogue), new and better attention mechanisms, as well as some interesting new types of recurrent architectures.