Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2021.3079209, IEEE
Transactions on Pattern Analysis and Machine Intelligence
1

Meta-Learning in Neural Networks: A Survey


Timothy Hospedales, Antreas Antoniou, Paul Micaelli, Amos Storkey

Abstract—The field of meta-learning, or learning-to-learn, has seen a dramatic rise in interest in recent years. Contrary to
conventional approaches to AI where tasks are solved from scratch using a fixed learning algorithm, meta-learning aims to improve the
learning algorithm itself, given the experience of multiple learning episodes. This paradigm provides an opportunity to tackle many
conventional challenges of deep learning, including data and computation bottlenecks, as well as generalization. This survey describes
the contemporary meta-learning landscape. We first discuss definitions of meta-learning and position it with respect to related fields,
such as transfer learning and hyperparameter optimization. We then propose a new taxonomy that provides a more comprehensive
breakdown of the space of meta-learning methods today. We survey promising applications and successes of meta-learning such as
few-shot learning and reinforcement learning. Finally, we discuss outstanding challenges and promising areas for future research.

Index Terms—Meta-Learning, Learning-to-Learn, Few-Shot Learning, Transfer Learning, Neural Architecture Search

1 I NTRODUCTION in multi-task scenarios where task-agnostic knowledge is


extracted from a family of tasks and used to improve learn-
Contemporary machine learning models are typically
ing of new tasks from that family [7], [19]; and single-task
trained from scratch for a specific task using a fixed learn-
scenarios where a single problem is solved repeatedly and
ing algorithm designed by hand. Deep learning-based ap-
improved over multiple episodes [15], [20], [21]. Successful
proaches specifically have seen great successes in a variety
applications have been demonstrated in areas spanning few-
of fields [1]–[3]. However there are clear limitations [4]. For
shot image recognition [19], [22], unsupervised learning
example, successes have largely been in areas where vast
[16], data efficient [23], [24] and self-directed [25] reinforce-
quantities of data can be collected or simulated, and where
ment learning (RL), hyperparameter optimization [20], and
huge compute resources are available. This excludes many
neural architecture search (NAS) [21], [26], [27].
applications where data is intrinsically rare or expensive [5],
Many perspectives on meta-learning can be found in
or compute resources are unavailable [6].
the literature, in part because different communities use the
Meta-learning is the process of distilling the experience
term differently. Thrun [7] operationally defines learning-to-
of multiple learning episodes – often covering a distribution
learn as occurring when a learner’s performance at solving
of related tasks – and using this experience to improve
tasks drawn from a given task family improves with respect
future learning performance. This ‘learning-to-learn’ [7] can
to the number of tasks seen. (cf., conventional machine
lead to a variety of benefits such as improved data and
learning performance improves as more data from a single
compute efficiency, and it is better aligned with human and
task is seen). This perspective [28]–[30] views meta-learning
animal learning [8], where learning strategies improve both
as a tool to manage the ‘no free lunch’ theorem [31] and im-
on a lifetime and evolutionary timescales [8]–[10].
prove generalization by searching for the algorithm (induc-
Historically, the success of machine learning was driven
tive bias) that is best suited to a given problem, or problem
by advancing hand-engineered features [11], [12]. Deep
family. However, this definition can include transfer, multi-
learning realised the promise of feature representation learn-
task, feature-selection, and model-ensemble learning, which
ing [13], providing a huge improvement in performance for
are not typically considered as meta-learning today. Another
many tasks [1], [3] compared to prior hand-designed fea-
usage of meta-learning [32] deals with algorithm selection
tures. Meta-learning in neural networks aims to provide the
based on dataset features, and becomes hard to distinguish
next step of integrating joint feature, model, and algorithm
from automated machine learning (AutoML) [33].
learning; that is, it targets replacing prior hand-designed
In this paper, we focus on contemporary neural-network
learners with learned learning algorithms [7], [14]–[16].
meta-learning. We take this to mean algorithm learning as
Neural network meta-learning has a long history [7],
per [28], [29], but focus specifically on where this is achieved
[17], [18]. However, its potential as a driver to advance the
by end-to-end learning of an explicitly defined objective func-
frontier of the contemporary deep learning industry has
tion (such as cross-entropy loss). Additionally we consider
led to an explosion of recent research. In particular meta-
single-task meta-learning, and discuss a wider variety of
learning has the potential to alleviate many of the main
(meta) objectives such as robustness and compute efficiency.
criticisms of contemporary deep learning [4], for instance
This paper provides a unique, timely, and up-to-date
by improving data efficiency, knowledge transfer and un-
survey of the rapidly growing area of neural network meta-
supervised learning. Meta-learning has proven useful both
learning. In contrast, previous surveys are rather out of date
and/or focus on algorithm selection for data mining [28],
T. Hospedales is with Samsung AI Centre, Cambridge and University of Edin-
burgh. A. Antoniou, P. Micaelli and Storkey are with University of Edinburgh. [32], [34], [35], AutoML [33], or particular applications of
Email: {t.hospedales,a.antoniou,paul.micaelli,a.storkey}@ed.ac.uk. meta-learning such as few-shot learning [36] or NAS [37].

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2021.3079209, IEEE
Transactions on Pattern Analysis and Machine Intelligence
2

We address both meta-learning methods and applica- pre-specified. However, the specification of ω can drastically
tions. We first introduce meta-learning through a high-level affect performance measures like accuracy or data efficiency.
problem formalization that can be used to understand and Meta-learning seeks to improve these measures by learning
position work in this area. We then provide a new taxonomy the learning algorithm itself, rather than assuming it is pre-
in terms of meta-representation, meta-objective and meta- specified and fixed. This is often achieved by revisiting the
optimizer. This framework provides a design-space for de- first assumption above, and learning from a distribution
veloping new meta learning methods and customizing them of tasks rather than from scratch. Note that while we the
for different applications. We survey several popular and following formalization of meta-learning takes a supervised
emerging application areas including few-shot, reinforce- perspective for simplicity, all the ideas generalize directly to
ment learning, and architecture search; and position meta- a reinforcement learning setting as discussed in Sec 5.2.
learning with respect to related topics such as transfer and Meta-Learning: Task-Distribution View A common view
multi-task learning. We conclude by discussing outstanding of meta-learning is to learn a general purpose learning algo-
challenges and areas for future research. rithm that generalizes across tasks, and ideally enables each
new task to be learned better than the last. We can evaluate
2 BACKGROUND the performance of ω over a task distribution p(T ), where a
task specifies a dataset and loss function T = {D, L}. At a
Meta-learning is difficult to define, having been used in var-
high level, learning how to learn thus becomes
ious inconsistent ways, even within contemporary neural-
network literature. In this section, we introduce our defini- min E L(D; ω) (2)
tion and key terminology, and then position meta-learning ω T ∼p(T )
with respect to related topics. where L is a task specific loss. In typical machine-learning
Meta-learning is most commonly understood as learning settings, we can split D = (Dtrain , Dval ). We then define the
to learn; the process of improving a learning algorithm over task specific loss to be L(D; ω) = L(Dval ; θ∗ (Dtrain , ω), ω);
multiple learning episodes. In contrast, conventional ML where θ∗ is the parameters of the model trained using the
improves model predictions over multiple data instances. ‘how to learn’ meta-knowledge ω on dataset Dtrain .
During base learning, an inner (or lower/base) learning More concretely, to solve this problem in practice, we
algorithm solves a task such as image classification [13], de- often assume a set of M source tasks is sampled from p(T ).
fined by a dataset and objective. During meta-learning, an Formally, we denote the set of M source tasks used in the
outer (or upper/meta) algorithm updates the inner learning meta-training stage as Dsource = {(Dsource
train val
, Dsource )(i) }M
i=1
algorithm such that the model it learns improves an outer where each task has both training and validation data. Of-
objective. For instance this objective could be generalization ten, the source train and validation datasets are respectively
performance or learning speed of the inner algorithm. called support and query sets. The meta-training step of
As defined above, many conventional algorithms such ‘learning how to learn’ can be written as:
as random search of hyper-parameters by cross-validation
could fall within the definition of meta-learning. The M
X
salient characteristic of contemporary neural-network meta- ω ∗ = arg min (i)
L(Dsource ; ω) (3)
ω
learning is an explicitly defined meta-level objective, and end- i=1

to-end optimization of the inner algorithm with respect to Now we denote the set of Q target tasks used in the meta-
this objective. Often, meta-learning is conducted on learning testing stage as Dtarget = {(Dtarget
train test
, Dtarget )(i) }Q
i=1 where
episodes sampled from a task family, leading to a base each task has both training and test data. In the meta-testing
learning algorithm that performs well on new tasks sampled stage we use the learned meta-knowledge ω ∗ to train the
from this family. However, in a limiting case all training base model on each previously unseen target task i:
episodes can be sampled from a single task. In the following
train (i)
section, we introduce these notions more formally. θ∗ (i) = arg min L(Dtarget ; θ, ω ∗ ) (4)
θ

2.1 Formalizing Meta-Learning In contrast to conventional learning in Eq. 1, learning on


the training set of a target task i now benefits from meta-
Conventional Machine Learning In conventional super- knowledge ω ∗ about the algorithm to use. This could be an
vised machine learning, we are given a training dataset estimate of the initial parameters [19], or an entire learning
D = {(x1 , y1 ), . . . , (xN , yN )}, such as (input image, output model [38] or optimization strategy [14]. We can evaluate the
label) pairs. We can train a predictive model ŷ = fθ (x) accuracy of our meta-learner by the performance of θ∗ (i) on
parameterized by θ, by solving: test (i)
the test split of each target task Dtarget .
θ∗ = arg min L(D; θ, ω) (1) This setup leads to analogies of conventional underfit-
θ ting and overfitting: meta-underfitting and meta-overfitting. In
where L is a loss function that measures the error between particular, meta-overfitting is an issue whereby the meta-
true labels and those predicted by fθ (·). Here, ω specifies knowledge learned on the source tasks does not generalize
assumptions about ‘how to learn’, such as the choice of to the target tasks. It is relatively common, especially in
optimizer for θ or function class for f . Generalization is then the case where only a small number of source tasks are
measured by evaluating L on test points with known labels. available. It can be seen as learning an inductive bias ω
The conventional assumption is that this optimization is that constrains the hypothesis space of θ too tightly around
performed from scratch for every problem D; and that ω is solutions to the source tasks.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2021.3079209, IEEE
Transactions on Pattern Analysis and Machine Intelligence
3

Meta-Learning: Bilevel Optimization View The previous the linear regression weights to predict examples x from the
discussion outlines the common flow of meta-learning in a validation set. Optimizing Eq. 7 ‘learns to learn’ by training
multiple task scenario, but does not specify how to solve the the function gω to map a training set to a weight vector.
meta-training step in Eq. 3. This is commonly done by cast- Thus gω should provide a good solution for novel meta-
ing the meta-training step as a bilevel optimization problem. test tasks T te drawn from p(T ). Methods in this family
While this picture is most accurate for the optimizer-based vary in the complexity of the predictive model g used, and
methods (see Sec. 3.1), it is helpful to visualize the mechan- how the support set is embedded [43] (e.g., by pooling,
ics of meta-learning more generally. Bilevel optimization CNN or RNN). These models are also known as amortized
[39] refers to a hierarchical optimization problem, where [44] because the cost of learning a new task is reduced
one optimization contains another optimization as a con- to a feed-forward operation through gω (·), with iterative
straint [20], [40]. Using this notation, meta-training can be optimization already paid for during meta-training of ω .
formalised as follows:
M
X 2.2 Historical Context of Meta-Learning
ω ∗ = arg min val (i) ∗ (i)
Lmeta (Dsource ;θ (ω), ω) (5)
ω
i=1
Meta-learning and learning-to-learn first appear in the lit-
∗ (i) task train (i) erature in 1987 [17]. J. Schmidhuber introduced a family of
s.t. θ (ω) = arg min L (Dsource ; θ, ω) (6)
θ methods that can learn how to learn, using self-referential
learning. Self-referential learning involves training neural
where Lmeta and Ltask refer to the outer and inner ob- networks that can receive as inputs their own weights and
jectives respectively, such as cross entropy in the case of predict updates for said weights. Schmidhuber proposed to
classification; and we have dropped explicit dependence of learn the model itself using evolutionary algorithms.
ω on D for brevity. Note the leader-follower asymmetry Meta-learning was subsequently extended to multiple
between the outer and inner levels: the inner optimization areas. Bengio et al. [45], [46] proposed to meta-learn biolog-
Eq. 6 is conditional on the learning strategy ω defined by ically plausible learning rules. Schmidhuber et al.continued
the outer level, but it cannot change ω during its training. to explore self-referential systems and meta-learning [47],
Here ω could indicate an initial condition in non-convex [48]. S. Thrun et al. took care to more clearly define the
optimization [19] or a hyper-parameter such as regulariza- term learning to learn in [7] and introduced initial theoretical
tion strength [20]. Sec. 4.1 discusses the space of choices for justifications and practical implementations. Proposals for
ω in detail. The outer level optimization trains ω to produce training meta-learning systems using gradient descent and
models θ∗ (i) (ω) that perform well on their validation sets backpropagation were first made in 1991 [49] followed by
after training. Sec. 4.2 discusses how to optimize ω . Note more extensions in 2001 [50], [51], with [28] giving an
that Lmeta and Ltask are distinctly denoted, because they overview of the literature at that time. Meta-learning was
can be different functions. For example, ω can define the used in the context of reinforcement learning in 1995 [52],
inner task loss Ltask
ω (see Sec. 4.1, [24], [41]); and as explored followed by various extensions [53], [54].
in Sec. 4.3, Lmeta can measure different quantities such as
validation performance, learning speed or robustness.
Finally, we note that the above formalization of meta- 2.3 Related Fields
training uses the notion of a distribution over tasks. While Here we position meta-learning against related areas whose
common in the meta-learning literature, it is not a necessary relation to meta-learning is often a source of confusion.
condition for meta-learning. More formally, if we are given Transfer Learning (TL) TL [34], [55] uses past experi-
a single train and test dataset (M = Q = 1), we can split ence from a source task to improve learning (speed, data
the training set to get validation data such that Dsource = efficiency, accuracy) on a target task. TL refers both to
train val
(Dsource , Dsource ) for meta-training, and for meta-testing this problem area and family of solutions, most commonly
we can use Dtarget = (Dsourcetrain val
∪ Dsource test
, Dtarget ). We still parameter transfer plus optional fine tuning [56] (although
learn ω over several episodes, and different train-val splits there are numerous other approaches [34]).
are usually used during meta-training. In contrast, meta-learning refers to a paradigm that can
Meta-Learning: Feed-Forward Model View As we will be used to improve TL as well as other problems. In TL
see, there are a number of meta-learning approaches that the prior is extracted by vanilla learning on the source task
synthesize models in a feed-forward manner, rather than via without the use of a meta-objective. In meta-learning, the
an explicit iterative optimization as in Eqs. 5-6 above. While corresponding prior would be defined by an outer opti-
they vary in their degree of complexity, it can be instructive mization that evaluates the benefit of the prior when learn
to understand this family of approaches by instantiating the a new task, as illustrated by MAML [19]. More generally,
abstract objective in Eq. 2 to define a toy example for meta- meta-learning deals with a much wider range of meta-
training linear regression [42]. representations than solely model parameters (Section 4.1).

X h i Domain Adaptation (DA) and Domain Generalization


min E T tr
(x gω (D ) − y) 2
(7) (DG) Domain-shift refers to the situation where source
ω T ∼p(T )
(x,y)∈D val and target problems share the same objective, but the input
(D ,D val )∈T
tr
distribution of the target task is shifted with respect to the
Here we meta-train by optimizing over a distribution of source task [34], [57], reducing model performance. DA is
tasks. For each task a train and validation set is drawn. The a variant of transfer learning that attempts to alleviate this
train set Dtr is embedded [43] into a vector gω which defines issue by adapting the source-trained model using sparse or

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2021.3079209, IEEE
Transactions on Pattern Analysis and Machine Intelligence
4

unlabeled data from the target. DG refers to methods to train Bayesian hierarchical models provide a valuable view-
a source model to be robust to such domain-shift without point for meta-learning, by providing a modeling rather
further adaptation. Many knowledge transfer methods have than an algorithmic framework for understanding the meta-
been studied [34], [57] to boost target domain performance. learning process. In practice, prior work in HBMs has typi-
However, as for TL, vanilla DA and DG don’t use a meta- cally focused on learning simple tractable models θ while
objective to optimize ‘how to learn’ across domains. Mean- most meta-learning work considers complex inner-loop
while, meta-learning methods can be used to perform both learning processes, involving many iterations. Nonetheless,
DA [58] and DG [41] (see Sec. 5.8). some meta-learning methods like MAML [19] can be under-
Continual learning (CL) Continual or lifelong learning stood through the lens of HBMs [73].
[59], [60] refers to the ability to learn on a sequence of tasks AutoML: AutoML [32], [33] is a rather broad umbrella
drawn from a potentially non-stationary distribution, and for approaches aiming to automate parts of the machine
in particular seek to do so while accelerating learning new learning process that are typically manual, such as data
tasks and without forgetting old tasks. Similarly to meta- preparation, algorithm selection, hyper-parameter tuning,
learning, a task distribution is considered, and the goal is and architecture search. AutoML often makes use of numer-
partly to accelerate learning of a target task. However most ous heuristics outside the scope of meta-learning as defined
continual learning methodologies are not meta-learning here, and focuses on tasks such as data cleaning that are
methodologies since this meta objective is not solved for less central to meta-learning. However, AutoML sometimes
explicitly. Nevertheless, meta-learning provides a potential makes use of end-to-end optimization of a meta-objective,
framework to advance continual learning, and a few recent so meta-learning can be seen as a specialization of AutoML.
studies have begun to do so by developing meta-objectives
that encode continual learning performance [61]–[63].
3 TAXONOMY
Multi-Task Learning (MTL) aims to jointly learn sev- 3.1 Previous Taxonomies
eral related tasks, to benefit from regularization due to
parameter sharing and the diversity of the resulting shared Previous [74], [75] categorizations of meta-learning meth-
representation [64], [65], as well as compute/memory sav- ods have tended to produce a three-way taxonomy across
ings. Like TL, DA, and CL, conventional MTL is a single- optimization-based methods, model-based (or black box)
level optimization without a meta-objective. Furthermore, methods, and metric-based (or non-parametric) methods.
the goal of MTL is to solve a fixed number of known tasks, Optimization Optimization-based methods include those
whereas the point of meta-learning is often to solve unseen where the inner-level task (Eq. 6) is literally solved as
future tasks. Nonetheless, meta-learning can be brought in an optimization problem, and focuses on extracting meta-
to benefit MTL, e.g. by learning the relatedness between knowledge ω required to improve optimization perfor-
tasks [66], or how to prioritise among multiple tasks [67]. mance. A famous example is MAML [19], which aims to
learn the initialization ω = θ0 , such that a small number
Hyperparameter Optimization (HO) is within the remit
of inner steps produces a classifier that performs well on
of meta-learning, in that hyperparameters like learning rate
validation data. This is also performed by gradient descent,
or regularization strength describe ‘how to learn’. Here we
differentiating through the updates of the base model. More
include HO tasks that define a meta objective that is trained
elaborate alternatives also learn step sizes [76], [77] or
end-to-end with neural networks, such as gradient-based
train recurrent networks to predict steps from gradients
hyperparameter learning [66], [68] and neural architecture
[14], [15], [78]. Meta-optimization by gradient over long
search [21]. But we exclude other approaches like random
inner optimizations leads to several compute and memory
search [69] and Bayesian Hyperparameter Optimization
challenges which are discussed in Section 6. A unified view
[70], which are rarely considered to be meta-learning.
of gradient-based meta learning expressing many existing
Hierarchical Bayesian Models (HBM) involve Bayesian methods as special cases of a generalized inner loop meta-
learning of parameters θ under a prior p(θ|ω). The prior learning framework has been proposed [79].
is written as a conditional density on some other variable Black Box / Model-based In model-based (or black-box)
ω which has its own prior p(ω). Hierarchical Bayesian methods the inner learning step (Eq. 6, Eq. 4) is wrapped up
models feature strongly as models for grouped data D = in the feed-forward pass of a single model, as illustrated
{Di |i = 1, 2, . . . , M }h, where each group ii has its own in Eq. 7. The model embeds the current dataset D into
QM
θi . The full model is i=1 p(Di |θi )p(θi |ω) p(ω). The lev- activation state, with predictions for test data being made
els of hierarchy can be increased further; in particular ω based on this state. Typical architectures include recurrent
can itself be parameterized, and hence p(ω) can be learnt. networks [14], [50], convolutional networks [38] or hyper-
Learning is usually full-pipeline, but using some form of networks [80], [81] that embed training instances and labels
Bayesian marginalisationR to compute the posterior over of a given task to define a predictor for test samples. In this
ω : P (ω|D) ∼ p(ω) M
Q
i=1 dθi p(Di |θi )p(θi |ω). The ease of case all the inner-level learning is contained in the activation
doing the marginalisation depends on the model: in some states of the model and is entirely feed-forward. Outer-
(e.g. Latent Dirichlet Allocation [71]) the marginalisation is level learning is performed with ω containing the CNN,
exact due to the choice of conjugate exponential models, RNN or hypernetwork parameters. The outer and inner-
in others (see e.g. [72]), a stochastic variational approach is level optimizations are tightly coupled as ω and D directly
used to calculate an approximate posterior, from which a specify θ. Memory-augmented neural networks [82] use an
lower bound to the marginal likelihood is computed. explicit storage buffer and can be seen as a model-based

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2021.3079209, IEEE
Transactions on Pattern Analysis and Machine Intelligence
5

algorithm [83], [84]. Compared to optimization-based ap- 4.1 Meta-Representation


proaches, these enjoy simpler optimization without requir- Meta-learning methods make different choices about what
ing second-order gradients. However, it has been observed meta-knowledge ω should be, i.e. which aspects of the learn-
that model-based approaches are usually less able to gen- ing strategy should be learned; and (by exclusion) which
eralize to out-of-distribution tasks than optimization-based aspects should be considered fixed.
methods [85]. Furthermore, while they are often very good
Parameter Initialization Here ω corresponds to the initial
at data efficient few-shot learning, they have been criticised
parameters of a neural network to be used in the inner
for being asymptotically weaker [85] as they struggle to
optimization, with MAML being the most popular example
embed a large training set into a rich base model.
[19], [95], [96]. A good initialization is just a few gradient
Metric-Learning Metric-learning or non-parametric algo- steps away from a solution to any task T drawn from
rithms are thus far largely restricted to the popular but spe- p(T ), and can help to learn without overfitting in few-shot
cific few-shot application of meta-learning (Section 5.1.1). learning. A key challenge with this approach is that the
The idea is to perform non-parametric ‘learning’ at the inner outer optimization needs to solve for as many parameters
(task) level by simply comparing validation points with as the inner optimization (potentially hundreds of millions
training points and predicting the label of matching training in large CNNs). This leads to a line of work on isolating a
points. In chronological order, this has been achieved with subset of parameters to meta-learn, for example by subspace
siamese [86], matching [87], prototypical [22], relation [88], [75], [97], by layer [80], [97], [98], or by separating out scale
and graph [89] neural networks. Here outer-level learning and shift [99]. Another concern is whether a single initial
corresponds to metric learning (finding a feature extractor ω condition is sufficient to provide fast learning for a wide
that represents the data suitably for comparison). As before range of potential tasks, or if one is limited to narrow distri-
ω is learned on source tasks, and used for target tasks. butions p(T ). This has led to variants that model mixtures
Discussion The common breakdown reviewed above over multiple initial conditions [97], [100], [101].
does not expose all facets of interest and is insufficient to Optimizer The above parameter-centric methods usually
understand the connections between the wide variety of rely on existing optimizers such as SGD with momentum
meta-learning frameworks available today. For this reason, or Adam [102] to refine the initialization when given some
we propose a new taxonomy in the following section. new task. Instead, optimizer-centric approaches [14], [15],
[78], [91] focus on learning the inner optimizer by training
3.2 Proposed Taxonomy a function that takes as input optimization states such as θ
and ∇θ Ltask and produces the optimization step for each
We introduce a new breakdown along three independent
base learning iteration. The trainable component ω can span
axes. For each axis we provide a taxonomy that reflects the
simple hyper-parameters such as a fixed step size [76],
current meta-learning landscape.
[77] to more sophisticated pre-conditioning matrices [103],
Meta-Representation (“What?”) The first axis is the [104]. Ultimately ω can be used to define a full gradient-
choice of meta-knowledge ω to meta-learn. This could be based optimizer through a complex non-linear transforma-
anything from initial model parameters [19] to readable tion of the input gradient and other metadata [14], [15],
code in the case of program induction [90]. [90], [91]. The parameters to learn here can be few if the
Meta-Optimizer (“How?”) The second axis is the choice optimizer is applied coordinate-wise across weights [15].
of optimizer to use for the outer level during meta-training The initialization-centric and optimizer-centric methods can
(see Eq. 5). The outer-level optimizer for ω can take a va- be merged by learning them jointly, namely having the
riety of forms from gradient-descent [19], to reinforcement former learn the initial condition for the latter [14], [76].
learning [90] and evolutionary search [24]. Optimizer learning methods have both been applied to
Meta-Objective (“Why?”) The third axis is the goal of for few-shot learning [14] and to accelerate and improve
meta-learning which is determined by choice of meta- many-shot learning [15], [90], [91]. Finally, one can also
objective Lmeta (Eq. 5), task distribution p(T ), and data- meta-learn zeroth-order optimizers [105] that only require
flow between the two levels. Together these can customize evaluations of Ltask rather than optimizer states such as
meta-learning for different purposes such as sample efficient gradients. These have been shown [105] to be competitive
few-shot learning [19], [38], fast many-shot optimization with conventional Bayesian Optimization [70] alternatives.
[90], [91], robustness to domain-shift [41], [92], label noise Feed-Forward Models (FFMs. aka, Black-Box, Amortized)
[93], and adversarial attack [94]. Another family of models trains learners ω that provide
Together these axes provide a design-space for meta- a feed-forward mapping directly from the support set
learning methods that can orient the development of new to the parameters required to classify test instances, i.e.,
algorithms and customization for particular applications. θ = gω (Dtrain ) – rather than relying on a gradient-based
Note that the base model representation θ isn’t included iterative optimization of θ. These correspond to black-
in this taxonomy, since it is determined and optimized in a box model-based learning in the conventional taxonomy
way that is specific to the application at hand. (Sec. 3.1) and span from classic [106] to recent approaches
such as CNAPs [107] that provide strong performance on
challenging cross-domain few-shot benchmarks [108].
4 S URVEY: M ETHODOLOGIES These methods have connections to Hypernetworks
In this section we break down existing literature according [109] which generate the weights of another neural net-
to our proposed new methodological taxonomy. work conditioned on some embedding – and are often

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2021.3079209, IEEE
Transactions on Pattern Analysis and Machine Intelligence
6

Meta-Learning

Meta-Optimizer Meta-Representation Meta-Objective Application

Parameter Instance Few-Shot


Gradient Curriculum Many/Few-Shot Exploration Label Noise
Initialization Weights Learning

Reinforcement Dataset/ Bayesian Adversarial


Optimizer Hyperparameters Multi/Single-Task Fast Learning
Learning Environment Meta-Learning Defense

Black-Box Model/ Loss/ Continual Unsupervised Domain


Evolution Architecture Online/Offline
Embeddings Reward Learning Meta-Learning Generalization

Modules/ Augmentation/ Exploration Net/Asymptotic Architecture


Compression Active Learning
Attention Noise Policy Performance Search

Fig. 1. Overview of the meta-learning landscape including algorithm design (meta-optimizer, meta-representation, meta-objective), and applications.

used for compression or multi-task learning. Here ω is making the embedding task-conditional [98], [114], learning
the hypernetwork and it synthesises θ given the source a more elaborate comparison metric [88], [89], or combining
dataset in a feed-forward pass [97], [110]. Embedding the with gradient-based meta-learning to train other hyper-
support set is often achieved by recurrent networks [50], parameters such as stochastic regularizers [115].
[111], [112] convolution [38], or set embeddings [44], [107]. Losses and Auxiliary Tasks Analogously to the meta-
Research here often studies architectures for paramaterizing learning approach to optimizer design, these aim to learn
the classifier by the task-embedding network: (i) Which the inner task-loss Ltask for the base model (in contrast to
ω
parameters should be globally shared across all tasks, vs Lmeta , which is fixed). Loss-learning approaches typically
synthesized per task by the hypernetwork (e.g., share the define a function that inputs relevant quantities (e.g. pre-
feature extractor and synthesize the classifier [80], [113]), dictions, labels, or model parameters) and outputs a scalar
and (ii) How to parameterize the hypernetwork so as to to be treated as a loss by the inner (task) optimizer. This
limit the number of parameters required in ω (e.g., via can lead to a learned loss that is easier to optimize (e.g. less
synthesizing only lightweight adapter layers in the feature local minima) than commonly used ones [24], [116], [117],
extractor [107], or class-wise classifier weight synthesis [44]). provides faster learning with improved generalization [42],
Some FFMs can also be understood elegantly in terms [118]–[120], or a loss whose minima corresponds to a model
of amortized inference in probabilistic models [44], [106], more robust to domain shift [41]. Loss learning methods
making predictions for test data x as: have also been used to learn from unlabeled instances [98],
Z [121], or to learn Ltask
ω as a differentiable approximation to a
qω (y|x, D ) = p(y|x, θ)qω (θ|Dtr )dθ
tr
(8) true non-differentiable Lmeta of interest such as area under
precision recall curve [122].
where the meta-representation ω is a network qω (·) that Loss learning also arises in generalizations of self-
approximates the intractable Bayesian inference for param- supervised [123] or auxiliary task [124] learning. In these
eters θ that solve the task with training data Dtr , and the problems unsupervised predictive tasks (such as colourising
integral may be computed exactly [106], or approximated pixels in vision [123], or simply changing pixels in RL [124])
by sampling [44] or point estimate [107]. The model ω is are defined and optimized with the aim of improving the
then trained to minimise validation loss over a distribution representation for the main task. In this case the best auxil-
of training tasks cf. Eq. 7. iary task (loss) to use can be hard to predict in advance, so
Finally, memory-augmented neural networks, with the meta-learning can be used to select among several auxiliary
ability to remember old data and assimilate new data losses according to their impact on improving main task
quickly, typically fall in the FFM category as well [83], [84]. learning. I.e., ω is a per-auxiliary task weight [67]. More
Embedding Functions (Metric Learning) Here the meta- generally, one can meta-learn an auxiliary task generator
optimization process learns an embedding network ω that that annotates examples with auxiliary labels [125].
transforms raw inputs into a representation suitable for Architectures Architecture discovery has always been an
recognition by simple similarity comparison between query important area in neural networks [37], [126], and one that
and support instances [22], [80], [87], [113] (e.g., with co- is not amenable to simple exhaustive search. Meta-Learning
sine similarity or euclidean distance). These methods are can be used to automate this very expensive process by
classified as metric learning in the conventional taxonomy learning architectures. Early attempts used evolutionary
(Section 3.1) but can also be seen as a special case of the algorithms to learn the topology of LSTM cells [127], while
feed-forward black-box models above. This can easily be later approaches leveraged RL to generate descriptions for
seen for methods that produce logits based on the inner good CNN architectures [27]. Evolutionary Algorithms [26]
product of the embeddings of support and query images xs can learn blocks within architectures modelled as graphs
and xq , namely gωT (xq )gω (xs ) [80], [113]. Here the support which could mutate by editing their graph. Gradient-based
image generates ‘weights’ to interpret the query example, architecture representations have also been visited in the
making it a special case of a FFM where the ‘hypernet- form of DARTS [21] where the forward pass during training
work’ generates a linear classifier for the query set. Vanilla consists in a softmax across the outputs of all possible layers
methods in this family have been further enhanced by in a given block, which are weighted by coefficients to be

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2021.3079209, IEEE
Transactions on Pattern Analysis and Machine Intelligence
7

meta learned (i.e. ω ). During meta-test, the architecture is defining a curriculum by hand [149], meta-learning can au-
discretized by only keeping the layers corresponding to tomate the process and select examples of the right difficulty
the highest coefficients. Recent efforts to improve DARTS by defining a teaching policy as the meta-knowledge and
have focused on more efficient differentiable approxima- training it to optimize the student’s progress [145], [150].
tions [128], robustifying the discretization step [129], learn- Datasets, Labels and Environments Another meta-
ing easy to adapt initializations [130], or architecture priors representation is the support dataset itself. This departs
[131]. See Section 5.3 for more details. from our initial formalization of meta-learning which con-
Attention Modules have been used as comparators in siders the source datasets to be fixed (Section 2.1, Eqs. 2-3).
metric-based meta-learners [132], to prevent catastrophic However, it can be easily understood in the bilevel view of
forgetting in few-shot continual learning [133] and to sum- Eqs. 5-6. If the validation set in the upper optimization is
marize the distribution of text classification tasks [134]. real and fixed, and a train set in the lower optimization is
Modules Modular meta-learning [135], [136] assumes that paramaterized by ω , the training dataset can be tuned by
the task agnostic knowledge ω defines a set of modules, meta-learning to optimize validation performance.
which are re-composed in a task specific manner defined by In dataset distillation [151]–[153], the support images
θ in order to solve each encountered task. These strategies themselves are learned such that a few steps on them
can be seen as meta-learning generalizations of the typical allows for good generalization on real query images. This
structural approaches to knowledge sharing that are well can be used to summarize large datasets into a handful
studied in multi-task and transfer learning [65], [137], and of images, which is useful for replay in continual learning
may ultimately underpin compositional learning [138]. where streaming datasets cannot be stored.
Hyper-parameters Here ω represents hyperparameters of Rather than learning input images x for fixed labels y ,
the base learner such as regularization strength [20], [68], one can also learn the input labels y for fixed images x. This
per-parameter regularization [92], task-relatedness in multi- can be used in distilling core sets [154] as in dataset distil-
task learning [66], or sparsity strength in data cleansing [66]. lation; or semi-supervised learning, for example to directly
Hyperparameters such as step size [68], [76], [77] can be learn the unlabeled set’s labels to optimize validation set
seen as part of the optimizer, leading to an overlap between performance [155], [156].
hyper-parameter and optimizer learning categories. In the case of sim2real learning [157] in computer vision
or reinforcement learning, one uses an environment simula-
Data Augmentation and Noise In supervised learning
tor to generate data for training. In this case, as detailed in
it is common to improve generalization by synthesizing
Section 5.10, one can also train the graphics engine [158] or
more training data through label-preserving transforma-
simulator [159] so as to optimize the real-data (validation)
tions on the existing data [13]. The data augmentation
performance of the downstream model after training on
operation is wrapped up in the inner problem (Eq. 6), and
data generated by that environment simulator.
is conventionally hand-designed. However, when ω defines
the data augmentation strategy, it can be learned by the Discussion: Transductive Representations and Methods
outer optimization in Eq. 5 in order to maximize validation Most of the representations ω discussed above are parame-
performance [139]. Since augmentation operations are of- ter vectors of functions that process or generate data. How-
ten non-differentiable, this requires reinforcement learning ever a few representations mentioned are transductive in the
[139], discrete gradient-estimators [140], or evolutionary sense that the ω literally corresponds to data points [151],
[141] methods. Recent attempts use meta-gradient to learn labels [155], or per-instance weights [66], [147]. Therefore
mixing proportions in mixup-based augmentation [142]. For the number of parameters in ω to meta-learn scales as the
stochastic neural networks that exploit noise internally [115], size of the dataset. While the success of these methods is a
ω can define a learnable noise distribution. testament to the capabilities of contemporary meta-learning
Minibatch Selection, Instance Weights, and Curriculum [153], this property may ultimately limit their scalability.
Learning When the base algorithm is minibatch-based Distinct from a transductive representation are methods
stochastic gradient descent, a design parameter of the learn- that are transductive in the sense that they operate on the
ing strategy is the batch selection process. Various hand- query instances as well as support instances [98], [125].
designed methods [143] exist to improve on randomly- Discussion: Interpretable Symbolic Representations A
sampled minibatches. Meta-learning approaches can define cross-cutting distinction that can be made across many of
ω as an instance selection probability [144] or neural net- the meta-representations discussed above is between unin-
work that picks instances [145] for inclusion in a minibatch. terpretable (sub-symbolic) and human interpretable (sym-
Related to mini-batch selection policies are methods that bolic) representations. Sub-symbolic representations, such
learn or infer per-instance loss weights for each sample in as when ω parameterizes a neural network [15], are more
the training set
P [146], [147]. For example defining a base common and make up the majority of studies cited above.
loss as L = i ωi `(f (xi ), yi ). This can be used to learn However, meta-learning with symbolic representations is
under label-noise by discounting noisy samples [146], [147], also possible, where ω represents human readable symbolic
discount outliers [66], or correct for class imbalance [146] functions such as optimization program code [90]. Rather
More generally, the curriculum [148] refers to sequences than neural loss functions [41], one can train symbolic
of data or concepts to learn that produce better performance losses ω that are defined by an expression analogous to
than learning items in a random order. For instance by cross-entropy [119]. One can also meta-learn new symbolic
focusing on instances of the right difficulty while rejecting activations [160] that outperform standards such as ReLU.
too hard or too easy (already learned) instances. Instead of As these meta-representations are non-smooth, the meta-

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2021.3079209, IEEE
Transactions on Pattern Analysis and Machine Intelligence
8

objective is non-differentiable and is harder to optimize performance does not depend on the length and reward
(see Section 4.2). So the upper optimization for ω typically sparsity of the inner optimization as for RL.
uses RL [90] or evolutionary algorithms [119]. However, EAs are attractive for several reasons [192]: (i) They
symbolic representations may have an advantage [90], [119], can optimize any base model and meta-objective with no
[160] in their ability to generalize across task families. I.e., to differentiability constraint. (ii) Not relying on backprop-
span wider distributions p(T ) with a single ω during meta- agation avoids both gradient degradation issues and the
training, or to have the learned ω generalize to an out of cost of high-order gradient computation of conventional
distribution task during meta-testing (see Section 6). gradient-based methods. (iii) They are highly parallelizable
Discussion: Amortization One way to relate some of for scalability. (iv) By maintaining a diverse population of
the representations discussed is in terms of the degree solutions, they can avoid local minima that plague gradient-
of learning amortization entailed [44]. That is, how much based methods [126]. However, they have a number of
task-specific optimization is performed during meta-testing disadvantages: (i) The population size required increases
vs how much learning is amortized during meta-training. rapidly with the number of parameters to learn. (ii) They can
Training from scratch, or conventional fine-tuning [56] per- be sensitive to the mutation strategy and may require careful
form full task-specific optimization at meta-testing, with hyperparameter optimization. (iii) Their fitting ability is
no amortization. MAML [19] provides limited amortization generally inferior to gradient-based methods, especially for
by fitting an initial condition, to enable learning a new large models such as CNNs.
task by few-step fine-tuning. Pure FFMs [22], [87], [107] are EAs are relatively more commonly applied in RL ap-
fully amortized, with no task-specific optimization, and thus plications [24], [168] (where models are typically smaller,
enable the fastest learning of new tasks. Meanwhile some and inner optimizations are long and non-differentiable).
hybrid approaches [97], [98], [108], [161] implement semi- However they have also been applied to learn learning
amortized learning by drawing on both feed-forward and rules [194], optimizers [195], architectures [26], [126] and
optimization-based meta-learning in a single framework. data augmentation strategies [141] in supervised learning.
They are also particularly important in learning human
interpretable symbolic meta-representations [119].
4.2 Meta-Optimizer
Discussion These three optimizers are also all used in
Given a choice of which facet of the learning strategy to conventional learning. However meta-learning compara-
optimize, the next axis of meta-learner design is actual outer tively more often resorts to RL and Evolution, e.g., as Lmeta
(meta) optimization strategy to use for training ω . is often non-differentiable with respect to representation ω .
Gradient A large family of methods use gradient descent
on the meta parameters ω [14], [19], [41], [66]. This requires 4.3 Meta-Objective and Episode Design
computing derivatives dLmeta /dω of the outer objective,
which are typically connected via the chain rule to the The final component is to define the meta-learning goal
model parameter θ, dLmeta /dω = (dLmeta /dθ)(dθ/dω). through choice of meta-objective Lmeta , and associated data
These methods are potentially the most efficient as they flow between inner loop episodes and outer optimizations.
exploit analytical gradients of ω . However key challenges Most methods define a meta-objective using a performance
include: (i) Efficiently differentiating through many steps metric computed on a validation set, after updating the task
of inner optimization, for example through careful design model with ω . This is in line with classic validation set ap-
of differentiation algorithms [20], [68], [189] and implicit proaches to hyperparameter and model selection. However,
differentiation [153], [162], [190], and dealing tractably with within this framework, there are several design options:
the required second-order gradients [191]. (ii) Reducing the Many vs Few-Shot Episode Design According to
inevitable gradient degradation problems whose severity whether the goal is improving few- or many-shot perfor-
increases with the number of inner loop optimization steps. mance, inner loop learning episodes may be defined with
(iii) Calculating gradients when the base learner, ω , or Ltask many [66], [90], [91] or few- [14], [19] examples per-task.
include discrete or other non-differentiable operations. Fast Adaptation vs Asymptotic Performance When val-
Reinforcement Learning When the base learner includes idation loss is computed at the end of the inner learning
non-differentiable steps [139], or the meta-objective Lmeta episode, meta-training encourages better final performance
is itself non-differentiable [122], many methods [23] resort of the base task. When it is computed as the sum of the
to RL to optimize the outer objective Eq. 5. This estimates validation loss after each inner optimization step, then meta-
the gradient ∇ω Lmeta , typically using the policy gradient training also encourages faster learning in the base task [77],
theorem. However, alleviating the requirement for differ- [90], [91]. Most RL applications also use this latter setting.
entiability in this way is typically extremely costly. High- Multi vs Single-Task When the goal is to tune the learner
variance policy-gradient estimates for ∇ω Lmeta mean that to better solve any task drawn from a given family, then
many outer-level optimization steps are required to con- inner loop learning episodes correspond to a randomly
verge, and each of these steps are themselves costly due drawn task from p(T ) [19], [22], [41]. When the goal is to
to wrapping task-model optimization within them. tune the learner to simply solve one specific task better, then
Evolution Another approach for optimizing the meta- the inner loop learning episodes all draw data from the same
objective are evolutionary algorithms (EA) [17], [126], [192]. underlying task [15], [66], [172], [179], [180], [196].
Many evolutionary algorithms have strong connections to It is worth noting that these two meta-objectives tend
reinforcement learning algorithms [193]. However, their to have different assumptions and value propositions. The

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2021.3079209, IEEE
Transactions on Pattern Analysis and Machine Intelligence
9

Meta-Representation Meta-Optimizer
Gradient RL Evolution
Initial Condition MAML [19], [162]. MetaOptNet [163]. [76], [85], [99], [164] MetaQ [165]. [166], [167] [19], [61], [62] ES-MAML [168]. [169]
Optimizer GD2 [15]. MetaLSTM [14]. [91] [16], [76], [103], [104], [170] PSD [78]. [90]
Hyperparam HyperRep [20], HyperOpt [66], LHML [68] MetaTrace [171]. [172] [169] [173]
Feed-Forward model SNAIL [38], CNAP [107]. [44], [83], [174], [175] [176]–[178] PEARL [110]. [23], [112]
Metric MatchingNets [87], ProtoNets [22], RelationNets [88]. [89]
Loss/Reward MetaReg [92]. [41] [120] MetaCritic [117]. [122] [179] [120] EPG [24]. [119] [173]
Architecture DARTS [21]. [130] [27] [26]
Exploration Policy MetaCuriosity [25]. [180]–[184]
Dataset/Environment Data Distillation [151]. [152] [155] Learn to Sim [158] [159]
Instance Weights MetaWeightNet [146]. MentorNet [150]. [147]
Data Augmentation/Noise MetaDropout [185]. [140] [115]. AutoAugment [139]. [141]
Modules [135], [136]
Annotation Policy [186], [187] [188]
TABLE 1
Research papers according to our taxonomy. We use color to indicate salient meta-objective or application goal. We focus on the main goal of
each paper for simplicity. The color code is: sample efficiency (red), learning speed (green), asymptotic performance (purple), cross-domain (blue).

multi-task objective obviously requires a task family p(T ) 5.1.1 Few-Shot Learning Methods
to work with, which single-task does not. Meanwhile for Few-shot learning (FSL) is extremely challenging, especially
multi-task, the data and compute cost of meta-training can for large neural networks [1], [13], where data volume is
be amortized by potentially boosting the performance of often the dominant factor in performance [199], and training
multiple target tasks during meta-test; but single-task – large models with small datasets leads to overfitting or
without the new tasks for amortization – needs to improve non-convergence. Meta-learning-based approaches are in-
the final solution or asymptotic performance of the current creasingly able to train powerful CNNs on small datasets
task, or meta-learn fast enough to be online. in many vision problems. We provide a non-exhaustive
Online vs Offline While the classic meta-learning representative summary as follows.
pipeline defines the meta-optimization as an outer-loop of
Classification The most common application of meta-
the inner base learner [15], [19], some studies have at-
learning is few-shot multi-class image recognition, where
tempted to preform meta-optimization online within a single
the inner and outer loss functions are typically the cross
base learning episode [41], [179], [196], [197]. In this case
entropy over training and validation data respectively [14],
the base model θ and learner ω co-evolve during a single
[22], [74], [76], [77], [87], [89], [97], [98], [101], [104], [200]–
episode. Since there is now no set of source tasks to amortize
[203]. Optimizer-centric [19], black-box [38], [80] and metric
over, meta-learning needs to be fast compared to base model
learning [87]–[89] models have all been considered.
learning in order to benefit sample or compute efficiency.
This line of work has led to a steady improvement in per-
Other Episode Design Factors Other operators can be formance compared to early methods [19], [86], [87]. How-
inserted into the episode generation pipeline to customize ever, performance is still far behind that of fully supervised
meta-learning for particular applications. For example one methods, so there is more work to be done. Current research
can simulate domain-shift between training and validation issues include improving cross-domain generalization [115],
to meta-optimize for good performance under domain- recognition within the joint label space defined by meta-
shift [41], [58], [92]; simulate network compression such as train and meta-test classes [81], and incremental addition of
quantization [198] between training and validation to meta- new few-shot classes [133], [174].
optimize for network compressibility; provide noisy labels
during meta-training to optimize for label-noise robustness Object Detection Building on progress in few-shot clas-
[93], or generate an adversarial validation set to meta- sification, few-shot object detection [174], [204] has been
optimize for adversarial defense [94]. These opportunities demonstrated, often using feed-forward hypernetwork-
are explored in more detail in the following section. based approaches to embed support set images and syn-
thesize final layer classification weights in the base model.
5 A PPLICATIONS Landmark Prediction aims to locate a skeleton of key
In this section we briefly review the ways in which meta- points within an image, such as such as joints of a human or
learning has been exploited in computer vision, reinforce- robot. This is typically formulated as an image-conditional
ment learning, architecture search, and so on. regression. For example, a MAML-based model was shown
to work for human pose estimation [205], modular-meta-
5.1 Computer Vision and Graphics learning was successfully applied to robotics [135], while a
Computer vision is a major consumer domain of meta- hypernetwork-based model was applied to few-shot clothes
learning techniques, notably due to its impact on few-shot fitting for novel fashion items [174].
learning, which holds promise to deal with the challenge Few-Shot Object Segmentation is important due to the
posed by the long-tail of concepts to recognise in vision. cost of obtaining pixel-wise labeled images. Hypernetwork-

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2021.3079209, IEEE
Transactions on Pattern Analysis and Machine Intelligence
10

based meta-learners have been applied in the one-shot try and address these issues by meta-training for domain-
regime [206], and performance was later improved by shift robustness as well as sample efficiency [115]. Gener-
adapting prototypical networks [207]. Other models tackle alization issues also arise in applying models to data from
cases where segmentation has low density [208]. under-represented countries [222].
Actions Beyond still images, meta-learning has been ap-
plied to few-shot action recognition [209], [210] and pre- 5.2 Meta Reinforcement Learning and Robotics
diction [211] using both FFM [209] and optimization-based
Reinforcement learning is typically concerned with learn-
[210], [211] approaches.
ing control policies that enable an agent to obtain high
Image and Video Generation In [44] an amortized proba- reward after performing a sequential action task within
bilistic meta-learner is used to generate multiple views of an an environment. RL typically suffers from extreme sample
object from just a single image, generative query networks inefficiency due to sparse rewards, the need for exploration,
[212] render scenes from novel views, and talking faces are and the high-variance [223] of optimization algorithms.
generated from little data by learning the initialization of an However, applications often naturally entail task families
adversarial model for quick adaptation [213]. In video do- which meta-learning can exploit – for example locomoting-
main, [214] meta-learns a weight generator that synthesizes to or reaching-to different positions [184], navigating within
videos given few example images as cues. Meta-learning has different environments [38], traversing different terrains
also accelerated style transfer by learning a FFM to solve the [63], driving different cars [183], competing with different
corresponding optimization problem [215]. competitor agents [61], and dealing with different handicaps
Generative Models and Density Estimation Density esti- such as failures in individual robot limbs [63]. Thus RL
mators capable of generating images typically require many provides a fertile application area in which meta-learning on
parameters, and as such overfit in the few-shot regime. task distributions has had significant successes in improving
Gradient-based meta-learning of PixelCNN generators was sample efficiency over standard RL algorithms. One can
shown to enable their few-shot learning [216]. intuitively understand the efficacy of these methods. For
instance meta-knowledge of a maze layout is transferable
5.1.2 Few-Shot Learning Benchmarks for all tasks that require navigating within the maze.
Progress in AI and machine learning is often measured, and Problem Setup Meta-RL uses essentially the same prob-
spurred, by well designed benchmarks [217]. Conventional lem setup outlined in Sec. 2.1, but in the setting of sequential
ML benchmarks define a task and dataset for which a model decision making within Markov decision processes (MDPs).
should generalize from seen to unseen instances. In meta- A conventional RL problem is defined by an MDP M, in
learning, benchmark design is more complex, since we are which policy πθ chooses actions at based on states st and
often dealing with a learner that should generalize from obtains rewards rt . The policy is trained to maximize the
seen to unseen tasks. Benchmark design thus needs to define expected cumulative return Rτ =
PT t
t γ rt of an episode
families of tasks from which meta-training and meta-testing τ ∼ M, τ = (s0 , a0 , . . . , sT , aT , rT ) after T timesteps,
tasks can be drawn. Established FSL benchmarks include θ∗ = arg maxθ Eτ ∼M,πθ [Rτ ]. Meta-RL, usually assumes a
miniImageNet [14], [87], Tiered-ImageNet [218], SlimageNet distribution over MDPs p(M), analogous to the distribution
[219], Omniglot [87] and Meta-Dataset [108]. over tasks in Sec. 2.1. Some aspect of the learning process for
Dataset Diversity, Bias and Generalization The standard policy θ is then paramaterized by meta-knowledge ω , for
benchmarks provide tasks for training and evaluation, but example the initial condition [19], or surrogate reward/loss
suffer from a lack of diversity (narrow p(T )); performance function [24], [117]. Then the goal of Meta-RL is to train the
on these benchmarks are not reflective of performance on meta-representation ω so as to maximize the expected return
real-world few shot tasks. For example, switching between of learning within the MDP distribution p(M):
kinds of animal photos in miniImageNet is not a strong
test of generalization. Ideally bechmarks would span more ω ∗ = arg max E E [Rτ ] (9)
ω M∼p(M) τ ∼M,πθ∗
diverse categories and types of images (satellite, medical,
agricultural, underwater, etc), and even test domain-shifts This is the RL version of the meta-training problem in Eq. 2,
between meta-train and meta-test tasks. and solutions are similarly based on bi-level optimization
There is work still to be done here as, even in the many- (e.g., [19], cf: Eqs 5-6), or feed-forward approaches (e.g., [38],
shot setting, fitting a deep model to a very wide distribution cf: Eq. 7). Finally, once trained, the meta-learner ω ∗ can also
of data is itself non-trivial [220], as is generalizing to out-of- be deployed to rapidly learn in a new MDP M ∼ p(M)
sample data [41], [92]. Similarly, the performance of meta- analogously to the meta-test step in Eq. 4.
learners often drops drastically when introducing a domain
shift between the source and target task distributions [113]. 5.2.1 Methods
This motivates the recent Meta-Dataset [108] and CVPR Several previously discussed meta-representations have
cross-domain few-shot challenge [221]. Meta-Dataset aggre- been explored in RL including learning the initial conditions
gates a number of individual recognition benchmarks to [19], [169], hyperparameters [169], [173], step directions [76]
provide a wider distribution of tasks p(T ) to evaluate the and step sizes [171]. These enable gradient-based learning of
ability to fit a wide task distribution and generalize across a neural policy with fewer environmental interactions and
domain-shift. Meanwhile, [221] challenges methods to gen- training fast convolutional [38] or recurrent [23], [112] black-
eralize from the everyday ImageNet images to medical, box models to synthesize a policy by embedding the envi-
satellite and agricultural images. Recent work has begun to roment experience. Recent work has developed improved

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2021.3079209, IEEE
Transactions on Pattern Analysis and Machine Intelligence
11

meta-optimization algorithms [166]–[168] for these tasks, pick/place may be subroutines for a room cleaning robot.
and provided theoretical guarantees for meta-RL [224]. However, developing meta-learners with effective composi-
Exploration A meta-representation rather unique to RL tional knowledge transfer is an open question, with modular
is the exploration policy. RL is complicated by the fact that meta-learning [136] being an option. Unsupervised meta-
the data distribution is not fixed, but varies according to RL variants aim to perform meta-training without manu-
the agent’s actions. Furthermore, sparse rewards may mean ally specified rewards [231], or adapt at meta-testing to a
that an agent must take many actions before achieving a changed environment but without new rewards [232]. Con-
reward that can be used to guide learning. As such, how tinual adaptation provides an agent with the ability to adapt
to explore and acquire data for learning is a crucial factor to a sequence of tasks within one meta-test episode [61]–[63],
in any RL algorithm. Traditionally exploration is based similar to continual learning. Finally, meta-learning has also
on sampling random actions [225], or hand-crafted heuris- been applied to imitation [111] and inverse RL [233].
tics [226]. Several meta-RL studies have instead explicitly
treated exploration strategy or curiosity function as meta- 5.2.2 Benchmarks
knowledge ω ; and modeled their acquisition as a meta- Meta-learning benchmarks for RL typically define a family
learning problem [25], [182], [183], [227] – leading to sample to solve in order to train and evaluate an agent that learns
efficiency improvements by ‘learning how to explore’. how to learn. These can be tasks (reward functions) to
Optimization RL is a difficult optimization problem achieve, or domains (distinct environments or MDPs).
where the learned policy is usually far from optimal, even Discrete Control RL An early meta-RL benchmark for
on ‘training set’ episodes. This means that, in contrast to vision-actuated control is the arcade learning environment
meta-SL, meta-RL methods are more commonly deployed (ALE) [234], which defines a set of classic Atari games split
to increase asymptotic performance [24], [173], [179] as into meta-training and meta-testing. The protocol here is to
well as sample-efficiency, and can lead to significantly bet- evaluate return after a fixed number of timesteps in the
ter solutions overall. The meta-objective of many meta-RL meta-test environment. A challenge is the great diversity
frameworks is the net return of the agent over a full episode, (wide p(T )) across games, which makes successful meta-
and thus both sample efficient and asymptotically perfor- training hard and leads to limited benefit from knowledge
mant learning are rewarded. Optimization difficulty also transfer [234]. Another benchmark [235] is based on splitting
means that there has been relatively more work on learning Sonic-hedgehog levels into meta-train/meta-test. The task
losses (or rewards) [117], [120], [179], [228] which an RL distribution here is narrower and beneficial meta-learning is
agent should optimize instead of – or in addition to – the relatively easier to achieve. Cobbe et al. [236] proposed two
conventional sparse reward objective. Such learned losses purpose designed video games for benchmarking meta-RL.
may be easier to optimize (denser, smoother) compared to CoinRun game [236] provides 232 procedurally generated
the true target [24], [228]. This also links to exploration as levels of varying difficulty and visual appearance. They
reward learning and can be considered to instantiate meta- show that some 10, 000 levels of meta-train experience are
learning of learning intrinsic motivation [180]. required to generalize reliably to new levels. CoinRun is
primarily designed to test direct generalization rather than
Online meta-RL A significant fraction of meta-RL stud-
fast adaptation, and can be seen as providing a distribution
ies addressed the single-task setting, where the meta-
over MDP environments to test generalization rather than
knowledge such as loss [117], [179], reward [173], [180], hy-
over tasks to test adaptation. To better test fast learning in a
perparameters [171], [172], or exploration strategy [181] are
wider task distribution, ProcGen [236] provides a set of 16
trained online together with the base policy while learning a
procedurally generated games including CoinRun.
single task. These methods thus do not require task families
and provide a direct improvement to their respective base Continuous Control RL While common benchmarks such
learners’ performance. as gym [237] have greatly benefited RL research, there is
less consensus on meta-RL benchmarks, making existing
On- vs Off-Policy meta-RL A major dichotomy in con- work hard to compare. Most continuous control meta-RL
ventional RL is between on-policy and off-policy learning studies have proposed home-brewed benchmarks that are
such as PPO [225] vs SAC [229]. Off-policy methods are low dimensional parametric variants of particular tasks such
usually significantly more sample efficient. However, off- as navigating to various locations or velocities [19], [110], or
policy methods have been harder to extend to meta-RL, traversing different terrains [63]. Several multi-MDP bench-
leading to more meta-RL methods being built on on-policy marks [238], [239] have recently been proposed but these
RL methods, thus limiting the absolute performance of primarily test generalization across different environmental
meta-RL. Early work in off-policy meta-RL methods has led perturbations rather than different tasks. The Meta-World
to strong results [110], [117], [165], [228]. Off-policy learning benchmark [240] provides a suite of 50 continuous con-
also improves the efficiency of the meta-train stage [110], trol tasks with state-based actuation, varying from simple
which can be expensive in meta-RL. It also provides new parametric variants such as lever-pulling and door-opening.
opportunities to accelerate meta-testing by replay buffer This benchmark should enable more comparable evaluation,
sample from meta-training [165]. and investigation of generalization within and across task
Other Trends and Challenges [63] is noteworthy in distributions. The meta-world evaluation [240] suggests that
demonstrating successful meta-RL on a real-world physical existing meta-RL methods struggle to generalize over wide
robot. Knowledge transfer in robotics is often best studied task distributions and meta-train/meta-test shifts. This may
compositionally [230]. E.g., walking, navigating and object be due to our meta-RL models being too weak and/or

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2021.3079209, IEEE
Transactions on Pattern Analysis and Machine Intelligence
12

benchmarks being too small, in terms of number and cov- over a distribution of tasks, just a single task. The former
erage tasks, for effective learning-to-learn. Another recent case is usually relevant in few-shot applications, especially
benchmark suitable for meta-RL is PHYRE [241] which in optimization based methods. For instance, MAML can
provides a set of 50 vision-based physics task templates be improved by learning a learning rate per layer per step
which can be solved with simple actions but are likely to [77]. The case where we wish to learn hyperparameters
require model-based reasoning to address efficiently. These for a single task is usually more relevant for many-shot
also provide within and cross-template generalization tests. applications [68], [153], where some validation data can
Discussion One complication of vision-actuated meta- be extracted from the training dataset, as discussed in
RL is disentangling visual generalization (as in computer Section 2.1. End-to-end gradient-based meta-learning has
vision) with fast learning of control strategies more gener- already demonstrated promising scalability to millions of
ally. For example CoinRun [236] evaluation showed large parameters (as demonstrated by MAML [19] and Dataset
benefit from standard vision techniques such as batch norm Distillation [151], [153], for example) in contrast to the classic
suggesting that perception is a major bottleneck. approaches (such cross-validation by grid or random [69]
search, or Bayesian Optimization [70]) which are typically
only successful with dozens of hyper-parameters.
5.3 Neural Architecture Search (NAS)
Architecture search [21], [26], [27], [37], [126] can be seen
as a kind of hyperparameter optimization where ω specifies 5.5 Bayesian Meta-learning
the architecture of a neural network. The inner optimiza- Bayesian meta-learning approaches formalize meta-learning
tion trains networks with the specified architecture, and via Bayesian hierarchical modelling, and use Bayesian infer-
the outer optimization searches for architectures with good ence for learning rather than direct optimization of parame-
validation performance. NAS methods have been analysed ters. In the meta-learning context, Bayesian learning is typ-
[37] according to ‘search space’, ‘search strategy’, and ‘per- ically intractable, and so approximations such as stochastic
formance estimation strategy’. These correspond to the hy- variational inference or sampling are used.
pothesis space for ω , the meta-optimization strategy, and the Bayesian meta-learning importantly provides uncer-
meta-objective. NAS is particularly challenging because: (i) tainty measures for the ω parameters, and hence measures
Fully evaluating the inner loop is expensive since it requires of prediction uncertainty which can be important for safety
training a many-shot neural network to completion. This critical applications, exploration in RL, and active learning.
leads to approximations such as sub-sampling the train set, A number of authors have explored Bayesian approaches
early termination of the inner loop, and interleaved descent to meta-learning complex neural network models with
on both ω and θ [21] as in online meta-learning. (ii.) The competitive results. For example, extending variational au-
search space is hard to define, and optimize. This is because toencoders to model task variables explicitly [72]. Neural
most search spaces are broad, and the space of architectures Processes [175] define a feed-forward Bayesian meta-learner
is not trivially differentiable. This leads to reliance on cell- inspired by Gaussian Processes but implemented with neu-
level search [21], [27] constraining the search space, RL [27], ral networks.Deep kernel learning is also an active research
discrete gradient estimators [128] and evolution [26], [126]. area that has been adapted to the meta-learning setting
Topical Issues While NAS itself can be seen as an instance [246], and is often coupled with Gaussian Processes [247]. In
of hyper-parameter or hypothesis-class meta-learning, it can [73] gradient based meta-learning is recast into a hierarchi-
also interact with meta-learning in other forms. Since NAS cal empirical Bayes inference problem (i.e. prior learning),
is costly, a topical issue is whether discovered architectures which models uncertainty in task-specific parameters θ.
can generalize to new problems [242]. Meta-training across Bayesian MAML [248] improves on this model by using
multiple datasets may lead to improved cross-task general- a Bayesian ensemble approach that allows non-Gaussian
ization of architectures [131]. Finally, one can also define posteriors over θ, and later work removes the need for
NAS meta-objectives to train an architecture suitable for costly ensembles [44], [249]. In Probabilistic MAML [95], it
few-shot learning [243]. Similarly to fast-adapting initial is the uncertainty in the metaknowledge ω that is modelled,
condition meta-learning approaches such as MAML [19], while a MAP estimate is used for θ. Increasingly, these
one can train good initial architectures [130] or architecture Bayesian methods are shown to tackle ambiguous tasks,
priors [131] that are easy to adapt towards specific tasks. active learning and RL problems.
Separate from the above, meta-learning has also been
Benchmarks NAS is often evaluated on CIFAR-10, but it
proposed to aid the Bayesian inference process itself, as in
is costly to perform and results are hard to reproduce due
[250] where the authors adapt a Bayesian sampler to provide
to confounding factors such as tuning of hyperparameters
efficient adaptive sampling methods.
[244]. To support reproducible and accessible research, the
NASbenches [245] provide pre-computed performance mea-
sures for a large number of network architectures. 5.6 Unsupervised and Semi-Supervised Meta-Learning
There are several distinct ways in which unsupervised
5.4 Hyper-parameter Optimization learning can interact with meta-learning, depending on
Meta-learning address hyperparameter optimization when whether unsupervised learning in performed in the inner
considering ω to specify hyperparameters, such as regular- loop or outer loop, and during meta-train vs meta-test.
ization strength or learning rate. There are two main set- Unsupervised Learning of a Supervised Learner The
tings: we can learn hyperparameters that improve training aim here is to learn a supervised learning algorithm (e.g.,

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2021.3079209, IEEE
Transactions on Pattern Analysis and Machine Intelligence
13

via MAML [19] style initial condition for supervised fine- Online and Adaptive Learning also consider tasks ar-
tuning), but do so without the requirement of a large set riving in a stream, but are concerned with the ability to
of source tasks for meta-training [251]–[253]. To this end, effectively adapt to the current task in the stream, more
synthetic source tasks are constructed without supervision than remembering the old tasks. To this end an online exten-
via clustering or class-preserving data augmentation, and sion of MAML was proposed [96] to perform MAML-style
used to define the meta-objective for meta-training. meta-training online during a task sequence. Meanwhile
Supervised Learning of an Unsupervised Learner This others [61]–[63] consider the setting where meta-training is
family of methods aims to meta-train an unsupervised performed in advance on source tasks, before meta-testing
learner. For example, by training the unsupervised algo- adaptation capabilities on a sequence of target tasks.
rithm such that it works well for downstream supervised Benchmarks A number of benchmarks for continual
learning tasks. One can train unsupervised learning rules learning work quite well with standard deep learning meth-
[16] or losses [98], [121] such that downstream supervised ods. However, most cannot readily work with meta-learning
learning performance is optimized – after re-using the un- approaches as their their sample generation routines do not
supervised representation for a supervised task [16], or provide a large number of explicit learning sets and an
adapting based on unlabeled data [98], [121]. Alternatively, explicit evaluation sets. Some early steps were made to-
when unsupervised tasks such as clustering exist in a wards defining meta-learning ready continual benchmarks
family, rather than in isolation, then learning-to-learn of in [96], [170], [258], mainly composed of Omniglot and
‘how-to-cluster’ on several source tasks can provide better perturbed versions of MNIST. However, most of those were
performance on new clustering tasks in the family [176]– simply tasks built to demonstrate a method. More explicit
[178], [254], [255]. The methods in this group that make benchmark work can be found in [219], which is built for
use of feed-forward models are often known as amortized meta and non meta-learning approaches alike.
clustering [177], [178], because they amortize the typically
iterative computation of clustering algorithms into the cost 5.8 Domain Adaptation and Domain Generalization
of training a single inference model, which subsequently
performs clustering using a single feed-froward pass. Over- Domain-shift refers to the statistics of data encountered in
all, these methods help to deal with the ill-definedness of deployment being different from those used in training. Nu-
the unsupervised learning problem by transforming it into merous domain adaptation and generalization algorithms
a problem with a clear supervised (meta) objective. have been studied to address this issue in supervised, unsu-
pervised, and semi-supervised settings [57].
Semi-Supervised Learning (SSL) This family of methods
aim to train a base-learner capable of performing well on Domain Generalization Domain generalization aims to
validation data after learning from a mixture of labeled train models with increased robustness to train-test domain
and unlabeled training examples. In the few-shot regime, shift [260], often by exploiting a distribution over training
approaches include SSL extensions of metric-learners [218], domains. Using a validation domain that is shifted with
and meta-learning measures of confidence [256] or instance respect to the training domain [261], different kinds of meta-
weights [156] to ensure that reliably labeled instances are knowledge such as regularizers [92], losses [41], and noise
used in self-training. In the many-shot regime, approaches augmentation [115] can be (meta) learned to maximize the
include direct transductive training of labels [155] , or train- robustness of the learned model to train-test domain-shift.
ing a teacher network to generate labels for a student [257]. Domain Adaptation To improve on conventional domain
adaptation [57], meta-learning can be used to define a meta-
objective that optimizes the performance of a base unsuper-
5.7 Continual, Online and Adaptive Learning vised DA algorithm [58].
Continual Learning refers to the human-like capability Benchmarks Popular benchmarks for DA and DG con-
of learning tasks presented in sequence. Ideally this is done sider image recognition across multiple domains such as
while exploiting forward transfer so new tasks are learned photo/sketch/cartoon. PACS [262] provides a good starter
better given past experience, without forgetting previously benchmark, with Visual Decathlon [41], [220] and Meta-
learned tasks, and without needing to store past data [60]. Dataset [108] providing larger scale alternatives.
Deep Neural Networks struggle to meet these criteria, es-
pecially as they tend to forget information seen in earlier 5.9 Language and Speech
tasks – a phenomenon known as catastrophic forgetting.
Meta-learning can include the requirements of continual Language Modelling Few-shot language modelling in-
learning into a meta-objective, for example by defining a creasingly showcases the versatility of meta-learners. Early
sequence of learning episodes in which the support set matching networks showed impressive performances on
contains one new task, but the query set contains examples one-shot tasks such as filling in missing words [87]. Many
drawn from all tasks seen until now [104], [170]. Various more tasks have since been tackled, including text classifi-
meta-representations can be learned to improve continual cation [134], neural program induction [263] and synthesis
learning performance, such as weight priors [133], gradient [264], English to SQL program synthesis [265], text-based
descent preconditioning matrices [104], or RNN learned relationship graph extractor [266], machine translation [267],
optimizers [170], or feature representations [258]. A related and quickly adapting to new personas in dialogue [268].
idea is meta-training representations to support local editing Speech Recognition Deep learning is now the dominant
updates [259] for improvement without interference. paradigm for state of the art automatic speech recognition

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2021.3079209, IEEE
Transactions on Pattern Analysis and Machine Intelligence
14

(ASR). Meta-learning is beginning to be applied to address Communications systems are increasingly impacted by
the many few-shot adaptation problems that arise within deep learning. For example by learning coding systems that
ASR including learning how to train for low-resource lan- surpass hand designed codes for realistic channels [282].
guages [269], cross-accent adaptation [270] and optimizing Few-shot meta-learning can be used to provide rapid adap-
models for individual speakers [271]. tation of codes to changing channel characteristics [283].
Active Learning (AL) methods wrap supervised learning,
5.10 Emerging Topics and define a policy for selective data annotation – typically
in the setting where annotation can be obtained sequentially.
Environment Learning and Sim2Real In Sim2Real we The goal of AL is to find the optimal subset of data to
are interested in training a model in simulation that is annotate so as to maximize performance of downstream su-
able to generalize to the real-world. The classic domain pervised learning with the fewest annotations. AL is a well
randomization approach simulates a wide distribution over studied problem with numerous hand designed algorithms
domains/MDPs, with the aim of training a sufficiently ro- [284]. Meta-learning can map active learning algorithm de-
bust model to succeed in the real world – and has succeeded sign into a learning task by: (i) defining the inner-level
in both vision [272] and RL [157]. Nevertheless tuning the optimization as conventional supervised learning on the
simulation distribution remains a challenge. This leads to annotated dataset so far, (ii) defining ω to be a query policy
a meta-learning setup where the inner-level optimization that selects the best unlabeled datapoints to annotate, (iii),
learns a model in simulation, the outer-level optimization defining the meta-objective as validation performance after
Lmeta evaluates the model’s performance in the real-world, iterative learning and annotation according to the query
and the meta-representation ω corresponds to the param- policy, (iv) performing outer-level optimization to train the
eters of the simulation environment. This paradigm has optimal annotation query policy [186]–[188]. However, if la-
been used in RL [159] as well as vision [158], [273]. In bels are used to train AL algorithms, they need to generalize
this case the source tasks used for meta-train tasks are across tasks to amortize their training cost [188].
not a pre-provided data distribution, but paramaterized by Learning with Label Noise commonly arises when large
omega, Dsource (ω). However, challenges remain in terms of datasets are collected by web scraping or crowd-sourcing.
costly back-propagation through a long graph of inner task While there are many algorithms hand-designed for this sit-
learning steps; as well as minimising the number of real- uation, recent meta-learning methods have addressed label
world Lmeta evaluations in the case of Sim2Real. noise. For example by transductively learning sample-wise
Meta-learning for Social Good Meta-learning lends itself weighs to down-weight noisy samples [146], or learning an
to challenges that arise in applications of AI for social good initial condition robust to noisy label training [93].
such as medical image classification and drug discovery, Adversarial Attacks and Defenses Deep Neural Net-
where data is often scarce. Progress in the medical domain is works can be fooled into misclassifying a data point that
especially relevant given the global shortage of pathologists should be easily recognizable, by adding a carefully crafted
[274]. In [5] an LSTM is combined with a graph neural human-invisible perturbation to the data [285]. Numerous
network to predict molecular behaviour (e.g. toxicity) in the attack and defense methods have been published in recent
one-shot data regime. In [275] MAML is adapted to weakly- years, with defense strategies usually consisting in carefully
supervised breast cancer detection tasks, and the order of hand-designed architectures or training algorithms. Analo-
tasks are selected according to a curriculum. MAML is also gous to the case in domain-shift, one can train the learning
combined with denoising autoencoders to do medical visual algorithm for robustness by defining a meta-loss in terms of
question answering [276], while learning to weigh support performance under adversarial attack [94], [286].
samples [218] is adapted to pixel wise weighting for skin
Recommendation Systems are a mature consumer of ma-
lesion segmentation tasks that have noisy labels [277].
chine learning in the commerce space. However, bootstrap-
Non-Backprop and Biologically Plausible Learners ping recommendations for new users with little historical
Most meta-learning work that uses explicit (non feed- interaction data, or new items for recommendation remains
forward/black-box) optimization for the base model is a challenge known as the cold-start problem. Meta-learning
based on gradient descent by backpropagation. Meta- has applied black-box models to item cold-start [287] and
learning can define the function class of ω so as to lead to gradient-based methods to user cold-start [288].
the discovery of novel learning rules that are unsupervised
[16] or biologically plausible [45], [278], [279], making use of
ideas less commonly used in contemporary deep learning 6 C HALLENGES AND O PEN Q UESTIONS
such as Hebbian updates [278] and neuromodulation [279]. Diverse and multi-modal task distributions The diffi-
Network Compression Contemporary CNNs require culty of fitting a meta-learner to a distribution of tasks
large amounts of memory that may be prohibitive on p(T ) can depend on its width. Many big successes of
embedded devices. Thus network compression in various meta-learning have been within narrow task families, while
forms such as quantization and pruning are topical research learning on diverse task distributions can challenge existing
areas [280]. Meta-learning is beginning to be applied to this methods [108], [220], [240]. This may be partly due to
objective as well, such as training gradient generator meta- conflicting gradients between tasks [289].
networks that allow quantized networks to be trained [198], Many meta-learning frameworks [19] implicitly assume
and weight generator meta-networks that allow quantized that the distribution over tasks p(T ) is uni-modal, and a sin-
networks to be trained with gradient [281]. gle learning strategy ω provides a good solution for them all.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2021.3079209, IEEE
Transactions on Pattern Analysis and Machine Intelligence
15

However task distributions are often multi-modal; such as alternating inner and outer steps [21], [41], [197], truncation
medical vs satellite vs everyday images in computer vision, [296], shortcuts [297] or inversion [189] of the inner opti-
or putting pegs in holes vs opening doors [240] in robotics. mization. Long horizon meta-learning can also be achieved
Different tasks within the distribution may require different by learning an initialization that minimizes the gradient
learning strategies, which is hard to achieve with today’s descent trajectory length over task manifolds [298]. Finally,
methods. In vanilla multi-task learning, this phenomenon is another family of approaches accelerate meta-training via
relatively well studied with, e.g., methods that group tasks closed-form solvers in the inner loop [163], [164].
into clusters [290] or subspaces [291]. However this is only Each of these strategies provides different trade-offs
just beginning to be explored in meta-learning [292]. on the following axes: the accuracy at meta-test time, the
Meta-generalization Meta-learning poses a new general- compute and memory cost as the size of ω increases, and
ization challenge across tasks analogous to the challenge the computate and memory cost as the horizon gets longer.
of generalizing across instances in conventional machine Implicit gradients scale to high-dimensional ω ; but only
learning. There are two sub-challenges: (i) The first is gen- provide approximate gradients for it, and require the inner
eralizing from meta-train to novel meta-test tasks drawn task loss to be a function of ω . Forward-mode differentiation
from p(T ). This is exacerbated because the number of tasks is exact and doesn’t have such constraints, but scales poorly
available for meta-training is typically low (much less than with the dimension of ω . Online methods are cheap in both
the number of instances available in conventional supervised memory and compute but suffer from a short-horizon bias
learning), making it difficult to generalize. One failure mode [299] and thus have lower meta-test accuracy. The same
for generalization in few-shot learning has been well studied could be said of truncation methods which are cheap but
under the guise of memorisation [200], which occurs when suffer from truncation bias [300]. Gradient degradation is
each meta-training task can be solved directly without per- also a challenge in the many-shot regime, and solutions
forming any task-specific adaptation based on the support include warp layers [104] or gradient averaging [68].
set. In this case models fail to generalize in meta-testing, In terms of the cost of solving new tasks at the meta-test
and specific regularizers [200] have been proposed to pre- stage, FFMs have a significant advantage over optimization-
vent this kind of meta-overfitting. (ii) The second challenge based meta-learners, which makes them appealing for appli-
is generalizing to meta-test tasks drawn from a different cations involving deployment of learning algorithms on mo-
distribution than the training tasks. This is inevitable in bile devices such as smartphones [6], for example to achieve
many potential practical applications of meta-learning, for personalisation. This is especially so because the embedded
example generalizing few-shot visual learning from every- device versions of contemporary deep learning software
day training images of ImageNet to specialist domains such frameworks typically lack support for backpropagation-
as medical images [221]. From the perspective of a learner, based training, which FFMs do not require.
this is a meta-level generalization of the domain-shift prob-
lem, as observed in supervised learning. Addressing these 7 C ONCLUSION
issues through meta-generalizations of regularization, trans- The field of meta-learning has seen a rapid growth in
fer learning, domain adaptation, and domain generalization interest. This has come with some level of confusion, with
are emerging directions [115]. Furthermore, we have yet regards to how it relates to neighbouring fields, what it
to understand which kinds of meta-representations tend to can be applied to, and how it can be benchmarked. In
generalize better under certain types of domain shifts. this survey we have sought to clarify these issues by
Task families Many existing meta-learning frameworks, thoroughly surveying the area both from a methodological
especially for few-shot learning, require task families for point of view – which we broke down into a taxonomy
meta-training. While this indeed reflects lifelong human of meta-representation, meta-optimizer and meta-objective;
learning, in some applications data for such task families and from an application point of view. We hope that this
may not be available. Unsupervised meta-learning [251]– survey will help newcomers and practitioners to orient
[253] and single-task meta-learning methods [41], [172], themselves to develop and exploit in this growing field, as
[179], [180], [196], could help to alleviate this requirement; as well as highlight opportunities for future research.
can improvements in meta-generalization discussed above.
Computation Cost & Many-shot A naive implementation ACKNOWLEDGMENTS
of bilevel optimization as shown in Section 2.1 is expensive T. Hospedales was supported by the Engineering and Phys-
in both time (because each outer step requires several inner ical Sciences Research Council of the UK (EPSRC) Grant
steps) and memory (because reverse-mode differentiation number EP/S000631/1 and the UK MOD University De-
requires storing the intermediate inner states). For this rea- fence Research Collaboration (UDRC) in Signal Processing,
son, much of meta-learning has focused on the few-shot and EPSRC Grant EP/R026173/1.
regime [19], where the number of inner gradient steps (aka
the horizon) is small. However, there is an increasing focus
on methods which seek to extend optimization-based meta- R EFERENCES
learning to the many-shot regime, where long horizons are [1] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning For
inevitable. Popular solutions include implicit differentiation Image Recognition,” in CVPR, 2016.
of ω [153], [162], [293], forward-mode differentiation of ω [2] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van
Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershel-
[66], [68], [294], gradient preconditioning [104], evolutionary vam, M. Lanctot et al., “Mastering The Game Of Go With Deep
strategies [295], solving for a greedy version of ω online by Neural Networks And Tree Search,” Nature, 2016.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2021.3079209, IEEE
Transactions on Pattern Analysis and Machine Intelligence
16

[3] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre- [37] T. Elsken, J. H. Metzen, and F. Hutter, “Neural Architecture
training Of Deep Bidirectional Transformers For Language Un- Search: A Survey,” Journal of Machine Learning Research, 2019.
derstanding,” in ACL, 2019. [38] N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel, “A Simple
[4] G. Marcus, “Deep Learning: A Critical Appraisal,” arXiv e-prints, Neural Attentive Meta-learner,” ICLR, 2018.
2018. [39] H. Stackelberg, The Theory Of Market Economy. Oxford University
[5] H. Altae-Tran, B. Ramsundar, A. S. Pappu, and V. S. Pande, “Low Press, 1952.
Data Drug Discovery With One-shot Learning,” CoRR, 2016. [40] A. Sinha, P. Malo, and K. Deb, “A Review On Bilevel Optimiza-
[6] A. Ignatov, R. Timofte, A. Kulik, S. Yang, K. Wang, F. Baum, tion: From Classical To Evolutionary Approaches And Applica-
M. Wu, L. Xu, and L. Van Gool, “AI Benchmark: All About Deep tions,” IEEE Transactions on Evolutionary Computation, 2018.
Learning On Smartphones In 2019,” arXiv e-prints, 2019. [41] Y. Li, Y. Yang, W. Zhou, and T. M. Hospedales, “Feature-Critic
[7] S. Thrun and L. Pratt, “Learning To Learn: Introduction And Networks For Heterogeneous Domain Generalization,” in ICML,
Overview,” in Learning To Learn, 1998. 2019.
[8] H. F. Harlow, “The Formation Of Learning Sets.” Psychological [42] G. Denevi, C. Ciliberto, D. Stamos, and M. Pontil, “Learning To
Review, 1949. Learn Around A Common Mean,” in NeurIPS, 2018.
[9] J. B. Biggs, “The Role of Meta-Learning in Study Processes,” [43] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhut-
British Journal of Educational Psychology, 1985. dinov, and A. J. Smola, “Deep sets,” in NIPS, 2017.
[10] A. M. Schrier, “Learning How To Learn: The Significance And [44] J. Gordon, J. Bronskill, M. Bauer, S. Nowozin, and R. E. Turner,
Current Status Of Learning Set Formation,” Primates, 1984. “Meta-Learning Probabilistic Inference For Prediction,” ICLR,
[11] P. Domingos, “A Few Useful Things To Know About Machine 2019.
Learning,” Commun. ACM, 2012. [45] Y. Bengio, S. Bengio, and J. Cloutier, “Learning A Synaptic
[12] D. G. Lowe, “Distinctive Image Features From Scale-Invariant,” Learning Rule,” in IJCNN, 1990.
International Journal of Computer Vision, 2004. [46] S. Bengio, Y. Bengio, and J. Cloutier, “On The Search For New
[13] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet Classifi- Learning Rules For ANNs,” Neural Processing Letters, 1995.
cation With Deep Convolutional Neural Networks,” in NeurIPS, [47] J. Schmidhuber, J. Zhao, and M. Wiering, “Simple Principles Of
2012. Meta-Learning,” Technical report IDSIA, 1996.
[14] S. Ravi and H. Larochelle, “Optimization As A Model For Few- [48] J. Schmidhuber, “A Neural Network That Embeds Its Own Meta-
Shot Learning,” in ICLR, 2016. levels,” in IEEE International Conference On Neural Networks, 1993.
[15] M. Andrychowicz, M. Denil, S. G. Colmenarejo, M. W. Hoffman, [49] ——, “A possibility for implementing curiosity and boredom in
D. Pfau, T. Schaul, and N. de Freitas, “Learning To Learn By model-building neural controllers,” in SAB, 1991.
Gradient Descent By Gradient Descent,” in NeurIPS, 2016. [50] S. Hochreiter, A. S. Younger, and P. R. Conwell, “Learning To
[16] L. Metz, N. Maheswaranathan, B. Cheung, and J. Sohl-Dickstein, Learn Using Gradient Descent,” in ICANN, 2001.
“Meta-learning Update Rules For Unsupervised Representation [51] A. S. Younger, S. Hochreiter, and P. R. Conwell, “Meta-learning
Learning,” ICLR, 2019. With Backpropagation,” in IJCNN, 2001.
[17] J. Schmidhuber, “Evolutionary Principles In Self-referential [52] J. Storck, S. Hochreiter, and J. Schmidhuber, “Reinforcement
Learning,” On learning how to learn: The meta-meta-... hook, 1987. driven information acquisition in non-deterministic environ-
[18] J. Schmidhuber, J. Zhao, and M. Wiering, “Shifting Inductive ments,” in ICANN, 1995.
Bias With Success-Story Algorithm, Adaptive Levin Search, And [53] M. Wiering and J. Schmidhuber, “Efficient model-based explo-
Incremental Self-Improvement,” Machine Learning, 1997. ration,” in SAB, 1998.
[19] C. Finn, P. Abbeel, and S. Levine, “Model-Agnostic Meta-learning [54] N. Schweighofer and K. Doya, “Meta-learning In Reinforcement
For Fast Adaptation Of Deep Networks,” in ICML, 2017. Learning,” Neural Networks, 2003.
[55] L. Y. Pratt, J. Mostow, C. A. Kamm, and A. A. Kamm, “Direct
[20] L. Franceschi, P. Frasconi, S. Salzo, R. Grazzi, and M. Pontil,
transfer of learned information among neural networks.” in
“Bilevel Programming For Hyperparameter Optimization And
AAAI, vol. 91, 1991.
Meta-learning,” in ICML, 2018.
[56] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How Transferable
[21] H. Liu, K. Simonyan, and Y. Yang, “DARTS: Differentiable Archi-
Are Features In Deep Neural Networks?” in NeurIPS, 2014.
tecture Search,” in ICLR, 2019.
[57] G. Csurka, Domain Adaptation In Computer Vision Applications.
[22] J. Snell, K. Swersky, and R. S. Zemel, “Prototypical Networks For
Springer, 2017.
Few Shot Learning,” in NeurIPS, 2017.
[58] D. Li and T. Hospedales, “Online Meta-Learning For Multi-
[23] Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, and Source And Semi-Supervised Domain Adaptation,” in ECCV,
P. Abbeel, “RL2 : Fast Reinforcement Learning Via Slow Rein- 2020.
forcement Learning,” in ArXiv E-prints, 2016.
[59] G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter,
[24] R. Houthooft, R. Y. Chen, P. Isola, B. C. Stadie, F. Wolski, J. Ho, “Continual Lifelong Learning With Neural Networks: A Review,”
and P. Abbeel, “Evolved Policy Gradients,” NeurIPS, 2018. Neural Networks, 2019.
[25] F. Alet, M. F. Schneider, T. Lozano-Perez, and L. Pack Kaelbling, [60] Z. Chen and B. Liu, “Lifelong machine learning,” Synthesis Lec-
“Meta-Learning Curiosity Algorithms,” ICLR, 2020. tures on Artificial Intelligence and Machine Learning, 2018.
[26] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, “Regularized [61] M. Al-Shedivat, T. Bansal, Y. Burda, I. Sutskever, I. Mordatch,
Evolution For Image Classifier Architecture Search,” AAAI, 2019. and P. Abbeel, “Continuous Adaptation Via Meta-Learning In
[27] B. Zoph and Q. V. Le, “Neural Architecture Search With Rein- Nonstationary And Competitive Environments,” ICLR, 2018.
forcement Learning,” ICLR, 2017. [62] S. Ritter, J. X. Wang, Z. Kurth-Nelson, S. M. Jayakumar, C. Blun-
[28] R. Vilalta and Y. Drissi, “A Perspective View And Survey Of dell, R. Pascanu, and M. Botvinick, “Been There, Done That:
Meta-learning,” Artificial intelligence review, 2002. Meta-learning With Episodic Recall,” ICML, 2018.
[29] S. Thrun, “Lifelong learning algorithms,” in Learning to learn. [63] I. Clavera, A. Nagabandi, S. Liu, R. S. Fearing, P. Abbeel,
Springer, 1998, pp. 181–209. S. Levine, and C. Finn, “Learning To Adapt In Dynamic, Real-
[30] J. Baxter, “Theoretical models of learning to learn,” in Learning to World Environments Through Meta-Reinforcement Learning,” in
learn. Springer, 1998, pp. 71–94. ICLR, 2019.
[31] D. H. Wolpert, “The Lack Of A Priori Distinctions Between [64] R. Caruana, “Multitask Learning,” Machine Learning, 1997.
Learning Algorithms,” Neural Computation, 1996. [65] Y. Yang and T. M. Hospedales, “Deep Multi-Task Representation
[32] J. Vanschoren, “Meta-Learning: A Survey,” CoRR, 2018. Learning: A Tensor Factorisation Approach,” in ICLR, 2017.
[33] F. Hutter, L. Kotthoff, and J. Vanschoren, Eds., Automatic machine [66] L. Franceschi, M. Donini, P. Frasconi, and M. Pontil, “Forward
learning: methods, systems, challenges. Springer, 2019. And Reverse Gradient-Based Hyperparameter Optimization,” in
[34] S. J. Pan and Q. Yang, “A Survey On Transfer Learning,” IEEE ICML, 2017.
TKDE, 2010. [67] X. Lin, H. Baweja, G. Kantor, and D. Held, “Adaptive Auxiliary
[35] C. Lemke, M. Budka, and B. Gabrys, “Meta-Learning: A Survey Task Weighting For Reinforcement Learning,” in NeurIPS, 2019.
Of Trends And Technologies,” Artificial intelligence review, 2015. [68] P. Micaelli and A. Storkey, “Non-greedy gradient-based hyperpa-
[36] Y. Wang, Q. Yao, J. T. Kwok, and L. M. Ni, “Generalizing from rameter optimization over long horizons,” arXiv, 2020.
a few examples: A survey on few-shot learning,” ACM Comput. [69] J. Bergstra and Y. Bengio, “Random Search For Hyper-Parameter
Surv., vol. 53, no. 3, Jun. 2020. Optimization,” in Journal Of Machine Learning Research, 2012.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2021.3079209, IEEE
Transactions on Pattern Analysis and Machine Intelligence
17

[70] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas, [104] S. Flennerhag, A. A. Rusu, R. Pascanu, F. Visin, H. Yin, and
“Taking The Human Out Of The Loop: A Review Of Bayesian R. Hadsell, “Meta-learning with warped gradient descent,” in
Optimization,” Proceedings of the IEEE, 2016. ICLR, 2020.
[71] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirchlet alloca- [105] Y. Chen, M. W. Hoffman, S. G. Colmenarejo, M. Denil, T. P.
tion,” Journal of Machine Learning Research, 2003. Lillicrap, M. Botvinick, and N. de Freitas, “Learning To Learn
[72] H. Edwards and A. Storkey, “Towards A Neural Statistician,” in Without Gradient Descent By Gradient Descent,” in ICML, 2017.
ICLR, 2017. [106] T. Heskes, “Empirical bayes for learning to learn,” in ICML, 2000.
[73] E. Grant, C. Finn, S. Levine, T. Darrell, and T. Griffiths, “Recasting [107] J. Requeima, J. Gordon, J. Bronskill, S. Nowozin, and R. E. Turner,
Gradient-Based Meta-Learning As Hierarchical Bayes,” in ICLR, “Fast and flexible multi-task classification using conditional neu-
2018. ral adaptive processes,” in NeurIPS, 2019.
[74] H. Yao, X. Wu, Z. Tao, Y. Li, B. Ding, R. Li, and Z. Li, “Automated [108] E. Triantafillou, T. Zhu, V. Dumoulin, P. Lamblin, K. Xu,
Relational Meta-learning,” in ICLR, 2020. R. Goroshin, C. Gelada, K. Swersky, P. Manzagol, and
[75] S. C. Yoonho Lee, “Gradient-Based Meta-Learning With Learned H. Larochelle, “Meta-Dataset: A Dataset Of Datasets For Learn-
Layerwise Metric And Subspace,” in ICML, 2018. ing To Learn From Few Examples,” ICLR, 2020.
[76] Z. Li, F. Zhou, F. Chen, and H. Li, “Meta-SGD: Learning To Learn [109] D. Ha, A. Dai, and Q. V. Le, “HyperNetworks,” ICLR, 2017.
Quickly For Few Shot Learning,” arXiv e-prints, 2017. [110] K. Rakelly, A. Zhou, C. Finn, S. Levine, and D. Quillen, “Effi-
[77] A. Antoniou, H. Edwards, and A. J. Storkey, “How To Train Your cient Off-Policy Meta-Reinforcement Learning Via Probabilistic
MAML,” in ICLR, 2018. Context Variables,” in ICML, 2019.
[78] K. Li and J. Malik, “Learning To Optimize,” in ICLR, 2017. [111] Y. Duan, M. Andrychowicz, B. Stadie, O. J. Ho, J. Schneider,
[79] E. Grefenstette, B. Amos, D. Yarats, P. M. Htut, A. Molchanov, I. Sutskever, P. Abbeel, and W. Zaremba, “One-shot Imitation
F. Meier, D. Kiela, K. Cho, and S. Chintala, “Generalized inner Learning,” in NeurIPS, 2017.
loop meta-learning,” arXiv preprint arXiv:1910.01727, 2019. [112] J. X. Wang, Z. Kurth-Nelson, D. Tirumala, H. Soyer, J. Z. Leibo,
[80] S. Qiao, C. Liu, W. Shen, and A. L. Yuille, “Few-Shot Image R. Munos, C. Blundell, D. Kumaran, and M. Botvinick, “Learning
Recognition By Predicting Parameters From Activations,” CVPR, To Reinforcement Learn,” CoRR, 2016.
2018. [113] W.-Y. Chen, Y.-C. Liu, Z. Kira, Y.-C. Wang, and J.-B. Huang, “A
[81] S. Gidaris and N. Komodakis, “Dynamic Few-Shot Visual Learn- Closer Look At Few-Shot Classification,” in ICLR, 2019.
ing Without Forgetting,” in CVPR, 2018. [114] B. Oreshkin, P. Rodrı́guez López, and A. Lacoste, “TADAM: Task
[82] A. Graves, G. Wayne, and I. Danihelka, “Neural Turing Ma- Dependent Adaptive Metric For Improved Few-shot Learning,”
chines,” in ArXiv E-prints, 2014. in NeurIPS, 2018.
[83] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lil- [115] H.-Y. Tseng, H.-Y. Lee, J.-B. Huang, and M.-H. Yang, “”Cross-
licrap, “Meta Learning With Memory-Augmented Neural Net- Domain Few-Shot Classification Via Learned Feature-Wise Trans-
works,” in ICML, 2016. formation”,” ICLR, Jan. 2020.
[84] T. Munkhdalai and H. Yu, “Meta Networks,” in ICML, 2017. [116] F. Sung, L. Zhang, T. Xiang, T. Hospedales, and Y. Yang, “Learn-
[85] C. Finn and S. Levine, “Meta-Learning And Universality: Deep ing To Learn: Meta-critic Networks For Sample Efficient Learn-
Representations And Gradient Descent Can Approximate Any ing,” arXiv e-prints, 2017.
Learning Algorithm,” in ICLR, 2018.
[117] W. Zhou, Y. Li, Y. Yang, H. Wang, and T. M. Hospedales, “Online
[86] G. Kosh, R. Zemel, and R. Salakhutdinov, “Siamese Neural Net-
Meta-Critic Learning For Off-Policy Actor-Critic Methods,” in
works For One-shot Image Recognition,” in ICML, 2015.
NeurIPS, 2020.
[87] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra et al., “Matching
[118] G. Denevi, D. Stamos, C. Ciliberto, and M. Pontil, “Online-
Networks For One Shot Learning,” in NeurIPS, 2016.
Within-Online Meta-Learning,” in NeurIPS, 2019.
[88] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. S. Torr, and T. M.
[119] S. Gonzalez and R. Miikkulainen, “Improved Training Speed,
Hospedales, “Learning To Compare: Relation Network For Few-
Accuracy, And Data Utilization Through Loss Function Opti-
Shot Learning,” in CVPR, 2018.
mization,” in CEC, 2020.
[89] V. Garcia and J. Bruna, “Few-Shot Learning With Graph Neural
Networks,” in ICLR, 2018. [120] S. Bechtle, A. Molchanov, Y. Chebotar, E. Grefenstette, L. Righetti,
G. Sukhatme, and F. Meier, “Meta-learning via learned loss,” in
[90] I. Bello, B. Zoph, V. Vasudevan, and Q. V. Le, “Neural Optimizer
ICPR, 2020.
Search With Reinforcement Learning,” in ICML, 2017.
[91] O. Wichrowska, N. Maheswaranathan, M. W. Hoffman, S. G. [121] A. I. Rinu Boney, “Semi-Supervised Few-Shot Learning With
Colmenarejo, M. Denil, N. de Freitas, and J. Sohl-Dickstein, MAML,” ICLR, 2018.
“Learned Optimizers That Scale And Generalize,” in ICML, 2017. [122] C. Huang, S. Zhai, W. Talbott, M. B. Martin, S.-Y. Sun, C. Guestrin,
[92] Y. Balaji, S. Sankaranarayanan, and R. Chellappa, “MetaReg: and J. Susskind, “Addressing The Loss-Metric Mismatch With
Towards Domain Generalization Using Meta-Regularization,” in Adaptive Loss Alignment,” in ICML, 2019.
NeurIPS, 2018. [123] C. Doersch and A. Zisserman, “Multi-task Self-Supervised Visual
[93] J. Li, Y. Wong, Q. Zhao, and M. S. Kankanhalli, “Learning To Learning,” in ICCV, 2017.
Learn From Noisy Labeled Data,” in CVPR, 2019. [124] M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo,
[94] M. Goldblum, L. Fowl, and T. Goldstein, “Adversarially Robust D. Silver, and K. Kavukcuoglu, “Reinforcement Learning With
Few-shot Learning: A Meta-learning Approach,” in NeurIPS, Unsupervised Auxiliary Tasks,” in ICLR, 2017.
2020. [125] S. Liu, A. Davison, and E. Johns, “Self-supervised Generalisation
[95] C. Finn, K. Xu, and S. Levine, “Probabilistic Model-agnostic With Meta Auxiliary Learning,” in NeurIPS, 2019.
Meta-learning,” in NeurIPS, 2018. [126] K. O. Stanley, J. Clune, J. Lehman, and R. Miikkulainen, “Design-
[96] C. Finn, A. Rajeswaran, S. Kakade, and S. Levine, “Online Meta- ing Neural Networks Through Neuroevolution,” Nature Machine
learning,” ICML, 2019. Intelligence, 2019.
[97] A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu, S. Osin- [127] J. Bayer, D. Wierstra, J. Togelius, and J. Schmidhuber, “Evolving
dero, and R. Hadsell, “Meta-Learning With Latent Embedding memory cell structures for sequence learning,” in ICANN, 2009.
Optimization,” ICLR, 2019. [128] S. Xie, H. Zheng, C. Liu, and L. Lin, “SNAS: Stochastic Neural
[98] A. Antoniou and A. Storkey, “Learning To Learn By Self- Architecture Search,” in ICLR, 2019.
Critique,” NeurIPS, 2019. [129] A. Zela, T. Elsken, T. Saikia, Y. Marrakchi, T. Brox, and F. Hut-
[99] Q. Sun, Y. Liu, T.-S. Chua, and B. Schiele, “Meta-Transfer Learn- ter, “Understanding and robustifying differentiable architecture
ing For Few-Shot Learning,” in CVPR, 2018. search,” in ICLR, 2020.
[100] R. Vuorio, S.-H. Sun, H. Hu, and J. J. Lim, “Multimodal [130] D. Lian, Y. Zheng, Y. Xu, Y. Lu, L. Lin, P. Zhao, J. Huang, and
Model-Agnostic Meta-Learning Via Task-Aware Modulation,” in S. Gao, “Towards Fast Adaptation Of Neural Architectures With
NeurIPS, 2019. Meta Learning,” in ICLR, 2020.
[101] H. Yao, Y. Wei, J. Huang, and Z. Li, “Hierarchically Structured [131] A. Shaw, W. Wei, W. Liu, L. Song, and B. Dai, “Meta Architecture
Meta-learning,” ICML, 2019. Search,” in NeurIPS, 2019.
[102] D. Kingma and J. Ba, “Adam: A Method For Stochastic Optimiza- [132] R. Hou, H. Chang, M. Bingpeng, S. Shan, and X. Chen, “Cross
tion,” in ICLR, 2015. Attention Network For Few-shot Classification,” in NeurIPS,
[103] E. Park and J. B. Oliva, “Meta-Curvature,” in NeurIPS, 2019. 2019.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2021.3079209, IEEE
Transactions on Pattern Analysis and Machine Intelligence
18

[133] M. Ren, R. Liao, E. Fetaya, and R. Zemel, “Incremental Few-shot [165] R. Fakoor, P. Chaudhari, S. Soatto, and A. J. Smola, “Meta-Q-
Learning With Attention Attractor Networks,” in NeurIPS, 2019. Learning,” in ICLR, 2020.
[134] Y. Bao, M. Wu, S. Chang, and R. Barzilay, “Few-shot Text Classi- [166] H. Liu, R. Socher, and C. Xiong, “Taming MAML: Efficient
fication With Distributional Signatures,” in ICLR, 2020. Unbiased Meta-reinforcement Learning,” in ICML, 2019.
[135] F. Alet, T. Lozano-Pérez, and L. P. Kaelbling, “Modular Meta- [167] J. Rothfuss, D. Lee, I. Clavera, T. Asfour, and P. Abbeel, “ProMP:
learning,” in CORL, 2018. Proximal Meta-Policy Search,” in ICLR, 2019.
[136] F. Alet, E. Weng, T. Lozano-Pérez, and L. P. Kaelbling, “Neu- [168] X. Song, W. Gao, Y. Yang, K. Choromanski, A. Pacchiano, and
ral Relational Inference With Fast Modular Meta-learning,” in Y. Tang, “ES-MAML: Simple Hessian-Free Meta Learning,” in
NeurIPS, 2019. ICLR, 2020.
[137] C. Fernando, D. Banarse, C. Blundell, Y. Zwols, D. Ha, A. A. [169] C. Fernando, J. Sygnowski, S. Osindero, J. Wang, T. Schaul,
Rusu, A. Pritzel, and D. Wierstra, “PathNet: Evolution Channels D. Teplyashin, P. Sprechmann, A. Pritzel, and A. Rusu, “Meta-
Gradient Descent In Super Neural Networks,” in arXiv, 2017. Learning By The Baldwin Effect,” in GECCO, 2018.
[138] B. M. Lake, “Compositional Generalization Through Meta [170] R. Vuorio, D.-Y. Cho, D. Kim, and J. Kim, “Meta Continual
Sequence-to-sequence Learning,” in NeurIPS, 2019. Learning,” arXiv e-prints, 2018.
[139] E. D. Cubuk, B. Zoph, D. Mané, V. Vasudevan, and Q. V. Le,
[171] K. Young, B. Wang, and M. E. Taylor, “Metatrace Actor-Critic:
“AutoAugment: Learning Augmentation Policies From Data,”
Online Step-Size Tuning By Meta-gradient Descent For Reinforce-
CVPR, 2019.
ment Learning Control,” in IJCAI, 2019.
[140] Y. Li, G. Hu, Y. Wang, T. Hospedales, N. M. Robertson, and
[172] Z. Xu, H. van Hasselt, and D. Silver, “Meta-Gradient Reinforce-
Y. Yang, “DADA: Differentiable Automatic Data Augmentation,”
ment Learning,” in NeurIPS, 2018.
in ECCV, 2020.
[141] R. Volpi and V. Murino, “Model Vulnerability To Distributional [173] M. Jaderberg, W. M. Czarnecki, I. Dunning, L. Marris, G. Lever,
Shifts Over Image Transformation Sets,” in ICCV, 2019. A. G. Castañeda, C. Beattie, N. C. Rabinowitz, A. S. Morcos,
[142] Z. Mai, G. Hu, D. Chen, F. Shen, and H. T. Shen, “Metamixup: A. Ruderman, N. Sonnerat, T. Green, L. Deason, J. Z. Leibo,
Learning adaptive interpolation policy of mixup with meta- D. Silver, D. Hassabis, K. Kavukcuoglu, and T. Graepel, “Human-
learning,” arXiv:1908.10059, 2019. level Performance In 3D Multiplayer Games With Population-
based Reinforcement Learning,” Science, 2019.
[143] C. Zhang, C. Öztireli, S. Mandt, and G. Salvi, “Active Mini-batch
Sampling Using Repulsive Point Processes,” in AAAI, 2019. [174] J.-M. Perez-Rua, X. Zhu, T. Hospedales, and T. Xiang, “Incremen-
[144] I. Loshchilov and F. Hutter, “Online Batch Selection For Faster tal Few-Shot Object Detection,” in CVPR, 2020.
Training Of Neural Networks,” in ICLR, 2016. [175] M. Garnelo, D. Rosenbaum, C. J. Maddison, T. Ramalho, D. Sax-
[145] Y. Fan, F. Tian, T. Qin, X. Li, and T. Liu, “Learning To Teach,” in ton, M. Shanahan, Y. W. Teh, D. J. Rezende, and S. M. A. Eslami,
ICLR, 2018. “Conditional Neural Processes,” ICML, 2018.
[146] J. Shu, Q. Xie, L. Yi, Q. Zhao, S. Zhou, Z. Xu, and D. Meng, [176] A. Pakman, Y. Wang, C. Mitelut, J. Lee, and L. Paninski, “Neural
“Meta-Weight-Net: Learning An Explicit Mapping For Sample clustering processes,” in ICML, 2019.
Weighting,” in NeurIPS, 2019. [177] J. Lee, Y. Lee, J. Kim, A. Kosiorek, S. Choi, and Y. W. Teh,
[147] M. Ren, W. Zeng, B. Yang, and R. Urtasun, “Learning To “Set transformer: A framework for attention-based permutation-
Reweight Examples For Robust Deep Learning,” in ICML, 2018. invariant neural networks,” in ICML, 2019.
[148] J. L. Elman, “Learning and development in neural networks: the [178] J. Lee, Y. Lee, and Y. W. Teh, “Deep amortized clustering,” 2019.
importance of starting small,” Cognition, 1993. [179] V. Veeriah, M. Hessel, Z. Xu, R. Lewis, J. Rajendran, J. Oh, H. van
[149] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum Hasselt, D. Silver, and S. Singh, “Discovery Of Useful Questions
Learning,” in ICML, 2009. As Auxiliary Tasks,” in NeurIPS, 2019.
[150] L. Jiang, Z. Zhou, T. Leung, L.-J. Li, and L. Fei-Fei, “Mentornet: [180] Z. Zheng, J. Oh, and S. Singh, “On Learning Intrinsic Rewards
Learning Data-driven Curriculum For Very Deep Neural Net- For Policy Gradient Methods,” in NeurIPS, 2018.
works On Corrupted Labels,” in ICML, 2018. [181] T. Xu, Q. Liu, L. Zhao, and J. Peng, “Learning To Explore With
[151] T. Wang, J. Zhu, A. Torralba, and A. A. Efros, “Dataset Distilla- Meta-Policy Gradient,” ICML, 2018.
tion,” CoRR, 2018. [182] B. C. Stadie, G. Yang, R. Houthooft, X. Chen, Y. Duan, Y. Wu,
[152] B. Zhao, K. R. Mopuri, and H. Bilen, “Dataset condensation with P. Abbeel, and I. Sutskever, “Some Considerations On Learning
gradient matching,” in ICLR, 2021. To Explore Via Meta-Reinforcement Learning,” in NeurIPS, 2018.
[153] J. Lorraine, P. Vicol, and D. Duvenaud, “Optimizing Millions Of [183] F. Garcia and P. S. Thomas, “A Meta-MDP Approach To Explo-
Hyperparameters By Implicit Differentiation,” in AISTATS, 2020. ration For Lifelong Reinforcement Learning,” in NeurIPS, 2019.
[154] O. Bohdal, Y. Yang, and T. Hospedales, “Flexible dataset distilla- [184] A. Gupta, R. Mendonca, Y. Liu, P. Abbeel, and S. Levine, “Meta-
tion: Learn labels instead of images,” arXiv, 2020. Reinforcement Learning Of Structured Exploration Strategies,” in
[155] W.-H. Li, C.-S. Foo, and H. Bilen, “Learning To Impute: A General NeurIPS, 2018.
Framework For Semi-supervised Learning,” arXiv e-prints, 2019. [185] H. B. Lee, T. Nam, E. Yang, and S. J. Hwang, “Meta Dropout:
[156] Q. Sun, X. Li, Y. Liu, S. Zheng, T.-S. Chua, and B. Schiele, “Learn- Learning To Perturb Latent Features For Generalization,” in
ing To Self-train For Semi-supervised Few-shot Classification,” in ICLR, 2020.
NeurIPS, 2019.
[186] P. Bachman, A. Sordoni, and A. Trischler, “Learning Algorithms
[157] O. M. Andrychowicz, B. Baker, M. Chociej, R. Józefowicz, B. Mc-
For Active Learning,” in ICML, 2017.
Grew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray,
J. Schneider, S. Sidor, J. Tobin, P. Welinder, L. Weng, and [187] K. Konyushkova, R. Sznitman, and P. Fua, “Learning Active
W. Zaremba, “Learning dexterous in-hand manipulation,” The Learning From Data,” in NeurIPS, 2017.
International Journal of Robotics Research, 2020. [188] K. Pang, M. Dong, Y. Wu, and T. M. Hospedales, “Meta-Learning
[158] N. Ruiz, S. Schulter, and M. Chandraker, “Learning To Simulate,” Transferable Active Learning Policies By Deep Reinforcement
ICLR, 2018. Learning,” CoRR, 2018.
[159] Q. Vuong, S. Vikram, H. Su, S. Gao, and H. I. Christensen, “How [189] D. Maclaurin, D. Duvenaud, and R. P. Adams, “Gradient-based
To Pick The Domain Randomization Parameters For Sim-to-real Hyperparameter Optimization Through Reversible Learning,” in
Transfer Of Reinforcement Learning Policies?” CoRR, 2019. ICML, 2015.
[160] Q. V. L. Prajit Ramachandran, Barret Zoph, “Searching For Acti- [190] C. Russell, M. Toso, and N. Campbell, “Fixing Implicit Deriva-
vation Functions,” in ArXiv E-prints, 2017. tives: Trust-Region Based Learning Of Continuous Energy Func-
[161] H. B. Lee, H. Lee, D. Na, S. Kim, M. Park, E. Yang, and S. J. tions,” in NeurIPS, 2019.
Hwang, “Learning to balance: Bayesian meta-learning for imbal- [191] A. Nichol, J. Achiam, and J. Schulman, “On First-Order Meta-
anced and out-of-distribution tasks,” ICLR, 2020. Learning Algorithms,” in ArXiv E-prints, 2018.
[162] A. Rajeswaran, C. Finn, S. Kakade, and S. Levine, “Meta-Learning [192] T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever, “Evolution
With Implicit Gradients,” in NeurIPS, 2019. Strategies As A Scalable Alternative To Reinforcement Learning,”
[163] K. Lee, S. Maji, A. Ravichandran, and S. Soatto, “Meta-Learning arXiv e-prints, 2017.
With Differentiable Convex Optimization,” in CVPR, 2019. [193] F. Stulp and O. Sigaud, “Robot Skill Learning: From Reinforce-
[164] L. Bertinetto, J. F. Henriques, P. H. Torr, and A. Vedaldi, “Meta- ment Learning To Evolution Strategies,” Paladyn, Journal of Be-
learning With Differentiable Closed-form Solvers,” in ICLR, 2019. havioral Robotics, 2013.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2021.3079209, IEEE
Transactions on Pattern Analysis and Machine Intelligence
19

[194] A. Soltoggio, K. O. Stanley, and S. Risi, “Born To Learn: The [222] T. de Vries, I. Misra, C. Wang, and L. van der Maaten, “Does
Inspiration, Progress, And Future Of Evolved Plastic Artificial Object Recognition Work For Everyone?” in CVPR, 2019.
Neural Networks,” Neural Networks, 2018. [223] R. J. Williams, “Simple Statistical Gradient-Following Algorithms
[195] Y. Cao, T. Chen, Z. Wang, and Y. Shen, “Learning To Optimize In For Connectionist Reinforcement Learning,” Machine learning,
Swarms,” in NeurIPS, 2019. 1992.
[196] F. Meier, D. Kappler, and S. Schaal, “Online Learning Of A [224] A. Fallah, A. Mokhtari, and A. Ozdaglar, “Provably Con-
Memory For Learning Rates,” in ICRA, 2018. vergent Policy Gradient Methods For Model-Agnostic Meta-
[197] A. G. Baydin, R. Cornish, D. Martı́nez-Rubio, M. Schmidt, and Reinforcement Learning,” arXiv e-prints, 2020.
F. D. Wood, “Online Learning Rate Adaptation With Hypergra- [225] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,
dient Descent,” in ICLR, 2018. “Proximal Policy Optimization Algorithms,” arXiv e-prints, 2017.
[198] S. Chen, W. Wang, and S. J. Pan, “MetaQuant: Learning To Quan- [226] O. Sigaud and F. Stulp, “Policy Search In Continuous Action
tize By Learning To Penetrate Non-differentiable Quantization,” Domains: An Overview,” Neural Networks, 2019.
in NeurIPS, 2019. [227] J. Schmidhuber, “What’s interesting?” 1997.
[199] C. Sun, A. Shrivastava, S. Singh, and A. Gupta, “Revisiting [228] L. Kirsch, S. van Steenkiste, and J. Schmidhuber, “Improving
Unreasonable Effectiveness Of Data In Deep Learning Era,” in Generalization In Meta Reinforcement Learning Using Learned
ICCV, 2017. Objectives,” in ICLR, 2020.
[200] M. Yin, G. Tucker, M. Zhou, S. Levine, and C. Finn, “Meta- [229] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft Actor-
Learning Without Memorization,” ICLR, 2020. Critic: Off-Policy Maximum Entropy Deep Reinforcement Learn-
[201] S. W. Yoon, J. Seo, and J. Moon, “Tapnet: Neural Network Aug- ing With A Stochastic Actor,” in ICML, 2018.
mented With Task-adaptive Projection For Few-shot Learning,” [230] O. Kroemer, S. Niekum, and G. D. Konidaris, “A Review Of
ICML, 2019. Robot Learning For Manipulation: Challenges, Representations,
[202] J. W. Rae, S. Bartunov, and T. P. Lillicrap, “Meta-learning Neural And Algorithms,” CoRR, 2019.
Bloom Filters,” ICML, 2019. [231] A. Jabri, K. Hsu, A. Gupta, B. Eysenbach, S. Levine, and C. Finn,
[203] A. Raghu, M. Raghu, S. Bengio, and O. Vinyals, “Rapid Learning “Unsupervised Curricula For Visual Meta-Reinforcement Learn-
Or Feature Reuse? Towards Understanding The Effectiveness Of ing,” in NeurIPS, 2019.
Maml,” in ICLR, 2020. [232] Y. Yang, K. Caluwaerts, A. Iscen, J. Tan, and C. Finn, “Norml:
[204] B. Kang, Z. Liu, X. Wang, F. Yu, J. Feng, and T. Darrell, “Few-shot No-reward Meta Learning,” in AAMAS, 2019.
Object Detection Via Feature Reweighting,” in ICCV, 2019. [233] S. K. Seyed Ghasemipour, S. S. Gu, and R. Zemel, “SMILe:
[205] L.-Y. Gui, Y.-X. Wang, D. Ramanan, and J. Moura, “Few-Shot Scalable Meta Inverse Reinforcement Learning Through Context-
Human Motion Prediction Via Meta-learning,” in ECCV, 2018. Conditional Policies,” in NeurIPS, 2019.
[206] A. Shaban, S. Bansal, Z. Liu, I. Essa, and B. Boots, “One-Shot [234] M. C. Machado, M. G. Bellemare, E. Talvitie, J. Veness,
Learning For Semantic Segmentation,” CoRR, 2017. M. Hausknecht, and M. Bowling, “Revisiting The Arcade Learn-
[207] N. Dong and E. P. Xing, “Few-Shot Semantic Segmentation With ing Environment: Evaluation Protocols And Open Problems For
Prototype Learning,” in BMVC, 2018. General Agents,” Journal of Artificial Intelligence Research, 2018.
[208] K. Rakelly, E. Shelhamer, T. Darrell, A. A. Efros, and S. Levine, [235] A. Nichol, V. Pfau, C. Hesse, O. Klimov, and J. Schulman, “Gotta
“Few-Shot Segmentation Propagation With Guided Networks,” Learn Fast: A New Benchmark For Generalization In RL,” CoRR,
ICML, 2019. 2018.
[209] M. Bishay, G. Zoumpourlis, and I. Patras, “Tarn: Temporal atten- [236] K. Cobbe, O. Klimov, C. Hesse, T. Kim, and J. Schulman, “Quan-
tive relation network for few-shot and zero-shot action recogni- tifying Generalization In Reinforcement Learning,” ICML, 2019.
tion,” in BMVC, 2019. [237] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman,
[210] H. Coskun, M. Zia, B. Tekin, F. Bogo, N. Navab, F. Tombari, and J. Tang, and W. Zaremba, “OpenAI Gym,” 2016.
H. Sawhney, “Domain-specific priors and meta learning for few- [238] C. Packer, K. Gao, J. Kos, P. Krähenbühl, V. Koltun, and D. Song,
shot first-person action recognition,” IEEE Transactions on Pattern “Assessing Generalization In Deep Reinforcement Learning,”
Analysis Machine Intelligence, 2021. arXiv e-prints, 2018.
[211] L.-Y. Gui, Y.-X. Wang, D. Ramanan, and J. M. Moura, “Few-shot [239] C. Zhao, O. Siguad, F. Stulp, and T. M. Hospedales, “Investigating
human motion prediction via meta-learning,” in ECCV, 2018. Generalisation In Continuous Deep Reinforcement Learning,”
[212] S. A. Eslami, D. J. Rezende, F. Besse, F. Viola, A. S. Morcos, arXiv e-prints, 2019.
M. Garnelo, A. Ruderman, A. A. Rusu, I. Danihelka, K. Gregor [240] T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and
et al., “Neural scene representation and rendering,” Science, vol. S. Levine, “Meta-world: A Benchmark And Evaluation For Multi-
360, no. 6394, pp. 1204–1210, 2018. task And Meta Reinforcement Learning,” CORL, 2019.
[213] E. Zakharov, A. Shysheya, E. Burkov, and V. S. Lempitsky, “Few- [241] A. Bakhtin, L. van der Maaten, J. Johnson, L. Gustafson, and
Shot Adversarial Learning Of Realistic Neural Talking Head R. Girshick, “Phyre: A New Benchmark For Physical Reasoning,”
Models,” in ICCV, 2019. in NeurIPS, 2019.
[214] T.-C. Wang, M.-Y. Liu, A. Tao, G. Liu, J. Kautz, and B. Catanzaro, [242] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning Trans-
“Few-shot Video-to-video Synthesis,” in NeurIPS, 2019. ferable Architectures For Scalable Image Recognition,” in CVPR,
[215] F. Shen, S. Yan, and G. Zeng, “Neural style transfer via meta 2018.
networks,” in CVPR, 2018. [243] T. Elsken, B. Staffler, J. H. Metzen, and F. Hutter, “Meta-Learning
[216] S. E. Reed, Y. Chen, T. Paine, A. van den Oord, S. M. A. Of Neural Architectures For Few-Shot Learning,” in CVPR, 2019.
Eslami, D. J. Rezende, O. Vinyals, and N. de Freitas, “Few-shot [244] L. Li and A. Talwalkar, “Random Search And Reproducibility For
Autoregressive Density Estimation: Towards Learning To Learn Neural Architecture Search,” arXiv e-prints, 2019.
Distributions,” in ICLR, 2018. [245] C. Ying, A. Klein, E. Christiansen, E. Real, K. Murphy, and F. Hut-
[217] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, ter, “NAS-Bench-101: Towards Reproducible Neural Architecture
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet Search,” in ICML, 2019.
Large Scale Visual Recognition Challenge,” International Journal [246] P. Tossou, B. Dura, F. Laviolette, M. Marchand, and A. Lacoste,
of Computer Vision, 2015. “Adaptive Deep Kernel Learning,” CoRR, 2019.
[218] M. Ren, E. Triantafillou, S. Ravi, J. Snell, K. Swersky, J. B. [247] M. Patacchiola, J. Turner, E. J. Crowley, M. O’Boyle, and
Tenenbaum, H. Larochelle, and R. S. Zemel, “Meta-Learning For A. Storkey, “Deep Kernel Transfer In Gaussian Processes For
Semi-Supervised Few-Shot Classification,” ICLR, 2018. Few-shot Learning,” arXiv e-prints, 2019.
[219] A. Antoniou and M. O. S. A. Massimiliano, Patacchiola, “Defin- [248] T. Kim, J. Yoon, O. Dia, S. Kim, Y. Bengio, and S. Ahn, “Bayesian
ing Benchmarks For Continual Few-shot Learning,” arXiv e- Model-Agnostic Meta-Learning,” NeurIPS, 2018.
prints, 2020. [249] S. Ravi and A. Beatson, “Amortized Bayesian Meta-Learning,” in
[220] S.-A. Rebuffi, H. Bilen, and A. Vedaldi, “Learning Multiple Visual ICLR, 2019.
Domains With Residual Adapters,” in NeurIPS, 2017. [250] Z. Wang, Y. Zhao, P. Yu, R. Zhang, and C. Chen, “Bayesian meta
[221] Y. Guo, N. C. F. Codella, L. Karlinsky, J. R. Smith, T. Rosing, and sampling for fast uncertainty adaptation,” in ICLR, 2020.
R. Feris, “A New Benchmark For Evaluation Of Cross-Domain [251] K. Hsu, S. Levine, and C. Finn, “Unsupervised Learning Via
Few-Shot Learning,” arXiv:1912.07200, 2019. Meta-learning,” ICLR, 2019.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2021.3079209, IEEE
Transactions on Pattern Analysis and Machine Intelligence
20

[252] S. Khodadadeh, L. Boloni, and M. Shah, “Unsupervised Meta- [284] B. Settles, “Active Learning,” Synthesis Lectures on Artificial Intel-
Learning For Few-Shot Image Classification,” in NeurIPS, 2019. ligence and Machine Learning, 2012.
[253] A. Antoniou and A. Storkey, “Assume, Augment And Learn: [285] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining And
Unsupervised Few-shot Meta-learning Via Random Labels And Harnessing Adversarial Examples,” in ICLR, 2015.
Data Augmentation,” arXiv e-prints, 2019. [286] C. Yin, J. Tang, Z. Xu, and Y. Wang, “Adversarial Meta-Learning,”
[254] Y. Jiang and N. Verma, “Meta-Learning To Cluster,” arxiv, 2019. CoRR, 2018.
[255] V. Garg and A. T. Kalai, “Supervising Unsupervised Learning,” [287] M. Vartak, A. Thiagarajan, C. Miranda, J. Bratman, and
in NeurIPS, 2018. H. Larochelle, “A meta-learning perspective on cold-start recom-
[256] S. M. Kye, H. B. Lee, H. Kim, and S. J. Hwang, “Meta-learned mendations for items,” in NIPS, 2017.
confidence for few-shot learning,” arXiv, 2020. [288] H. Bharadhwaj, “Meta-learning for user cold-start recommenda-
[257] H. Pham, Q. Xie, Z. Dai, and Q. V. Le, “Meta pseudo labels,” tion,” in IJCNN, 2019.
arXiv:2003.10580, 2020. [289] T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn,
[258] K. Javed and M. White, “Meta-learning Representations For “Gradient Surgery For Multi-Task Learning,” 2020.
Continual Learning,” in NeurIPS, 2019. [290] Z. Kang, K. Grauman, and F. Sha, “Learning With Whom To Share
[259] A. Sinitsin, V. Plokhotnyuk, D. Pyrkin, S. Popov, and A. Babenko, In Multi-task Feature Learning,” in ICML, 2011.
“Editable Neural Networks,” in ICLR, 2020. [291] Y. Yang and T. Hospedales, “A Unified Perspective On Multi-
[260] K. Muandet, D. Balduzzi, and B. Schölkopf, “Domain General- Domain And Multi-Task Learning,” in ICLR, 2015.
ization Via Invariant Feature Representation,” in ICML, 2013. [292] K. Allen, E. Shelhamer, H. Shin, and J. Tenenbaum, “Infinite
[261] D. Li, Y. Yang, Y. Song, and T. Hospedales, “Learning To General- Mixture Prototypes For Few-shot Learning,” in ICML, 2019.
ize: Meta-Learning For Domain Generalization,” in AAAI, 2018. [293] F. Pedregosa, “Hyperparameter optimization with approximate
[262] D. Li, Y. Yang, Y.-Z. Song, and T. Hospedales, “Deeper, Broader gradient,” in ICML, 2016.
And Artier Domain Generalization,” in ICCV, 2017. [294] R. J. Williams and D. Zipser, “A learning algorithm for contin-
[263] J. Devlin, R. Bunel, R. Singh, M. J. Hausknecht, and P. Kohli, ually running fully recurrent neural networks,” Neural Computa-
“Neural Program Meta-Induction,” in NIPS, 2017. tion, vol. 1, no. 2, pp. 270–280, 1989.
[264] X. Si, Y. Yang, H. Dai, M. Naik, and L. Song, “Learning A Meta- [295] L. Metz, N. Maheswaranathan, C. D. Freeman, B. Poole, and
solver For Syntax-guided Program Synthesis,” ICLR, 2018. J. Sohl-Dickstein, “Tasks, stability, architecture, and compute:
[265] P. Huang, C. Wang, R. Singh, W. Yih, and X. He, “Natural Training more effective learned optimizers, and using them to
Language To Structured Query Generation Via Meta-Learning,” train themselves,” CoRR, vol. abs/2009.11243, 2020.
CoRR, 2018. [296] A. Shaban, C.-A. Cheng, N. Hatch, and B. Boots, “Truncated back-
[266] Y. Xie, H. Jiang, F. Liu, T. Zhao, and H. Zha, “Meta Learning With propagation for bilevel optimization,” in AISTATS, 2019.
Relational Information For Short Sequences,” in NeurIPS, 2019. [297] J. Fu, H. Luo, J. Feng, K. H. Low, and T.-S. Chua, “DrMAD:
Distilling reverse-mode automatic differentiation for optimizing
[267] J. Gu, Y. Wang, Y. Chen, V. O. K. Li, and K. Cho, “Meta-Learning
hyperparameters of deep neural networks,” in IJCAI, 2016.
For Low-Resource Neural Machine Translation,” in EMNLP,
[298] S. Flennerhag, P. G. Moreno, N. Lawrence, and A. Damianou,
2018.
“Transferring knowledge across learning processes,” in ICLR,
[268] Z. Lin, A. Madotto, C. Wu, and P. Fung, “Personalizing Dialogue
2019.
Agents Via Meta-Learning,” CoRR, 2019.
[299] Y. Wu, M. Ren, R. Liao, and R. Grosse, “Understanding short-
[269] J.-Y. Hsu, Y.-J. Chen, and H. yi Lee, “Meta Learning For End-to-
horizon bias in stochastic meta-optimization,” in ICLR, 2018.
End Low-Resource Speech Recognition,” in ICASSP, 2019.
[300] L. Metz, N. Maheswaranathan, J. Nixon, D. Freeman, and J. Sohl-
[270] G. I. Winata, S. Cahyawijaya, Z. Liu, Z. Lin, A. Madotto, P. Xu,
Dickstein, “Understanding and correcting pathologies in the
and P. Fung, “Learning Fast Adaptation On Cross-Accented
training of learned optimizers,” in ICML, 2019.
Speech Recognition,” arXiv e-prints, 2020.
[271] O. Klejch, J. Fainberg, and P. Bell, “Learning To Adapt: A Meta- Timothy Hospedales is a Professor at the Uni-
learning Approach For Speaker Adaptation,” Interspeech, 2018. versity of Edinburgh, and Principal Researcher
[272] J. Tremblay, A. Prakash, D. Acuna, M. Brophy, V. Jampani, at Samsung AI Research. His research interest
C. Anil, T. To, E. Cameracci, S. Boochoon, and S. Birchfield, is in data efficient and robust learning-to-learn
“Training Deep Networks With Synthetic Data: Bridging The with diverse applications in vision, language, re-
Reality Gap By Domain Randomization,” in CVPR, 2018. inforcement learning, and beyond.
[273] A. Kar, A. Prakash, M. Liu, E. Cameracci, J. Yuan, M. Rusiniak,
D. Acuna, A. Torralba, and S. Fidler, “Meta-Sim: Learning To
Generate Synthetic Datasets,” ICCV, 2019.
Antreas Antoniou is a PhD student at the
[274] D. M. Metter, T. J. Colgan, S. T. Leung, C. F. Timmons, and J. Y.
University of Edinburgh, supervised by Amos
Park, “Trends In The US And Canadian Pathologist Workforces
Storkey. His research contributions in meta-
From 2007 To 2017,” JAMA Network Open, 2019.
learning and few-shot learning are commonly
[275] G. Maicas, A. P. Bradley, J. C. Nascimento, I. D. Reid, and
seen as key benchmarks in the field. His main
G. Carneiro, “Training Medical Image Analysis Systems Like
interests lie around meta-learning better priors
Radiologists,” CoRR, 2018.
such as losses, initializations and neural network
[276] B. D. Nguyen, T.-T. Do, B. X. Nguyen, T. Do, E. Tjiputra, and Q. D. layers, to improve few-shot and life-long learning.
Tran, “Overcoming Data Limitation In Medical Visual Question
Answering,” arXiv e-prints, 2019. Paul Micaelli is a PhD student at the University
[277] Z. Mirikharaji, Y. Yan, and G. Hamarneh, “Learning To Segment of Edinburgh, supervised by Amos Storkey and
Skin Lesions From Noisy Annotations,” CoRR, 2019. Timothy Hospedales. His research focuses on
[278] T. Miconi, J. Clune, and K. O. Stanley, “Differentiable Plasticity: zero-shot knowledge distillation and on meta-
Training Plastic Neural Networks With Backpropagation,” in learning over long horizons for many-shot prob-
ICML, 2018. lems.
[279] T. Miconi, A. Rawal, J. Clune, and K. O. Stanley, “Backpropamine:
Training Self-modifying Neural Networks With Differentiable
Neuromodulated Plasticity,” in ICLR, 2019. Amos Storkey is Professor of Machine Learning
[280] B. Dai, C. Zhu, and D. Wipf, “Compressing Neural Networks and AI in the School of Informatics, University of
Using The Variational Information Bottleneck,” ICML, 2018. Edinburgh. He leads a research team focused on
[281] Z. Liu, H. Mu, X. Zhang, Z. Guo, X. Yang, K.-T. Cheng, and J. Sun, deep neural networks, Bayesian and probabilis-
“Metapruning: Meta Learning For Automatic Neural Network tic models, efficient inference and meta-learning.
Channel Pruning,” in ICCV, 2019.
[282] T. O’Shea and J. Hoydis, “An Introduction To Deep Learning For
The Physical Layer,” IEEE Transactions on Cognitive Communica-
tions and Networking, 2017.
[283] Y. Jiang, H. Kim, H. Asnani, and S. Kannan, “MIND: Model
Independent Neural Decoder,” arXiv e-prints, 2019.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

You might also like