Professional Documents
Culture Documents
07 - Induction Networks For Few-Shot Text Classification
07 - Induction Networks For Few-Shot Text Classification
Ruiying Geng1,2 , Binhua Li2 , Yongbin Li2∗ , Xiaodan Zhu3 , Ping Jian1∗, Jian Sun2
1
School of Computer Science and Technology, Beijing Institute of Technology
2
Alibaba Group, Beijing
3
ECE, Queen’s University
{ruiying.gry,binhua.lbh,shuide.lyb,jian.sun}@alibaba-inc.com
zhu2048@gmail.com pjian@bit.edu.cn
is deficient or when it needs to adapt to unseen tation of only one or very few examples challenges
classes. In such challenging scenarios, recent
the standard fine-tuning method in deep learning.
studies have used meta-learning to simulate
the few-shot task, in which new queries are
Early studies (Salamon and Bello, 2017) applied
compared to a small support set at the sample- data augmentation and regularization techniques
wise level. However, this sample-wise com- to alleviate the overfitting problem caused by data
parison may be severely disturbed by the var- sparseness, only to a limited extent. Instead, re-
ious expressions in the same class. Therefore, searchers have explored meta-learning (Finn et al.,
we should be able to learn a general repre- 2017) to leverage the distribution over similar
sentation of each class in the support set and tasks, inspired by human learning. Contempo-
then compare it to new queries. In this pa-
rary approaches to few-shot learning often decom-
per, we propose a novel Induction Network
to learn such a generalized class-wise repre- pose the training procedure into an auxiliary meta-
sentation, by innovatively leveraging the dy- learning phase, which includes many meta-tasks,
namic routing algorithm in meta-learning. In following the principle that the testing and train-
this way, we find the model is able to induce ing conditions must match. They extract some
and generalize better. We evaluate the pro- transferable knowledge by switching the meta-
posed model on a well-studied sentiment clas- task from mini-batch to mini-batch. As such, few-
sification dataset (English) and a real-world di-
shot models can classify new classes with just a
alogue intent classification dataset (Chinese).
Experiment results show that on both datasets,
small labeled support set.
the proposed model significantly outperforms However, existing approaches for few-shot
the existing state-of-the-art approaches, prov- learning still confront many important problems,
ing the effectiveness of class-wise generaliza- including the imposed strong priors (Fei-Fei et al.,
tion in few-shot text classification. 2006), complex gradient transfer between tasks
(Munkhdalai and Yu, 2017), and fine-tuning the
1 Introduction target problem (Qi et al., 2018). The approaches
proposed by Snell et al. (2017) and Sung et al.
Deep learning has achieved a great success in
(2018), which combine non-parametric methods
many fields such as computer vision, speech
and metric learning, provide potential solutions
recognition and natural language processing
to some of those problems. The non-parametric
(Kuang et al., 2018). However, supervised deep
methods allow novel examples to be rapidly as-
learning is notoriously greedy for large labeled
similated, without suffering from catastrophic
datasets, which limits the generalizability of deep
overfitting. Such non-parametric models only
models to new classes due to annotation cost. Hu-
need to learn the representation of the samples
mans on the other hand are readily capable of
and the metric measure. However, instances in the
rapidly learning new classes of concepts with few
same class are interlinked and have their uniform
examples or stimuli. This notable gap provides a
fraction and their specific fractions. In previous
fertile ground for further research.
studies, the class-level representations are calcu-
∗
Corresponding authors: Y.Li and P.Jian. lated by simply summing or averaging represen-
tations of samples in the support set. In doing as follows:
so, essential information may be lost in the noise
brought by various forms of samples in the same • We propose the Induction Networks for few-
class. Note that few-shot learning algorithms do shot text classification. To deal with sample-
not fine-tune on the support set. When increasing wise diversity in the few-shot learning task,
size of the support set, the improvement brought our model is the first, to the best of our
by a bigger data size will also be diminished by knowledge, that explicitly models the abil-
more sample level noises. ity to induce class-level representations from
small support sets.
Instead, we explore a better approach by per-
forming induction at the class-wise level: ignor- • The proposed Induction Module combines
ing irrelevant details and encapsulating general se- the dynamic routing algorithm with typical
mantic information from samples with various lin- meta-learning frameworks. The matrix trans-
guistic forms in the same class. As a result, there formation and routing procedure enable our
is a need for a perspective architecture that can model to generalize well to recognize unseen
reconstruct hierarchical representations of support classes.
sets and dynamically induce sample representa-
tions to class representations. • Our method outperforms the current state-of-
the-art models on two few-shot text classifi-
Recently, capsule network (Sabour et al., 2017)
cation datasets, including a well-studied sen-
has been proposed, which possesses the exciting
timent classification benchmark and a real-
potential to address the aforementioned issue. A
world dialogue intent classification dataset.
capsule network uses “capsules” that perform dy-
namic routing to encode the intrinsic spatial re- 2 Related Work
lationship between parts and whole that consti-
tutes viewpoint invariant knowledge. Following 2.1 Few-Shot Learning
a similar spirit, we can regard samples as parts The seminal work on few-shot learning dates back
and class as a whole. We propose the Induc- to the early 2000s (Fe-Fei et al., 2003; Fei-Fei
tion Networks, which aims to model the ability et al., 2006). The authors combined generative
of learning generalized class-level representation models with complex iterative inference strate-
from samples in a small support set, based on the gies. More recently, many approaches have used
dynamic routing process. First, an Encoder Mod- a meta-learning (Finn et al., 2017; Mishra et al.,
ule generates representations for a query and sup- 2018) strategy in the sense that they extract some
port samples. Next, an Induction Module executes transferable knowledge from a set of auxiliary
a dynamic routing procedure, in which the ma- tasks, which then helps them to learn the tar-
trix transformation can be seen as a map from the get few-shot problem well without suffering from
sample space to the class space, and then the gen- overfitting. In general, these approaches can be
eration of the class representation is all depend- divided into two categories.
ing on the routing-by-agreement procedure other
than any parameters, which renders a robust induc- Optimization-based Methods This type of ap-
tion ability to the proposed model to deal with un- proach aims to learn to optimize the model param-
seen classes. By regarding the samples’ represen- eters given the gradients computed from the few-
tations as input capsules and the classes’ as output shot examples. Munkhdalai and Yu (2017) pro-
capsules, we expect to recognize the semantics of posed the Meta Network, which learnt the meta-
classes that is invariant to sample-level noise. Fi- level knowledge across tasks and shifted its in-
nally, the interaction between queries and classes ductive biases via fast parameterization for rapid
is modelled—their representations are compared generalization. Mishra et al. (2018) introduced a
by a Relation Module to determine if the query generic meta-learning architecture called SNAIL
matches the class or not. Defining an episode- which used a novel combination of temporal con-
based meta-training strategy, the holistic model is volutions and soft attention.
meta-trained end-to-end with the generalizability Distance Metric Learning These approaches
and scalability to recognize unseen classes. are different from the above approaches that en-
The specific contributions of our work are listed tail some complexity when learning the target few-
shot problem. The core idea in metric-based few- we propose to use capsules and dynamic rout-
shot learning is similar to nearest neighbours and ing to learn generalized class-level representation
kernel density estimation. The predicted probabil- from samples based. The dynamic routing method
ity over a set of known labels is a weighted sum makes our model generalize better in the few-shot
of labels of support set samples. Vinyals et al. text classification task.
(2016) produced a weighted K-nearest neighbour
classifier measured by the cosine distance, which 3 Problem Definition
was called Matching Networks. Snell et al. (2017)
3.1 Few-Shot Classification
proposed the Prototypical Networks which learnt
a metric space where classification could be per- Few-shot classification (Vinyals et al., 2016; Snell
formed by computing squared Euclidean distances et al., 2017) is a task in which a classifier must
to prototype representations of each class. Differ- be adapted to accommodate new classes not seen
ent from fixed metric measures, the Relation Net- in training, given only a few examples of each of
work learnt a deep distance metric to compare the these new classes. We have a large labeled train-
query with given examples (Sung et al., 2018). ing set with a set of classes Ctrain . However, after
Recently, some studies have been presented fo- training, our ultimate goal is to produce classifiers
cusing specifically on few-shot text classification on the testing set with a disjoint set of new classes
problems. Xu et al. (2018) studied lifelong domain Ctest , for which only a small labeled support set
word embeddings via meta-learning. Yu et al. will be available. If the support set contains K la-
(2018) argued that the optimal meta-model may beled examples for each of the C unique classes,
vary across tasks, and they employed the multi- the target few-shot problem is called a C-way K-
metric model by clustering the meta-tasks into shot problem. Usually, the K is too small to train
several defined clusters. Rios and Kavuluru (2018) a supervised classification model. Therefore, we
developed a few-shot text classification model for aim to perform meta-learning on the training set,
multi-label text classification where there was a and extract transferable knowledge that will allow
known structure over the label space. Xu et al. us to deliver better few-shot learning on the sup-
(2019) proposed a open-world learning model to port set and thus classify the test set more accu-
deal with the unseen classes in the product classi- rately.
fication problem. We solve the few-shot learning 3.2 Training Procedure
problem from a different perspective and propose
a dynamic routing induction method to encapsu- The training procedure has to be chosen carefully
late the abstract class representation from sam- to match inference at test time. An effective way
ples, achieving state-of-the-art performances on to exploit the training set is to decompose the
two datasets. training procedure into an auxiliary meta-learning
phase and mimic the few-shot learning setting via
episode-based training, as proposed in Vinyals
2.2 Capsule Network
et al. (2016). We construct an meta-episode to
The Capsule Network was first proposed by compute gradients and update our model in each
Sabour et al. (2017), which allowed the network to training iteration. The meta-episode is formed by
learn robustly the invariants in part-whole relation- randomly selecting a subset of classes from the
ships. Lately, Capsule Network has been explored training set first, and then choosing a subset of
in the natural language processing field. Yang examples within each selected class to act as the
et al. (2018) successfully applied Capsule Net- support set S and a subset of the remaining exam-
work to fully supervised text classification prob- ples to serve as the query set Q. The meta-training
lem with large labeled datasets. Unlike their work, procedure explicitly learns to learn from the given
we study few-shot text classification. Xia et al. support set S to minimise a loss over the query
(2018) reused the supervised model similar to that set Q. We call this strategy as episode-based meta
of Yang et al. (2018) for intent classification, in training, and the details are shown in Algorithm
which a capsule-based architecture is extended to 1. It is worth noting that there are exponentially
compute similarity between the target intents and many possible meta tasks to train the model
source intents. Unlike their work, we propose In- on, making it hard to overfit. For example, if a
duction Networks for few-shot learning, in which dataset contains 159 training classes, this leads to
Algorithm 1 Episode-Based Meta Training resentation e of the text is the weighted sum of H:
1: for each episode iteration do T
2: Randomly select C classes from the class
X
e= at · ht (4)
space of the training set; t=1
3: Randomly select K labeled samples from
each of the C classes as support set S = 4.2 Induction Module
{(xs , ys )}m
s=1 (m = K × C), and select a This section introduces the proposed dynamic
fraction of the reminder of those C classes’ routing induction algorithm. We regard these
samples as query set Q = {(xq , yq )}nq=1 ; vectors e obtained from the support set S by
4: Feed the support set S to the model and up- Eq 4 as sample vectors es , and the vectors e from
date the parameters by minimizing the loss the query set Q as query vectors eq . The most
in the query set Q; important step is to extract the representation for
5: end for each class in the support set. The main purpose
of the induction module is to design a non-linear
! mapping from sample vector esij to class vector ci :
159
= 794, 747, 031 possible 5−way tasks.
5 n o n oC
esij ∈ R2u 7→ ci ∈ R2u .
i=1,...C,j=1...K i=1
Class 1
C ⇥d
<latexit sha1_base64="oPE6sSkasJdBLSZjbGAxTRLvZTU=">AAACzXicjVHLSsNAFD2Nr1pfVZdugkVwVRIRdFnsxp0V7ANrkWQ6rYN5MZkIperWH3CrvyX+gf6Fd8YU1CI6IcmZc+85M/dePwlEqhzntWDNzM7NLxQXS0vLK6tr5fWNVhpnkvEmi4NYdnwv5YGIeFMJFfBOIrkX+gFv+9d1HW/fcJmKODpTo4T3Qm8YiYFgniLqvG5fKBHy1O5flitO1THLngZuDirIVyMuv+ACfcRgyBCCI4IiHMBDSk8XLhwkxPUwJk4SEibOcYcSaTPK4pThEXtN3yHtujkb0V57pkbN6JSAXklKGzukiSlPEtan2SaeGWfN/uY9Np76biP6+7lXSKzCFbF/6SaZ/9XpWhQGODQ1CKopMYyujuUumemKvrn9pSpFDglxGvcpLgkzo5z02Taa1NSue+uZ+JvJ1Kzeszw3w7u+JQ3Y/TnOadDaq7pO1T3dr9SO8lEXsYVt7NI8D1DDMRpokneERzzh2TqxMuvWuv9MtQq5ZhPflvXwAcEJksQ=</latexit>
Figure 1: Induction Networks architecture for a C-way K-shot (C = 3, K = 2) problem with one query example
where bi is the logits of coupling coefficients, and Algorithm 2 Dynamic Routing Induction
initialized by 0 in the first iteration. Given each Require: sample vector esij in support set S and
sample prediction vector êsij , each class candidate initialize the logits of coupling coefficients
vector ĉi is a weighted sum of all sample predic- bij = 0
tion vectors êsij in class i: Ensure: class vector ci
X 1: for all samples j = 1, ..., K in class i:
ĉi = dij · êsij (8) 2: êsij = squash(Ws esij + bs )
j 3: for iter iterations do
4: di = sof tmax (bi )
then a non-linear “squashing” function is applied
ĉi = dij · êsij
P
5:
to ensure that the length of the vector output of the j
routing process will not exceed 1: 6: ci = squash(ĉi )
7: for all samples j = 1, ..., K in class i:
ci = squash(ĉi ) (9) 8: bij = bij + êsij · ci
9: end for
The last step in every iteration is to adjust the 10: Return ci
logits of coupling coefficients bij by a “routing by
agreement” method. If the produced class candi-
date vector has a large scalar output with one sam- Module, the next essential procedure is to mea-
ple prediction vector, there is a top-down feedback sure the correlation between each pair of query
which increases the coupling coefficient for that and class. The output of the Relation Module is
sample and decreases it for other samples. This called the relation score, representing the correla-
type of adjustment is very effective and robust for tion between ci and eq , which is a scalar between
the few-shot learning scenario because it does not 0 and 1. Specifically, we use the neural tensor
need to restore any parameters. Each bij is updated layer (Socher et al., 2013) in this module, which
by: has shown great advantages in modeling the rela-
bij = bij + êsij · ci (10) tionship between two vectors (Wan et al., 2016;
Geng et al., 2017). We choose it as an interaction
Formally, we call our induction method as dy- function in this paper. The tensor layer outputs a
namic routing induction and summarize it in Al- relation vector as follows:
gorithm 2.
v(ci , eq ) = f ci T M [1:h] eq (11)
4.3 Relation Module
After the class vector ci is generated by the In- where M k ∈ R2u×2u , k ∈ [1, ..., h] is one slice
duction Module and each query text in the query of the tensor parameters and f is a non-linear acti-
set is encoded to a query vector eq by the Encoder vation function called RELU (Glorot et al., 2011).
The final relation score riq between the i-th class Training Set Testing Set
and the q-th query is calculated by a fully con- Class Num 159 57
nected layer activated by a sigmoid function. Data Num 195,775 2,279
Data Num/Class ≥ 77 20 ∼ 77
nation of temporal convolutions and soft at- once for each of the target tasks. The mean accu-
tention (Mishra et al., 2018). racy of the 12 target task is compared to the base-
line models following Yu et al. (2018).
• ROBUSTTC-FSL: This approach combines
several metric-based methods by clustering 5.3 Experiment Results
the tasks (Yu et al., 2018). Overall Performance Experiment results on
ARSC are presented in Table 2. The proposed
The baseline results on ARSC are reported in Induction Networks achieves a 85.63% accuracy,
Yu et al. (2018) and we implemented the baseline outperforming the existing state-of-the-art model,
models on ODIC with the same text encoder mod- ROBUSTTC-FSL, by a notable 3% improve-
ule. ment. We due the improvement to the fact that
Implementation Details We use 300-dimension ROBUSTTC-FSL builds a general metric method
Glove embeddings (Pennington et al., 2014) for by integrating several metrics at the sample level,
ARSC dataset and 300-dimension Chinese word which faces the difficulty of getting rid of the
embeddings trained by Li et al. (2018) for ODIC. noise among different expressions in the same
We set the hidden state size of LSTM u = 128 and class. In addition to that, the task-clustering-based
the attention dimension da = 64. The iteration method used by ROBUSTTC-FSL must be found
number iter used in dynamic routing algorithm is on the relevance matrix, which is inefficient when
3. The relation module is a neural tensor layer with applied to real-world scenarios where the tasks
h = 100 followed by a fully connected layer ac- change rapidly. Our Induction Networks, how-
tivated by sigmoid. We build 2-way 5-shot mod- ever, is trained in the meta-learning framework
els on ARSC following Yu et al. (2018), and build with more flexible generalization and its induction
episode-based meta training with C = [5, 10] and ability can hence be accumulated through different
K = [5, 10] for comparison on ODIC. In addition tasks.
to K sample texts as support set, the query set has We also evaluate our method with a real-world
20 query texts for each of the C sampled classes in intent classification dataset ODIC. The experiment
every training episode. This means, for example, results are listed in Table 3. We can see that
that there are 20 × 5 + 5 × 5 = 125 texts in one our proposed Induction Networks achieves best
training episode for the 5-way 5-shot experiments. classification performances on all of the four ex-
periments. In the distance metric learning mod-
Evaluation Methods We evaluate the perfor- els (Matching Networks, Prototypical Networks,
mance by few-shot classification accuracy follow- Graph Network and Relation Network), all the
ing previous studies in few-shot learning (Snell learning occurs in representing features and mea-
et al., 2017; Sung et al., 2018). To evaluate the suring distances at the sample-wise level. Our
proposed model with the baselines objectively, we work builds an induction module focusing on the
compute mean few-shot classification accuracies class-wise level of representation, which we claim
on ODIC over 600 randomly selected episodes to be more robust to variation of samples in the
from the testing set. We sample 10 test texts per support set. Our model also outperforms the lat-
class in each episode for evaluation in both 5-shot est optimization-based method—SNAIL. The dif-
and 10-shot scenarios. Note that for ARSC, the ference between Induction Networks and SNAIL
support set for testing is fixed by Yu et al. (2018). shown in Table 3 is statistically significant un-
Consequently, we just need to run the test episode der the paired at the 99% significance level. In
Iteration Accuracy
Routing+Relation 3 85.63
Routing+Relation 2 85.41
Routing+Relation 1 85.06
Routing + Cosine 3 84.67
Sum +Relation - 83.07
Attention + Relation - 85.15 (a) Before transformation (b) After transformation