Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

Noname manuscript No.

(will be inserted by the editor)

Knowledge Distillation: A Survey


Jianping Gou1 · Baosheng Yu 1
· Stephen J. Maybank 2
· Dacheng Tao1
arXiv:2006.05525v3 [cs.LG] 30 Jun 2020

Received: date / Accepted: date

Abstract In recent years, deep neural networks have Keywords Deep neural networks · Model
been successful in both industry and academia, es- compression · Knowledge distillation · Knowledge
pecially for computer vision tasks. The great success transfer · Teacher-student architecture.
of deep learning is mainly due to its scalability to
encode large-scale data and to maneuver billions of
model parameters. However, it is a challenge to deploy 1 Introduction
these cumbersome deep models on devices with limited
resources, e.g., mobile phones and embedded devices, During the last few years, deep learning has been
not only because of the high computational complex- the basis of many successes in artificial intelligence,
ity but also the large storage requirements. To this including a variety of applications in computer vi-
end, a variety of model compression and acceleration sion (Krizhevsky et al., 2012), reinforcement learning
techniques have been developed. As a representative (Silver et al., 2016), and natural language processing
type of model compression and acceleration, knowledge (Devlin et al., 2019). With the help of many recent
distillation effectively learns a small student model from techniques, including residual connections (He et al.,
a large teacher model. It has received rapid increasing 2016, 2020) and batch normalization (Ioffe and Szegedy,
attention from the community. This paper provides a 2015), it is easy to train very deep models with
comprehensive survey of knowledge distillation from the thousands of layers on powerful GPU or TPU clus-
perspectives of knowledge categories, training schemes, ters. For example, it takes less than ten minutes to
distillation algorithms and applications. Furthermore, train a ResNet model on a popular image recognition
challenges in knowledge distillation are briefly reviewed benchmark with millions of images (Deng et al., 2009;
and comments on future research are discussed and Sun et al., 2019); It takes no more than one and a half
forwarded. hours to train a powerful BERT model for language un-
derstanding (Devlin et al., 2019; You et al., 2019). The
large-scale deep models have achieved overwhelming
successes, however the huge computational complexity
Jianping Gou
E-mail: cherish.gjp@gmail.com and massive storage requirements make it a great
Baosheng Yu challenge to deploy them in real-time applications,
E-mail: baosheng.yu@sydney.edu.au especially on devices with limited resources, such as
Stephen J. Maybank
video surveillance and autonomous driving cars.
E-mail: sjmaybank@dcs.bbk.ac.uk
Dacheng Tao To develop efficient deep models, recent works usu-
E-mail: dacheng.tao@sydney.edu.au ally focus on 1) efficient building blocks for deep models,
1 UBTECH Sydney AI Centre, School of Computer Science, including depthwise separable convolution, as in Mo-
Faculty of Engineering, The University of Sydney, Darlington,
NSW 2008, Australia.
bileNets (Howard et al., 2017; Sandler et al., 2018) and
2 Department of Computer Science and Information Systems, ShuffleNets (Zhang et al., 2018a; Ma et al., 2018); and
Birkbeck College, University of London, UK. 2) model compression and acceleration techniques, in
the following categories (Cheng et al., 2018).
2 Jianping Gou1 et al.

Teacher Model

Knowledge Transfer Student Model

Knowledge
Distill Transfer

Data

Fig. 1 The generic teacher-student framework for knowledge distillation.

• Parameter pruning and sharing: These methods fo- model to the student model. A general teacher-student
cus on removing inessential parameters from deep framework for knowledge distillation is shown in Fig. 1.
neural networks without any significant effect on the
performance. This category is further divided into
model quantization (Wu et al., 2016), model bina- Although the great success in practice, there are not
rization (Courbariaux et al., 2015), structural matri- too many works on either the theoretical or empirical
ces (Sindhwani et al., 2015) and parameter sharing understanding of knowledge distillation (Cheng et al.,
(Han et al., 2015; Wang et al., 2019f). 2020; Phuong and Lampert, 2019; Cho and Hariharan,
• Low-rank factorization: These methods identify re- 2019). Specifically, to understand the working mech-
dundant parameters of deep neural networks by em- anisms of knowledge distillation, Phuong & Lampert
ploying the matrix and tensor decomposition (Yu et al., obtained a theoretical justification for a generaliza-
2017; Denton et al., 2014). tion bound with fast convergence of learning dis-
• Transferred compact convolutional filters: These meth- tilled student networks in the scenario of deep linear
ods remove inessential parameters by transferring classifiers (Phuong and Lampert, 2019). This justifica-
or compressing the convolutional filters (Zhai et al., tion answers what and how fast the student learns
2016). and reveals the factors of determining the success
• Knowledge distillation (KD): These methods distill of distillation. Successful distillation relies on data
the knowledge from a larger deep neural network into geometry, optimization bias of distillation objective and
a small network (Hinton et al., 2015). strong monotonicity of the student classifier. Cheng
et al. quantified the extraction of visual concepts from
the intermediate layers of a deep neural network, to
A comprehensive review on model compression and explain knowledge distillation (Cheng et al., 2020). Cho
acceleration is outside the scope of this paper. The & Hariharan empirically analyzed in detail the efficacy
focus of this paper is knowledge distillation, which of knowledge distillation (Cho and Hariharan, 2019).
has received increasing attention from the research Empirical results show that a larger model may not
community. Large deep models tend to achieve good be a better teacher because of model capacity gap
performance in practice, because the over parameter- (Mirzadeh et al., 2019). Experiments also show that
ization improves the generalization performance when distillation adversely affects the student learning. The
new data is considered (Brutzkus and Globerson, 2019; empirical evaluation of different forms of knowledge
Allen-Zhu et al., 2019; Arora et al., 2018; Zhang et al., distillation about knowledge, distillation and mutual
2018; Tu et al., 2020). In knowledge distillation, a affection between teacher and student is not covered by
small student model is supervised by a large teacher Cho and Hariharan (2019). Knowledge distillation has
model (Bucilua et al., 2006; Ba and Caruana, 2014; also been explored for label smoothing, for assessing the
Hinton et al., 2015; Urban et al., 2017). The key prob- accuracy of the teacher and for obtaining a prior for the
lem is how to transfer the knowledge from the teacher optimal output layer geometry (Tang et al., 2020).
Knowledge Distillation: A Survey 3

Background
Sec. 2
Response-Based Knowledge
KD Algorithms
Knowledge Distillation Adversarial KD
Multi-Teacher
Cross-Modal KD
Sec. 3 KD
Sec. 6
Feature-Based Knowledge Knowledge
Graph-Based Attention-Based
Data-Free KD
Sec. 5 KD KD
Relation-Based Knowledge
Teacher-Student
Architecture Quantized KD Lifelong KD NAS-Based KD

Sec. 4
Offline Distillation
Distillation Sec. 7
KD Applications
Online Distillation Visual Recognition NLP
Sec. 8

Self-Distillation Challenges Directions Speech Recognition Other Applications

Fig. 2 The schematic structure of survey on knowledge distillation. The body of this survey mainly has fundamentals
of knowledge distillation, knowledge, distillation, teacher-student architecture, the knowledge distillation algorithms and
applications, and discussions about challenges and directions. Note that ‘Section’ is abbreviated as ‘Sec.’ in this figure.

Knowledge distillation for model compression is sim- this paper is shown in Fig.2. The important concepts
ilar to the way in which human beings learn. Inspired and conventional model of knowledge distillation are
by this, recent knowledge distillation methods have provided in Section 2. The different kinds of knowledge
extended to teacher-student learning (Hinton et al., and of distillation are summarized in Section 3 and 4,
2015), mutual learning (Zhang et al., 2018b), assis- respectively. The existing studies about the teacher-
tant teaching (Mirzadeh et al., 2019), lifelong learn- student structures in knowledge distillation are illus-
ing (Zhai et al., 2019), and self-learning (Yuan et al., trated in Section 5. The latest knowledge distillation
2019). Most of the extensions of knowledge distilla- approaches are comprehensively summarized in Section
tion concentrate on compressing deep neural networks. 6. The many applications of knowledge distillation
The resulting lightweight student networks can be are illustrated in Section 7. Challenging problems and
easily deployed in applications such as visual recog- future directions in knowledge distillation are discussed
nition, speech recognition, and natural language pro- and conclusion is given in Section 8.
cessing (NLP). Furthermore, the transfer of knowledge
from one model to another in knowledge distillation
can be extended to other tasks, such as adversar- 2 Background
ial attacks (Papernot et al., 2016), data augmenta-
tion (Lee et al., 2019a; Gordon and Duh, 2019), data In this section, we first introduce the background
privacy and security (Wang et al., 2019a). to knowledge distillation, and then review the nota-
In this paper, we present a comprehensive survey tion for formulating a vanilla knowledge distillation
on knowledge distillation. The main objectives of this method (Hinton et al., 2015).
survey are to 1) provide an overview on knowledge Deep neural networks have achieved remarkable suc-
distillation, including background with motivations, cess, especially in the real-world scenarios with large-
basic notations and formulations, and several typical scale data. However, the deployment of deep neural
knowledge, distillation and algorithms; 2) review the networks in mobile devices and embedded systems is
recent progress of knowledge distillation, including a great challenge, due to the limited computational
algorithms and applications to different real-world sce- capacity and memory of the devices. To address this
narios; and 3) address some hurdles and provide insights issue, Bucilua et al. (2006) first proposed model com-
to knowledge distillation based on different perspec- pression to transfer the information from a large model
tives of knowledge transfer, including different types or an ensemble of models into training a small model
of knowledge, training schemes, distillation algorithms without a significant drop in accuracy. The main idea
and structures, and applications. The organization of is that the student model mimics the teacher model
4 Jianping Gou1 et al.

in order to obtain a competitive or even a superior introduced to control the importance of each soft target
performance. The learning of a small model from as
a large model was later popularized as knowledge exp(zi /T )
distillation (Hinton et al., 2015). pi = P , (2)
j exp(zj /T )

where a higher temperature produces a softer probabil-


Hard targets 0 1 0 0 ity distribution over classes. Specifically, when T → ∞
cow dog cat car all classes share the same probability. When T →
-6 0, the soft targets become one-hot labels, i.e., the
Soft targets 10 0.9 0.1 10-9
hard targets. Both the soft targets from the teacher
Fig. 3 An intuitive example of hard and soft targets for model and the ground-truth label are of great im-
knowledge distillation in (Liu et al., 2018c). portance for improving the performance of the stu-
dent model (Bucilua et al., 2006; Hinton et al., 2015;
Romero et al., 2015), which are used for a distilled
A vanilla knowledge distillation framework usually
loss and a student loss, respectively.
contains one or more large pre-trained teacher models
The distilled loss is defined to match the logits
and a small student model. The teacher models are
between the teacher model and the student model as
usually much larger than the student model. The main
follows.
idea is to train an efficient student model to obtain a
comparable accuracy under the guidance of the teacher LD (p(zt , T ), p(zs , T )) =
X
−pi (zti , T ) log(pi (zsi , T )) ,
models. The supervision signal from a teacher model, i
which is usually referred to the “knowledge” learned (3)
by the teacher model, helps the student model to mimic
the behavior of the teacher model. In a typical image where zt and zs are the logits of the teacher and student
classification task, the logits (e.g., the output of last models, respectively. The logits of the teacher model
layer in a deep neural network) are used as the carriers are matched by the cross-entropy gradient with respect
of the knowledge from the teacher model, which is not to the logits of the student model. The cross-entropy
explicitly provided by the training data samples. For gradient with respect to logit zsi can be evaluated as
example, an image of cat is mistakenly classified as a
∂LD (p(zt , T ), p(zs , T )) pi (zsi , T ) − pi (zti , T )
dog with a very low probability, but the probability =
∂zsi T
of such mistake is still many times higher than the ! (4)
probability of mistaking a cat for a car (Liu et al., 1 exp(zsi /T ) exp(zti /T )
= P −P .
2018c). Another example is that an image of hand- T j exp(zsj /T ) j exp(ztj /T )
written digit 2 is more similar to the digit 3 than to
the digit 7. This knowledge learned by a teacher model If the temperature T is much higher  than the magni-
∂LD p(zt ,T ),p(zs ,T )
is known as dark knowledge in (Hinton et al., 2015). tude of logits, can be approximated
∂zsi
The method of transferring dark knowledge in a according to its Taylor series,
vanilla knowledge distillation is formulated as follows. 
Given a vector of logits z as the output of the last fully ∂LD p(zt , T ), p(zs , T )
connected layer of a deep model, such that zi is the logit ∂zsi
(5)
for the i-th class, and then the probability pi that the 1  1 + zTsi 1 + zTti 
= P zsj − P ztj .
input belongs to the i-th class can be estimated by a T N+ j T N+ j T
softmax function,
If it is further assumed that the logits P of eachPtransfer
exp(zi ) training sample are zero-mean, (i.e., j zsj = j ztj =
pi = P . (1) 0), then Eq. (5) can be simplified as
j exp(zj )

∂LD p(zt , T ), p(zs , T ) 1
Therefore, the predictions of the soft targets obtained = (zsi − zti ) . (6)
∂zsi NT 2
by the teacher model contain the dark knowledge and
can be used as a supervisor to transfer knowledge from Therefore, according to Eq. (6), the distilled loss is
the teacher model to the student model. Similarly, the equal to matching the logits between the teacher model
one-hot label is referred to as a hard target. An and the student model under the conditions of a high
intuitive example about soft and hard targets is shown temperature and the zero-mean logits, i.e., minimizing
in Fig. 3. Furthermore, a temperature factor T is (zsi − zti ). Thus, distillation through matching logits
Knowledge Distillation: A Survey 5

Teacher Model (pre-trained) Knowledge Distillation


Softmax Soft
Layer 1 Layer 2 ĂĂ Layer n
T=t Targets
Distilled
Transfer Loss
Data Set
Softmax Soft
Student Model (to be trained) T=t Targets

Layer 1 Layer 2 ĂĂ Layer m


L
Softmax Soft Student Ground
T=1 Targets Loss Truth Label

Fig. 4 The specific architecture of the benchmark knowledge distillation.

with high temperature can convey very much useful the teacher knowledge (Hinton et al., 2015; Kim et al.,
knowledge information learned by the teacher model 2018; Ba and Caruana, 2014; Mirzadeh et al., 2019).
to train the student model. The activations, neurons or features of intermediate
The student loss is defined as the cross-entropy layers also can be used as the knowledge to guide the
between the ground truth label and the soft logits of learning of the student model (Romero et al., 2015;
the student model: Huang and Wang, 2017; Ahn et al., 2019; Heo et al.,
X 2019c; Zagoruyko and Komodakis, 2017). The relation-
LS (y, p(zs , T )) = −yi log(pi (zsi , T )) , (7) ships between different activations, neurons or pairs
i
of samples contain rich information learned by the
where y is a ground truth vector, in which only one teacher model (Yim et al., 2017; Lee and Song, 2019;
element is 1 that represents the ground truth label of Liu et al., 2019f; Tung and Mori, 2019; Yu et al., 2019).
the transfer training sample and the others are 0. In Furthermore, the parameters of the teacher model
ditilled and student losses, both use the same logits (or the connections between layers) also contain an-
of the student model but different temperatures. The other knowledge (Liu et al., 2019c). We discuss dif-
temperature is T = 1 in the student loss and T = t ferent forms of knowledge in the following categories:
in the distilled loss. Finally, the benchmark model of a response-based knowledge, feature-based knowledge,
vanilla knowledge distillation is the joint of the distilled and relation-based knowledge. An intuitive example of
and student losses: different categories of knowledge within a teacher model
L(x, W ) = α ∗ LD (p(zt , T ), p(zs , T )) is shown in Fig. 5.
(8)
+β ∗ LS (y, p(zs , T )) ,
where x is a training input on the transfer set, W are the
parameters of the student model, and α and β are the 3.1 Response-Based Knowledge
regulated parameters. To easily understand knowledge
distillation, the specific architecture of the vanilla Response-based knowledge usually refers to the neu-
knowledge distillation with the joint of the teacher ral response of the last output layer of the teacher
and student models is shown in Fig. 4. In knowledge model. The main idea is to directly mimic the final
distillation shown in Fig. 4, the teacher model is always prediction of the teacher model. The response-based
first pre-trained and then the student model is trained knowledge distillation is simple yet effective for model
using the knowledge from the trained teacher model. In compression, and has been widely used in different tasks
fact, such is offline knowledge distillation. Moreover, the and applications. The most popular response-based
used knowledge only from soft targets of the pre-trained knowledge for image classification is known as the soft
teacher model is used for training the student model. targets (Ba and Caruana, 2014; Hinton et al., 2015).
Of course, there are other types of knowledge and The response-based knowledge can be used for different
distillation that will be discussed in the next sections. types of model prediction. For example, the response in
object detection task may contain the logits together
with the offset of a bounding box (Chen et al., 2017).
3 Knowledge In semantic landmark localization tasks, e.g., human
pose estimation, the response of the teacher model may
In this section, we focus on different categories of knowl- include a heatmap for each landmark (Zhang et al.,
edge for knowledge distillation. A vanilla knowledge 2019a). Recently, response-based knowledge has been
distillation uses the logits of a large deep model as further explored to address the information of ground-
6 Jianping Gou1 et al.

Relation-Based
Relation-Based Knowledge
Knowledge

Hint Layers

Teacher Model
Input Layer Output Layer

Data

Distill

Feature-Based
Feature-Based Knowledge
Knowledge Response-Based
Response-Based Knowledge
Knowledge

Fig. 5 The schematic illustrations of sources of response-based knowledge, feature-based knowledge and relation-based
knowledge in a deep teacher network.

truth label as the conditional targets (Meng et al., derived an “attention map” from the original feature
2019). maps to express knowledge. The attention map was
The idea of the response-based knowledge is straight- generalized by Huang and Wang (2017) using neuron
forward and easy to understand, especially in the con- selectivity transfer. Passalis and Tefas (2018) trans-
text of “dark knowledge”. From another perspective, ferred knowledge by matching the probability distri-
the effectiveness of the soft targets is analogous to bution in feature space. To make it easier to transfer
label smoothing (Kim and Kim, 2017), i.e., both are the teacher knowledge, Kim et al. (2018) introduced
effective regularizers (Muller et al., 2019; Ding et al., so called “factors” as a more understandable form
2019). Moreover, the response-based knowledge can also of intermediate representations. Recently, Heo et al.
be enhanced by label smoothing. (2019c) proposed to use the activation boundary of the
hidden neurons for knowledge transfer. Interestingly,
the parameter sharing of hint layers of the teacher
3.2 Feature-Based Knowledge model together with response-based knowledge is also
used as the teacher knowledge (Zhou et al., 2018). A
Deep neural networks are good at learning multi- summary of the categories of feature-based knowledge
ple levels of feature representation with increasing is shown in Table 1.
abstraction. This is known as representation learn-
ing (Bengio et al., 2013). The output of the last layer
and the output of intermediate layers, i.e., feature 3.3 Relation-Based Knowledge
maps, are all used as knowledge to supervise the
training of the student model. Both response-based and feature-based knowledge use
The intermediate representations were first intro- the output of specific layers in the teacher model.
duced in Fitnets (Romero et al., 2015), to provide hints Relation-based knowledge further explores the relation-
to improve the training of the student model. The main ships between different layers and data samples.
idea is to directly match the feature activations of the To explore the relationships between different fea-
teacher and the student. Inspired by this, a variety ture maps, Yim et al. (2017) proposed a flow of solution
of other methods have been proposed to match the process (FSP), which is defined by the Gram matrix
features indirectly (Zagoruyko and Komodakis, 2017; between two layers. The FSP matrix summarizes the
Passalis and Tefas, 2018; Kim et al., 2018; Heo et al., relations between pairs of feature maps. It is calculated
2019c). Specifically, Zagoruyko and Komodakis (2017) using the inner products between features from two lay-
Knowledge Distillation: A Survey 7

Table 1 The summary of feature-based knowledge


Feature-based knowledge
Methods Descriptions
Feature representations of middle layer (Romero et al., The middle layer of student is guided by that of teacher via hint-
2015) based training.
Parameters distribution of layers (Liu et al., 2019c) The parameters distribution matching of layers between teacher
and student.
Relative dissimilarities of hint maps (You et al., 2017) The distance-based dissimilarity relationships of teacher are
transferred into student.
Attention transfer (Zagoruyko and Komodakis, 2017) Student mimics the attention maps of teacher.
Factor transfer (Kim et al., 2018) The convolutional modules as factors encode the feature
representations of the output layer.
Probabilistic knowledge transfer (Passalis and Tefas, Soft probabilistic distribution of data defined on the feature
2018) representations of the output layer.
Parameter sharing (Zhou et al., 2018) Student shares the same lower layers of network with teacher.
Activation boundaries (Heo et al., 2019c) The activation boundaries formed by hidden neurons of student
match those of teacher.
Neuron selectivity transfer (Huang and Wang, 2017) Student imitates the distribution of the neuron activations from
hint layers of teacher.
Feature responses and neuron activations (Heo et al., The magnitude of feature response carrying enough feature
2019a) information and the activation status of each neuron.

Table 2 The summary of relation-based knowledge


Relation-based knowledge
Methods Descriptions
Multi-head graph (Lee and Song, 2019) The intra-data relations between any two feature maps via multi-head
attention network.
Logits and representation graphs Logits graph of multiple self-supervised teachers distills softened prediction
(Zhang and Peng, 2018) knowledge at classifier level and representation graph distills the feature
knowledge from pairwise ensemble representations of the compact bilinear
pooling layers of teachers .
FSP matrix (Yim et al., 2017; Lee et al., The relations between any two feature maps from any two layers of networks.
2018)
Mutual relations of data (Park et al., 2019) Mutual relations between feature representation outputs structure-wise.
Instance relationship graph (Liu et al., This graph contains the knowledge about instance features, instance
2019f) relationships and the feature space transformation cross layers of teacher.
Similarities between feature representations The locality similarities between feature representations of hint layers of the
(Chen et al., 2020b; Tung and Mori, 2019) teacher networks (Chen et al., 2020b); The similar activations between input
pairs in the teacher networks (Tung and Mori, 2019).
Correlation congruence (Peng et al., 2019a) Correlation between instances of data.
Embedding networks (Yu et al., 2019) The distances between feature embeddings of layers from teacher and
between data pairs.
Similarity transfer (Chen et al., 2018c) Cross sample Similarities.

ers. Using the correlations between feature maps as the Traditional knowledge transfer methods often in-
distilled knowledge, knowledge distillation via singular volve individual knowledge distillation. The individual
value decomposition was proposed to extract key infor- soft targets of a teacher are directly distilled into
mation in the feature maps (Lee et al., 2018). To use student. In fact, the distilled knowledge contains not
the knowledge from multiple teachers, Zhang and Peng only feature information but also mutual relations
(2018) formed two graph by respectively using the logits of data. Liu et al. proposed a robust and effective
and features of each teacher model as the nodes. Specif- knowledge distillation method via instance relationship
ically, the importance and relationships of the different graph (Liu et al., 2019f). The transferred knowledge in
teachers are modelled by the logits and representation instance relationship graph contains instance features,
graphs before the knowledge transfer (Zhang and Peng, instance relationships and the feature space transfor-
2018). Graph-based knowledge distillation from the mation cross layers. Based on idea of manifold learning,
intra-data relations was proposed by Lee and Song the student network is learned by feature embedding,
(2019). Park et al. proposed a relational knowledge which preserves the information about the positions
distillation method (Park et al., 2019). of feature representations in the hint layers of the
teacher networks (Chen et al., 2020b). Through feature
8 Jianping Gou1 et al.

embedding, the similarities of the data in the teacher


network are transferred into the student network. Tung Teacher Offline Distillation Student
and Mori proposed a similarity-preserving knowledge
distillation method (Tung and Mori, 2019). In particu-
lar, similarity-preserving knowledge, which arises from
the similar activations of input pairs in the teacher Online Distillation
networks, is transferred into the student network, with Teacher Student
the pairwise similarities preserved. Peng et al. proposed
a knowledge distillation method based on correlation
congruence, in which the distilled knowledge contains Self-Distillation
both the instance-level information and the correlations
between instances (Peng et al., 2019a). Using the corre-
Pre-trained
lation congruence for distillation, the student network
Teacher/Student To be trained
can learn the correlation between instances.
Distilled knowledge can be categorized from other
perspectives, such as structured knowledge of the data Fig. 6 Different distillations. The red color for “pre-trained”
means networks are learned before distillation and the yellow
(Liu et al., 2019f; Chen et al., 2020b; Tung and Mori,
color for “to be trained” means networks are learned during
2019; Peng et al., 2019a; Tian et al., 2020), privileged distillation
information about the input features (Lopez-Paz et al.,
2015; Vapnik and Izmailov, 2015), and so on. A sum-
mary of the categories of relation-based knowledge is the knowledge transfer, including the design of knowl-
shown in Table 2. edge (Hinton et al., 2015; Romero et al., 2015) and the
loss functions for matching features or distributions
matching (Huang and Wang, 2017; Passalis and Tefas,
4 Distillation Schemes 2018; Zagoruyko and Komodakis, 2017; Mirzadeh et al.,
2019; Li et al., 2018; Heo et al., 2019b; Asif et al., 2019).
In this section, we discuss the training schemes (i.e. The main advantage of offline methods is that they
distillation schemes) for both teacher and student are simple and easy to be implemented. For example,
models. According to whether the teacher model is the teacher model may contain a set of models trained
updated simultaneously with the student model or using different software packages, possibly located on
not, the learning schemes of knowledge distillation different machines. The knowledge can be extracted and
can be directly divided into three main categories: stored in a cache.
offline distillation, online distillation and self- Obviously, the offline distillation methods always
distillation, as shown in Fig. 6. employ one-way knowledge transfer and two-phase
training procedure. Nevertheless, a complex high-capacity
teacher network and its large time-consuming training
4.1 Offline Distillation time with the large-scale training data must be needed
in offline distillation, and the capacity gap between
Most of previous knowledge distillation methods work
teacher and student always exists before distillation.
offline. In vanilla knowledge distillation (Hinton et al.,
2015), the knowledge is transferred from a pre-trained
teacher model into a student model. Therefore, the
whole training process has two stages, namely: 1) the 4.2 Online Distillation
large teacher model is first trained on a set of training
samples before distillation; and 2) the teacher model is Although offline distillation methods are simple and
used to extract the knowledge in the forms of logits or effective, some issues in offline distillation have at-
the intermediate features, which are then used to guide tracted increasing attention from the research commu-
the training of the student model during distillation. nity (Mirzadeh et al., 2019). To overcome the limitation
The first stage in offline distillation is usually not of offline distillation, online distillation is proposed
discussed as part of knowledge distillation, i.e., it is to further improve the performance of the student
assumed that the teacher model is pre-defined. Little at- model, especially when a large-capacity high perfor-
tention is paid to the teacher model structure and its re- mance teacher model is not available (Zhang et al.,
lationship with the student model. Therefore, the offline 2018b; Chen et al., 2020a). In online distillation, both
methods mainly focus on improving different parts of the teacher model and the student model are updated
Knowledge Distillation: A Survey 9

simultaneously, and the whole knowledge distillation A further three interesting self-distillation meth-
framework is end-to-end trainable. ods were recently proposed by Yuan et al. (2019);
A variety of online knowledge distillation methods Hahn and Choi (2019) and Yun et al. (2019). Yuan et
have been proposed, especially in the last two years al. proposed teacher-free knowledge distillation meth-
(Zhang et al., 2018b; Chen et al., 2020a; Zhu and Gong, ods based on the analysis of label smoothing reg-
2018; Xie et al., 2019; Anil et al., 2018; Kim et al., ularization (Yuan et al., 2019). Hahn and Choi pro-
2019b; Zhou et al., 2018). Specifically, in deep mutual posed a novel self-knowledge distillation method, in
learning (Zhang et al., 2018b), multiple neural net- which the self-knowledge consists of the predicted
works work in a collaborative way. Any one network probabilities instead of traditional soft probabilities
can be the student model and other models can be (Hahn and Choi, 2019). These predicted probabilities
the teacher during the training process. Chen et al. are defined by the feature representations of the train-
(2020a) further introduced auxiliary peers and a group ing model. They reflect the similarities of data in
leader into deep mutual learning to form a diverse feature embedding space. Yun et al. proposed class-
set of peer models. To reduce the computational cost, wise self-knowledge distillation to match the output
Zhu and Gong (2018) proposed a multi-branch archi- distributions of the training model between intra-
tecture, in which each branch indicates a student class samples and augmented samples within the same
model and different branches share the same backbone source with the same model (Yun et al., 2019). In
network. Rather than using the ensemble of logits, addition, the self-distillation proposed by Lee et al.
Kim et al. (2019b) introduced a feature fusion module (2019a) is adopted for data augmentation and the self-
to construct the teacher classifier. Xie et al. (2019) re- knowledge of augmentation is distilled into the model
placed the convolution layer with cheap convolution op- itself. Self distillation is also adopted to optimize deep
erations to form the student model. Anil et al. employed models (the teacher or student networks) with the
online distillation to train large-scale distributed neural same architecture one by one (Furlanello et al., 2018;
network, and proposed a variant of online distillation Bagherinezhad et al., 2018). Each network distills the
called codistillation (Anil et al., 2018). Codistillation knowledge of the previous network using a teacher-
in parallel trains multiple models with the same archi- student optimization.
tectures and any one model is trained by transferring For further intuitive understanding of distillation,
the knowledge from the other models. Recently, an offline, online and self distillation can also be summa-
online adversarial knowledge distillation method is rized from the perspective of human beings teacher-
proposed to simultaneously train multiple networks by student learning. Offline distillation means the knowl-
the discriminators using knowledge from both the class edgeable teacher teaches the new student the knowl-
probabilities and a feature map (Chung et al., 2020). edge; online distillation means the teacher and student
study together with each other under the main super-
vision of teacher; self-distillation means the students
4.3 Self-Distillation
learn knowledge themselves without teacher. Moreover,
just like the human beings learning, these three kinds of
In self-distillation, the same networks are used for the
distillation can be combined to complement each other
teacher and the student models (Yuan et al., 2019;
due to their own advantages.
Zhang et al., 2019b; Hou et al., 2019; Yang et al., 2019b;
Yun et al., 2019; Hahn and Choi, 2019; Lee et al., 2019a).
This can be regarded as a special case of online 5 Teacher-Student Architecture
distillation. Specifically, Zhang et al. proposed a new
self-distillation method, in which knowledge from the In knowledge distillation, the teacher-student architec-
deeper sections of the network is distilled into its shal- ture is a generic carrier to form the knowledge transfer.
low sections (Zhang et al., 2019b). Similar to the self- In other words, the quality of knowledge acquisition and
distillation in (Zhang et al., 2019b), a self-attention dis- distillation from teacher to student is also determined
tillation method was proposed for lane detection (Hou et al.,by how to design the teacher and student networks.
2019). The network utilizes the attention maps of In terms of the habits of human beings learning, we
its own layers as distillation targets for its lower hope that a student can find a right teacher. Thus,
layers. Snapshot distillation is a special variant of self- to well finish capturing and distilling knowledge in
distillation, in which knowledge in the earlier epochs of knowledge distillation, how to select or design proper
the network (teacher) is transferred into its later epochs structures of teacher and student is very important but
(student) to support a supervised training process difficult problem. Recently, the model setups of teacher
within the same network (Yang et al., 2019b). and student are almost pre-fixed with unvaried sizes
10 Jianping Gou1 et al.

and structures during distillation, so as to easily cause introduced a teacher assistant to mitigate the training
the model capacity gap. However, how to particulary gap between teacher model and student model. The
design the architectures of teacher and student and gap is further reduced by residual learning, i.e., the
why their architectures are determined by these model assistant structure is used to learn the residual er-
setups are nearly missing. In this section, we discuss ror (Gao et al., 2020). On the other hand, several recent
the relationship between the structures of the teacher methods also focus on minimizing the difference in
model and the student model as illustrated in Fig. 7. structure of the student model and the teacher model.
For example, Polino et al. (2018) combined network
quantization with knowledge distillation, i.e., the stu-
dent model is small and quantized version of the teacher
Model Setup
model. Nowak and Corso (2018) proposed a structure
compression method which involves transferring the
knowledge learned by multiple layers to a single layer.
Knowledge Distillation Wang et al. (2018a) progressively performed block-wise
Teacher Model Student Model knowledge transfer from teacher networks to student
networks while preserving the receptive field. In online
setting, the teacher networks are usually ensembles
Simplified Structure
of student networks, in which the student models
Quantized Structure
share similar structure (or the same structure) with
Same Structure each other (Zhang et al., 2018b; Zhu and Gong, 2018;
Small Structure (opitimazed/condensed) Furlanello et al., 2018; Chen et al., 2020a).
Recently, depth-wise separable convolution has been
Fig. 7 Relationship of the teacher and student models.
widely used to design efficient neural networks for mo-
bile or embedded devices (Chollet, 2017; Howard et al.,
Knowledge distillation was previously designed to 2017; Sandler et al., 2018; Zhang et al., 2018a; Ma et al.,
compress an ensemble of deep neural networks in 2018). Inspired by the success of neural architecture
(Hinton et al., 2015). The complexity of deep neural search (or NAS), the performances of small neural
networks mainly comes from two dimensions: depth networks have been further improved by searching
and width. It is usually required to transfer knowledge for a global structure based on efficient meta oper-
from deeper and wider neural networks to shallower ations or blocks (Wu et al., 2019; Tan et al., 2019;
and thinner neural networks (Romero et al., 2015). Tan and Le, 2019; Radosavovic et al., 2020). Further-
The student network is usually chosen to be: 1) a more, the idea of dynamically searching for a knowl-
simplified version of a teacher network with fewer layers edge transfer regime also appears in knowledge dis-
and fewer channels in each layer (Wang et al., 2018a; tillation, e.g., automatically removing redundant lay-
Zhu and Gong, 2018); or 2) a quantized version of a ers in a data-driven way using reinforcement learn-
teacher network in which the structure of the network is ing (Ashok et al., 2018), and searching for optimal
preserved (Polino et al., 2018; Mishra and Marr, 2017; student networks given the teacher networks (Liu et al.,
Wei et al., 2018; Shin et al., 2019); or 3) a small net- 2019h; Xie et al., 2020; Gu and Tresp, 2020). The idea
work with efficient basic operations (Howard et al., of a neural architecture search in knowledge distillation,
2017; Zhang et al., 2018a; Huang et al., 2017); or 4) i.e., a joint search of student structure and knowledge
a small network with optimized global network struc- transfer under the guidance of the teacher model, will
ture (Liu et al., 2019h; Xie et al., 2020; Gu and Tresp, be an interesting subject of future study.
2020); or 5) the same network as teacher (Zhang et al.,
2018b; Furlanello et al., 2018). 6 Distillation Algorithms
The model capacity gap between the large deep
neural network and a small student neural network A simple yet very effective idea for knowledge trans-
degrades knowledge transfer (Mirzadeh et al., 2019; fer is to directly match the response-based knowl-
Gao et al., 2020). To effectively transfer knowledge to edge, feature-based knowledge (Romero et al., 2015;
student networks, a variety of methods have been Hinton et al., 2015) or the representation distributions
proposed for a controlled reduction of the model com- in feature space (Passalis and Tefas, 2018) between the
plexity (Zhang et al., 2018b; Nowak and Corso, 2018; teacher model and the student model. Many different
Crowley et al., 2018; Liu et al., 2019a,h; Wang et al., algorithms have been proposed to improve the process
2018a; Gu and Tresp, 2020). Mirzadeh et al. (2019) of transferring knowledge in more complex settings. In
Knowledge Distillation: A Survey 11

this section, we review recently proposed typical types dataset (Chen et al., 2019a) or used to augment the
of distillation methods for knowledge transfer within training dataset (Liu et al., 2018b), shown in Fig. 8 (a).
the field of knowledge distillation. Furthermore, Micaelli and Storkey (2019) utilized an
adversarial generator to generate hard examples for
knowledge transfer. 2) A discriminator is introduced
6.1 Adversarial Distillation to distinguish the samples from the student and the
teacher models by using either the logits (Xu et al.,
In knowledge distillation, it is difficult for the teacher 2018a,b) or the features (Wang et al., 2018e), shown in
model to perfectly learn from the true data distri- Fig. 8 (b). Specifically, Belagiannis et al. (2018) used
bution. Simultaneously, the student model has only unlabeled data samples to form the knowledge trans-
a small capacity and so cannot mimic the teacher fer. Multiple discriminators were used by Shen et al.
model accurately (Mirzadeh et al., 2019). Are there (2019b). Furthermore, an effective intermediate su-
other ways of training the student model in order pervision, i.e., the squeezed knowledge, was used by
to mimic the teacher model? Recently, adversarial Shu et al. (2019) to mitigate the capacity gap between
learning has received a great deal of attention due to the teacher and the student. 3) Adversarial knowledge
its great success in generative networks, i.e., genera- distillation is carried out in an online manner, i.e.,
tive adversarial networks or GANs (Goodfellow et al., the teacher and the student are jointly optimized
2014). Specifically, the discriminator in a GAN esti- in each iteration (Wang et al., 2018d; Chung et al.,
mates the probability that a sample comes from the 2020), shown in Fig. 8 (c). Besides, using knowledge
training data distribution while the generator tries distillation to compress GANs, a learned small GAN
to fool the discriminator using generated data sam- student network mimics a larger GAN teacher network
ples. Inspired by this, many adversarial knowledge via knowledge transfer (Aguinaldo et al., 2019).
distillation methods have been proposed to enable the In summary, three main points can be concluded
teacher and student networks to have a better under- from the adversarial distillation methods above as
standing of the true data distribution (Wang et al., follows: GAN is an effective tool to enhance the power
2018d; Xu et al., 2018a; Micaelli and Storkey, 2019; of student learning via the teacher knowledge transfer;
Xu et al., 2018b; Liu et al., 2018b; Wang et al., 2018e; joint GAN and KD can generate the valuable data for
Chen et al., 2019a; Shen et al., 2019b; Shu et al., 2019; improving the KD performance and overcoming the
Liu et al., 2018a; Belagiannis et al., 2018). limitations of unusable and unaccessible data; KD can
be used to compress GANs.
Distillation Distillation
T/D S S/G T
6.2 Multi-Teacher Distillation

Different teacher architectures can provide their own


G Data D useful knowledge for a student network. The multiple
(a) (b)
teacher networks can be individually and integrally
used for distillation during the period of training a
T: Teacher
Distillation S: Student
student network. In a typical teacher-student frame-
T S
G D D: Discriminator work, the teacher usually has a large model or an
G: Generator

(c)
ensemble of large models. To transfer knowledge from
multiple teachers, the simplest way is to use the aver-
Fig. 8 The different categories of the main adversarial
aged response from all teachers as the supervision sig-
distillation methods. (a) Generator in GAN produces training
data to improve KD performance; the teacher may be used nal (Hinton et al., 2015). Several multi-teacher knowl-
as discriminator. (b) Discriminator in GAN ensures that the edge distillation methods have recently been proposed
student (also as generator) mimics the teacher. (c) Teacher (Sau and Balasubramanian, 2016; You et al., 2017; Chen et al.,
and student form a generator; online knowledge distillation
2019b; Furlanello et al., 2018; Yang et al., 2019a; Zhang et al.,
is enhanced by the discriminator.
2018b; Lee et al., 2019c; Park and Kwak, 2019; Papernot et al.,
2017; Fukuda et al., 2017; Ruder et al., 2017; Wu et al.,
As shown in Fig. 8, adversarial learning-based distil- 2019a; Yang et al., 2020b; Vongkulbhisal et al., 2019).
lation methods, especially those methods using GANs, A generic framework for multi-teacher distillation is
can be divided into three main categories as follows: shown in Fig. 9.
1) An adversarial generator is trained to generate syn- Multiple teacher networks have turned out to be
thetic data, which is either directly used as the training effective for training student model usually using logits
12 Jianping Gou1 et al.

scenarios using cross-modal knowledge transfer are


Data
reviewed as follows.
Given a teacher model pretrained on one modality
(e.g., RGB images) with a large number of well-
annotated data samples, Gupta et al. (2016) trans-
Teacher 1 Teacher 2
ĂĂ Teacher n Student
ferred the knowledge from the teacher model to the
student model with a new unlabeled input modality,
such as a depth image and optical flow. Specifically,
Knowledge Transfer the proposed method relies on unlabeled paired sam-
ples involving both modalities, i.e., both RGB and
Fig. 9 The generic framework for multi-teacher distillation.
depth images. The features obtained from RGB im-
ages by the teacher are then used for the super-
and feature representation as the knowledge. In addi- vised training of the student (Gupta et al., 2016). The
tion to the averaged logits from all teachers, You et al. idea behind the paired samples is to transfer the
(2017) further incorporated features from the inter- annotation or label information via pair-wise sam-
mediate layers in order to encourage the dissimilarity ple registration and has been widely used for cross-
among different training samples. To utilize both logits modal applications (Albanie et al., 2018; Zhao et al.,
and intermediate features, Chen et al. (2019b) used 2018; Thoker and Gall, 2019). To perform human pose
two teacher networks, in which one teacher trans- estimation through walls or with occluded images,
fers response-based knowledge to the student and the Zhao et al. (2018) used synchronized radio signals and
other teacher transfers feature-based knowledge to the camera images. Knowledge is transferred across modali-
student. Fukuda et al. (2017) randomly selected one ties for radio-based human pose estimation. Thoker and Gall
teacher from the pool of teacher networks at each it- (2019) obtained paired samples from two modalities:
eration. To transfer feature-based knowledge from mul- RGB videos and skeleton sequence. The pairs are
tiple teachers, additional teacher branches are added used to transfer the knowledge learned on RGB videos
to the student networks to mimic the intermediate to a skeleton-based human action recognition model.
features of teachers (Park and Kwak, 2019; Asif et al., To improve the action recognition performance using
2019). Born again networks address multiple teach- only RGB images, Garcia et al. (2018) performed cross-
ers in a step-by-step manner, i.e., the student at modality distillation on an additional modality, i.e.,
the t step is used as the teacher of the student at depth image, to generate a hallucination stream for
the t + 1 step (Furlanello et al., 2018), and similar RGB image modality. Tian et al. (2020) introduced
ideas can be found in (Yang et al., 2019a). To effi- a contrastive loss to transfer pair-wise relationship
ciently perform knowledge transfer and explore the across different modalities. To improve target detection,
power of multiple teachers, several alternative meth- Roheda et al. (2018) proposed cross-modality distil-
ods have been proposed to simulate multiple teach- lation among the missing and available modelities
ers by adding different types of noise to a given using GANs. The generic framework of cross-modal
teacher (Sau and Balasubramanian, 2016) or by us- distillation is shown in Fig. 10.
ing stochastic blocks and skip connections (Lee et al.,
2019c). More interestingly, due to the special char-
acteristics of multi-teacher distillation, its extensions Modality 1 Teacher
are used for domain adaptation via knowledge adap-
tation (Ruder et al., 2017), and to protect the pri-
vacy and security of data (Vongkulbhisal et al., 2019;
Data Distillation
Papernot et al., 2017).

Modality 2 Student
6.3 Cross-Modal Distillation
Fig. 10 The generic framework for cross-modal distillation.
The data or labels for some modalities might not be For simplicity, only two modalities are shown.
available during training or testing (Gupta et al., 2016;
Garcia et al., 2018; Zhao et al., 2018; Roheda et al.,
2018). For this reason it is important to transfer Moreover, Do et al. (2019) proposed a knowledge
knowledge between different modalities. Several typical distillation-based visual question answering method,
Knowledge Distillation: A Survey 13

in which knowledge from trilinear interaction teacher student. In (Chen et al., 2020b), the graph was used
model with image-question-answer as inputs is distilled to maintain the relationship between samples in the
into the learning of a bilinear interaction student high-dimensional space. Knowledge transfer was then
model with image-question as inputs. The probabilistic carried out using a proposed locality preserving loss
knowledge distillation proposed by Passalis and Tefas function. Lee and Song (2019) analysed intra-data rela-
(2018) is also used for knowledge transfer from the tex- tions using a multi-head graph, in which the vertices are
tual modality into the visual modality. Hoffman et al. the features from different layers in CNNs. Park et al.
(2016) proposed a modality hallucination architecture (2019) directly transferred the mutual relations of data
based on cross-modality distillation to improve detec- samples, i.e., to match edges between a teacher graph
tion performance. Besides, these cross-model distilla- and a student graph. Tung and Mori (2019) used the
tion methods transfer the knowledge among multi- similarity matrix to represent the mutual relations of
ple domains (Kundu et al., 2019; Chen et al., 2019c; the activations of the input pairs in teacher and student
Su and Maji, 2016). models. The similarity matrix of student then matched
that of teacher. Furthermore, Peng et al. (2019a) not
only matched the response-based and feature-based
6.4 Graph-Based Distillation knowledge, but also used the graph-based knowledge.
In (Liu et al., 2019f), the instance features and instance
Most of knowledge distillation algorithms focus on relationships were modeled as vertexes and edges of the
transferring knowledge from the teacher to the student graph, respectively.
in terms of individual training sample, while several Rather than using the graph-based knowledge, sev-
recent methods have been proposed to explore the eral methods control knowledge transfer using a graph.
intra-data relationships using graphs (Luo et al., 2018; Specifically, Luo et al. (2018) considered the modal-
Chen et al., 2020b; Zhang and Peng, 2018; Lee and Song, ity discrepancy to incorporate privileged information
2019; Park et al., 2019; Liu et al., 2019f; Tung and Mori, from the source domain. A directed graph, referred
2019; Peng et al., 2019a; Minami et al., 2019; Yao et al., to as a distillation graph is introduced to explore
2019; Ma and Mei, 2019). The main ideas of these the relationship between different modalities. Each
graph-based distillation methods are 1) to use the graph vertex represent a modality and the edges indicate
as the carrier of teacher knowledge; or 2) to use the the connection strength between one modality and
graph to control the message passing of the teacher another. Minami et al. (2019) proposed a bidirectional
knowledge. A generic framework for graph-based distil- graph-based diverse collaborative learning to explore
lation is shown in Fig. 11. As described in Section 3.3, diverse knowledge transfer patterns. Yao et al. (2019)
the graph-based knowledge falls in line of relation- introduced GNNs to deal with the knowledge trans-
based knowledge. In this section, we introduce typical fer for graph-based knowledge. Besides, using knowl-
definitions of the graph-based knowledge and the graph- edge distillation, the topological semantics of a graph
based message passing distillation algorithms. convolutional teacher network as the topology-aware
knowledge are transferred into the graph convolutional
Relation Knowledge
student network (Yang et al., 2020a)
Teacher

6.5 Attention-Based Distillation

Data Since attention can well reflect the neuron activa-


tions of convolutional neural network, some atten-
tion mechanisms are used in knowledge distillation
Distillation
Student to improve the performance of the student network
(Zagoruyko and Komodakis, 2017; Huang and Wang,
2017; Srinivas and Fleuret, 2018; Crowley et al., 2018;
Fig. 11 A generic framework for graph-based distillation.
Song et al., 2018). Among these attention-based KD
methods (Crowley et al., 2018; Huang and Wang, 2017;
In (Zhang and Peng, 2018), each vertex represented Srinivas and Fleuret, 2018; Zagoruyko and Komodakis,
a self-supervised teacher. Two graphs were then con- 2017), different attention transfer mechanisms are de-
structed using logits and intermediate features, i.e., fined for distilling knowledge from the teacher network
the logits graph and representation graph, to transfer to the student network. The core of attention transfer
knowledge from multiple self-supervised teachers to the is to define the attention maps for feature embedding in
14 Jianping Gou1 et al.

the layers of a neural network. That is to say, knowledge knowledge to a small low-precision student network. To
about feature embedding is transferred using attention ensure that a small student network accurately mimics
map functions. Unlike the attention maps, a different a large teacher network, the full-precision teacher net-
attentive knowledge distillation method was proposed work is first quantized on the feature maps, and then
by Song et al. (2018). An attention mechanism is used the knowledge is transferred from the quantized teacher
to assign different confidence rules (Song et al., 2018). to a quantized student network (Wei et al., 2018).
In (Kim et al., 2019a), quantization-aware knowledge
distillation was proposed, based on self-study of a
6.6 Data-Free Distillation quantized student network and on the co-studying of
teacher and student networks with knowledge transfer.
Some data-free KD methods have been proposed to Furthermore, Shin et al. (2019) carried out empirical
overcome problems with unavailable data arising from analysis of deep neural networks using both distilla-
et al.,
privacy, legality, security and confidentiality concerns (Chention and quantization, taking into account the hyper-
2019a; Lopes et al., 2017; Nayak et al., 2019; Micaelli and Storkey,
parameters for knowledge distillation, such as the size
2019). Just as “data free” implies, there is no train- of teacher networks and the distillation temperature.
ing data. Instead, the data is newly or synthetically
generated. In (Chen et al., 2019a; Micaelli and Storkey,
2019), the transfer data is generated by a GAN. In Low-precision
the proposed data-free knowledge distillation method student network

K
no
(Lopes et al., 2017), the transfer data to train the

ion

w
le
zat
student network is reconstructed by using the layer ac-

dg
Q

et
nti

ua

ra
tivations or layer spectral activations of the teacher net-
a

nt

n
Qu

sf
iz

er
work. Nayak et al. (2019) proposed a zero-shot knowl-

at
io
n
edge distillation method that does not use existing data.
The transfer data is produced by modelling the softmax Full-precision
A large network
space using the parameters of the teacher network. In teacher network
fact, the target data in (Micaelli and Storkey, 2019;
Fig. 12 A generic framework for quantized distillation.
Nayak et al., 2019) is generated by using the informa-
tion from the feature representations of teacher net-
works. Similar to zero-shot learning, a knowledge dis-
tillation method with few-shot learning is designed by
distilling knowledge from a teacher model with gaussian 6.8 Lifelong Distillation
processes into a student neural network (Kimura et al.,
2018). The teacher uses limited labelled data. Lifelong learning, including continual learning, continu-
ous learning and meta-learning, aims to learn in a simi-
lar way to human. It accumulates the previously learned
6.7 Quantized Distillation knowledge and also transfers the learned knowledge
into future learning (Chen and Liu, 2018). Knowledge
Network quantization reduces the computation com- distillation provides an effective way to preserve and
plexity of neural networks by converting high-precision transfer learned knowledge without catastrophic forget-
networks (e.g., 32-bit floating point) into low-precision ting. Recently, an increasing number of KD variants,
networks (e.g., 2-bit and 8-bit). Meanwhile, knowl- which are based on the lifelong learning, have been
edge distillation aims to train small model to yield a developed (Jang et al., 2019; Flennerhag et al., 2019;
performance comparable to that of a complex model. Peng et al., 2019b; Liu et al., 2019d; Lee et al., 2019b;
Some KD methods have been proposed using the Zhai et al., 2019; Zhou et al., 2019c; Shmelkov et al.,
quantization process in the teacher-student framework 2017; Li and Hoiem, 2017). The methods proposed
(Polino et al., 2018; Mishra and Marr, 2017; Wei et al., in (Jang et al., 2019; Peng et al., 2019b; Liu et al.,
2018; Shin et al., 2019; Kim et al., 2019a). A framework 2019d; Flennerhag et al., 2019) adopt meta-learning.
for quantized distillation methods is shown in Fig. 12. Jang et al. (2019) designed meta-transfer networks that
Specifically, Polino et al. (2018) proposed a quantized can determine what and where to transfer in the
distillation method to transfer the knowledge to a teacher-student architecture. Flennerhag et al. (2019)
weight-quantized student network. In (Mishra and Marr, proposed a light-weight framework called Leap for
2017), the proposed quantized KD was called the “ap- meta-learning over task manifolds by transferring knowl-
prentice”. A high precision teacher network transfers edge from one learning process to another. Peng et al.
Knowledge Distillation: A Survey 15

(2019b) designed a new knowledge transfer network recommendation systems. Furthermore, knowledge dis-
architecture for few-shot image recognition. The archi- tillation also can be used for other purposes, such as
tecture simultaneously incorporates visual information the data privacy and as a defense against adversarial
from images and prior knowledge. Liu et al. (2019d) attacks. This section briefly reviews applications of
proposed the semantic-aware knowledge preservation knowledge distillation.
method for image retrieval. The teacher knowledge
obtained from the image modalities and semantic in-
7.1 KD in Visual Recognition
formation are preserved and transferred. Moreover,
to address the problem of catastrophic forgetting in
In last few years, a variety of knowledge distillation
lifelong learning, global distillation (Lee et al., 2019b),
methods have been widely used for model compression
multi-model distillation (Zhou et al., 2019c), knowledge
in different visual recognition applications. Specifically,
distillation-based lifelong GAN (Zhai et al., 2019) and
most of the knowledge distillation methods were previ-
the other KD-based methods (Li and Hoiem, 2017;
ously developed for image classification (Li and Hoiem,
Shmelkov et al., 2017) have been developed to extract
2017; Peng et al., 2019b; Bagherinezhad et al., 2018;
the learned knowledge and teach the student network
Chen et al., 2018a; Wang et al., 2019b; Mukherjee et al.,
on new tasks.
2019; Zhu et al., 2019) and then extended to other vi-
sual recognition applications, including face recognition
6.9 NAS-Based Distillation (Luo et al., 2016; Kong et al., 2019; Yan et al., 2019;
Ge et al., 2018; Wang et al., 2018b, 2019c; Duong et al.,
Neural architecture search (NAS), which is one of 2019; Wu et al., 2020; Wang et al., 2017), action recog-
the most popular auto machine learning (or AutoML) nition (Hao and Zhang, 2019; Thoker and Gall, 2019;
techniques, aims to automatically identify deep neural Luo et al., 2018; Garcia et al., 2018; Wang et al., 2019e;
models and adaptively learn appropriate deep neural Wu et al., 2019b; Zhang et al., 2020), object detection
structures. In knowledge distillation, the success of (Li et al., 2017; Hong and Yu, 2019; Shmelkov et al.,
knowledge transfer depends on not only the knowl- 2017; Wei et al., 2018; Wang et al., 2019d), lane detec-
edge from the teacher but also the architecture of tion (Hou et al., 2019), image or video segmentation
the student. However, there might be a capacity gap (He et al., 2019; Liu et al., 2019g; Mullapudi et al., 2019;
between the large teacher model and the small student Siam et al., 2019; Dou et al., 2020), video classification
model, making it difficult for the student to learn (Bhardwaj et al., 2019; Zhang and Peng, 2018), pedes-
well from the teacher. To address this issue, neural trian detection (Shen et al., 2016), facial landmark de-
architecture search has been adopted to find the appro- tection (Dong and Yang, 2019), person re-identification
priate student architecture in oracle-based (Kang et al., (Wu et al., 2019a) and person search (Munjal et al.,
2020) and architecture-aware knowledge distillation 2019), pose estimation (Nie et al., 2019; Zhang et al.,
(Liu et al., 2019h). Furthermore, knowledge distillation 2019a; Zhao et al., 2018), saliency estimation (Li et al.,
is employed to improve the efficiency of neural archi- 2019), image retrieval (Liu et al., 2019d), depth estima-
tecture search, such as AdaNAS (Macko et al., 2019), tion (Pilzer et al., 2019; Ye et al., 2019), visual odome-
NAS with distilled architecture knowledge (Li et al., try (Saputra et al., 2019) and visual question answering
2020) and teacher guided search for architectures or (Mun et al., 2018; Aditya et al., 2019). Since knowledge
TGSA (Bashivan et al., 2019). In TGSA, each archi- distillation in classification task is fundamental for
tecture search step is guided to mimic the intermediate other tasks, we briefly review knowledge distillation in
feature representations of the teacher network. The pos- challenging image classification settings, such as face
sible structures for the student are efficiently searched recognition and action recognition.
and the feature transfer is effectively supervised by the The existing KD-based face recognition methods
teacher. can not only be easily deployed, but also improve the
classification accuracy (Luo et al., 2016; Kong et al.,
2019; Yan et al., 2019; Ge et al., 2018; Wang et al.,
7 Applications 2018b, 2019c; Duong et al., 2019; Wang et al., 2017).
First of all, these methods focus on the lightweight face
As an effective technique for the compression and recognition with very satisfactory accuracy (Luo et al.,
acceleration of deep neural networks, knowledge distil- 2016; Wang et al., 2018b, 2019c; Duong et al., 2019).
lation has been widely used in different fields of arti- In (Luo et al., 2016), the knowledge from the chosen
ficial intelligence, including visual recognition, speech informative neurons of top hint layer of the teacher
recognition, natural language processing (NLP), and network is transferred into the student network. A
16 Jianping Gou1 et al.

teacher weighting strategy with the loss of feature Chen et al. (2018a) proposed the feature maps-based
representations from hint layers was designed for knowl- knowledge distillation method with GAN. It transfers
edge transfer to avoid the incorrect supervision by the knowledge from feature maps to a student. Using
teacher (Wang et al., 2018b). A recursive knowledge knowledge distillation, a visual interpretation and diag-
distillation method was designed by using a previous nosis framework that unifies the teacher-student models
student network to initialize the next one (Yan et al., for interpretation and a deep generative model for di-
2019). The student model in recursive distillation is agnosis was designed for image classifiers (Wang et al.,
first generated by group and pointwise convolution. Un- 2019b). Similar to knowledge distillation-based low-
like the KD-based face recognition methods in closed- resolution face recognition, Zhu et al. (2019) proposed
set problems, shrinking teacher-student networks for deep feature distillation for the low-resolution image
million-scale lightweight face recognition in open-set classification, in which the output features of a student
problems was proposed by designing different distilla- match that of teacher.
tion losses (Duong et al., 2019; Wu et al., 2020). As argued in Section 6.3, knowledge distillation with
The typical KD-based face recognition is the low- the teacher-student structure can transfer and preserve
resolution face recognition (Ge et al., 2018; Wang et al., the cross-modality knowledge. Efficient and effective
2019c; Kong et al., 2019). To improve low-resolution action recognition under its cross-modal task scenarios
face recognition accuracy, the knowledge distillation can be successfully realized (Thoker and Gall, 2019;
framework is developed by using architectures between Luo et al., 2018; Garcia et al., 2018; Hao and Zhang,
high-resolution face teacher and low-resolution face 2019; Wu et al., 2019b; Zhang et al., 2020). These meth-
student for model acceleration and improved clas- ods are the examples of spatiotemporal modality dis-
sification performance. Ge et al. (2018) proposed a tillation with a different knowledge transfer for ac-
selective knowledge distillation method, in which the tion recognition. Examples include mutual teacher-
teacher network for high-resolution face recognition student networks (Thoker and Gall, 2019), multiple
selectively transfers its informative facial features into stream networks (Garcia et al., 2018), spatiotemporal
the student network for low-resolution face recognition distilled dense-connectivity network (Hao and Zhang,
through sparse graph optimization. In (Kong et al., 2019), graph distillation (Luo et al., 2018) and multi-
2019), cross-resolution face recognition was realized by teacher to multi-student networks (Wu et al., 2019b;
designing a resolution invariant model unifying both Zhang et al., 2020). Among these methods, the light-
face hallucination and heterogeneous recognition sub- weight student can distill and share the knowledge
nets. To get efficient and effective low resolution face information from multiple modalities stored in the
recognition model, the multi kernel maximum mean teacher.
discrepancy between student and teacher networks We summarize two main observations of distillation-
was adopted as the feature loss (Wang et al., 2019c). based visual recognition applications, as follows.
In addition, the KD-based face recognition can be • Knowledge distillation provides efficient and effective
extended to face alignment and verification by changing teacher-student networks for many complex visual
the losses in knowledge distillation (Wang et al., 2017). recognition tasks.
Recently, knowledge distillation has been used suc- • Knowledge distillation can make full use of the
cessfully for solving the complex image classification different types of knowledge in complex visual data,
problems. In addition, there are existing typical meth- such as cross-modality data, multi-domain data and
ods (Li and Hoiem, 2017; Bagherinezhad et al., 2018; multi-task data and low-resolution image data.
Peng et al., 2019b; Chen et al., 2018a; Zhu et al., 2019;
Wang et al., 2019b; Mukherjee et al., 2019). For in-
complete, ambiguous and redundant image labels, the 7.2 KD in NLP
label refinery model through self-distillation and label
progression was proposed to learn soft, informative, Conventional language models such as BERT are very
collective and dynamic labels for complex image classi- time consuming and resource consuming with com-
fication (Bagherinezhad et al., 2018). To address catas- plex cumbersome structures. Knowledge distillation is
trophic forgetting with CNN in a variety of image clas- extensively studied in the field of natural language
sification tasks, a learning without forgetting method processing (NLP), in order to obtain the lightweight,
for CNN, including both knowledge distillation and efficient and effective language models. More and more
lifelong learning was proposed to recognize a new image KD methods are proposed for solving the numer-
task and to preserve the original tasks (Li and Hoiem, ous NLP tasks (Liu et al., 2019b; Gordon and Duh,
2017). For improving image classification accuracy, 2019; Haidar and Rezagholizadeh, 2019; Yang et al.,
Knowledge Distillation: A Survey 17

2020b; Tang et al., 2019; Hu et al., 2018; Sun et al., improve the accuracy of multilingual machine trans-
2019; Nakashole and Flauger, 2017; Jiao et al., 2019; lation. To improve the translation quality, the knowl-
Wang et al., 2018c; Zhou et al., 2019a; Sanh et al., 2019; edge distillation method was proposed by Freitag et al.
Turc et al., 2019; Arora et al., 2019; Clark et al., 2019; (2017). An ensemble teacher model with a data filter-
Kim and Rush, 2016; Mou et al., 2016; Liu et al., 2019e; ing method teaches the student model. The ensemble
Hahn and Choi, 2019; Tan et al., 2019; Kuncoro et al., teacher model is based on a number of NMT models.
2016; Cui et al., 2017; Wei et al., 2019; Freitag et al., In (Wei et al., 2019), to improve the performance of
2017; Shakeri et al., 2019; Aguilar et al., 2020). The machine translation and machine reading tasks, a novel
existing NLP tasks using KD contain neural machine online knowledge distillation method was proposed to
translation (NMT) (Hahn and Choi, 2019; Zhou et al., address the unstableness of the training process and
2019a; Kim and Rush, 2016; Tan et al., 2019; Wei et al., the decreasing performance on each validation set.
2019; Freitag et al., 2017; Gordon and Duh, 2019), ques- In this online KD, the best evaluated model during
tion answering system (Wang et al., 2018c; Arora et al., training is chosen as the teacher model and updated
2019; Yang et al., 2020b; Hu et al., 2018), document re- by any subsequent better model. If the next model had
trieval (Shakeri et al., 2019), event detection (Liu et al., the poor performance, then the current teacher model
2019b), text generation (Haidar and Rezagholizadeh, would guide it.
2019) and so on. Among these KD-based NLP methods, As a multilingual representation model, BERT has
most of them belong to natural language understanding attracted attention in natural language understand-
(NLU), and many of these KD methods for NLU are ing (Devlin et al., 2019), but it is also a cumber-
designed as the task-specific distillation (Tang et al., some deep model that is not easy to be deployed.
2019; Turc et al., 2019; Mou et al., 2016) and multi- To address this problem, several lightweight variations
task distillation (Liu et al., 2019e; Yang et al., 2020b; of BERT (called BERT model compression) using
Sanh et al., 2019; Clark et al., 2019). In what follows, knowledge distillation are proposed (Sun et al., 2019;
we describe KD research works for neural machine Jiao et al., 2019; Tang et al., 2019; Sanh et al., 2019).
translation and then for extending a typical mul- Sun et al. (2019) proposed patient knowledge distil-
tilingual representation model entitled bidirectional lation for BERT model compression (BERT-PKD),
encoder representations from transformers ( or BERT) which is used for sentiment classification, paraphrase
(Devlin et al., 2019) in NLU. similarity matching, natural language inference, and
In natural language processing, neural machine machine reading comprehension. In the patient KD
translation is the hottest application. There are many method, the feature representations of the [CLS] token
extended knowledge distillation methods for neural ma- from the hint layers of teacher are transferred to the
chine translation (Hahn and Choi, 2019; Zhou et al., student. In (Jiao et al., 2019), to accelerate language in-
2019a; Kim and Rush, 2016; Gordon and Duh, 2019; ference of BERT, TinyBERT was proposed by designing
Wei et al., 2019; Freitag et al., 2017; Tan et al., 2019). two-stage transformer knowledge distillation containing
In (Zhou et al., 2019a), an empirical analysis about how the general distillation from general-domain knowledge
knowledge distillation affects the non-autoregressive and the task specific distillation from the task-specific
machine translation (NAT) models was studied. The knowledge in BERT. In (Tang et al., 2019). a task-
experimental fact that the translation quality is largely specific knowledge distillation from the BERT teacher
determined by both the capacity of an NAT model model into a bidirectional long short-term memory
and the complexity of the distilled data via the knowl- network (BiLSTM) was proposed for sentence classifica-
edge transfer was obtained. In the sequence generation tion and matching. In (Sanh et al., 2019), a lightweight
scenario of NMT, the proposed word-level knowledge student model called DistilBERT with the same generic
distillation is extended into sequence-level knowledge structure as BERT was designed and learned on a
distillation for training sequence generation student variety of tasks of NLP for good performance. In
model that mimics the sequence distribution of the (Aguilar et al., 2020), a simplified student BERT was
teacher (Kim and Rush, 2016). The good performance proposed by using the internal representations of a large
of sequence-level knowledge distillation was further teacher BERT via internal distillation.
explained from the perspective of data augmenta- Several KD methods used in NLP with different
tion and regularization in (Gordon and Duh, 2019). In perspectives are represented below. For question an-
(Tan et al., 2019), to overcome the multilingual diver- swering (Hu et al., 2018), to improve the efficiency and
sity, multi-teacher knowledge distillation with multiple robustness of the existing machine reading comprehen-
individual models for handling bilingual pairs and a sion methods, an attention-guided answer distillation
multilingual model for the student was proposed to method was proposed by fusing generic distillation and
18 Jianping Gou1 et al.

answer distillation to avoid confusing answers. For a 2020; Shen et al., 2019a; Oord et al., 2018). In partic-
task-specific distillation (Turc et al., 2019), the perfor- ular, these KD-based speech recognition applications
mance of knowledge distillation with the interactions have spoken language identification (Shen et al., 2018,
among pre-training, distillation and fine-tuning for the 2019a), text-independent speaker recognition (Ng et al.,
compact student model was studied. The proposed pre- 2018), audio classification (Gao et al., 2019; Perez et al.,
trained distillation performs well in sentiment classifi- 2020), speech enhancement (Watanabe et al., 2017),
cation, natural language inference, textual entailment. acoustic event detection (Price et al., 2016; Shi et al.,
For a multi-task distillation in the context of natural 2019a,b), speech synthesis (Oord et al., 2018) and so
language understanding (Clark et al., 2019), the single- on.
multi born-again distillation method based on born- Most existing knowledge distillation methods for
again neural networks (Furlanello et al., 2018) was pro- speech recognition, use teacher-student architectures
posed. Single-task teachers teach a multi-task student. to improve the efficiency and recognition accuracy of
For multilingual representations, knowledge distilla- acoustic models (Chan et al., 2015; Chebotar and Waters,
tion transferred knowledge among the multi-lingual 2016; Lu et al., 2017; Price et al., 2016; Shen et al.,
word embeddings for bilingual dictionary induction 2018; Gao et al., 2019; Shen et al., 2019a; Shi et al.,
(Nakashole and Flauger, 2017). The effectiveness of the 2019c,a; Watanabe et al., 2017; Perez et al., 2020). Us-
knowledge transfer was also studied using ensembles of ing a recurrent neural network (RNN) for holding
multilingual models (Cui et al., 2017). the temporal information from speech sequences, the
Several observations about knowledge distillation knowledge from the teacher RNN acoustic model was
for natural language processing are summarized as transferred into a small student DNN model (Chan et al.,
follows. 2015). Better speech recognition accuracy was obtained
• Knowledge distillation can provide an efficient and by combining multiple acoustic modes. The ensembles
effective lightweight language deep models using of different RNNs with different individual training
teacher-student architecture. Teacher can transfer criteria were designed to train a student model through
the rich knowledge from a large number of language knowledge transfer (Chebotar and Waters, 2016). The
data to the student, and the student can quickly learned student model performed well on 2,000-hour
complete many language tasks. large vocabulary continuous speech recognition (LVCSR)
• The teacher-student knowledge transfer can be used tasks in 5 languages. To strengthen the generalization
in many multilingual tasks. of the spoken language identification (LID) model on
short utterances, the knowledge of feature represen-
tations of the long utterance-based teacher network
7.3 KD in Speech Recognition was transferred into the short utterance-based student
network that can discriminate short utterances and
In the field of speech recognition, deep neural acous- perform well on the short duration utterance-based
tic models have attracted attention and interest due LID tasks (Shen et al., 2018). To further improve the
to their powerful performance. However, more and performance of short utterance-based LID, an inter-
more real-time speech recognition systems are de- active teacher-student online distillation learning was
ployed in embedded platforms with limited compu- proposed to enhance the performance of the feature
tational resources and fast response time. The state- representations of short utterances (Shen et al., 2019a).
of-the-art deep complex models cannot satisfy the Meanwhile, for audio classification, a multi-level
requirement of such speech recognition scenarios. To feature distillation method was developed and an ad-
satisfy these requirements, knowledge distillation is versarial learning strategy was adopted to optimize
widely studied and applied in many speech recog- the knowledge transfer (Gao et al., 2019). To improve
nition tasks. There are many knowledge distillation noise robust speech recognition, knowledge distilla-
systems for designing lightweight deep acoustic models tion was employed as the tool of speech enhance-
for speech recognition (Chebotar and Waters, 2016; ment (Watanabe et al., 2017). In (Perez et al., 2020), a
Wong and Gales, 2016; Chan et al., 2015; Price et al., audio-visual multi-modal knowledge distillation method
2016; Fukuda et al., 2017; Bai et al., 2019b; Ng et al., was proposed. knowledge is transferred from the teacher
2018; Albanie et al., 2018; Lu et al., 2017; Shi et al., models on visual and acoustic data into a student model
2019a; Roheda et al., 2018; Shi et al., 2019b; Gao et al., on audio data. In essence, this distillation shares the
2019; Ghorbani et al., 2018; Takashima et al., 2018; cross-modal knowledge among the teachers and stu-
Watanabe et al., 2017; Shi et al., 2019c; Asami et al., dents (Perez et al., 2020; Albanie et al., 2018; Roheda et al.,
2017; Huang et al., 2018; Shen et al., 2018; Perez et al., 2018). For efficient acoustic event detection, a quantized
Knowledge Distillation: A Survey 19

distillation method was proposed by using both knowl- • The lightweight student model can satisfy the practi-
edge distillation and quantization (Shi et al., 2019a). cal requirements of speech recognition, such as real-
The quantized distillation transfers knowledge from time responses, use of limited resources and high
a large CNN teacher model with better detection recognition accuracy.
accuracy into a quantized RNN student model. • Many teacher-student architectures are built on RNN
Unlike most existing traditional frame-level KD models because of the temporal property of speech
methods, sequence-level KD can perform better in some sequences.
sequence models for speech recognition, such as connec- • Sequence-level knowledge distillation can be applied
tionist temporal classification (CTC) (Wong and Gales, to sequence models with good results.
2016; Takashima et al., 2018; Huang et al., 2018). In • Knowledge distillation using teacher-student knowl-
(Huang et al., 2018), sequence-level KD was introduced edge transfer can solve the cross-domain or cross-
into connectionist temporal classification, in order to modal speech recognition in applications such as
match an output label sequence used in the training multi-accent recognition.
of teacher model and the input speech frames used
in distillation. In (Wong and Gales, 2016), the effect
7.4 KD in Other Applications
of speech recognition performance on frame-level and
sequence-level student-teacher training was studied and The full and correct leverages of external knowledge,
a new sequence-level student-teacher training method such as in a user review or in images, play a very
was proposed. The teacher ensemble was constructed important role in the effectiveness of deep recom-
using sequence-level combination instead of frame- mendation models. Reducing the complexity and im-
level combination. To improve the performance of proving the efficiency of deep recommendation models
unidirectional RNN-based CTC for real-time speech is also very necessary. Recently, knowledge distilla-
recognition, the knowledge of a bidirectional LSTM- tion has been successfully applied in recommender
based CTC teacher model was transferred into a unidi- systems for deep model compression and acceleration
rectional LSTM-based CTC student model via frame- (Chen et al., 2018b; Tang and Wang, 2018; Pan et al.,
level KD and sequence-level KD (Takashima et al., 2019). In (Tang and Wang, 2018), knowledge distilla-
2018). tion was first introduced into the recommender systems
Moreover, knowledge distillation can be used to and called ranking distillation because the recom-
solve some special issues in speech recognition (Bai et al., mendation was expressed as a ranking problem. In
2019b; Asami et al., 2017; Ghorbani et al., 2018). To (Chen et al., 2018b), an adversarial knowledge distil-
overcome overfitting issue of DNN acoustic models lation method was designed for efficient recommenda-
when data are scarce, knowledge distillation was em- tion. A teacher supervised the student as user-item
ployed as a regularization way to train adapted model prediction network (generator). The student learn-
with the supervision of the source model (Asami et al., ing was adjusted by adversarial adaption between
2017). The final adapted model achieved better per- teacher and student networks. Unlike distillation in
formance on three real acoustic domains including (Chen et al., 2018b; Tang and Wang, 2018), Pan et al.
two dialects and children’s speech. To overcome the (2019) designed a enhanced collaborative denoising
degradation of the performance of non-native speech autoencoder (ECAE) model for recommender systems
recognition, an advanced multi-accent student model via knowledge distillation to capture useful knowledge
was trained by distilling the knowledge from the multi- from user feedbacks and to reduce noise. The unified
ple accent-specific RNN-CTC models (Ghorbani et al., ECAE framework contained a generation network, a
2018). In essence, knowledge distillation in (Asami et al., retraining network and a distillation layer that trans-
2017; Ghorbani et al., 2018) realized the cross-domain ferred knowledge and reduced noise from the generation
knowledge transfer. To solve the complexity of fus- network.
ing the external language model (LM) into sequence- Using the natural characteristics of knowledge dis-
to-sequence model (Seq2seq) for speech recognition, tillation with teacher-student architectures, knowledge
knowledge distillation was employed as an effective tool distillation is used as an effective strategy to solve
to integrate a LM (teacher) into Seq2seq model (stu- the adversarial attacks or perturbations of deep mod-
dent) (Bai et al., 2019b). The trained Seq2seq model els (Papernot et al., 2016; Ross and Doshi-Velez, 2018;
can reduced character error rates in sequence-to-sequence Goldblum et al., 2019; Gil et al., 2019) and the issue
speech recognition. of the unavailable data due to the privacy, confi-
Several observations on knowledge distillation-based dentiality and security concerns (Lopes et al., 2017;
speech recognition are summarized as follows. Papernot et al., 2017; Bai et al., 2019a; Wang et al.,
20 Jianping Gou1 et al.

2019a; Vongkulbhisal et al., 2019). As stated in the ref- Most KD methods leverage a combination of dif-
erences (Ross and Doshi-Velez, 2018; Papernot et al., ferent kinds of knowledge, including response-based,
2016), the perturbations of the adversarial samples feature-based, and relation-based knowledge. There-
can be overcome by the robust outputs of the teacher fore, it is important to know the influence of each
networks and distillation. To avoid exposing the private individual type of knowledge and to know how dif-
data, multiple teachers access subsets of the sensi- ferent kinds of knowledge help each other in a com-
tive or unlabelled data and supervise the student plementary manner. For example, the response-based
(Papernot et al., 2017; Vongkulbhisal et al., 2019). To knowledge has a similar motivation to label smoothing
address the issue of privacy and security, the data to and the model regularization (Kim and Kim, 2017;
train the student network is generated by using the Muller et al., 2019; Ding et al., 2019); The featured-
layer activations or layer spectral activations of the based knowledge is often used to mimic the interme-
teacher network via data-free distillation (Lopes et al., diate process of the teacher and the relation-based
2017). To protect data privacy and prevent intellectual knowledge is used to capture the relationships across
piracy, Wang et al. (2019a) proposed a private model different samples. To this end, it is still challenge
compression framework via knowledge distillation. The to model different types of knowledge in a unified
student model was applied to public data while the framework.
teacher model was applied to both sensitive and public How to transfer the rich knowledge from the teacher
data. This private knowledge distillation adopted pri- to a student is a key step in knowledge distillation. The
vacy loss and batch loss to further improve privacy. To knowledge transfer is usually realized by the distillation
consider the compromise between privacy and perfor- that determines the ability of the student learning and
mance, a few shot network compression method via a the teacher teaching. Generally, the existing distillation
novel layer-wise knowledge distillation with few samples methods can be categorized into offline distillation,
per class was developed and performed well (Bai et al., online distillation and self distillation. To improve
2019a). Of course, there are other special interesting the efficacy of the knowledge transfer, developing new
applications of knowledge distillation, such as neural ar- distillation and reasonably coordinating different distil-
chitecture search (Macko et al., 2019; Bashivan et al., lations could be promising strategies.
2019) and interpretability of deep neural networks Currently, most KD methods focus on new knowl-
(Liu et al., 2018c). edge or new distillation loss functions, leaving the
design of the teacher-student architectures poorly inves-
tigated (Nowak and Corso, 2018; Crowley et al., 2018;
8 Conclusion and Discussion Kang et al., 2020; Liu et al., 2019h; Ashok et al., 2018;
Liu et al., 2019a). In fact, apart from the knowledge
Knowledge distillation and its applications have aroused and distillation, the relationship between the struc-
considerable attention in recent few years. In this tures of the teacher and the student also significantly
paper, we present a comprehensive review on knowledge influences the performance of knowledge distillation.
distillation, from the perspectives of knowledge, distilla- For example, as described in (Zhang et al., 2019b),
tion schemes, teacher-student architectures, distillation the student model may learn very little from some
algorithms, applications, as well as challenges and teacher models. This is caused by the model capac-
future directions. Below, we discuss the challenges of ity gap between the teacher model and the student
knowledge distillation and provide some insights on the model (Kang et al., 2020). As a result, the design
future research of knowledge distillation. of an effective student model or construction of a
proper teacher model are still challenging problems in
knowledge distillation.
8.1 Challenges Despite a huge number of the knowledge distil-
lation methods and applications, the understanding
For knowledge distillation, the key is to 1) extract of knowledge distillation including theoretical expla-
rich knowledge from the teacher and 2) to transfer nations and empirical evaluations remains insufficient
the knowledge from the teacher to guide the training (Lopez-Paz et al., 2015; Phuong and Lampert, 2019;
of the student. Therefore, we discuss the challenges Cho and Hariharan, 2019). For example, distillation
in knowledge distillation from the followings aspects: can be viewed as a form of learning with privileged
the quality of knowledge, the types of distillation, the information (Lopez-Paz et al., 2015). The assumption
design of the teacher-student architectures, and the of linear teacher and student models enables the study
theory behind the knowledge distillation. of the theoretical explanations of characteristics of the
Knowledge Distillation: A Survey 21

student learning via distillation (Phuong and Lampert, from unlabelled data by a large model can also super-
2019). Furthermore, some empirical evaluations and vise the target model via distillation (Noroozi et al.,
analysis on the efficacy of knowledge distillation were 2018). To this end, the extensions of knowledge dis-
performed by Cho and Hariharan (2019). However, a tillation for other purposes and applications might be
deep understanding of generalizability of knowledge a meaningful future direction.
distillation, especially how to measure the quality The learning of knowledge distillation is similar
of knowledge or the quality of the teacher-student to the human beings learning. It can be practicable
architecture, is still very difficult to attain. to popularize the knowledge transfer to the classic
and traditional machine learning methods (Zhou et al.,
2019b; Gong et al., 2018; You et al., 2018; Gong et al.,
8.2 Future Directions
2017). For example, traditional two-stage classification
In order to improve the performance of knowledge and was felicitous cast to a single teacher single student
distillation, the most important factors include what problem based on the idea of knowledge distillation
kind of teacher-student network architecture, what kind (Zhou et al., 2019b). Furthermore, knowledge distil-
of knowledge is learned from the teacher network, and lation can be flexibly deployed to various learning
where is distilled into the student network. schemes, such as the adversarial learning (Liu et al.,
The model compression and acceleration methods 2018b), auto machine learning (Macko et al., 2019),
for deep neural networks usually fall into four different lifelong learning (Zhai et al., 2019), and reinforcement
categories, namely parameter pruning and sharing, learning (Ashok et al., 2018). Therefore, it will be in-
low-rank factorization, transferred compact convolu- teresting and useful to integrate knowledge distillation
tional filters and knowledge distillation (Cheng et al., with other learning schemes for practical challenges in
2018). In existing knowledge distillation methods, there the future.
are only a few related works discussing the combi-
nation of knowledge distillation and other kinds of
compressing methods. For example, quantized knowl- References
edge distillation, which can be seen as a parameter
pruning method, integrates network quantization into Aditya, S., Saha, R., Yang, Y. & Baral, C. (2019).
the teacher-student architectures (Polino et al., 2018; Spatial knowledge distillation to aid visual reasoning.
Mishra and Marr, 2017; Wei et al., 2018). Therefore, In: WACV.
to learn efficient and effective lightweight deep models Aguilar, G., Ling, Y., Zhang, Y., Yao, B., Fan, X. &
for the deployment on portable platforms, the hybrid Guo, E. (2020). Knowledge distillation from internal
compression methods via both knowledge distillation representations. In: AAAI.
and other compressing techniques will be an interesting Aguinaldo, A., Chiang, P. Y., Gain, A., Patil,
topic for future study. A., Pearson, K. & Feizi, S. (2019). Compressing
Apart from model compression for acceleration for gans using knowledge distillation. arXiv preprint
deep neural networks, knowledge distillation also can arXiv:1902.00159.
be used in other problems because of the natural Ahn, S., Hu, S., Damianou, A., Lawrence, N. D. &
characteristics of the knowledge transfer on the teacher- Dai, Z. (2019). Variational information distillation for
student architecture. Recently, knowledge distillation knowledge transfer. In: CVPR.
has been applied to the data privacy and security Albanie, S., Nagrani, A., Vedaldi, A. & Zisserman, A.
(Wang et al., 2019a), adversarial attacks of deep mod- (2018). Emotion recognition in speech using cross-
els (Papernot et al., 2016), cross-modalities (Gupta et al., modal transfer in the wild. In: ACM MM.
2016), multiple domains (Asami et al., 2017), catas- Allen-Zhu, Z., Li, Y., & Liang, Y. (2019). Learning and
trophic forgetting (Lee et al., 2019b), accelerating learn- generalization in overparameterized neural networks,
ing of deep models (Chen et al., 2015), efficiency of going beyond two layers. In: NeurIPS.
neural architecture search (Bashivan et al., 2019), self- Anil, R., Pereyra, G., Passos, A., Ormandi, R., Dahl, G.
supervision (Noroozi et al., 2018), and data augmen- E.. & Hinton, G. E. (2018). Large scale distributed
tation (Lee et al., 2019a; Gordon and Duh, 2019). An- neural network training through online distillation.
other interesting example is that the knowledge transfer arXiv preprint arXiv:1804.03235.
from the small teacher networks to a large student net- Arora, S., Cohen, N., & Hazan, E. (2018). On
work can accelerate the student learning (Chen et al., the optimization of deep networks: Implicit ac-
2015). This is very quite different from vanilla knowl- celeration by overparameterization. arXiv preprint
edge distillation. The feature representations learned arXiv:1802.06509.
22 Jianping Gou1 et al.

Arora, S., Khapra, M. M. & Ramaswamy, H. G. (2019). with knowledge distillation. In: NeurIPS.
On knowledge distillation from complex networks for Chen, H., Wang, Y., Xu, C., Yang, Z., Liu, C., Shi, B. &
response prediction. In: NAACL-HLT. et al. (2019a). Data-free learning of student networks.
Asami, T., Masumura, R., Yamaguchi, Y., Masataki, In: ICCV.
H. & Aono, Y. (2017). Domain adaptation of dnn Chen, H., Wang, Y., Xu, C., Xu, C. & Tao, D. (2020b).
acoustic models using knowledge distillation. In: Learning student networks via feature embedding.
ICASSP. TNNLS. DOI: 10.1109/TNNLS.2020.2970494.
Ashok, A., Rhinehart, N., Beainy, F. & Kitani, K. M. Chen, T., Goodfellow, I. & Shlens, J. (2015) Net2net:
(2018). N2N learning: Network to network compres- Accelerating learning via knowledge transfer. arXiv
sion via policy gradient reinforcement learning. In: preprint arXiv:1511.05641.
ICLR. Chen, W. C., Chang, C. C. & Lee, C. R. (2018a).
Asif, U., Tang, J. & Harrer, S. (2019). Ensemble Knowledge distillation with feature maps for image
knowledge distillation for learning improved and classification. In: ACCV.
efficient networks. arXiv preprint arXiv:1909.08097. Chen, X., Zhang, Y., Xu, H., Qin, Z. & Zha, H. (2018b).
Ba, J. & Caruana, R. (2014). Do deep nets really need Adversarial distillation for efficient recommendation
to be deep? In: NeurIPS. with external knowledge. ACM TOIS 37(1):1–28.
Bagherinezhad, H., Horton, M., Rastegari, M. & Chen, X., Su, J. & Zhang, J. (2019b). A two-teacher
Farhadi, A. (2018). Label refinery: Improving tramework for knowledge distillation. In: ISNN.
imagenet classification through label progression. Chen, Y., Wang, N. & Zhang, Z. (2018c). Darkrank:
arXiv preprint arXiv:1805.02641. Accelerating deep metric learning via cross sample
Bai, H., Wu, J., King, I. & Lyu, M. (2019a). Few similarities transfer. In: AAAI.
shot network compression via cross distillation. arXiv Chen, Y. C., Lin, Y. Y., Yang, M. H., Huang, J. B.
preprint arXiv:1911.09450. (2019c). Crdoco: Pixel-level domain transfer with
Bai, Y., Yi, J., Tao, J., Tian, Z. & Wen, Z. cross-domain consistency. In: CVPR.
(2019b). Learn spelling from teachers: transferring Chen, Z. & Liu, B. (2018). Lifelong machine learning.
knowledge from language models to sequence-to- Synthesis Lectures on Artificial Intelligence and
sequence speech recognition. In: Interspeech. Machine Learning 12(3):1–207.
Bashivan, P., Tensen, M. & DiCarlo, J. J. (2019). Cheng, Y., Wang, D., Zhou, P. & Zhang, T. (2018).
Teacher guided architecture search. In: ICCV. Model compression and acceleration for deep neural
Belagiannis, V., Farshad, A. & Galasso, F. (2018). networks: The principles, progress, and challenges.
Adversarial network compression. In:ECCV. IEEE Signal Proc Mag 35(1):126–136.
Bengio, Y., Courville, A., & Vincent, P. (2013). Rep- Cheng, X., Rao, Z., Chen, Y., & Zhang, Q. (2020).
resentation learning: A review and new perspectives. Explaining Knowledge Distillation by Quantifying
IEEE TPAMI 35(8):1798–1828. the Knowledge. In: CVPR.
Bhardwaj, S., Srinivasan, M. & Khapra, M. M. (2019). Cho, J. H. & Hariharan, B. (2019). On the efficacy of
Efficient video classification using fewer frames. In: knowledge distillation. In: ICCV.
CVPR. Chollet, F. (2017). Xception: Deep learning with
Brutzkus, A., & Globerson, A. (2019). Why do Larger depthwise separable convolutions. In: CVPR.
Models Generalize Better? A Theoretical Perspective Chung, I., Park, S., Kim, J. & Kwak, N. (2020).
via the XOR Problem. In: ICML. Feature-map-level online adversarial knowledge dis-
Bucilua, C., Caruana, R. & Niculescu-Mizil, A. (2006). tillation. In: ICRL.
Model compression. In: SIGKDD. Clark, K., Luong, M. T., Khandelwal, U., Manning,
Chan, W., Ke, N. R. & Lane, I. (2015). Transferring C. D. & Le, Q. V. (2019). Bam! born-again multi-
knowledge from a rnn to a DNN. arXiv preprint task networks for natural language understanding.
arXiv:1504.01483. In: ACL.
Chebotar, Y. & Waters, A. (2016). Distilling knowledge Courbariaux, M., Bengio, Y. & David, J. P. (2015).
from ensembles of neural networks for speech Binaryconnect: Training deep neural networks with
recognition. In: Interspeech. binary weights during propagations. In: NeurIPS.
Chen, D., Mei, J. P., Wang, C., Feng, Y. & Chen, C. Crowley, E. J., Gray, G. & Storkey, A. J. (2018).
(2020a) Online knowledge distillation with diverse Moonshine: Distilling with cheap convolutions. In:
peers. In: AAAI. NeurIPS.
Chen, G., Choi, W., Yu, X., Han, T., & Chandraker, Cui, J., Kingsbury, B., Ramabhadran, B., Saon, G.,
M. (2017). Learning efficient object detection models Sercu, T., Audhkhasi, K. & et al. (2017). Knowledge
Knowledge Distillation: A Survey 23

distillation across ensembles of multilingual models distillation. IEEE TIP 28(4):2051–2062.


for low-resource languages. In: ICASSP. Ghorbani, S., Bulut, A. E. & Hansen, J. H.
Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei- (2018). Advancing multi-accented lstm-ctc speech
Fei, L. (2009). Imagenet: A large-scale hierarchical recognition using a domain specific student-teacher
image database. In: CVPR. learning paradigm. In: SLTW.
Denton, E. L., Zaremba, W., Bruna, J., LeCun, Y. & Gil, Y., Chai, Y., Gorodissky, O. & Berant, J. (2019).
Fergus, R. (2014). Exploiting linear structure within White-to-black: Efficient distillation of black-box
convolutional networks for efficient evaluation. In: adversarial attacks. In: NAACL-HLT.
NeurIPS. Goldblum, M., Fowl, L., Feizi, S. & Goldstein,
Devlin, J., Chang, M. W., Lee, K. & Toutanova, K. T. (2019). Adversarially robust distillation. arXiv
(2019). Bert: Pre-training of deep bidirectional trans- preprint arXiv:1905.09747.
formers for language understanding. In: NAACL- Gong, C., Chang, X., Fang, M. & Yang, J. (2018).
HLT . Teaching semi-supervised classifier via generalized
Ding, Q., Wu, S., Sun, H., Guo, J. & Xia, ST. (2019). distillation. In: IJCAI.
Adaptive regularization of labels. arXiv preprint Gong, C., Tao, D., Liu, W., Liu, L., & Yang, J. (2017).
arXiv:1908.05474. Label propagation via teaching-to-learn and learning-
Do, T., Do, T. T., Tran, H., Tjiputra, E. & Tran, Q. to-teach. TNNLS 28(6):1452–1465.
D. (2019). Compact trilinear interaction for visual Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,
question answering. In: ICCV. Warde-Farley, D., Ozair, S., Courville, A., & Bengio,
Dong, X. & Yang, Y. (2019). Teacher supervises Y. (2014). Generative adversarial nets. In: NIPS.
students how to learn from partially labeled images Gordon, M. A. & Duh, K. (2019). Explaining sequence-
for facial landmark detection. In: ICCV. level knowledge distillation as data-augmentation
Dou, Q., Liu, Q., Heng, P. A., & Glocker, B. (2020). for neural machine translation. arXiv preprint
Unpaired Multi-modal Segmentation via Knowledge arXiv:1912.03334.
Distillation. To appear in IEEE TMI. Gu, J., & Tresp, V. (2020). Search for Better Students
Duong, C. N., Luu, K., Quach, K. G. & Le, N. to Learn Distilled Knowledge. In: ECAI.
(2019.) ShrinkTeaNet: Million-scale lightweight face Gupta, S., Hoffman, J. & Malik, J. (2016). Cross modal
recognition via shrinking teacher-student networks. distillation for supervision transfer. In: CVPR.
arXiv preprint arXiv:1905.10620. Hahn, S. & Choi, H. (2019). Self-knowledge distillation
Flennerhag, S., Moreno, P. G., Lawrence, N. D. & in natural language processing. In: RANLP.
Damianou, A. (2019). Transferring knowledge across Haidar, M. A. & Rezagholizadeh, M. (2019). Textkd-
learning processes. In: ICLR. gan: Text generation using knowledge distillation
Freitag, M., Al-Onaizan, Y. & Sankaran, B. (2017). and generative adversarial networks. In: Canadian
Ensemble distillation for neural machine translation. Conference on Artificial Intelligence.
arXiv preprint arXiv:1702.01802. Han, S., Pool, J., Tran, J. & Dally, W. (2015). Learning
Fukuda, T., Suzuki, M., Kurata, G., Thomas, S., Cui, both weights and connections for efficient neural
J. & Ramabhadran, B. (2017). Efficient knowledge network. In: NeurIPS.
distillation from an ensemble of teachers. In: Hao, W. & Zhang, Z. (2019). Spatiotemporal distilled
Interspeech. dense-connectivity network for video action recogni-
Furlanello, T., Lipton, Z., Tschannen, M., Itti, L. & tion. Pattern Recogn 92:13–24.
Anandkumar, A. (2018). Born again neural networks. He, F., Liu, T., & Tao, D. (2020). Why resnet
In: ICML. works? residuals generalize. TNNLS. DOI:
Gao, L., Mi, H., Zhu, B., Feng, D., Li, Y. & Peng, Y. 10.1109/TNNLS.2020.2966319.
(2019). An adversarial feature distillation method for He, K., Zhang, X., Ren, S. & Sun, J. (2016) Deep
audio classification. IEEE Access 7:105319–105330. residual learning for image recognition. In: CVPR.
Gao, M., Shen, Y., Li, Q., & Loy, C. C. (2020). He, T., Shen, C., Tian, Z., Gong, D., Sun, C. & Yan, Y.
Residual Knowledge Distillation. arXiv preprint (2019). Knowledge adaptation for efficient semantic
arXiv:2002.09168. segmentation. In: CVPR.
Garcia, N. C., Morerio, P. & Murino, V. (2018). Heo, B., Kim, J., Yun, S., Park, H., Kwak, N., & Choi,
Modality distillation with multiple stream networks J. Y. (2019a). A comprehensive overhaul of feature
for action recognition. In: ECCV. distillation. In: ICCV.
Ge, S., Zhao, S., Li, C. & Li, J. (2018). Low-resolution Heo, B., Lee, M., Yun, S. & Choi, J. Y. (2019b).
face recognition in the wild via selective knowledge Knowledge distillation with adversarial samples
24 Jianping Gou1 et al.

supporting decision boundary. In: AAAI. In: CVPR.


Heo, B., Lee, M., Yun, S. & Choi, J. Y. (2019c). Kim, S. W. & Kim, H. E. (2017). Transferring
Knowledge transfer via distillation of activation knowledge to smaller network with class-distance
boundaries formed by hidden neurons. In: AAAI. loss. In: IClRW.
Hinton, G., Vinyals, O. & Dean, J. (2015). Distilling Kim, Y., Rush & A. M. (2016). Sequence-level
the knowledge in a neural network. arXiv preprint knowledge distillation. In: EMNLP.
arXiv:1503.02531. Kimura, A., Ghahramani, Z., Takeuchi, K., Iwata,
Hoffman, J., Gupta, S. & Darrell, T. (2016). Learning T. & Ueda, N. (2018). Few-shot learning of
with side information through modality hallucina- neural networks from scratch by pseudo example
tion. In: CVPR. optimization. In: BMVC.
Hong, W. & Yu, J. (2019). Gan-knowledge distilla- Kong, H., Zhao, J., Tu, X., Xing, J., Shen, S. &
tion for one-stage object detection. arXiv preprint Feng, J. (2019). Cross-resolution face recognition via
arXiv:1906.08467. prior-aided face hallucination and residual knowledge
Hou, Y., Ma, Z., Liu, C. & Loy, CC. (2019). Learning distillation. arXiv preprint arXiv:1905.10777.
lightweight lane detection cnns by self attention Krizhevsky, A., Sutskever, I. & Hinton, G. E. (2012).
distillation. In: ICCV. Imagenet classification with deep convolutional
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, neural networks. In: NeurIPS.
D., Wang, W., Weyand, T., Andreetto, M., & Kuncoro, A., Ballesteros, M., Kong, L., Dyer, C. &
Adam, H. (2017). Mobilenets: Efficient convolutional Smith, N. A. (2016). Distilling an ensemble of greedy
neural networks for mobile vision applications. arXiv dependency parsers into one mst parser. In: EMNLP.
preprint arXiv:1704.04861. Kundu, J. N., Lakkakula, N. & Babu, R. V. (2019). Um-
Hu, M., Peng, Y., Wei, F., Huang, Z., Li, D., Yang, N. adapt: Unsupervised multi-task adaptation using
& et al. (2018). Attention-guided answer distillation adversarial cross-task distillation. In: CVPR.
for machine reading comprehension. In: EMNLP. Lee, H., Hwang, S. J. & Shin, J. (2019a). Rethink-
Huang, G., Liu, Z., Van, Der Maaten, L. & Weinberger, ing data augmentation: Self-supervision and self-
K. Q. (2017). Densely connected convolutional distillation. arXiv preprint arXiv:1910.05872.
networks. In: CVPR. Lee, K., Lee, K., Shin, J. & Lee, H. (2019b).
Huang, M., You, Y., Chen, Z., Qian, Y. & Yu, K. Overcoming catastrophic forgetting with unlabeled
(2018). Knowledge distillation for sequence model. data in the wild. In: ICCV.
In: Interspeech. Lee, K., Nguyen, L. T. & Shim, B. (2019c). Stochastic-
Huang, Z. & Wang, N. (2017). Like what you like: ity and skip connections improve knowledge transfer.
Knowledge distill via neuron selectivity transfer. In: AAAI.
arXiv preprint arXiv:1707.01219. Lee, S. & Song, B. (2019). Graph-based knowledge
Ioffe, S., & Szegedy, C. (2015). Batch normalization: distillation by multi-head attention network. In:
Accelerating deep network training by reducing BMVC.
internal covariate shift. In: ICML Lee, S. H., Kim, D. H. & Song, B. C. (2018). Self-
Jang, Y., Lee, H., Hwang, S. J. & Shin, J. (2019). supervised knowledge distillation using singular value
Learning what and where to transfer. In: ICML. decomposition. In: ECCV.
Jiao, X., Yin, Y., Shang, L., Jiang, X., Chen, X., Li, C., Peng, J., Yuan, L., Wang, G., Liang, X., Lin, L.,
Li, L. & et al. (2019). Tinybert: Distilling bert & Chang, X. (2020). Blockwisely Supervised Neural
for natural language understanding. arXiv preprint Architecture Search with Knowledge Distillation. In:
arXiv:1909.10351. CVPR.
Kang, M., Mun, J. & Han, B. (2020). Towards Li, J., Fu, K., Zhao, S. & Ge, S. (2019). Spatiotemporal
oracle knowledge distillation with neural architecture knowledge distillation for efficient estimation of aerial
search. In: AAAI. video saliency. IEEE TIP 29:1902–1914.
Kim, J., Park, S. & Kwak, N. (2018). Paraphrasing Li, Q., Jin, S. & Yan, J. (2017). Mimicking very efficient
complex network: Network compression via factor network for object detection. In: CVPR.
transfer. In: NeurIPS. Li, T., Li, J., Liu, Z. & Zhang, C. (2018). Knowl-
Kim, J., Bhalgat, Y., Lee, J., Patel, C., & Kwak, edge distillation from few samples. arXiv preprint
N. (2019a). QKD: Quantization-aware Knowledge arXiv:1812.01839.
Distillation. arXiv preprint arXiv:1911.12491. Li, Z. & Hoiem, D. (2017). Learning without forgetting.
Kim, J., Hyun, M., Chung, I. & Kwak, N. (2019b). Fea- IEEE TPAMI 40(12):2935–2947.
ture fusion for online mutual knowledge distillation.
Knowledge Distillation: A Survey 25

Liu, I. J., Peng, J. & Schwing, A. G. (2019a). Knowledge Workshop.


flow: Improve upon your teachers. In: ICLR. Ma, J., & Mei, Q. (2019). Graph representation learning
Liu, J., Chen, Y. & Liu, K. (2019b). Exploiting via multi-task knowledge distillation. arXiv preprint
the ground-truth: An adversarial imitation based arXiv:1911.05700.
knowledge distillation approach for event detection. Ma, N., Zhang, X., Zheng, H. T., & Sun, J. (2018).
In: AAAI. Shufflenet v2: Practical guidelines for efficient cnn
Liu, J., Wen, D., Gao, H., Tao, W., Chen, T. W., architecture design. In: ECCV
Osa, K. & et al. (2019c). Knowledge representing: Meng, Z., Li, J., Zhao, Y. & Gong, Y. (2019).
efficient, sparse representation of prior knowledge for Conditional teacher-student learning. In: ICASSP.
knowledge distillation. In: CVPRW. Micaelli, P. & Storkey, A. J. (2019). Zero-shot
Liu, P., Liu, W., Ma, H., Mei, T. & Seok, M. (2018a). knowledge transfer via adversarial belief matching.
Ktan: knowledge transfer adversarial network. arXiv In: NeurIPS.
preprint arXiv:1810.08126. Minami, S., Hirakawa, T., Yamashita, T. & Fujiyoshi,
Liu, Q., Xie, L., Wang, H., Yuille & A. L. (2019d). H. (2019). Knowledge transfer graph for deep collab-
Semantic-aware knowledge preservation for zero-shot orative learning. arXiv preprint arXiv:1909.04286.
sketch-based image retrieval. In: ICCV. Mirzadeh, S. I., Farajtabar, M., Li, A. & Ghasemzadeh,
Liu, R., Fusi, N. & Mackey, L. (2018b). Model H. (2019). Improved knowledge distillation via
compression with generative adversarial networks. teacher assistant: Bridging the gap between student
arXiv preprint arXiv:1812.02271. and teacher. arXiv preprint arXiv:1902.03393.
Liu, X., Wang, X. & Matwin, S. (2018c). Improving Mishra, A. & Marr, D. (2017). Apprentice: Us-
the interpretability of deep neural networks with ing knowledge distillation techniques to improve
knowledge distillation. In: ICDMW. low-precision network accuracy. arXiv preprint
Liu, X., He, P., Chen, W. & Gao, J. (2019e). Improving arXiv:1711.05852.
multi-task deep neural networks via knowledge Mou, L., Jia, R., Xu, Y., Li, G., Zhang, L. & Jin,
distillation for natural language understanding. Z. (2016). Distilling word embeddings: An encoding
arXiv preprint arXiv:1904.09482. approach. In: CIKM.
Liu, Y., Cao, J., Li, B., Yuan, C., Hu, W., Li, Y. & Mukherjee, P., Das, A., Bhunia, A. K. & Roy, P. P.
Duan, Y. (2019f). Knowledge distillation via instance (2019). Cogni-net: Cognitive feature learning through
relationship graph. In: CVPR. deep visual perception. In: ICIP.
Liu, Y., Chen, K., Liu, C., Qin, Z., Luo, Z. & Wang, Mullapudi, R. T., Chen, S., Zhang, K., Ramanan, D. &
J. (2019g). Structured knowledge distillation for Fatahalian, K. (2019). Online model distillation for
semantic segmentation. In: CVPR. efficient video inference. In: ICCV.
Liu, Y., Jia, X., Tan, M., Vemulapalli, R., Zhu, Y., Muller, R., Kornblith, S. & Hinton, G. E. (2019). When
Green, B. & et al. (2019h). Search to distill: Pearls does label smoothing help? In: NeurIPS.
are everywhere but not the eyes. In: CVPR. Mun, J., Lee, K., Shin, J. & Han, B. (2018). Learning
Lopes, R. G., Fenu, S. & Starner, T. (2017). Data- to specialize with knowledge distillation for visual
free knowledge distillation for deep neural networks. question answering. In: NeurIPS.
arXiv preprint arXiv:1710.07535. Munjal, B., Galasso, F. & Amin, S. (2019). Knowledge
Lopez-Paz, D., Bottou, L., Schölkopf, B. & Vapnik, distillation for end-to-end person search. In: BMVC.
V. (2015). Unifying distillation and privileged Nakashole, N. & Flauger, R. (2017). Knowledge
information. arXiv preprint arXiv:1511.03643. distillation for bilingual dictionary induction. In:
Lu, L., Guo, M. & Renals, S. (2017). Knowledge EMNLP.
distillation for small-footprint highway networks. In: Nayak, G. K., Mopuri, K. R., Shaj, V., Babu, R.
ICASSP. V. & Chakraborty, A. (2019). Zero-shot knowledge
Luo, P., Zhu, Z., Liu, Z., Wang, X. & Tang, X. (2016). distillation in deep networks. In: ICML.
Face model compression by distilling knowledge from Ng, R. W., Liu, X. & Swietojanski, P. (2018). Teacher-
neurons. In: AAAI. student training for text-independent speaker recog-
Luo, Z., Hsieh, J. T., Jiang, L., Carlos Niebles, J.& Fei- nition. In: SLTW.
Fei, L. (2018). Graph distillation for action detection Nie, X., Li, Y., Luo, L., Zhang, N. & Feng, J.
with privileged modalities. In: ECCV. (2019). Dynamic kernel distillation for efficient pose
Macko, V., Weill, C., Mazzawi, H. & Gonzalvo, estimation in videos. In: ICCV.
J. (2019). Improving neural architecture search Noroozi, M., Vinjimoor, A., Favaro, P. & Pirsiavash,
image classifiers via ensemble learning. In: NeurIPS H .(2018). Boosting self-supervised learning via
26 Jianping Gou1 et al.

knowledge transfer. In: CVPR. generative adversarial networks. In: ICASSP.


Nowak, T. S. & Corso, J. J. (2018). Deep net triage: Romero, A., Ballas, N., Kahou, S. E., Chassang, A.,
Analyzing the importance of network layers via struc- Gatta, C., & Bengio, Y. (2015). Fitnets: Hints for
tural compression. arXiv preprint arXiv:1801.04651. thin deep nets. In: ICLR.
Oord, A., Li, Y., Babuschkin, I., Simonyan, K., Ross, A. S. & Doshi-Velez, F. (2018). Improving the
Vinyals, O., Kavukcuoglu, K. & et al. (2018). Parallel adversarial robustness and interpretability of deep
wavenet: Fast high-fidelity speech synthesis. ICML. neural networks by regularizing their input gradients.
Pan, Y., He, F. & Yu, H. (2019). A novel enhanced In: AAAI.
collaborative autoencoder with knowledge distilla- Ruder, S., Ghaffari, P. & Breslin, J. G. (2017).
tion for top-n recommender systems. Neurocomputing Knowledge adaptation: Teaching to adapt. arXiv
332:137–148. preprint arXiv:1702.02052.
Papernot, N., Abadi, M., Erlingsson, U., Goodfellow, Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., &
I. & Talwar, K. (2017). Semi-supervised knowledge Chen, L. C. (2018). Mobilenetv2: Inverted residuals
transfer for deep learning from private training data. and linear bottlenecks. In: CVPR.
In: ICLR Sanh, V., Debut, L., Chaumond, J. & Wolf, T.
Papernot, N., McDaniel, P., Wu, X., Jha, S. & Swami, (2019). Distilbert, a distilled version of bert:
A. (2016). Distillation as a defense to adversarial smaller, faster, cheaper and lighter. arXiv preprint
perturbations against deep neural networks. In: IEEE arXiv:1910.01108.
SP. Saputra, M. R. U., de Gusmao, P. P., Almalioglu,
Park, S. & Kwak, N. (2019). Feed: feature-level Y., Markham, A. & Trigoni, N. (2019). Distilling
ensemble for knowledge distillation. arXiv preprint knowledge from a deep pose regressor network. In:
arXiv:1909.10754. ICCV.
Park, W., Kim, D., Lu, Y. & Cho, M. (2019). Relational Sau, B. B. & Balasubramanian, V. N. (2016). Deep
knowledge distillation. In: CVPR. model compression: Distilling knowledge from noisy
Passalis, N. & Tefas, A. (2018). Learning deep repre- teachers. arXiv preprint arXiv:1610.09650.
sentations with probabilistic knowledge transfer. In: Shu, C., Li, P., Xie, Y., Qu, Y., Dai, L., & Ma,
ECCV. L.(2019). Knowledge squeezed adversarial network
Peng, B., Jin, X., Liu, J., Li, D., Wu, Y., Liu, Y. & compression. arXiv preprint arXiv:1904.05100.
et al. (2019a). Correlation congruence for knowledge Shakeri, S., Sethy, A. & Cheng, C. (2019). Knowledge
distillation. In: ICCV. distillation in document retrieval. arXiv preprint
Peng, Z., Li, Z., Zhang, J., Li, Y., Qi, G. J. & Tang, J. arXiv:1911.11065.
(2019b). Few-shot image recognition with knowledge Shen, J., Vesdapunt, N., Boddeti, V. N. & Kitani, K.
transfer. In: ICCV. M. (2016). In teacher we trust: Learning compressed
Perez, A., Sanguineti, V., Morerio, P. & Murino, V. models for pedestrian detection. arXiv preprint
(2020). Audio-visual model distillation using acoustic arXiv:1612.00478.
images. In: WACV. Shen, P., Lu, X., Li, S. & Kawai, H. (2018). Feature
Phuong, M. & Lampert, C. (2019). Towards under- representation of short utterances based on knowl-
standing knowledge distillation. In: ICML. edge distillation for spoken language identification.
Pilzer, A., Lathuiliere, S., Sebe, N. & Ricci, E. (2019). In: Interspeech.
Refine and distill: Exploiting cycle-inconsistency and Shen, P., Lu, X., Li, S. & Kawai, H. (2019a). Interactive
knowledge distillation for unsupervised monocular learning of teacher-student model for short utterance
depth estimation. In: CVPR. spoken language identification. In: ICASSP.
Polino, A., Pascanu, R. & Alistarh, D. (2018). Model Shen, Z., He, Z. & Xue, X. (2019b). Meal: Multi-model
compression via distillation and quantization. In: ensemble via adversarial learning. In: AAAI.
ICLR. Shi, B., Sun, M., Kao, C. C., Rozgic, V., Matsoukas,
Price, R., Iso, K. & Shinoda, K. (2016). Wise teachers S. & Wang, C. (2019a). Compression of acoustic
train better dnn acoustic models. EURASIP Journal event detection models with quantized distillation.
on Audio, Speech, and Music Processing 2016(1):10. In: Interspeech.
Radosavovic, I., Kosaraju, R. P., Girshick, R., He, K., Shi, B., Sun, M., Kao, CC., Rozgic, V., Matsoukas, S.
& Dollar P. (2020). Designing network design spaces. & Wang, C. (2019b). Semi-supervised acoustic event
In: CVPR. detection based on tri-training. In: ICASSP.
Roheda, S., Riggan, B. S., Krim, H. & Dai, L. (2018). Shi, Y., Hwang, M. Y., Lei, X. & Sheng, H. (2019c).
Cross-modality distillation: A case for conditional Knowledge distillation for recurrent neural network
Knowledge Distillation: A Survey 27

language modeling with trust regularization. In: Tang, J. & Wang, K. (2018). Ranking distillation:
ICASSP. Learning compact ranking models with high perfor-
Shin, S., Boo, Y. & Sung, W. (2019). Empirical analysis mance for recommender system. In: SIGKDD.
of knowledge distillation technique for optimization Tang, R., Lu, Y., Liu, L., Mou, L., Vechtomova, O.
of quantized deep neural networks. arXiv preprint & Lin, J. (2019). Distilling task-specific knowledge
arXiv:1909.01688. from bert into simple neural networks. arXiv preprint
Shmelkov, K., Schmid, C. & Alahari, K. (2017). arXiv:1903.12136.
Incremental learning of object detectors without Thoker, F. M. & Gall, J. (2019). Cross-modal
catastrophic forgetting. In: ICCV. knowledge distillation for action recognition. In:
Siam, M., Jiang, C., Lu, S., Petrich, L., Gamal, ICIP.
M., Elhoseiny, M. & et al. (2019). Video object Tian, Y., Krishnan, D. & Isola, P. (2020). Con-
segmentation using teacher-student adaptation in a trastive representation distillation. arXiv preprint
human robot interaction (hri) setting. In: ICRA. arXiv:1910.10699.
Sindhwani, V., Sainath, T. & Kumar, S. (2015). Struc- Tu, Z., He, F., & Tao, D. (2020). Understanding
tured transforms for small-footprint deep learning. Generalization in Recurrent Neural Networks. In In-
In: NeurIPS. ternational Conference on Learning Representations.
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, In: ICLR.
L., Van Den Driessche, G., ... & Dieleman, S. (2016). Tung, F. & Mori, G. (2019). Similarity-preserving
Mastering the game of Go with deep neural networks knowledge distillation. In: ICCV.
and tree search. Nature, 529(7587):484–489. Turc, I., Chang, M. W., Lee, K. & Toutanova, K.
Song, X., Feng, F., Han, X., Yang, X., Liu, W. & Nie, L. (2019). Well-read students learn better: The impact
(2018). Neural compatibility modeling with attentive of student initialization on knowledge distillation.
knowledge distillation. In: SIGIR. arXiv preprint arXiv:1908.08962.
Srinivas, S. & Fleuret, F. (2018). Knowledge transfer Urban, G., Geras, K. J., Kahou, S. E., Aslan, O.,
with jacobian matching. In: ICML. Wang, S., Caruana, R. & et al. (2017). Do deep
Su, J. C. & Maji, S. (2016). Adapting models to convolutional nets really need to be deep and
signal degradation using distillation. arXiv preprint convolutional? In: ICLR.
arXiv:1604.00433. Vapnik, V. & Izmailov, R. (2015). Learning using priv-
Sun, S., Cheng, Y., Gan, Z. & Liu, J. (2019). Patient ileged information: similarity control and knowledge
knowledge distillation for bert model compression. transfer. J Mach Learn Res 16(2023-2049):2.
In: NEMNLP-IJCNLP. Vongkulbhisal, J., Vinayavekhin, P. & Visentini-
Sun, P., Feng, W., Han, R., Yan, S., & Wen, Y. (2019). Scarzanella, M. (2019). Unifying heterogeneous
Optimizing network performance for distributed dnn classifiers with distillation. In: CVPR.
training on gpu clusters: Imagenet/alexnet training Wang, C., Lan, X. & Zhang, Y. (2017). Model
in 1.5 minutes. arXiv preprint arXiv:1902.06855. distillation with knowledge transfer from face
Takashima, R., Li, S. & Kawai, H. (2018). An classification to alignment and verification. arXiv
investigation of a knowledge distillation method for preprint arXiv:1709.02929.
ctc acoustic models. In: ICASSP. Wang, H., Zhao, H., Li, X. & Tan, X. (2018a).
Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, Progressive blockwise knowledge distillation for
M., Howard, A., & Le, Q. V. (2019). Mnasnet: neural network acceleration. In: IJCAI.
Platform-aware neural architecture search for mobile. Wang, J., Bao, W., Sun, L., Zhu, X., Cao, B. &
In: CVPR. Philip, SY. (2019a). Private model compression via
Tan, M., & Le, Q. (2019). EfficientNet: Rethinking knowledge distillation. In: AAAI.
Model Scaling for Convolutional Neural Networks. Wang, J., Gou, L., Zhang, W., Yang, H. & Shen,
In: ICML. H. W. (2019b). Deepvid: Deep visual interpretation
Tan, X., Ren, Y., He, D., Qin, T., Zhao, Z. & Liu, T. Y. and diagnosis for image classifiers via knowledge
(2019). Multilingual neural machine translation with distillation. TVCG 25(6):2168–2180.
knowledge distillation. In: ICLR. Wang, M., Liu, R., Abe, N., Uchida, H., Matsunami, T.
Tang, J., Shivanna, R., Zhao, Z., Lin, D., Singh, A., & Yamada, S. (2018b). Discover the effective strategy
Chi, E. H., & Jain, S. (2020). Understanding and for face recognition model compression by improved
Improving Knowledge Distillation. arXiv preprint knowledge distillation. In: ICIP.
arXiv:2002.03532. Wang, M., Liu, R., Hajime, N., Narishige, A., Uchida,
H. & Matsunami, T.(2019c). Improved knowledge
28 Jianping Gou1 et al.

distillation for training fast low resolution face arXiv:1909.13063.


recognition model. In: ICCVW. Xie, Q., Hovy, E., Luong, M. T., & Le, Q. V. (2020).
Wang, T., Yuan, L., Zhang, X. & Feng, J. (2019d). Self-training with Noisy Student improves ImageNet
Distilling object detectors with fine-grained feature classification. In: CVPR.
imitation. In: CVPR. Xu, Z., Hsu, Y. C. & Huang, J. (2018a). Training
Wang, W., Zhang, J., Zhang, H., Hwang, M. Y., Zong, shallow and thin networks for acceleration via
C. & Li, Z. (2018c). A teacher-student framework for knowledge distillation with conditional adversarial
maintainable dialog manager. In: EMNLP. networks. In: ICLR Workshop.
Wang, X., Zhang, R., Sun, Y. & Qi, J. (2018d) Kdgan: Xu, Z., Hsu, Y. C. & Huang, J. (2018b). Training
Knowledge distillation with generative adversarial student networks for acceleration with conditional
networks. In: NeurIPS. adversarial networks. In: BMVC.
Wang, X., Hu, J. F., Lai, J. H., Zhang, J. & Zheng, W. Yan, M., Zhao, M., Xu, Z., Zhang, Q., Wang, G.
S. (2019e). Progressive teacher-student learning for & Su, Z. (2019). Vargfacenet: An efficient variable
early action prediction. In: CVPR. group convolutional neural network for lightweight
Wang, Y., Xu, C., Xu, C. & Tao, D. (2019e). Packing face recognition. In: ICCVW.
convolutional neural networks in the frequency Yang, C., Xie, L., Qiao, S. & Yuille, A. (2019a).
domain. IEEE TPAMI 41(10),2495–2510. Knowledge distillation in generations: More tolerant
Wang, Y., Xu, C., Xu, C. & Tao, D. (2018e). teachers educate better students. In: AAAI.
Adversarial learning of portable student networks. In: Yang, C., Xie, L., Su, C. & Yuille, A. L. (2019b).
AAAI. Snapshot distillation: Teacher-student optimization
Watanabe, S., Hori, T., Le Roux, J. & Hershey, J. in one generation. In: CVPR.
R. (2017). Student-teacher network learning with Yang, Y., Qiu, J., Song, M., Tao, D. & Wang,
enhanced features. In: ICASSP. X. (2020a). Distilling Knowledge From Graph
Wei, H. R., Huang, S., Wang, R., Dai, X. & Chen, J. Convolutional Networks. In: CVPR.
(2019). Online distilling from checkpoints for neural Yang, Z., Shou, L., Gong, M., Lin, W. & Jiang, D.
machine translation. In: NAACL-HLT. (2020b). Model compression with two-stage multi-
Wei, Y., Pan, X., Qin, H., Ouyang, W. & Yan, J. (2018). teacher knowledge distillation for web question
Quantization mimic: Towards very tiny cnn for object answering system. In: WSDM.
detection. In: ECCV. Yao, H., Zhang, C., Wei, Y., Jiang, M., Wang, S.,
Wong, J. H. & Gales, M. (2016). Sequence student- Huang, J., Chawla, N. V., & Li, Z. (2019). Graph
teacher training of deep neural networks. In: Few-shot Learning via Knowledge Transfer. arXiv
Interspeech. preprint arXiv:1910.03053.
Wu, B., Dai, X., Zhang, P., Wang, Y., Sun, F., Wu, Ye, J., Ji, Y., Wang, X., Ou, K., Tao, D. &
Y., ... & Keutzer, K. (2019). Fbnet: Hardware-aware Song, M. (2019). Student becoming the master:
efficient convnet design via differentiable neural Knowledge amalgamation for joint scene parsing,
architecture search. In: CVPR. depth estimation, and more. In: CVPR.
Wu, A., Zheng, W. S., Guo, X. & Lai, J. H. (2019a). Yim, J., Joo, D., Bae, J. & Kim, J. (2017). A gift from
Distilled person re-identification: Towards a more knowledge distillation: Fast optimization, network
scalable system. In: CVPR. minimization and transfer learning. In: CVPR.
Wu, J., Leng, C., Wang, Y., Hu, Q. & Cheng, J. (2016). You, S., Xu, C., Xu, C. & Tao, D. (2017). Learning from
Quantized convolutional neural networks for mobile multiple teacher networks. In: SIGKDD.
devices. In: CVPR. You, S., Xu, C., Xu, C. & Tao, D. (2018). Learning with
Wu, M. C., Chiu, C. T. & Wu, K. H. (2019b). single-teacher multi-student. In: AAAI.
Multi-teacher knowledge distillation for compressed You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S.,
video action recognition on deep neural networks. In: Bhojanapalli, S., ... & Hsieh, C. J. (2019). Large
ICASSP. batch optimization for deep learning: Training bert
Wu, X., He, R., Hu, Y., & Sun, Z. (2020). Learning an in 76 minutes. In: ICLR.
evolutionary embedding via massive knowledge dis- Yu, L., Yazici, V. O., Liu, X., Weijer, J., Cheng, Y. &
tillation. International Journal of Computer Vision, Ramisa, A. (2019). Learning metrics from teachers:
1-18. Compact networks for image embedding. In: CVPR.
Xie, J., Lin, S., Zhang, Y. & Luo, L. (2019). Yu, X., Liu, T., Wang, X., & Tao, D. (2017). On
Training convolutional neural networks with cheap compressing deep models by low rank and sparse
convolutions and online distillation. arXiv preprint decomposition. In: CVPR.
Knowledge Distillation: A Survey 29

Yuan, L., Tay, F. E., Li, G., Wang, T. & Feng, J. preprint arXiv:1904.01769.
(2019). Revisit knowledge distillation: a teacher-free Zhu, M., Han, K., Zhang, C., Lin, J. & Wang, Y. (2019).
framework. arXiv preprint arXiv:1909.11723. Low-resolution visual recognition via deep feature
Yun, S., Park, J., Lee, K. & Shin, J. (2019). distillation. In: ICASSP.
Regularizing predictions via class wise self knowledge Zhu, X. & Gong, S. (2018). Knowledge distillation by
distillation. on-the-fly native ensemble. In: NeurIPS.
Zagoruyko, S. & Komodakis, N. (2017). Paying more
attention to attention: Improving the performance of
convolutional neural networks via attention transfer.
In: ICLR.
Zhai, M., Chen, L., Tung, F., He, J., Nawhal, M. &
Mori, G. (2019). Lifelong gan: Continual learning for
conditional image generation. In: ICCV.
Zhai, S., Cheng, Y., Zhang, Z. M. & Lu, W. (2016).
Doubly convolutional neural networks. In: NeurIPS.
Zhang, C. & Peng, Y. (2018). Better and faster:
knowledge transfer from multiple self-supervised
learning tasks via graph distillation for video
classification. In: IJCAI.
Zhang, F., Zhu, X. & Ye, M. (2019a). Fast human pose
estimation. In: CVPR.
Zhang, J., Liu, T., & Tao, D. (2018). An information-
theoretic view for deep learning. arXiv preprint
arXiv:1804.09060.
Zhang, L., Song, J., Gao, A., Chen, J., Bao, C. &
Ma, K. (2019b). Be your own teacher: Improve the
performance of convolutional neural networks via self
distillation. In: ICCV.
Zhang, S., Guo, S., Wang, L., Huang, W., & Scott,
M. R. (2020). Knowledge Integration Networks for
Action Recognition. In AAAI.
Zhang, X., Zhou, X., Lin, M. & Sun, J. (2018a).
Shufflenet: An extremely efficient convolutional
neural network for mobile devices. In: CVPR.
Zhang, Y., Xiang, T., Hospedales, T. M. & Lu, H.
(2018b). Deep mutual learning. In: CVPR.
Zhao, M., Li, T., Abu Alsheikh, M., Tian, Y., Zhao,
H., Torralba, A. & Katabi, D. (2018). Through-
wall human pose estimation using radio signals. In:
CVPR.
Zhou C, Neubig G, Gu J (2019a) Understanding
knowledge distillation in non-autoregressive machine
translation. In: ICLR.
Zhou, G., Fan, Y., Cui, R., Bian, W., Zhu, X. & Gai,
K. (2018). Rocket launching: A universal and efficient
framework for training well-performing light net. In:
AAAI.
Zhou, J., Zeng, S. & Zhang, B. (2019b) Two-stage
image classification supervised by a single teacher
single student model. In: BMVC.
Zhou, P., Mai, L., Zhang, J., Xu, N., Wu, Z. & Davis,
L. S. (2019c). M2KD: Multi-model and multi-level
knowledge distillation for incremental learning. arXiv

You might also like