Thanks To XYZ Agency For Funding

KNOWLEDGE DISTILLATION FOR OPTIMIZATION OF
QUANTIZED DEEP NEURAL NETWORKS
Sungho Shin, Yoonho Boo, and Wonyong Sung
Department of Electrical and Computer Engineering

Seoul National University
Seoul, 08826 Korea
sungho.develop@gmail.com, wysung@snu.ac.kr
arXiv:1909.01688v3 [cs.LG] 23 Oct 2019
ABSTRACT In this work, we exploit KD with various types of teacher net-

Knowledge distillation (KD) technique that utilizes a pre-trained works that include full-precision [16, 17] model, quantized one, and
teacher model for training a student network is exploited for the teacher-assistant based one [18]. The analysis results indicate that,
optimization of quantized deep neural networks (QDNNs). We rather than the type of the model, the distribution of the soft label
considered the choice of the teacher network and also investigate is critical to the performance improvement of the student network.
the effect of hyperparameters for KD. We have tried several large Since the distribution of the soft label can be controlled by the tem-
floating-point models and quantized ones as the teacher. The exper- perature and the size of the teacher network, we try to show how
iments show that the softmax distribution produced by the teacher well-selected temperature can improve the QDNN performance dra-
network is more important than its performance for effective KD matically even with a small teacher network. Further, we suggest
training. Since the softmax distribution of the teacher network can a simple KD training scheme that adjusts the mixing ratio of hard
be controlled by KDs hyperparameters, we analyze the interrelation- and soft losses during training for obtaining stable performance im-
ship of each KD component for quantized DNN training. We show provements. We name it as the gradual soft loss reducing (GSLR)
that even a small teacher model can achieve the same distillation technique. GSLR employs both soft and hard losses equally at the
performance as a larger teacher model. We also propose the gradual beginning of the training, and gradually reduces the ratio of the soft
soft loss reducing (GSLR) technique for robust KD based QDNN loss as the training progresses.
optimization, which controls the mixing ratio of hard and soft losses This paper is organized as follows. Section 2 describes how the
during training. QDNN can be trained with KD and explains why the hyperparame-
ters of KD are important. Section 3 shows the experimental results
Index Terms— Deep neural network, quantization, knowledge and we conclude the paper in Section 4.
distillation, fixed-point optimization
2. QUNTIZED DEEP NEURAL NETWORK TRAINING

1. INTRODUCTION USING KNOWLEDGE DISTILLATION
Deep neural networks (DNNs) usually require a large number of pa-
In this section, we first briefly describe the conventional neural net-
rameters, thus it is very necessary to reduce the size of the model
work quantization method and also depict how QDNN training can
to operate it in embedded systems. Quantization is a widely used
be combined with KD. We also explain the hyperparameters of KD
compression technique, and even 1- or 2-bit models can show quite
and their role in QDNN training.
good performance. However, it is necessary to train the model very
carefully not to lose the performance when only low-precision arith-
metic is allowed. Many QDNN papers have suggested various types 2.1. Quantization of deep neural networks and knowledge dis-
of quantizers or complex training algorithms [1, 2, 3, 4, 5]. tillation
Knowledge distillation (KD) that trains small networks using
The deep neural network parameter vector, w, can be expressed in
larger networks for better performance [6, 7]. KD employs the soft-
2b level when quantized in b-bit. Since we usually use a symmet-
label generated by the teacher network to train the student network.
ric quantizer, the quantized weight vector Q(w) can be represented
Leveraging the knowledge contained in previously trained networks
using (1) or (2) for the case of b = 1 or b > 1 as follows:
has attracted attention in many applications for model compres-
sion [8, 9, 10, 11] and learning algorithms [12, 13, 14, 15]. Recently, Q1 (w) = Binarize(w) = ∆ · sign(w) (1)
the use of KD for the training of QDNN has been studied [16, 17]. nj |w| k (M − 1) o
However, there are many design choices to explore when applying Qb (w) = sign(w) · ∆ · min + 0.5 , (2)
KD to QDNN training. The work in [16] studied the effects of ∆ 2
simultaneous training or pre-training in teach model design. The
result is rather expected; employing a pre-trained teacher model where M is the number of quantization levels (2b − 1) and ∆ repre-
is advantageous when considering the performance of the student sents the quantization step size. ∆ can be computed by L2-error min-
network. However, [18] mentioned that a too large teacher network imization between floating and fixed-point weights or by the stan-
does not help improve the performance of the student model. dard deviation of the weight vector [2, 19, 20].
Severe quantization such as 1- or 2-bit frequently incurs large
Thanks to XYZ agency for funding. performance degradation. Retraining technique is widely used to
Algorithm 1: QDNN training with KD Table 1. Train and test accuracies of the quantized ResNet20 that
Initialization: wT : Pretrained teacher model, trained with various KD methods on CIFAR-10 dataset. ‘TL ’, ‘T’,
wS : Pretrained student model, ‘S’, ‘(F)’, and ‘(Q)’ denote large teacher, teacher, student, (full-
λ: Loss wegithed factor, τ : Temperature precision), and (quantized), respectively. HD is a conventional train-
Output : wSq : Quantized student model ing using hard loss. τ represents the temperature. Note that all the
while not converged do student networks are 2-bit QDNN and the results are the average of
wSq = Quant(wS ) five times running.
Run forward teacher (wT ) and student model (wSq )
Compute distillation loss L(wSq , λ) Method Status Train Acc. Test Acc.
∂L(wS )
q T (float) 99.95 94.02
Run backward and compute gradients ∂wS
q T(F)-S S (τ = 5) 98.02 92.25
q
∂L(wS ) S (τ = 2) 98.76 91.94
wS = wS − η · ∇ q ;
∂wS TL (float) 99.99 95.24
end TL (F)-T(F)-S T (float) 99.99 94.8
Return wSq S (τ = 10) 97.78 92.18
TL (float) 99.99 95.24
TL (F)-T(Q)-S T (4-bit) 99.99 94.46
S (τ = 10) 97.79 92.14
minimize the performance loss [21]. When retraining the student T (8-bit) 99.99 94.34
network, forward, backward, and gradient computations should be T(Q)-S
S (τ = 10) 97.53 92.02
conducted using quantized weights but the computed gradients must
HD Conventional 98.91 91.71
be added to full-precision weights [1, 2, 3, 4, 5].
The probability computation in deep neural networks usually
employ the softmax layer. Logit, z, is fed into the softmax layer and
T(F)_ 2 [91.94%]
generates the probability of each class, p, using pi = Pexp(z i /τ )
. T(F)_ 5 [92.25%]
j exp(zj /τ )
0.4 T(F)_ 10 [92.18%]
τ is a hyperparameter of KD known as the temperature. A high value
T(Q)_ 10 [92.14%]
of τ softens the probability distribution. KD employs the probability Softmax Probability 0.3
generated by the teacher network as a soft label to train the student
network, and the following loss function is minimized during train-
ing. 0.2
L(wS ) = (1 − λ)H(y, pS ) + λH(pT , pS ) (3) 0.1
where H(·) denotes a loss function, y is the ground truth hard label, 0.0
0 1
4 5 6 7 8 9 2 3
wS is the weight vector of the student network, pT and pS are the Label
probabilities of the teacher and student networks, and λ is a loss Fig. 1. Example of the softmax distribution for label 6 from the
weighting factor for adjusting the ratio of soft and hard losses. teacher models in Table 1. The numbers in square brackets are the
A recent paper [18] suggests the teacher, teacher-assistant, and CIFAR-10 test accuracies of the student networks that trained by
student models because the effect of KD gradually decreases when each teacher model.
the size difference between the teacher and student networks be-
comes too large. This performance degradation is due to the capacity
limitation of the student model. Since QDNN limits the represen- Table 1 compares the results of these three approaches. Fig-
tation level of the weight parameters, the capacity of a quantized ure 1 also shows the softmax distributions when the teacher models
network is reduced when compared with the full-precision model. and temperatures vary. The test accuracy of the quantized ResNet20
Therefore, QDNN training with KD is more sensitive to the size of trained using hard loss was 91.71%. The results using various KD
the teacher network. We consider the optimization of three hyper- approaches indicate the following information. First, whether the
parameters described above (temperature, loss weighting factor, and teacher network is quantized or not, the performance of the student
size of the teacher network). Algorithm 1 describes how to train network is not much different. Secondly, employing the teacher as-
QDNN with KD. sistant network [18] does not help increase the performance. The
performance is similar to that of a general KD. Thirdly, KD training
with full-precision teacher network [16, 17] is significantly better
2.2. Teacher model selection for KD
than conventional training when τ is 5. But, no performance increase
In this section, we try to find the best teacher model for QDNN is observed when τ is 2. Lastly, the teacher models that achieve the
training with KD. We consider three different approaches. The first student network accuracy of 92.14%, 92.18%, and 92.25% have a
one is training the full-precision teacher and student networks inde- similar softmax distribution. However, T(F)-S with τ = 2 shows a
pendently and applies KD when fine-tuning the quantized student quite sharp softmax shape, and the resulting performance is similar
model as suggested in [16, 17]. The second is training a medium- to that of hard-target training.
sized teacher assistant network with a very large teacher model, These points indicate that the softmax distribution is the key to
and then optimizing the student network using the teacher assistant lead effective KD training. Although the different teacher models
model as suggested by [18]. The last approach is using a quantized generate dissimilar softmax distributions, we can control the shapes
teacher model with the possibility of the student learning something by using the temperature. A detailed discussion about hyperparam-
on quantization. eters is provided in the following subsections.
Table 2. Train and test accuracies (%) of the teacher networks on Table 4. Results of QDNN training with KD on ResNet-20 for
the CIFAR-10 and the CIFAR-100 datasets. ‘WRN20xN ’ denotes CIFAR-10 and CIFAR-100 dataset. ‘WRN’,‘RN’, ‘SM’, ‘DS’ rep-
WideResNet with a wide factor of ‘N ’. resent WideResNet, ResNet, student model, and deeper student, re-
CIFAR-10 Train Test CIFAR-100 Train Test spectively.
ResNet20 99.62 92.63 ResNet20 90.12 68.43 CIFAR10 Teacher (full-precision) Student (2-bit)
WRN20x1.2 99.83 92.93 WRN20x1.2 94.92 69.64 # params (M) Test # params (M) Test
WRN20x1.5 99.93 93.48 WRN20x1.5 98.63 71.80 τ λ
(model name) (%) (model name) (%)
WRN20x1.7 99.95 94.02 WRN20x1.7 99.36 72.17 0.3
WRN20x2 99.95 94.36 WRN20x2 99.82 74.03 74.2 5 0.5
5.3 (SM 2)
WRN20x5 100 95.24 WRN20x3 99.95 76.31 89.7 5.8
(small network) 89.3 5 0.5
WRN20x10 100 95.23 WRN20x4 99.95 77.93 QDistill (DS)
WRN20x5 99.98 78.17 145 82.7
WRN20x10 99.98 78.68 95.7 94.23 5 0.5
(WRN28x20) (WRN22x16)
0.66 0.27
93.8 91.6 1 0.5
(RN44) (RN20)
Apprentice
Table 3. Training results of full-precision and 2-bit quantized 0.66 0.47
ResNet20 on CIFAR-10 and CIFAR-100 datasets in terms of accu- 93.8 92.6 1 0.5
(RN44) (RN32)
racy (%). The models are trained with hard loss only. 0.61 0.27
93.5 92.52 10 0.5
Train acc. Test acc. (WRN20x1.5) (RN20)
Ours 0.38 0.27
Full-precision 99.62 92.63 92.9 91.3 3 0.5
CIFAR-10 (WRN20x1.2) (RN20) 1-bit
2-bit quantized 98.92 91.71
Full-precision 90.12 68.43 CIFAR100 Teacher (full-precision) Student (2-bit)
CIFAR-100 # params (M) Test # params (M) Test
2-bit quantized 77.61 65.23 τ λ
(model name) (%) (model name) (%)
36.5 17.2
QDistill 77.2 49.3 5 0.5
(WRN28x10) (WRN22x8)
2.3. Discussion on hyperparameters of KD
22.0 22.0
Guided 65.4 64.6 - -
As we mentioned in Section 2.1 and 2.2, the hyperparameters tem- (AlexNet) (AlexNet)
perature (τ ), loss weighting factor (λ), and size of teacher network 0.39 0.28
69.64 66.6 2 0.5
(N ) can significantly affect the QDNN performance. Previous works (WRN20x1.2) (RN20)
Ours 0.78 0.28
usually fixed these hyperparameters when training QDNN with KD. 72.17 67.0 3 GSLR
For example, [16] always fixes τ to 1, and [17] holds it to 1 or 5 de- (WRN20x1.7) (RN20)
pending on the dataset. However, these three parameters are closely
interrelated. For example, [18] points out that when the teacher
model is very large compared to the student model, the softmax in- test accuracies of the teacher networks on CIFAR-10 and CIFAR-
formation produced by the teacher network become sharper, making 100 datasets are reported in Table 2. We employ ResNet20 as the
it difficult to transfer the knowledge of the teacher network to the student network for both the CIFAR-10 and CIFAR-100 datasets. If
student model. However, even in this case, controlling the tempera- the network size is large enough considering the size of the dataset,
ture may be able to make it possible. Therefore, when the value of which means over-parameterized, most quantization method works
one hyperparameter is changed, the others also need to be adjusted well [21]. Therefore, to evaluate a quantization algorithm, we need
carefully. Thus, we empirically analyze the effect of KD’s hyperpa- to employ a small network that is located in the under-parameterized
rameters. In addition, we introduce the gradual soft loss reducing region [21, 24]. Although the full-precision ResNet20 model is
(GSLR) technique that aids to improve the performance of QDNN over-parameterized, which means near 100% training accuracy,
dramatically. The GSLR is a KD training method that gradually in- the 2-bit network becomes under-parameterized on the CIFAR-10
creases the reflection ratio of the hard loss. dataset. Likewise, on the CIFAR-100 dataset, both the full-precision
and the quantized models are under-parameterized. Thus, it is a
3. EXPERIMENTAL RESULTS good network configuration to evaluate the effect of KD on QDNN
training. We report the train and the test accuracies for ResNet20 on
3.1. Experimental setup CIFAR-10 and CIFAR-100 in Table 3.
Dataset: We employ CIFAR-10 and CIFAR-100 datasets for exper- 3.2. Results
iments. CIFAR-10 and CIFAR-100 consist of 10 and 100 classes,
respectively. Both datasets contain 50K training images and 10K We compare our models with the previous works in Table 4. The
testing images. The size of each image is 32x32 with RGB chan- compared QDNN models trained with KD include QDistill [17], Ap-
nels. prentice [16], and Guided [25]. We achieve the results that signif-
Model configuration & training hyperparameter: To analyze icantly exceed those of previous studies. We compare our model
the impact of hyperparameters of KD on QDNN training, we train (0.27M) with the ‘student model 2’ (SM 2) of QDistill that has 0.3
WideResnet20xN (WRN20xN ) [22] as the teacher networks, where M parameters, and achieve an 18.32% of performance gap in the
N is set to 1, 1.2, 1.5, 1.7, 2, 3, 4, 5, and 10. When N is 1, the net- test accuracy. Also, it is about 1% better than the ResNet20 result
work structure is the same with ResNet20 [23]. All the train and the reported by Apprentice and even achieved the same performance
τ=1 τ=5 τ=10 Teacher τ=1 τ=2 τ=4 τ=5 Teacher Hard label τ1 τ2 τ3 τ4 τ5 τ10 τ20 τ50 τ1 τ2 τ3 τ4 τ5 τ10 τ20 τ50
95.3% 68% 68%
77%
TEST ACC. (%)

TEST ACC. (%)
94.8% 67% 67%
TEST ACC. (%)

75%
TEST ACC.(%)
94.3% 66% 66%

73%
93.8% 71% 65% 65%
93.3% 69% 64% 64%
92.8% 67%
63% 63%
92.3% 65%
91.8% 63% 62% 62%
x1 x1.2 x1.5 x1.7 x2 x5 x1 x1.2 x1.5 x1.7 x2 x5 x1 x1.2 x1.5 x1.7 x2 x3 x4 x5 x10 x1 x1.2 x1.5 x1.7 x2 x3 x4 x5 x10
TEACHER NETWORK TEACHER NETWORK TEACHER NETWORK TEACHER NETWORK
(a) CIFAR-10 (b) CIFAR-100 (a) KD (b) KD+GSLR

Fig. 2. Results of 2-bit ResNet20 that trained with varying the tem- Fig. 3. Results of 2-bit ResNet20 models that trained by the vari-
perature (τ ) and the size of the teacher network on the CIFAR-10 ous size of teacher networks and the temperature on CIFAR-100. In
and the CIFAR-100 datasets. The numbers in x-axis represent the (b), the black horizontal line represents the test accuracy when the
wide factor (N ) for WideResNet20xN . student network is trained with hard label only.
with their ResNet32 result. When quantized to 1-bit, the test ac- of the best performing teacher model also changes to WRN20x1.5
curacy of 91.3% is obtained, which is almost the same as Appren- and WRN20x1.7, respectively. This demonstrates that a proper value
tice’s ResNet20 2-bit model. In the case of CIFAR-100, QDistil and of temperature can improve the performance, but it should not be too
Guided student models use considerably large number, 17.2M and high since the knowledge from the teacher network can disappear.
22.0M, of parameters. Our student model only contains 0.28M pa-
rameters but achieve 17.7% and 2.4% higher accuracies than QDis-
3.4. Gradual soft loss reducing
till and Guided, respectively. These huge performance gaps show
the importance of selecting the proper hyperparameters. Throughout the paper, we have discussed the effects of the temper-
ature and the size of teacher network on the QDNN training with
3.3. Model size and temperature KD. Since the two hyperparameters are interrelated, careful param-
eter selection is required and it makes the training challenging. We
We report the test accuracies of 2-bit ResNet20 on CIFAR-10 in also have the risk of cherry picking if the outcome cannot be pre-
Figure 2 (a). To demonstrate the effect of the temperature for the dicted well without using the test result. Thus, we need to have a
QDNN training, we train 2-bit ResNet20 while varying the size of parameter setting technique that is fail-proof.
the teacher network from ‘WRN20x1’ to ‘WRN20x5’. Each ex- We have developed a KD technique that is much less sensitive to
periment was conducted for three τ values of 1, 5, and 10, which specific parameter setting for KD. At the beginning of the training,
correspond to small, medium, and large one, respectively. Note that where the gradient changes a lot, we use the soft and hard losses
WRN20xN contains the number of channel maps increased by N equally and then, gradually reduce the amount of the soft loss as
times. When the value of τ is small (blue line in the figure), the the training proceeds. We name this simple method as the gradual
performance greatly depends on N , or the teacher model size. The soft loss reducing (GSLR) technique. To evaluate the effectiveness
performance change is much reduced as the value of τ increases to of the GSLR, we train 2-bit ResNet20 while varying the size of the
the medium (orange line) or the large value (blue line). This is re- teacher and the temperature as shown in Figure 3. The results clearly
lated to the accuracy of the teacher model (red line). When the size show that GSLR greatly aids to improve the performance or at least
of the teacher model increases, the shape of the soft label becomes yields the comparable results with the hard loss (black horizontal
similar to that of hard label. In this case, the KD training results are line). When comparing the traditional KD, shown in Figure 3 (a),
not much different from that trained with the hard label. Therefore, and GSLR KD, in Figure 3 (b), we can find that the latter yields
with τ = 1, the performance decreases to 91.9% when the teacher much more predictable result, by which reducing the risk of cherry
network becomes larger than WRN20x2. This result is similar to the picking.
performance of a 2-bit ResNet20 trained with the hard loss (91.71%).
The soft label needs to have a broad shape and it can be achieved ei-
ther by increasing the temperature or limiting the size of the teacher 4. CONCLUDING REMARKS
network. A similar problem can occur for full-precision model KD
training, but it is more important for QDNN since the model capac- In this work, we investigate the teacher model choice and the impact
ity is reduced due to quantization. Therefore, when training QDNN of the hyperparameters in quantized deep neural networks training
with KD, we need to consider the relationship between the size of with knowledge distillation. We found that the teacher needs not be
teacher model and the temperature. a quantized neural network. Instead, hyperparameters that control
Figure 2 (b) shows the test accuracies of the 2-bit ResNet20 the shape of softmax distribution is more important. The hyperpa-
trained with KD on the CIFAR-100 dataset. Since the CIFAR-100 rameters for KD, which are the temperature, loss weighting factor,
includes 100 classes, the soft label distribution is not sharp and the and size of the teacher network, are closely interrelated. When the
optimum value of τ is usually lower than that of the CIFAR-10. size of the teacher network grows, increasing the temperature aids to
More specifically, when τ is larger than 5 (purple line), the test ac- boost the performance to some extent. We introduce a simple train-
curacies are lower than 65.49% (green dotted line), the accuracy of ing technique, gradual soft loss reducing (GSLR) for fail-safe KD
the 2-bit ResNet20 trained with the hard label. The soft label can training. At the beginning of the training, GSLR equally employs
easily become too flat even with a small τ , thus the teacher’s knowl- the hard and soft losses, and then gradually reduces the soft loss as
edge does not transfer well to the student network. When τ is not the training proceeds. With careful hyperparameter selection and
large (e.g. less than 5), the tendency is similar to CIFAR-10 experi- the GSLR technique, we achieve the far better performances than
ment. When τ is 1 (blue line), the best performance is observed with those of previous studies for designing 2-bit quantized deep neural
ResNet20. As τ increases to 2 (yellow line) and 4 (grey line), the size networks on the CIFAR-10 and CIFAR-100 datasets.
5. REFERENCES [14] Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho, “Re-
lational knowledge distillation,” in Proceedings of the IEEE
[1] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El- Conference on Computer Vision and Pattern Recognition,
Yaniv, and Yoshua Bengio, “Quantized neural networks: 2019, pp. 3967–3976.
Training neural networks with low precision weights and ac- [15] Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim, “A
tivations.,” The Journal of Machine Learning Research, vol. gift from knowledge distillation: Fast optimization, network
18, no. 187, pp. 1–30, 2017. minimization and transfer learning,” in Proceedings of the
[2] Kyuyeon Hwang and Wonyong Sung, “Fixed-point feedfor- IEEE Conference on Computer Vision and Pattern Recogni-
ward deep neural network design using weights +1, 0, and -1,” tion, 2017, pp. 4133–4141.
in Signal Processing Systems (SiPS), 2014 IEEE Workshop on. [16] Asit Mishra and Debbie Marr, “Apprentice: Using knowledge
IEEE, 2014, pp. 1–6. distillation techniques to improve low-precision network accu-
[3] Chen Xu, Jianqiang Yao, Zhouchen Lin, Wenwu Ou, Yuanbin racy,” in International Conference on Learning Representa-
Cao, Zhirong Wang, and Hongbin Zha, “Alternating multi- tions, 2018.
bit quantization for recurrent neural networks,” International [17] Antonio Polino, Razvan Pascanu, and Dan Alistarh, “Model
Conference on Learning Representations (ICLR), 2018. compression via distillation and quantization,” in International
Conference on Learning Representations, 2018.
[4] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen,
and Yuheng Zou, “Dorefa-net: Training low bitwidth convo- [18] Seyed-Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, and Has-
lutional neural networks with low bitwidth gradients,” arXiv san Ghasemzadeh, “Improved knowledge distillation via
preprint arXiv:1606.06160, 2016. teacher assistant: Bridging the gap between student and
teacher,” arXiv preprint arXiv:1902.03393, 2019.
[5] Shu-Chang Zhou, Yu-Zhi Wang, He Wen, Qin-Yao He, and Yu-
Heng Zou, “Balanced quantization: An effective and efficient [19] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and
approach to quantized neural networks,” Journal of Computer Ali Farhadi, “Xnor-net: Imagenet classification using binary
Science and Technology, vol. 32, no. 4, pp. 667–682, 2017. convolutional neural networks,” in Proceedings of the Euro-
pean Conference on Computer Vision (ECCV). Springer, 2016,
[6] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean, “Distill- pp. 525–542.
ing the knowledge in a neural network,” arXiv preprint
arXiv:1503.02531, 2015. [20] Sungho Shin, Yoonho Boo, and Wonyong Sung, “Fixed-point
optimization of deep neural networks with adaptive step size
[7] Cristian Bucilu, Rich Caruana, and Alexandru Niculescu- retraining,” in 2017 IEEE International conference on acous-
Mizil, “Model compression,” in Proceedings of the 12th tics, speech and signal processing (ICASSP). IEEE, 2017, pp.
ACM SIGKDD international conference on Knowledge discov- 1203–1207.
ery and data mining. ACM, 2006, pp. 535–541.
[21] Wonyong Sung, Sungho Shin, and Kyuyeon Hwang, “Re-
[8] Zhiyuan Tang, Dong Wang, and Zhiyong Zhang, “Recurrent siliency of deep neural networks under quantization,” arXiv
neural network training with dark knowledge transfer,” in 2016 preprint arXiv:1511.06488, 2015.
IEEE International Conference on Acoustics, Speech and Sig- [22] Sergey Zagoruyko and Nikos Komodakis, “Wide residual net-
nal Processing (ICASSP). IEEE, 2016, pp. 5900–5904. works,” arXiv preprint arXiv:1605.07146, 2016.
[9] Xuemeng Song, Fuli Feng, Xianjing Han, Xin Yang, Wei Liu, [23] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun,
and Liqiang Nie, “Neural compatibility modeling with atten- “Deep residual learning for image recognition,” in 2017
tive knowledge distillation,” in The 41st International ACM IEEE Conference on Computer Vision and Pattern Recognition
SIGIR Conference on Research & Development in Information (CVPR). IEEE, 2016, pp. 770–778.
Retrieval. ACM, 2018, pp. 5–14.
[24] Yoonho Boo, Sungho Shin, and Wonyong Sung, “Memoriza-
[10] Taichi Asami, Ryo Masumura, Yoshikazu Yamaguchi, Hi- tion capacity of deep neural networks under parameter quanti-
rokazu Masataki, and Yushi Aono, “Domain adaptation of dnn zation,” in ICASSP 2019-2019 IEEE International Conference
acoustic models using knowledge distillation,” in 2017 IEEE on Acoustics, Speech and Signal Processing (ICASSP). IEEE,
International Conference on Acoustics, Speech and Signal Pro- 2019, pp. 1383–1387.
cessing (ICASSP). IEEE, 2017, pp. 5185–5189. [25] Bohan Zhuang, Chunhua Shen, Mingkui Tan, Lingqiao Liu,
[11] Junpeng Wang, Liang Gou, Wei Zhang, Hao Yang, and Han- and Ian Reid, “Towards effective low-bitwidth convolutional
Wei Shen, “Deepvid: Deep visual interpretation and diagnosis neural networks,” in Proceedings of the IEEE Conference on
for image classifiers via knowledge distillation,” IEEE trans- Computer Vision and Pattern Recognition, 2018, pp. 7920–
actions on visualization and computer graphics, vol. 25, no. 6, 7928.
pp. 2168–2180, 2019.
[12] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, An-
toine Chassang, Carlo Gatta, and Yoshua Bengio, “Fitnets:
Hints for thin deep nets,” arXiv preprint arXiv:1412.6550,
2014.
[13] Mandar Kulkarni, Kalpesh Patil, and Shirish Karande,
“Knowledge distillation using unlabeled mismatched images,”
arXiv preprint arXiv:1703.07131, 2017.

Thanks To XYZ Agency For Funding

Uploaded by

Copyright:

Available Formats

You might also like

Thanks To XYZ Agency For Funding

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Thanks To XYZ Agency For Funding

Uploaded by

Copyright:

Available Formats

KNOWLEDGE DISTILLATION FOR OPTIMIZATION OF

QUANTIZED DEEP NEURAL NETWORKS

Sungho Shin, Yoonho Boo, and Wonyong Sung

Department of Electrical and Computer Engineering

ABSTRACT In this work, we exploit KD with various types of teacher net-

2. QUNTIZED DEEP NEURAL NETWORK TRAINING

L(wS ) = (1 − λ)H(y, pS ) + λH(pT , pS ) (3) 0.1

TEST ACC. (%)

TEST ACC. (%)

94.3% 66% 66%

(a) CIFAR-10 (b) CIFAR-100 (a) KD (b) KD+GSLR

You might also like