Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Transfer Learning of Neural Networks in NLP

D Ravi Shankar
December, 2017

Advisor: Dr V Susheela Devi

Mid-term M.Tech Project Report

Abstract 2 Related Work


Transfer learning is a research problem in Transfer learning(Neural network features specifically)
machine learning area where the knowledge is being successfully applied to a variety of problems in
gained by solving a problem related to a task
computer vision across domains and applications but
or domain is applied to solve a different but
researchers [1] have shown that neural network based
related problem.Transfer Learning techniques
have been effectively used in fields like Image transfer learning in NLP will only benefit when the
Processing and were able to achieve good re- source and target tasks are semantically related. They
sults.However, in NLP, Transfer Learning has have used INIT and MULT methods while transfer-
been loosely applied and conclusions are not ring the parameters. Researchers [2] have also shown
consistent. In this project we aim to explore that neural network based transfer learning improves
different neural network based Transfer Learn- the performance of models for sequence tagging and
ing schemes on a variety of datasets. shown that even problems across domains and ap-
plications can benefit(may not be significant) from
transfer learning.
1 Introduction
Transfer Learning plays important roles in many natu-
ral language processing (NLP) applications, especially 3 Transfer Learning
when we do not have large enough datasets for the task
of interest.The real world is messy and contains an in- Authors in [4] defined Transfer Learning as below:
finite number of novel scenarios, many of which our Def: Given a source domain DS and learning task
model may not have encountered during training and TS , a target domain DT and learning task TT , trans-
for which it is in turn ill-prepared to make predictions. fer learning aims to help improve the learning of the
Transfer Learning helps in such situations to be able target predictive function fT (.) in DT using the knowl-
to generalize to the new conditions.Transfer Learning edge in DS and TS , where DS 6= DT , or TS 6= TT .
has been widely used in Image processing field but in
NLP the conclusions are not consistent.
3.1 Transfer Methods
The rest of the paper is organized as follows. Sec-
tion 2 discusses the related work. Section 3 describes We have experimented with two methods in neural
the transfer learning methods that we used for the network based transfer learning.They are
experiments. Datasets used, Implementation and Ex- 1.Parametric Initialization (INIT): The INIT
periments performed are discussed in section 4, 5 & 6. method first trains the network on source dataset and
Future Work is discussed in section 7 and conclusions then directly uses the tuned parameters to initialize
are reported in section 8. the network for Target dataset. After transfer we may
fix the tuned parameters in the target domain when
we do not have labelled data in Target domain. If we
have labelled data in Target domain we can further
fine tune the parameters.
2. Multi-task learning (MULT): In this method we
simultaneously use samples from both the domains to
train the network. The overall cost function is given
by
J = αJs + (1 − α)JT

where Js and JT are the individual cost functions for


source and target domains respectively. α ∈ (0,1)
is the hyper parameter balancing both the domains.
Here we switch to source domain with probability α
and target domain with probability 1 − α and train
the network with the corresponding domain. Figure 1: (Ref [1]) Architecture used by Lili Mou et
al., ’a’ for Experiment 1 and ’b’ for Experiment 2

4 Datasets:
6 Experiments
IMBD: A large dataset for binary sentiment classifi-
cation (positive vs. negative) - 25k sentences. 6.1 Experiment 1:
MR: A small dataset for binary sentiment classifica- For this experiment we have used LSTM architecture.
tion ∼ 10k sentences. We have trained the model on IMDB and then trans-
QC: A (small) ) dataset for 6-way question classifica- ferred the weights to MR and QC datasets and the
tion (e.g., location, time, and number) ∼ 5000 ques- results are shown in Table 1.When we have trans-
tions. ferred the parameters from IMDB to MR, the accuracy
SNLI: A large dataset for sentence entailment recog- has improved by 1.95% and from IMDB to QC, there
nition. The classification objectives are entailment, isn’t much change in accuracy. The reason for this is
contradiction, and neutral ∼ 500k pairs. that IMDB and MR are semantically similar datasets
SICK: A small dataset with exactly the same classifi- whereas IMDB and QC are semantically different.
cation objective as SNLI ∼ 10k pairs.
MSRP: A (small) dataset for paraphrase detec-
tion.The objective is binary classification: judging Paper[1] Paper[1] Without With
whether two sentences have the same meaning ∼ 5000 Dataset (without (with Trans- Trans-
pairs. Transfer) Transfer) fer fer
Quora dataset: It contains duplicate questions pairs IMDB 87.0 - 84.10 -
with labels indicating whether the pair of questions MR 75.1 81.4 79.21 81.16
request the same information ∼ 400k question pairs. QC 90.8 93.2 96.93 96.90

Table 1: Transfer of Parameters from IMDB to MR


5 Implementation and QC datasets.

We have tried to replicate the results from [1] using


6.2 Experiment 2:
INIT method. For experiment 1, we have trained a
LSTM model on IMDB and transferred the parame- For this experiment we have used Siamese CNN archi-
ters to MR and QC datasets and the results are shown tecture and additionally used two hidden layers and an
in Table 1. For experiment 2, we have trained CNN output layer. We have trained the model on SNLI and
model on SNLI dataset and transferred the parame- then transferred the parameters to SICK and MSRP.
ters to SICK and MRSP datasets. Similarly we also trained the model on Quora and then
We have then experimented with both INIT and transferred the parameters to SICK and MSRP. The
MULT on SNLI(source) and Quora(target) datasets. results are shown in Table 2.
Paper We have transferred the parameters layer wise from
Paper
(without Without With SNLI to SICK & MSRP and Quora to SICK & MSRP
(with
Trans- Transfer Transfer and the results are shown in Table 3. It shows that
Transfer)
fer) transfer learning also depends on what layers we are
SNLI transferring.
to 70.9 77.6 66.12 70.10
SICK
6.3 Experiments on Quora Dataset:
Quora
to - - 66.12 68.74 INIT and MULT:
SICK All these experiments use Siamese CNN architec-
SNLI ture (Figure 1.b) with two additional hidden layers.A
to 69 69.9 68.28 67.71 plot comparing INIT and MULT is shown in Fig.2.In
MSRP this experiment accuracies in case of both INIT and
Quora MULT were plotted when different percentages of tar-
to - - 68.28 69.62 get data(Quora) are used(available) for training.We
MSRP can conclude from this plot that transfer learning is
effective when the amount of target data is small and
Table 2: Transfer of Parameters from SNLI to SICK using that small data alone, we cannot train a good
and MSRP and from Quora to SICK and MSRP model.Also INIT appears to be slightly better in com-
parison to MULT.
An experiment was performed to determine the ef-
fect of sharing different parts of the network and the
Transfer of parameters from SNLI to SICK appears results are shown in Table 4.We have not observed
to be successful with 3.98 % increase in accuracy, while considerable difference in accuracies between the cases
there is no improvement from SNLI to MSRP. SNLI when only the filters are shared and when both filters
and SICK are semantically similar datasets, whereas and hidden layers were shared(this case seems to be
SNLI and MSRP are not. From Quora to MSRP there slightly better).
is an improvement of 1.34 % while the performance We have also performed an experiment to determine
has decreased when transferred from SNLI to MSRP. how over-fitting on the source dataset in INIT af-
Quora and MSRP are more semantically related than fects the transfer learning.The Model was trained on
SNLI & MSRP. This suggests that transfer learning SNLI dataset for different number of epochs before
is more prone to semantics. transferring the parameters to train on Quora. Ac-
curacies of both SNLI(source) and Quora(target) are
then plotted for comparison in Fig.3. As can be seen
from the plot accuracy of Quora dataset peaked along
SNLI Quora SNLI Quora with the accuracy of SNLI dataset. When the model
to to to to was overfit on SNLI data, accuracy on Quora dataset
SICK SICK MSRP MSRP showed a dip. This indicates that parameters from
Without best model on the source dataset should be used for
66.12 66.12 68.28 68.28
Transfer transfer learning.
CNN weights 69.92 68.74 67.71 68.74
CNN weights
and Hidden 70.10 68.37 66.28 69.62
layer 1 SNLI to Quora
CNN weights Without Transfer 82.01
and Hidden 68.74 67.91 66.62 69.47 CNN weights 81.86
layer 1 & 2 CNN weights and
82.41
Hidden layer 1
Table 3: Transfer of Parameters layer wise from SNLI CNN weights and
82.15
to SICK and MSRP and from Quora to SICK and Hidden layer 1 & 2
MSRP.
Table 4: Transfer of Parameters layer wise from SNLI
to Quora
and use it in a variety of applications like sentiment
analysis, paraphrase detection, etc and see how much
it helps in improving the performance on the target
dataset. To the best of our knowledge no previous
works have used this kind of transfer learning. Also
we would like to do analysis on which part of network
to transfer from source to target depending on the
type of application.

8 Conclusions:
1. Transfer Learning is successful when we are dealing
with semantically similar tasks.
2. It is helpful when the target dataset is small.
4. INIT method is performing slightly better com-
Figure 2: Comparison of INIT and MULT. Graphs pared to MULT.
represent accuracy on Quora(target) dataset for dif- 3. Transfer Learning also depends on what layers we
ferent transfer learning schemes. are transferring.
5. Are we losing general information if the model is
trained on source data for best accuracy? The an-
swer seems to be NO as evident from Figure 3.The
accuracy on Quora dataset peaked along with that of
SNLI dataset.

References
[1] Lili Mou, Zhao Meng, Rui Yan, Ge Li, Yan Xu,
Lu Zhang, Zhi Jin. How Transferable are Neural
Networks in NLP Applications?. In Proceedings of
the 2016 Conference on Empirical Methods in Nat-
ural Language Processing (EMNLP), pages 478–
489, 2016.
Figure 3: Effect of over-fitting source dataset.Graphs
show how the accuracy on Quora(target) dataset [2] Zhilin Yang, Ruslan Salakhutdinov, William W.
varies with the accuracy of the SNLI(source) dataset. Cohen. Transfer Learning for Sequence Tagging
with Hierarchical Recurrent Networks. ICLR 2017.

[3] Seunghyun Yoon, Hyeongu Yun, Yuna Kim, Gyu-


7 Future Work tae Park, Kyomin Jung. Efficient Transfer Learn-
Sequence-to-sequence (seq2seq) models (Sutskever et ing Schemes for Personalized Language Modeling
al. [5], 2014, Cho et al. 2014 [6]) have enjoyed great using Recurrent Neural Network CoRR 2017, vol-
success in a variety of tasks such as machine trans- ume: abs/1701.03578.
lation, speech recognition, and text summarization. [4] Sinno Jialin Pan and Qiang Yang. A sur-
NMT system first reads the source sentence using an vey on transfer learning. IEEE Transactions on
encoder to build a ”context” vector, a sequence of Knowledge and Data Engineering 2010, 22(10),
numbers that represents the sentence meaning; a de- 13451359.
coder, then, processes the sentence vector to emit a
translation. The context vector thus contains suf- [5] Ilya Sutskever, Oriol Vinyals, and Quoc VV
ficient lexical and semantic information to fully re- Le. Sequence to Sequence Learning with Neural
construct a sentence in another language.We want to Networks. In Proceedings of EMNLP 2014,pages
transfer this knowledge (”context vector”) from NMT 17241734.
[6] Kyunghyun Cho, Bart van Merrienboer, aglar [7] Engineering at Quora.
Glehre, Dzmitry Bahdanau, Fethi Bougares, Hol- https://engineering.quora.com/Semantic-Question
ger Schwenk, and Yoshua Bengio. Learning phrase -Matching-with-Deep-Learning
representations using RNN encoder-decoder for
statistical machine translation. In Advances in
Neural Information Processing Systems 2014, [8] Sebastian Ruder blog.
pages 31043112. http://ruder.io/transfer-learning/

You might also like