Professional Documents
Culture Documents
Wei 18 A
Wei 18 A
Wei 18 A
1 optimize what to transfer for a target pair of source and target domains
Figure 2. Illustration of the L2T framework: in the training stage, we have Ne transfer learning experiences {E1 , · · · , ENe } from which
we learn a reflection function f encrypting transfer learning skills; in the testing stage, given the (Ne + 1)-th source-target pair and the
∗
learned reflection function f ( 1 ), we optimize the transferred knowledge between them, i.e., WN e +1
, by maximizing the value of f ( 2 ).
setting Xes = Xet and Yes = Yet for each pair of domains. Latent feature factor based algorithms aim to learn domain-
ae ∈ A = {a1 , · · · , aNa } denotes a transfer learning algo- invariant feature factors across domains. Consider classify-
rithm having been applied between Se and Te . Suppose that ing dog pictures as a source domain and cat pictures as a
the transferred knowledge by the algorithm ae can be param- target domain. The domain-invariant feature factors may in-
eterized as We . Finally, each transfer learning experience is clude eyes, mouth, tails, etc. What to transfer, in this case, is
labeled by the performance improvement ratio le = pst t
e /pe , the shared feature factors across domains. The way of defin-
where pte is the learning performance (e.g., classification ing domain-invariant feature factors dictates two groups of
accuracy) on a test dataset in Te without transfer and pste is latent feature factor based algorithms, i.e., common latent
that on the same test dataset after transferring We from Se . space based and manifold ensemble based algorithms.
With Ne transfer learning experiences {E1 , · · · , ENe } as Common Latent Space Based This line of algorithms,
the input, the L2T agent learns a function f such that including but not limited to TCA (Pan et al., 2011), LS-
f (Se , Te , We ) approximates le as shown in the training DT (Zhang et al., 2016), and DIP (Baktashmotlagh et al.,
stage of Figure 2. We call f a reflection function which 2013), assumes that domain-invariant feature factors lie in
encrypts meta-cognitive transfer learning skills - what and a single shared latent space. We denote by ϕ the function
how to transfer can maximize the improvement ratio giv- mapping original feature representation into the latent space.
en a pair of domains. Whenever a new pair of domains If ϕ is linear, it can be represented as an embedding matrix
SNe +1 , TNe +1 arrives, the L2T agent can optimize the W ∈ Rm×u where u is the dimensionality of the latent
∗
knowledge to be transferred, i.e., WN e +1
, by maximizing space. Therefore, we can parameterize what to transfer we
the value of f (see step 2 of the testing stage in Figure 2). focus on with W which describes u latent feature factors.
Otherwise, if ϕ is nonlinear, what to transfer can still be
3.2. Parameterizing What to Transfer parameterized with W. Though a nonlinear ϕ is not ex-
plicitly specified in most cases such as LSDT using sparse
Transfer learning algorithms applied can vary from expe- coding, target examples represented in the latent space
t
rience to experience. Uniformly parameterizing “what Zte = ϕ(Xte ) ∈ Rne ×u are always available. Consequently,
to transfer” for any algorithm out of the base algorithm we obtain the similarity metric matrix (Cao et al., 2013) in
set A is a prerequisite for learning the reflection func- the latent space, i.e., G = (Xte )† Zte (Zte )T [(Xte )T ]† ∈ Rm×m
tion. In this work, we consider A to contain algorithms according to Xte G(Xte )T = Zte (Zte )T , where (Xte )† is the
transferring single-level latent feature factors, because ex- pseudo-inverse of Xte . LDL decomposition on G = LDLT
isting parameter-based and instance-based algorithms can- brings the latent feature factor matrix W = LD1/2 .
not address the transfer learning setting we focus on (i.e.,
Xse = Xte and Yse = Yte ). Though limited parameter-based Manifold Ensemble Based Initiated by Gopalan et al.,
algorithms (Yang et al., 2007a; Tommasi et al., 2014) can manifold ensemble based algorithms consider that a source
transfer across domains in heterogeneous label spaces, they and a target domain share multiple subspaces (of the
can only handle binary classification problems. Deep neural same dimension) as points on the Grassmann manifold be-
network based algorithms (Yosinski et al., 2014; Long et al., tween them. The representation of target examples on u
t(n )
2015; Tzeng et al., 2015) transferring latent feature factors domain-invariant latent factors turns to Ze u = [ϕ1 (Xte ),
t nte ×nu u
in multiple levels are left for our future research. As a result, · · ·, ϕnu (Xe )] ∈ R , if nu subspaces on the mani-
we parameterize what to transfer with a latent feature factor fold are sampled. When all continuous subspaces on the
matrix W which is elaborated in the following. manifold are sampled, i.e., nu → ∞, Gong et al. proved
Transfer Learning via Learning to Transfer
t(∞) t(∞)
that Ze (Ze )T = Xte G(Xte )T where G is the similar- β T d̂e , where d̂e = [dˆ2e(1) ,· · ·, dˆ2e(Nk ) ] with dˆ2e(k) computed by
ity metric matrix. For computational details of G, please the k-th kernel Kk . In this paper, we consider RBF kernels
refer to (Gong et al., 2012). W = LD1/2 with L and D ob- Kk (a, b) = exp(−a−b2 /δk ) by varying the bandwidth δk .
tained from performing LDL decomposition on G = LDLT ,
Unfortunately, the MMD alone is insufficient to mea-
therefore, is also qualified to represent latent feature factors
sure the difference between domains. The distance vari-
distributed in a series of subspaces on a manifold.
ance among all pairs of instances across domains is
also required to fully characterize the difference. A
3.3. Learning from Experiences pair of domains with small MMD but extremely high
The goal here is to learn a reflection function f such that variance still have little overlap. Equation (1) is actu-
f (Se , Te , We ) can approximate le for all experiences {E1 , ally the empirical estimation of d2e (Xse We , Xte We ) =
s s t t
· · · , ENe }. The improvement ratio le is closely related to Exse xs t t h(xe , xe , xe , xe ) (Gretton et al., 2012b) where
e xe xe
two aspects: 1) the difference between a source and a target h(xse , xs t t s s t t
e , xe , xe ) = K(xe We , xe We )+K(xe We , xe We )−
s t s t
domain in the shared latent space, and 2) the discriminative K(xe We , xe We ) − K(xe We , xe We ). Consequently, the
ability of a target domain in the latent space. The smaller distance variance, σe2 , equals
difference guarantees more overlap between domains in σe2 (Xse We , Xte We ) =Exse xs
e xe xe
s s t t
t t [(h(xe , xe , xe , xe )
The Difference between a Source and a Target Domain hk2 ) = E [(hk1 −Ehk1 )(hk2 −Ehk2 )]. Note that Ehk1 is
s s t t
We follow (Pan et al., 2011) and adopt the maximum mean shorthand for Exse xs t t hk1 (xe , xe , xe , xe ) where hk1
e xe xe
discrepancy (MMD) (Gretton et al., 2012b) to measure the is calculated using the k1 -th kernel. We detail the empirical
difference between domains. By mapping two domains into estimate Q̂e of Qe in the supplementary due to page limit.
the reproducing kernel Hilbert space (RKHS), MMD em- The Discriminative Ability of a Target Domain In view
pirically evaluates the distance between the mean of source of limited labeled examples in a target domain, we resort
examples and that of target examples: to unlabeled examples to evaluate the discriminative ability.
dˆ2e (Xse We , Xte We ) The principles of the unlabeled discriminant criterion are
ns nte 2 two-fold: 1) similar examples should still be neighbours
1 e
1
= s s
φ(xei We ) − t φ(xtej We ) after being embedded into the latent space; and 2) dissim-
ne i=1 ne j=1
H ilar examples should be far away. We adopt the unlabeled
ne s discriminant criterion proposed in (Yang et al., 2007b),
1
= s 2 K(xsei We , xsei We ) τe = tr(WeT SN T L
(ne ) e We )/tr(We Se We ),
i,i =1
t nt H
1
ne
where SLe = t t t t
j,j =1 (nte )2 (xej − xej )(xej − xej )
e jj T
where xtej is the j-th example in Xte , and φ maps from If xtej and xtej are mutual r-nearest neighbours to each
the u-dimensional latent space to the RKHS H. K(·, ·) = other, Hjj equals the kernel value K(xtej , xtej ). By max-
φ(·), φ(·) is the kernel function. Different kernels K lead
imizing the unlabeled discriminant criterion τe , the local
to different MMD distances and thereby different values scatter covariance matrix guarantees the first principle, while
nte K(xtej ,xtej )−Hjj
of f . Thus learning the reflection function f is equivalent SN
e = j,j =1 (nte )2
(xtej − xtej )(xtej − xtej )T ,
to optimizing K so that the MMD distance can well char- the non-local scatter covariance matrix, enforces the second
acterize the improvement ratio le for all pairs of domains. principle. τe also depends on kernels which in this case in-
Inspired by multi-kernel MMD (Gretton et al., 2012b), we dicate different neighbour information and different degrees
parameterize K as a linear combination of Nk PSD kernels, of similarity between neighboured examples. With τe(k) ob-
k
i.e., K = N k=1 βk Kk (βk ≥ 0, ∀k), and learn the coefficients tained from the k-th kernel Kk , the unlabeled discriminant
Nk
β = [β1 ,· · ·, βNk ] instead. Using β , the MMD can be rewrit- criterion τe can be written as τe = k=1 βk τe(k) = β T τ e
k
ten as dˆ2e (Xse We , Xte We )= N ˆ2 s t
k=1 βk de(k) (Xe We , Xe We ) = where τ e = [τe(1) , · · · , τe(Nk ) ].
Transfer Learning via Learning to Transfer
The Optimization Problem Combining the two aspects lem (3) w.r.t W by employing a conjugate gradient method
abovementioned to model the reflection function f , we fi- in which the gradient is listed in the supplementary material.
nally formulate the optimization problem as follows,
β ∗ , λ ∗ , μ∗ , b ∗ = 4. Stability and Generalization Bounds
Ne
μ 1 In this section, we would theoretically investigate how previ-
arg min Lh β T d̂e + λβ T Q̂e β + + b,
β,λ,μ,b
e=1
βT τ e le ous transfer learning experiences influence a transfer learn-
+ γ1 R(β, λ, μ, b), ing task of interest. We also provide and prove the algo-
s.t. βk ≥ 0, ∀k ∈ {1, · · · , Nk }, λ ≥ 0, μ ≥ 0, (2)
rithmic stability and generalization bound for latent feature
factor based transfer learning algorithms without experi-
T T μ
where 1/f = β d̂e + λβ Q̂e β + + b and Lh (·) is the
βT τ e ences considered in the supplementary.
Huber regression loss (Huber et al., 1964) constraining the
value of 1/f to be as close to 1/le as possible. γ1 con- Consider S = {S1 , T1 ,· · ·, SNe , TNe } to be Ne transfer
trols the complexity of the parameters by l2-regularization. learning experiences or the so-called meta-samples (Maurer,
Minimizing the difference between domains, including the 2005). Let L(S) be our algorithm that learns meta-cognitive
MMD distance βT d̂e and the distance variance βT Q̂e β, and knowledge from Ne transfer learning experiences in S and
meanwhile maximizing the discriminant criterion βT τ e in applies the knowledge to the (Ne +1)-th transfer learning
the target domain will contribute a large performance im- task SNe+1 , TNe+1 . To analyse the stability and give the
provement ratio le (i.e., a small 1/le ). λ and μ balance the generalization bound, we make an assumption on the dis-
importance of the three terms in f , and b is the bias term. tribution from which all Ne transfer learning experiences
as meta-samples are sampled. For every environment E
we have, all Ne pairs of source and target domains in S
3.4. Inferring What to Transfer
are drawn according to an algebraic β-mixing stationary
Once the L2T agent has learned the reflection function distribution (DE )Ne , which is not i.i.d.. Intuitively, the al-
f (S, T , W; β ∗ , λ∗ , μ∗ , b∗ ), it takes advantage of the func- gebraical β-mixing stationary distribution (see Definition
tion to optimize what to transfer, i.e., the latent feature factor 2 in (Mohri & Rostamizadeh, 2010)) with the β-mixing
matrix W, for a newly arrived source domain SNe +1 and coefficient β(m) ≤ β0 /mr models the dependence between
a target domain TNe +1 . The optimal latent feature factor future samples and past samples by a distance of at least m.
∗
matrix WN e +1
should maximize the value of f . To this The independent block technique (Bernstein, 1927) has been
end, we optimize the following objective with regard to W, widely adopted to deal with non-i.i.d. learning problems.
∗ ∗
WNe +1 = arg max f (SNe +1 , TNe +1 , W; β , λ , μ , b ) − γ2 WF
∗ ∗ ∗ 2 Under this assumption, L(S) is uniformly stable.
W
1
∗ T ∗
= arg min (β ) d̂W + λ (β ) Q̂W β + μ
∗ T ∗ ∗ Theorem 1. Suppose that for any xte and for any yet we
W (β ∗ )T τ W
2
have xte 2 ≤ rx and |yet | ≤ B . Meanwhile, for any e-th trans-
+ γ2 WF , (3)
fer learning experience, we assume that the latent feature
where
·
F denotes the matrix Frobenius norm and γ2 factor matrix We ≤ rW . To meet the assumption above,
controls the complexity of W. The first and second terms we reasonably simplify L(S) so that the latent feature fac-
in problem (3) can be calculated as tor matrix for the (Ne + 1)-th transfer learning task is a
Nk
1
a linear combination of all Ne historical latent factor feature
∗ T ∗
(β ) d̂W = βk 2
Kk (vi W, vi W)+ matrices plus a noisy latent feature matrix W satisfying
k=1
a e
i,i =1
W ≤r , i.e., WNe +1 = N e=1 ce We +W with each coef-
b
a,b
1 2 ficient 0 ≤ ce ≤ 1. Our algorithm L(S) is uniformly stable.
Kk (wj W, wj W) − Kk (vi W, wj W) ,
b2 ab i,j=1
j,j =1 For any S, T as the coming transfer learning task, the
Nk following inequality holds:
∗ T ∗ 1
n
∗
(β ) Q̂W β =
n2 − 1 k=1
βk Kk (vi W, vi W)+ lemp (L(S), (S, T )) − lemp (L(Se0 ), (S, T ))
i,i =1
2
1
n 4(4Ne − 3 + r /rW )B 2 rx B rx
Kk (wi W, wi W) − 2Kk (vi W, wi W) − Kk (vi W, vi W) ≤ ∼O , (4)
n 2
i,i =1
λNe2 λNe
2 where S = {S1 , T1 , · · · , Se0 −1 , Te0 −1 , Se0 , Te0 , Se0 +1 ,
+ Kk (wi W, wi W) − 2Kk (vi W, wi W) ,
Te0 +1 , · · · , SNe , TNe } denotes the full set of meta-samples,
where the shorthand vi = xs(Ne +1)i , vi = xs(Ne +1)i , wj = and Se0 = {S1 , T1 , · · · , Se0 −1 , Te0 −1 , Se0 , Te0 , Se0 +1 ,
x(Ne +1)j , wj = x(Ne +1)j , a = nsNe +1
t t
, and b = ntNe +1 are used Te0 +1 , · · · , SNe , TNe } represents the meta-samples with
due to space limit. Note that n = min(nsNe +1 , ntNe +1 ). The the e0 -th meta-example replaced as Se0 , Te0 .
third term in problem (3) can be computed as (β ∗ )T τ W =
Nk ∗ tr(WT SN By generalizing S to be meta-samples S and hS to be L2T
k W)
k=1 βk tr(WT SL W) . We optimize the non-convex prob-
k
L(S), we apply Corollary 21 in (Mohri & Rostamizadeh,
Transfer Learning via Learning to Transfer
2010) to give the generalization bound of our algorithm generated training experiences, lying in the same environ-
L(S) in Theorem 2. ment (generated by one dataset), are non i.i.d., which fit the
1 1 algebraical β-mixing assumption theoretically in Section 4.
Theorem 2. Let δ = δ −(Ne ) 2(r+1) − 4 (r > 1 is required).
Then for any sample S of size Ne drawn according to an Baselines and Evaluation Metrics We compare L2T with
algebraic β-mixing stationary distribution, and δ ≥ 0 such the following nine baseline algorithms in three classes:
that δ ≥ 0, the following generalization bound holds with
• Non-transfer: Original builds a model using labeled
probability at least 1 − δ:
data in a target domain only.
1 1 1
R(L(S)) − RNe (L(S))
< O (Ne ) 2(r+1) − 4 log( ) , • Common latent space based transfer learning algo-
δ rithms: TCA (Pan et al., 2011), ITL (Shi & Sha, 2012),
where R(L(S)) and RNe (L(S)) denote the expected risk CMF (Long et al., 2014), LSDT (Zhang et al., 2016),
and the empirical risk of L2T over meta-samples, respec- STL (Raina et al., 2007), DIP (Baktashmotlagh et al.,
tively. A larger mixing parameter r, indicating more inde- 2013) and SIE (Baktashmotlagh et al., 2014).
pendence, would lead to a tighter bound. • Manifold ensemble based algorithms: GFK (Gong
et al., 2012).
Theorem 2 tells that as the number of transfer learning
experiences, i.e., Ne , increases, L2T tends to produce a The eight feature-based transfer learning algorithms also
tighter generalization bound. This fact lays the foundation constitute the base set A. Based on feature representa-
for further conducting L2T in an online manner which can tions obtained by different algorithms, we use the nearest-
gradually assimilate transfer learning experiences and con- neighbor classifier to perform three-class classification for
tinuously improve. The detailed proofs for Theorem 1 and 2 the target domain.
can be found in the supplementary.
One evaluation metric is classification accuracy on testing
examples of a target domain. However, accuracies are in-
5. Experiments comparable for different target domains at different levels
Datasets We evaluate the L2T framework on two image of difficulty. The other evaluation metric we adopt is the
datasets, Caltech-256 (Griffin et al., 2007) and Sketch- performance improvement ratio defined in Section 3.1, so
es (Eitz et al., 2012). Caltech-256, collected from Google as to compare the L2T over different pairs of domains.
1.2
TCA LSDT DIP
Images, contains a total of 30,607 images in 256 categories.
Performance improvement ratio
0.85 0.85
Classification accuracy
Classification accuracy
Classification accuracy
0.95
0.8 0.8
0.75 0.9
0.75
0.7 Original GFK Original GFK Original GFK
TCA STL 0.7 0.85
TCA STL TCA STL
0.65 ITL DIP 0.65 ITL DIP ITL DIP
CMF SIE CMF SIE 0.8 CMF SIE
0.6 LSDT L2T 0.6 LSDT L2T LSDT L2T
0.55 0.75
0.55
0 15 30 45 60 75 90 105 120 0 15 30 45 60 75 90 105 120 0 15 30 45 60 75 90 105 120
The number of labeled examples The number of labeled examples The number of labeled examples
in a target domain in a target domain in a target domain
(a) galaxy / harpsichord / saturn (b) bat / mountain-bike / saddle (c) microwave / spider / watch
→ kangaroo / standing-bird / sun → bush / person / walkie-talkie → spoon / trumpet / wheel
0.95
0.9 0.95
0.9
0.9
Classification accuracy
Classification accuracy
Classification accuracy
0.85
0.85 0.85
0.8
0.8 0.8
0.75 Original GFK Original GFK
0.75 Original GFK 0.75
TCA STL TCA STL TCA STL
0.7
0.7 ITL DIP ITL DIP 0.7 ITL DIP
CMF SIE 0.65 CMF SIE CMF SIE
0.65 0.65
LSDT L2T LSDT L2T LSDT L2T
0.6 0.6 0.6
0 15 30 45 60 75 90 105 120 0 15 30 45 60 75 90 105 120 0 15 30 45 60 75 90 105 120
The number of labeled examples The number of labeled examples The number of labeled examples
in a target domain in a target domain in a target domain
(d) bridge / harp / traffic-light (e) bridge / helicopter / tripod (f) caculator / straw / french-horn
→ door-handle / hand / present → key / parrot / traffic-light → doorknob / palm-tree / scissors
Figure 3. Classification accuracies on six pairs of source and target domains.
algorithms behave differently. The transferable knowledge Varying the Experiences We further investigate how trans-
learned by LSDT helps a target domain a lot when train- fer learning experiences used to learn the reflection function
ing examples are scarce, while GFK performs poorly until influence the performance of L2T. In this experiment, we
training examples become more. STL is almost the worst evaluate on 50 randomly sampled pairs out of the 500 testing
baseline because it learns a dictionary from the source do- pairs in order to efficiently investigate a wide range of cases
main only but ignores the target domain. It runs at a high in the following. The sampled set is unbiased and sufficient
risk of failure especially when two domains are distant. DIP to characterize such influence, evidenced by the asymptotic
and SIE, which minimize the MMD and Hellinger distance consistency between the average performance improvement
between domains subject to manifold constraints, are com- ratio on the 500 pairs in Figure 4 and that on the 50 pairs
petent. Note that we have run the paired t-test between L2T in the last line of Table 1. First, we fix the number of trans-
and each baseline with all the p-values in the order of 10−12 , fer learning experiences to be 1,000 and vary the set of
concluding that the L2T is significantly superior. base transfer learning algorithms. The results are shown in
Table 1. Even with experiences generated by single base al-
We also randomly select six of the 500 testing pairs and
gorithm, e.g., ITL or DIP, the L2T can still learn a reflection
compare classification accuracies by different algorithms
function that significantly better (p-value < 0.05) decides
for each pair in Figure 3. The performance of all baselines
what to transfer than using ITL or DIP directly. With more
varies from pair to pair. Among all the baseline methods,
base algorithms involved, the transfer learning experiences
TCA performs the best when transferring between domains
are more diverse to cover more situations of source-target
in Figure 3a and LSDT is the most superior in Figure 3c.
pairs and the knowledge transferred between them. As a
However, L2T consistently outperforms the baselines on
result, the L2T learns a better reflection function and there-
all the settings. For some pairs, e.g., Figures 3a, 3c and 3f,
by achieves higher performance improvement ratios, which
the three classes in a target domain are comparably easy
coincides with Theorem 2 where a larger r indicating more
to tell apart, hence Original without transfer can achieve
independence between experiences gives a tighter bound.
even better results than some transfer learning algorithms.
Second, we fix the set of base algorithms to include all the
In this case, L2T still improves by discovering the best
eight baselines and vary the number of transfer learning
transferable knowledge from the source domain, especially
experiences used for training. As shown in Figure 5, the
when the number of labeled examples is small (see Figure 3c
average performance improvement ratio achieved by L2T
and 3f). If two domains are very related, e.g., the source
tends to increase as the number of labeled examples in the
with “galaxy” and “saturn” and the target with “sun” in
target domain decreases, given that Original without trans-
Figure 3a, L2T even finds out more transferable knowledge
fer performs extremely poor with scarce labeled examples.
and contributes more significant improvement.
Transfer Learning via Learning to Transfer
Table 1. The performance improvement ratios by varying different approaches used to generate transfer learning experiences. For example,
“ITL+L2T” denotes the L2T learning from experiences generated by ITL only, and the second line of results for “ITL+L2T” is the p-value
compared to ITL.
# of labeled
3 15 30 45 60 75 90 105 120
examples
TCA 1.0181 1.0024 0.9965 0.9973 0.9941 0.9933 0.9938 0.9927 0.9928
ITL 1.0188 1.0248 1.0250 1.0254 1.0250 1.0224 1.0232 1.0224 1.0224
CMF 0.9607 1.0203 1.0224 1.0218 1.0190 1.0158 1.0144 1.0142 1.0125
LSDT 1.0828 1.0168 0.9988 0.9940 0.9895 0.9867 0.9854 0.9834 0.9837
GFK 0.9729 1.0180 1.0232 1.0243 1.0246 1.0219 1.0239 1.0229 1.0225
STL 0.9973 0.9771 0.9715 0.9713 0.9715 0.9694 0.9705 0.9693 0.9693
DIP 1.0875 1.0633 1.0518 1.0465 1.0425 1.0372 1.0365 1.0343 1.0317
SIE 1.0745 1.0579 1.0485 1.0448 1.0412 1.0359 1.0359 1.0334 1.0318
1.1210 1.0737 1.0577 1.0506 1.0456 1.0398 1.0394 1.0361 1.0359
ITL + L2T
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0002 0.0002
1.1605 1.0927 1.0718 1.0620 1.0562 1.0500 1.0483 1.0461 1.0451
DIP + L2T
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
(LSDT/GFK 1.1660 1.0973 1.0746 1.0652 1.0573 1.0506 1.0485 1.0451 1.0429
/SIE) + L2T 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
(TCA/ITL/CMF/GFK 1.1712 1.0954 1.0707 1.0607 1.0529 1.0469 1.0449 1.0421 1.0416
/LSDT/SIE/) + L2T 0.0000 0.0000 0.0001 0.0001 0.0106 0.0019 0.0002 0.0047 0.0106
all + L2T 1.1872 1.1054 1.0795 1.0699 1.0616 1.0551 1.0531 1.0500 1.0502
3 -12 12
1000 1.187 [2 :2 ] 1.195
1.2
The number of experiences
120 1.1 15
800 1.0
1.149 [2-10:210] 1.154
0.9
Different kernels
600 0.8
1.108 105 30 [2-8:28] 1.111
400
1.066 [2-6:26] 1.069
MMD
200
Variance 90 45
Discriminant
50 1.025 [2-4:24] 1.027
3 20 40 60 80 100 120 MMD+Variance 3 20 40 60 80 100 120
The number of labeled examples MMD+Discriminant 75 60
The number of labeled examples
in a target domain Discriminant+Variance L2T in a target domain
Figure 5. Varying the number of transfer Figure 6. Varying the components consti- Figure 7. Varying the number of kernels
learning experiences. tuted in the f . considered in the f .
More importantly, it increases as the number of experiences fer learning skills in the reflection function, achieve larger
increases, which coincides with Theorem 2. performance improvement ratios.
Varying the Reflection Function We also study the influ-
ence of different configurations of the reflection function on 6. Conclusion
the performance of L2T. First, we vary the components to
In this paper, we propose a novel L2T framework for transfer
be considered in building the reflection function f as shown
learning which automatically optimizes what and how to
in Figure 6. Considering single type, either MMD, variance,
transfer between a source and a target domain by leveraging
or the discriminant criterion, brings inferior performance
previous transfer learning experiences. In particular, L2T
and even negative transfer. L2T taking all the three factors
learns a reflection function mapping a pair of domains and
into consideration outperforms the others, demonstrating
the knowledge transferred between them to the performance
that the three components are all necessary and mutually
improvement ratio. When a new pair of domains arrives,
reinforcing. With all the three components included, we
L2T optimizes what and how to transfer by maximizing the
plot values of the learned β ∗ in the supplementary materi-
value of the learned reflection function. We believe that L2T
al. Second, we change the kernels used. In Figure 7, we
opens a new door to improve transfer learning by leveraging
present results by either narrowing down or extending the
transfer learning experiences. Many research issues, e.g.,
range [2−8 η : 20.5 η : 28 η]. Obviously, more kernels (e.g.,
incorporating hierarchical latent feature factors as what to
[2−12 η : 20.5 η : 212 η]), capable of encrypting better trans-
transfer and designing online L2T, can be further examined.
Transfer Learning via Learning to Transfer
Acknowledgements Gong, B., Shi, Y., Sha, F., and Grauman, K. Geodesic flow
kernel for unsupervised domain adaptation. In CVPR, pp.
We thank the reviewers for their valuable comments to 2066–2073, 2012.
improve this paper. The research has been supported by
National Grant Fundamental Research (973 Program) of Gopalan, R., Li, R., and Chellappa, R. Domain adaptation
China under Project 2014CB340304, Hong Kong CERG for object recognition: An unsupervised approach. In
projects 16211214/16209715/16244616, Hong Kong ITF ICCV, pp. 999–1006, 2011.
ITS/391/15FX and NSFC 61673202.
Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B.,
and Smola, A. A kernel two-sample test. JMLR, 13(Mar):
References 723–773, 2012a.
Al-Shedivat, M., Bansal, T., Burda, Y., Sutskever, I., Mor-
Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan,
datch, I., and Abbeel, P. Continuous adaptation via meta-
S., Pontil, M., Fukumizu, K., and Sriperumbudur, B. K.
learning in nonstationary and competitive environments.
Optimal kernel choice for large-scale two-sample tests.
In ICLR, 2018.
In NIPS, pp. 1205–1213, 2012b.
Argyriou, A., Evgeniou, T., and Pontil, M. Multi-task fea-
Griffin, G., Holub, A., and Perona, P. Caltech-256 object
ture learning. In NIPS, pp. 41–48, 2007.
category dataset. 2007.
Baktashmotlagh, M., Harandi, M. T., Lovell, B. C., and Salz- Huber, P. J. et al. Robust estimation of a location parame-
mann, M. Unsupervised domain adaptation by domain ter. The Annals of Mathematical Statistics, 35(1):73–101,
invariant projection. In ICCV, pp. 769–776, 2013. 1964.
Baktashmotlagh, M., Harandi, M. T., Lovell, B. C., and Salz- Long, M., Wang, J., Ding, G., Shen, D., and Yang, Q. Trans-
mann, M. Domain adaptation on the statistical manifold. fer learning with graph co-regularization. TKDE, 26(7):
In CVPR, pp. 2481–2488, 2014. 1805–1818, 2014.
Belmont, J. M., Butterfield, E. C., Ferretti, R. P., et al. To Long, M., Cao, Y., Wang, J., and Jordan, M. Learning
secure transfer of training instruct self-management skills. transferable features with deep adaptation networks. In
In Detterman, D. K. and Sternberg, R. J. P. (eds.), How ICML, pp. 97–105, 2015.
and How Much Can Intelligence be Increased, pp. 147–
154. Ablex Norwood, NJ, 1982. Luria, A. R. Cognitive Development: Its Cultural and Social
Foundations. Harvard University Press, 1976.
Bernstein, S. Sur l’extension du théorème limite du calcul
des probabilités aux sommes de quantités dépendantes. Maurer, A. Algorithmic stability and meta-learning. JMLR,
Mathematische Annalen, 97(1):1–59, 1927. 6(Jun):967–994, 2005.
Mo, K., Li, S., Zhang, Y., Li, J., and Yang, Q. Personalizing
Blitzer, J., McDonald, R., and Pereira, F. Domain adaptation
a dialogue system with transfer learning. arXiv preprint
with structural correspondence learning. In EMNLP, pp.
arXiv:1610.02891, 2016.
120–128, 2006.
Mohri, M. and Rostamizadeh, A. Stability bounds for s-
Cao, Q., Ying, Y., and Li, P. Similarity metric learning for
tationary ϕ-mixing and β-mixing processes. JMLR, 11
face recognition. In ICCV, pp. 2408–2415, 2013.
(Feb):789–814, 2010.
Caruana, R. Multitask learning. Machine Learning, 28: Pan, S. J. and Yang, Q. A survey on transfer learning. TKDE,
41–75, 1997. 22(10):1345–1359, 2010.
Dai, W., Yang, Q., Xue, G.-R., and Yu, Y. Boosting for Pan, S. J., Tsang, I. W., Kwok, J. T., and Yang, Q. Domain
transfer learning. In ICML, pp. 193–200, 2007. adaptation via transfer component analysis. TNN, 22(2):
199–210, 2011.
Eitz, M., Hays, J., and Alexa, M. How do humans sketch
objects? ACM Trans. Graph. (Proc. SIGGRAPH), 31(4): Pentina, A. and Lampert, C. H. Lifelong learning with
44:1–44:10, 2012. non-iid tasks. In NIPS, pp. 1540–1548, 2015.
Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta- Raina, R., Battle, A., Lee, H., Packer, B., and Ng, A. Y.
learning for fast adaptation of deep networks. In ICML, Self-taught learning: transfer learning from unlabeled
pp. 1126–1135, 2017. data. In ICML, pp. 759–766, 2007.
Transfer Learning via Learning to Transfer
Yang, J., Zhang, D., Yang, J.-y., and Niu, B. Globally maxi-
mizing, locally minimizing: unsupervised discriminant
projection with applications to face and palm biometrics.
TPAMI, 29(4), 2007b.