Professional Documents
Culture Documents
Sodapdf
Sodapdf
Sodapdf
Technical Report
CSD-TR-03-02
May 28, 2003
We present a g
eneral method
using kernel Ca
nonical Correlati
on Analysis to
learn a semantic
representation t
o web images a
nd their associat
ed text. The
semantic space
provides a com
mon representati
on and enables
a comparison
between the text
and images. In
the experiments
we look at two
approaches
of retrieving ima
ges based only
on their content
from a text quer
y. We com-
pare the approa
ches against a
standard cross-
representation re
trieval technique
known as the G
eneralised Vecto
r Space Model.
Keywords: Ca
nonical correlat
ion analysis,
kernel canonic
al correlation
analysis, partial
Gram-Schmidt
orthogonolisatio
n, Cholesky d
ecomposition,
incomplete Chol
esky decomposit
ion, kernel meth
ods.
Introduction 1
1 Introducti
on
During recent years
there have been a
dvances in data lea
rning using kernel
methods. Kernel r
epresentation offer
s an alternative le
arning to non-linear
functions by projec
ting the data into
a high dimensional
feature space to
increase the comput
ational power of the
linear learning mac
hines, though this
still leaves open the
issue of how best to
choose the features
or the kernel func-
tion in ways that wi
ll improve performa
nce. We review so
me of the methods
that have been deve
loped for learning th
e feature space.
Principal Componen
t Analysis (PCA) is
a multivariate data
analysis proce-
dure that involves a
transformation of a
number of possibly c
orrelated variables
into a smaller numb
er of uncorrelated va
riables known as pri
ncipal components.
PCA only makes us
e of the training inp
uts while making no
use of the labels.
Independent Comp
onent Analysis (IC
A) in contrast to
correlation-based
transformations such
as PCA not only de
correlates the signal
s but also reduces
higher-order statistic
al dependencies, att
empting to make th
e signals as inde-
pendent as possible.
In other words, ICA
is a way of finding
a linear not only
orthogonal co-
ordinate system in
any multivariate dat
a. The directions of
the
axes of this co-
ordinate system are
determined by both
the second and hig
her
order statistics of th
e original data. The
goal is to perform
a linear transform
which makes the res
ulting variables as st
atistically independe
nt from each other
as possible.
Canonical Correlati
on Analysis (CCA)
is a method of co
rrelating linear
relationships betwee
n two multidimensio
nal variables. CCA
can be seen as us-
ing complex labels
as a way of guiding
feature selection to
wards the underling
semantics. CCA mak
es use of two views
of the same semant
ic object to extract
the representation o
f the semantics. T
he main difference
between CCA and
the other three met
hods is that CCA i
s closely related to
mutual information
(Borga 1998 [3]).
Hence CCA can be
easily motivated in
information based
tasks and is our na
tural selection.
Proposed by H. Hote
lling in 1936 [12], C
CA can be seen as t
he problem of find-
ing basis vectors for
two sets of variables
such that the correl
ation between the
projections of the v
ariables onto these
basis vectors are m
utually maximised.
In an attempt to inc
rease the flexibility
of the feature selecti
on, kernelisation of
CCA (KCCA) has b
een applied to map
the hypotheses to a
higher-dimensional
feature space. KCC
A has been applied
in some preliminary
work by Fyfe &
Lai [8], Akaho [1] an
d the recently Vinok
ourov et al. [19] wit
h improved results.
Introduction 2
In this study we f
ollow the work of
Borga [4] where w
e represent the
eigenproblem as two
eigenvalue equations
as this allows us to
reduce the com-
putation time and d
imensionality of the
eigenvectors.
Further to that, w
e follow the idea o
f Bach & Jordan [
2] to compute a
new correlation mat
rix with reduced di
mensionality. Thou
gh Bach & Jordan
[2]
address a very diffe
rent problem, they
use the same unde
rlining technique of
Cholesky decomposi
tion to re-represent
the kernel matrices.
We show that by
using partial Gram-
Schmidt orthogonoli
sation [6] is equiva
lent to incomplete
Cholesky decomposi
tion, in the sense t
hat incomplete Chol
esky decomposition
can be seen as a
dual implementation
of partial Gram-
Schmidt.
In this study we al
so present a gener
alisation of the fra
mework for canoni-
cal correlation analy
sis. Our approach is
based on the works
of Gifi (1990) and
Ketterling (1971). Th
e purpose of the g
eneralisation is to e
xtend the canonical
correlation as an as
sociativity measure
between two set of
variables to more
than two sets, whil
st preserving most
of its properties.
The generalisation
starts with the opti
misation problem fo
rmulation of canoni
cal correlation. By
changing the object
ive function we will
arrive at the multi
set problem. Ap-
plying similar const
raint sets in the o
ptimisation problem
s we find that the
feasible solutions ar
e singular vectors of
matrices, which are
derived the same
way for the original
and generalised pro
blem.
In Section 2 we pr
esent the theoretic
al background of
CCA. In Section 3
Theoretical Foundations 3
2 Theoretical Founda
tions
Proposed by H. Hotelling in 19
36 [12], Canonical correlation a
nalysis can be
seen as the problem of finding
basis vectors for two sets of var
iables such that
the correlation between the proj
ections of the variables onto the
se basis vectors
are mutually maximised. Correl
ation analysis is dependent on
the co-ordinate
system in which the variables a
re described, so even if there i
s a very strong
linear relationship between two
sets of multidimensional varia
bles, depending
on the co-ordinate system used,
this relationship might not be vis
ible as a cor-
relation. Canonical correlation
analysis seeks a pair of linear
transformations
one for each of the sets of var
iables such that when the set
of variables are
transformed the corresponding
co-ordinates are maximally corr
elated.
ρ = maxcorr(Sxwx, Sywy)
we can rewrite th
e correlation expr
ession as
Eˆ[hwx, xi2]Eˆ[hwx,
xi2]
= max
q
follo
ws Eˆ[x
hat
y0]w .
y
wEˆ[xx0]w
Eˆ[yy0]wy
y
Where we use A0
to denote the tran
spose of a vector
or matrix A.
Now observe that
the covariance m
atrix of (x, y) is
"xx Cxx
C(x,
Cxy (
= E yy 2
Cyx
Cyy .
1
)
Hence, we can re
write the function
ρ as
w
ρ x (
max
the maximum ca
nonical correlatio
n is the maximu
m of ρ with resp
ect to wx
and wy.
3 Algorith
m
In this section w
e will give an ov
erview of the Ca
nonical correlatio
n analysis
(CCA) and Kerne
l-CCA (KCCA) al
gorithms where
we formulate the
optimisa-
tion problem as a
generalised eigen
problem.
3.1 Canoni
cal Correlati
on Analysis
Observe that the
solution of equati
on (2.2) is not a
ffected by re-
scaling wx or
wy either together
or independently,
so that for examp
le replacing wx b
y αwx
gives the quotient
αwx0
w
xywy xC .
xw
xCxx
wxw w0
subject to
wx0 Cxxwx = 1
wy0 Cyywy = 1.
The correspondin
g Lagrangian is
λ
L(λ, (wx0 (wy0 Cyy
wx,
wy) wx wy − 1)
= w0 −
Taking derivatives
in respect to wx
and wy we obtain
∂f
∂w (
=
Cx
ywy
−
λx
Cxx
wx
=
0
∂f
(
∂w= C
yxw
x −
λ yC
yyw
y =
0.
ytimes the secon
d equation from
w0
which together wi
th the constraints
implies that λy −
λx = 0, let λ = λx
= λy .
Assuming Cyy is i
nvertible we have
C
C
wy λ (
and so substitutin
g in equation (3.1
) gives
CxyC−1
xwx
− λCxxwx
λ
= 0
or
Cxy
Cy (
xw
x
=
λ2
Cx
xw
x
As the covarianc
e matrices Cxx
and Cyy are sy
mmetric positive
definite
we are able to d
ecompose them
using a complete
Cholesky decom
position
(more details on
Cholesky decomp
osition can be fou
nd in section 4.2)
Cxx = Rxx · Rx 0
where Rxx is a lo
wer triangular mat
rix. If we let ux =
Rxx0 · wx we are a
ble to
rewrite equation (
3.4) as follows
CxyC−1CyxR−1 ux
= λ2Rxxux
R−1CxyC−1CyxR−
1 u x = λ 2u x .
We are therefore
left with a symm
etric eigenproblem
of the form Ax =
λx.
Algorithm 6
3.2 Kernel
Canonical C
orrelation An
alysis
CCA may not ex
tract useful descr
iptors of the data
because of its li
nearity.
Kernel CCA offe
rs an alternative
solution by first
projecting the dat
a into a
higher dimensiona
l feature space
φ : x = (x1, . . . xn
) 7→ φ(x) = (φ1(x)
, . . . , φN (x)) (n
< N)
before performing
CCA in the new
feature space, es
sentially moving fr
om the
primal to the dua
l representation a
pproach. Kernels
are methods of i
mplicitly
mapping data into
a higher dimensio
nal feature space,
a method known
as the
”kernel trick”. A k
ernel is a functio
n K, such that fo
r all x, z ∈ X
K(
x,
z)
=<
φ(
x)
·φ
(z)
(3.
5)
where φ is a ma
pping from X to a
feature space F.
Kernels offer a gr
eat deal
of flexibility, as t
hey can be gene
rated from other
kernels. In the k
ernel the
data only appears
through entries in
the Gram matrix,
therefore this app
roach
gives a further ad
vantage as the n
umber of tuneabl
e parameters and
updating
time does not de
pend on the num
ber of attributes b
eing used.
The directions wx
and wy (of length
N) can be rewritt
en as the projecti
on of
the data onto the
direction α and β
(of length m)
wx = X0α
wy = Y 0β.
Substituting into
equation (2.2) we
obtain the followi
ng
α0XX0
Y Y 0β
α,β
(
3
α0XX .
6
0XX0 )
α · β0Y
Y 0Y Y
0β
Let Kx = XX0 an
d Ky = Y Y 0 be th
e kernel matrices
corresponding to t
he two
representation. W
e substitute into
equation (3.6)
α
α,β
K.
K
β
α0 K
β0K
(3.7)
We find that in e
quation (3.7) the
variables are no
w represented in
the dual
form.
Observe that as w
ith the primal form
presented in equa
tion (2.2), equatio
n (3.7)
is not affected by
re-scaling of α a
nd β either togeth
er or independentl
y. Hence
Algorithm 7
α0Kx2α = 1
β0Ky2β = 1
The correspondin
g Lagrangian is
λ
L(λ, α, α0 β0Ky2β
= α0K α −1
−
Tak
ing
deri
(
vati
ves
in
res
pec
t to
α
and
β
we
obt
ain
∂f
∂α
(
Subtracting β0 tim
es the second eq
uation from α0 tim
es the first we ha
ve
0 = α 0KxKyβ −
α0λαKx2α − β0KyK
xα + β0λβKy2β
which together wi
th the constraints
implies that λα −
λβ = 0, let λ = λα
= λβ.
Considering the c
ase where the ke
rnel matrices Kx
and Ky are invert
ible, we
have
Ky−1Ky−1KyK
β
xα
λ
Ky−1Kxα
=
KxKyKy−1Kxα −
λ2KxKxα = 0.
Hence
KxKxα − λ2KxKx
α=0
or
(
We are left with
a generalised eig
enproblem of the
form Ax = λx.
We can
deduce from equ
ation 3.10 that λ
= 1 for every ve
ctor of α; hence
we can
choose the projec
tions wx to be un
it vectors ji i = 1,
. . . , m while wy a
re the
columns of 1
be formed. Since
kernel methods pr
ovide high dimens
ional representati
ons such
independence is n
ot uncommon. It i
s therefore clear t
hat a naive applic
ation of
CCA in kernel de
fined feature spac
e will not provide
useful results. In t
he next
section we investi
gate how this pro
blem can be avoi
ded.
Computational Issues 8
4 Computational Is
sues
We observe from equation (3
.10) that if Kx is invertible m
aximal correlation is
obtained, suggesting learnin
g is trivial. To force non-
trivial learning we intro-
duce a control on the flexibil
ity of the projections by pen
alising the norms of
the associated weight vectors
by a convex combination of
constraints based on
Partial Least Squares. Anot
her computational issue that
can arise is the use
of large training sets, as this
can lead to computational pro
blems and degener-
acy. To overcome this issu
e we apply partial Gram-
Schmidt orthogonolisation
(equivalently incomplete Cho
lesky decomposition) to redu
ce the dimensionality
of the kernel matrices.
4.1 Regularisation
To force non-trivial learning
on the correlation we introd
uce a control on the
flexibility of the projection
mappings using Partial Lea
st Squares (PLS) to
penalise the norms of the as
sociated weights. We conve
xly combine the PLS
term with the KCCA term in
the denominator of equation
(3.7) obtaining
α
ρ = 0
xα,β
K
x
K
y
β
(α0Kx2α + κkwxk2) · (
β0Ky2β + κkwyk2))
= max
β .
(α0Kx2α + κα0Kx
β0Ky2β + κβ0Kyβ)
We observe that the new reg
ularised equation is not affect
ed by re-scaling of α
or β, hence the optimisation
problem is subject to
(α0Kx2α + κα0Kxα) = 1
(β0Ky2β + κβ0Kyβ) = 1
The corresponding Lagrangai
n is
− λβ
2 (β0Ky2β + κβ0Kyβ − 1).
Taking derivatives in respect
to α and β
∂f
∂α (4.
= K x K yβ
1)
− λα(Kx2α +
κKxα)
∂β = K yK x α (4.
2)
− λβ(Ky2β + κK
yβ).
(Ky + κI)−1Ky−1KyKxα
β =
λ
= (Ky + κI)−1Kxα
λ
substituting in equation 4.1
gives
KxKy(Ky + κI)−1Kxα =
λ2Kx(Kx + κI)α
Ky(Ky + κI)
−1Kxα = λ2(Kx + κI)α
(Kx + κI)−1Ky(Ky + κI)
−1Kxα = λ2α
A simple, non-singular, sy
mmetric matrix for which th
e factorisation is not
possible is
0 1
1 0
On the other hand, if the sym
metric matrix A is positive de
finite (i.e., x0Ax > 0
if x0x > 0), then the factorisat
ion is possible. We have
Computational Issues 10
1.Initialisation:PN i = 1, K0 = K,
2.P = I, for j ∈ [1, N], Gjj =
Kjj
While j=1 Gjj > η and i!
=N +1
•
•
•
Pnext = I, Pnextii = 0, P
• nextj∗j∗ = 0, Pnextij∗ = 1,
Pnextj∗i = 1
• P = P · Pnext
K
0
P
n
e
x
t
K
0
P
n
e
x
t
o
f
G
:
j
G ∗
i,
)
1:
i−
C
1a
↔l
c
u
Gl
a
jt
e
,1
:i
−
i
t
1h
G
c
(io
,l
u
)m
n
↔
o
Gf
(j
∗G
:
,
Gi+1:n,i = Pi
1 −1
P
3.
Output: an N × M lower t
riangular matrix G and a p
ermutation matrix
P such that kP 0KP − GG0k
≤ η (appendix 1.2 for proof).
The algorithm involves picki
ng one column of K at a ti
me, choosing the
column to be added by gree
dily maximising a lower boun
d on the reduction
Computational Issues 11
Initialisations:
m = size of K, a N × N mat
rix
j =1
size and index are a vector w
ith the same length as K
feat a zeros matrix equal to t
he size of K
for i = 1 to m do
norm2[i] = Kii;
Algorithm:
while
Pnorm2[i] > η and j! = N
+ 1 do
ij = argmaxi(nor
m2[i]);
index[j] =pij;
size[j] =
−1
feat[i,t]·feat[ij,t]
”
feat[i, j] size ;
= [j]
norm2[i] = norm2[i]− fea
t(i, j)· feat(i, j);
end;
j =j +1
end;
return feat
Output:
kK − feat · feat0k ≤ η wher
e feat is a N × M lower tri
angular matrix
for j = 1 to
newfeat[j] t=1 newfeat[t] · f
= (Ki,index[j]
end; eat[index[j], t])/
size[j];
(4.7)
(4.8)
Multiplying the first equation
with Rx0 and the second equat
ion with Ry0 gives
Rx0 RxRx0 RyRy0 β − λ (4.
9)
Rx0 RxRx0 RxRx0 α =
(4.1
0 0)
Ry0 RyRy0 RxRx0 α − λ
Ry0 RyRy0 RyRy0 β =
0.
Let Z be the new correlation
matrix with the reduced dime
nsiality
Rx0 Rx = Zxx
Ry0 Ry = Zyy
Rx0 Ry = Zxy
Ry0 Rx = Zyx
Let α˜ and β˜ be the reduced
directions, such that
α˜ = Rx0 α
β˜ = Ry0 β
substituting in equations (4.9)
and (4.10) we find that we re
turn to the primal
representation of CCA with a
dual representation of the dat
a
ZxxZxyβ˜ − λZxx2 α˜ = 0
ZyyZyxα˜ − λZyy2 β˜ = 0.
Computational Issues 13
Zxyβ˜− λZ
(4.
xxα˜ = 0 11
)
(4.
Zyxα˜ − λ 12
)
Zyyβ˜ =
0.
We are able to rewrite β˜ from
equation (4.12) as
Zyy−1Zyxα˜
β˜ =
λ
and substituting in equation (
4.11) gives
ZxyZyy−1Zyxα
˜ = λ2Zxx (4.
13
α˜ )
we are left with a generalise
d eigenproblem of the form
Ax = λBx. Let SS0
be equal to the complete Chol
esky decomposition of Zxx suc
h that Zxx = SS0
where S is a lower triangular
matrix, and let αˆ = S0·α.˜ Su
bstituting in equation
(4.13) we obtain
S−1ZxyZyy−1ZyxS−1 αˆ = λ2αˆ
We now have a symmetric g
eneralised eigenproblem of th
e form Ax = λx.
RxRx0 RyRy0 β −
λ(RxRx0 RxRx0 + κRxRx0 )α =
0
RyRy0 RxRx0 α −
λ(RyRy0 RyRy0 + κRyRy0 )β =
0
Multiplying the first equation
with Rx0 and the second equat
ion with Ry0 gives
Zxyβ˜− λ(Zxx
(4.
+ κI)˜α = 16)
0 (4.
Zyxα˜ − λ(Zyy 17)
+ κI)β˜ =
0.
Experimental Results 14
(Zyy + κI)−1Zyxα˜
β˜ =
λ
substituting in equation 4.16
gives
5 Experimental Res
ults
In the following experiments
the problem of learning sem
antics of multimedia
content by combining image a
nd text data is addressed. T
he synthesis is ad-
dressed by the kernel Canoni
cal correlation analysis descri
bed in Section 4.3.
We test the use of the derive
d semantic space in an imag
e retrieval task that
uses only image content. Th
e aim is to allow retrieval of
images from a text
query but without reference t
o any labeling associated wit
h the image. This
can be viewed as a cross-
modal retrieval task. We used
the combined multime-
dia image-text web database,
which was kindly provided by
the authors of [15],
where we are trying to facilitat
e mate retrieval on a test set.
The data was di-
vided into three classes (Figur
e 1) - Sport, Aviation and Pai
ntball - 400 records
each and consisted of jpeg i
mages retrieved from the Int
ernet with attached
text. We randomly split each
class into two halves which w
ere used as training
and test data accordingly. Th
e extracted features of the d
ata were used the
same as in [15] (detailed desc
ription of the features used ca
n be found in [15]:
image HSV colour, image Ga
bor texture and term frequenc
ies in text.
20 2 2
0 0
20
40
4 4 40
0 0
60
6 6 60
0 0
80
8 8 80
0 0
100
1 1 100
0 0
0 0
120
1 1 120
2 2
0 0
140
1 1 140
4 4
0 0
160
1 1 160
6 6
0 0
Aviatio
n
450
150 200 250 300 350 400 100 150 200 250 300 350 400 450 500 100 150 200 250 300 350 400 100 150 200 250 300 350 400
10
10 10
20
2
2 0 20
0
40
3
0 30
3
0
40
60 40
4
5
0
0
50
6
5 0
0
80
60
7
6 0
0
80
70
100
7
0 9 80
0
20
1
0 20
2
0
40
2
0
40
3
0 60
4
0
40
8
0
50
60
100
80 60
120
80 70
1 140
0
0
120
160
1 9
0 0
0
Paintba
20 30 40 50 60 70 80 90 100110 120 140 140 300
ll
κ = argmaxkλR(κ) − λ(κ)k
We find that κ = 7 and set v
ia a heuristic technique the G
ram-Schmidt preci-
sion parameter η = 0.5 .
5.1 Content-Based
Retrieval
In this experiment we used the
first 30 and 5 α˜ eigenvectors
and β˜ eigenvectors
(corresponding to the largest e
igenvalues). We computed the
10 and 30 images
for which their semantic featu
re vector has the closest inn
er product with the
semantic feature vector of the
chosen text. Success is consi
dered if the images
contained in the set are of the
same label as the query text (
Figure 3 - retrieval
Experimental Results 16
30 7 8 90.6
6. 3 9%
8 .
2 0
% 2
%
90
85
80
75
70
65
200
20 40 60 80 100 120 140 160 180
Image Set Size Image Set Size Image Set Size
1 1
0 0
2
0
3
0
4
0
5
0
6
0
6
0
7
0
7
0
8
0
9
0
1
0
0
1
1
0
10
4 80
0
10
10
20
20
30
30
40
40
50
50
60
60
70
70
80
80
90
90
100
100
110
110
10
20 30 40 50 60 70 80 20 30 40 50 60 70 80
5.2 Mate-Based Re
trieval
In the experiment we used the fi
rst 150 and 30 α˜ eigenvectors
and β˜ eigenvectors
(corresponding to the largest e
igenvalues). We computed the
10 and 30 images
for which their semantic featu
re vector has the closest inn
er product with the
semantic feature vector of the
chosen text. A successful mat
ch is considered if
the image that actually match
ed the chosen text is contain
ed in this set. We
compute the success as the av
erage of 10 runs (Figure 5 - r
etrieval example for
set of 5 images).
90
80
70
60
50
40
30
10 15 20 25 30 35 40 45 50
eigenvectors
j=1 count
success % for imag
e set i = 60 × 100
0
where countj = 1 if the exact
matching image to the text qu
ery was present in
the set, else countj = 0. The s
uccess rate in Table 4 is com
puted as above and
averaged over all image sets.
As visible in Figure 7 we fi
nd that unlike the Content-
Based (Section 5.1)
retrieval, increasing the numb
er of eigenvectors used will a
ssist in locating the
matching image to the query
text. We speculate that this
may be the result
of added detail towards exac
t correlation in the semantic
projection. Though
we do not compute for all ei
genvectors as this process w
ould be expensive
Experimental Results 19
20
40
20
60
20
40
80
60
4
1 0
0
0
80
120 60
100
120
1 8
4 0
0
140
1
6 1
0
0
160
180
1
2
0
200
200
140
50
100 150 200 250 300 350 400 450 100150200250300350400450500 40 60 80 100120 140 160 180 200 220
10
20
20
40
60
30
80
40
100
50
120
60
140
70
160
180
80
200
90
20 40 60 80 100 120 100 150 200 250 300 350 400 450 500
5.3 Regularisation
Parameter
We next verify that the met
hod of selecting the regulari
sation parameter κ
a priori gives a value perform
ed well. We randomly split e
ach class into two
halves which were used as
training and test data accord
ingly, we keep this
divided set for all runs. We
set the value of the incompl
ete Gram-Schmidt
orthogonolisation precision p
arameter η = 0.5 and run o
ver possible values
κ where for each value we t
est its content-based and m
ate-based retrieval
performance.
100
90
80
70
60
50
40
30
20
95
90
85
80
75
70
65
60
55
50
40 60 80 100 120 140 160 180
eigenvectors
88.55
88.5
88.45
88.4
100 120 140 160 180 200 220 240 260 280 300
Kappa
92.796
92.794
92.792
92.79
92.788
92.786
92.784
92.782
92.78
92.778
85 90 95 100 105 110 115 120
Kappa
85.515
85.51
85.505
85.5
85.495
85.49
85.485
300
120 140 160 180 200 220 240 260 280
Kappa
93.03
93.02
93.01
93
92.99
92.98
92.97
92.96
100 150 200 250 300 350 400 450 500 550
kappa
Tr(A) = X (6.1)
i
(6.3)
kAkF = kaik22 = hai, aii
the notation k k2 means the Euclidean, l2, norm of a ve
ctor.
Generalisation of Canonical Correlation Analysis 24
subject to (6.5)
g(y) = 0, (6.6)
(6.7)
x ∈ Rm, y ∈ Rn.
Let the set Y ⊆ Rn the feasibility domain for y determi
ned by the constrain
g(y) = 0.
subject to (6.9)
g(y) = 0, (6.10)
y ∈ Rn, (6.11)
has the same optimal solution in y than equation (6.4) h
as.
subject to
(6.14)
a(1)1 Σ11a(1)1 = 1, (6.15)
a1(2) Σ22a(2)1 = 1. (6.16)
The meaning of this optimisation problem is to find th
e maximum correlation
between the linear combinations of the columns of th
e matrices H(1), H(2),
subject to the length of the vectors corresponding to th
ese linear combinations
normalised to 1.
k, l = 1, 2, j = 1, . . . , r − 1. (6.22)
The problem (6.13) expanded by the orthogonality con
strains (6.17), namely
the components of every new pair in the iteration have t
o be orthogonal to the
components of the previous pairs.
a(k)i = Σ− 1
2
yi(k), (6.23)
1 1
Dkl = Σ− 2 ΣklΣ− 2 , (6.24)
k, l = 1, 2, i = 1, . . . , p, (6.25)
Generalisation of Canonical Correlation Analysis 26
(6.27)
y(k)
(6.28)
(6.29)
L1 = y(1) 0 1 0
(6.30)
2 2
where λ1 and λ2 are the Lagrangian multipliers. The
vectors of the partial
derivatives of L1respect to the vectors y1(1), y1(2) are eq
ual to 0 by the KKT
conditions, thus we get
∂L1
(6.31)
= 2D12a(2)1 − 2λ1y1(1) = 0,
∂y1(1)
∂L1
∂y1(2) = 2D21y1(1) − 2λ2y1(2) = 0. (6.32)
(6.34)
12y1(1)
− λ2y(2)
Based on the constrains of the optimisation problem (
6.26) and the identity
λ1 = λ2 = y1(1) D12y1(2).
(6.35)
After replacing λ1 and λ2 with λ the following equality syst
em can be formulated
a(1) 1 if i = j,
Σ11a(1)j = (6.39)
0 otherwise,
a(2) 1 if i = j,
Σ22a(2)j = (6.40)
0 otherwise,
i, j = 1, . . . , p,
(6.41)
a(1)
Σ12a(2)j = 0, (6.42)
i, j = 1, . . . , p, j 6= i. (6.43)
Based on equation (6.37) and the definition of the Frobe
nius norm we have a
compact formulation of the canonical correlation proble
m:
max Tr A(1) Σ12A(2) (6.44)
(6.45)
A(k) (6.46)
ΣkkA(k) = I,
a(k)
Σkla(l)j = 0, (6.47)
min XK (6.54)
k,l=1
H(k)A(k) − H(l)A(l)
(6.55)
A(k) (6.56)
ΣkkA(k) = I,
a(k)
Σkla(l)j = 0, (6.57)
k, l = 1, . . . , K, l 6= k, i, j = 1, . . . , p, j 6 (6.58)
= i.
In the forthcoming sections we will show how to simplif
y this problem.
(6.59)
2 kxk − xlk22 =
1 Xm Xn
= (6.61)
2 (xki − xli)2 =
k=1,l=1 i=1
1 Xm Xn
= (6.62)
2 i=1
k=1,l=1
1 Xm
= Xn x2ki + Xm x2li − Xm 2xkix (6.63)
2 i=1
k=1,l=1 k=1,l=1
1 !
= Xn Xm x2ki + mXm Xm xkiXm xli (6.64)
2 i=1 x2li − 2
k=1 l=1 k=1 l=1
Generalisation of Canonical Correlation Analysis 29
= m2 Xn (6.67)
(xki − M1(i))2.
i=1k=1
XK (6.70)
X − H(k)A(k)F
subject to (6.71)
0
1 if k = l and i = j,
i 0 if (k = l and i 6= j) or (k 6= l (6.72)
Σkla(l)j =
nd i 6= j),
k, l = 1, . . . , K, i, j = 1, . . . , p, except when (6.73)
6= l and i = j,
Generalisation of Canonical Correlation Analysis 30
1 if i = j,
y(k) yj(k) = (6.76)
0 if i 6= j,
k = 1, . . . , K, i, j = 1, . . . , p,
(6.77)
y(k)
Dklyj(l) = 0, (6.78)
k, l = 1, . . . , K, k 6= l, i, j = 1, . . . p, i 6 (6.79)
= j,
1 XK
= (6.84)
K X − H(k)A(k) F
k=1
1 XK Xp
= = (6.85)
K xi − H(k)a(k)i 2
k=1 i=1
1 XK Xp
= = (6.86)
K xi − Qkyi(k) 2
k=1 i=1
1 XK
= hxi − Qkyi(k), xi − Qkyi(k)i. (6.87)
K
k=1i=1
L = XK (6.88)
hxi − Qkyi(k), xi − Qkyi(k)i+
k=1i=1
∂L XK
= = 0, i = 1, . . . , p, (6.92)
∂xi k=1 2xi − 2Qkyi(k)
∂L 0
Xp XK Xp λkl,ijDklyj(l) = 0,
= 2Dkkyi(k) − xi − 2λk,ij yj(k) −
∂yi(k) j l j
l6=kj6=i
(6.93)
k = 1, . . . , K, i = 1, . . . , p.
(6.94)
Conclusions 32
We can express xi fo
r any i = 1, . . . , p fr
om (6.92)
xi X Qlyi( (
Kl=1 6
l), i
.
= 1, 9
..., 5
)
p.
7 Conclusion
s
Through this study
we have presented
a tutorial on cano
nical correlation
analysis and have e
stablished a novel g
eneral approach to
retrieving images
based solely on their
content. This is then
applied to content-
based and mate-
based retrieval. Exp
eriments show that i
mage retrieval can b
e more accurate
than with the Gener
alised Vector Space
Model. We demon
strate that one
can choose the regu
larisation parameter
κ a priori that perfor
ms well in very
different regimes. He
nce we have come t
o the conclusion that
kernel Canonical
Correlation Analysis
is a powerful tool fo
r image retrieval via
content. In the
future we will extend
our experiments to
other data collection
s.
In the procedure of
the generalisation
of the canonical co
rrelation analysis
we can see that the
original problem can
be transformed and r
einterpreted as a
total distance proble
m or variance mini
misation problem. T
his special duality
between the correlat
ion and the distanc
e requires more inv
estigation to give
more suitable descrip
tion of the structure
of some special spa
ces generated by
different kernels.
These approaches ca
n give tools to handl
e some problems in
the kernel space,
where the inner prod
ucts and the distanc
es between the point
s are known but
the coordinates are
not. For some probl
ems it is sufficient t
o know only the
coordinates of a few
special points, whic
h can be expressed
from the known
inner product, e.g. t
o do cluster analysis
in the kernel space
and to compute
the coordinates of th
e cluster centres onl
y.
Acknowledgme
nts
We would like to a
cknowledge the fina
ncial support of EU
Projects KerMIT,
No. IST-2000-25341
and LAVA, No. IST-
2001-34405.
1 Proofk ≤ η
− GiGi
1.1 Some
notation
Lemma 4. Let
and B be an P aii
re matrices such
hat Trace(A)
then we have
ace(AB) = Trace(
BA)
0
k ≤ η 33
Proof.
Trane(AB) = (ab)ii
i
= Xn aijbji
i,j
= Xn bjiaij
j,i
= (ba)jj
j
= Trace(BA)
Proof.
Xn aii = Xn λi
i
i
Proof.
=
kcxk ckxk kxk
Proof kK − GiGi k ≤ η 34
Hence we obtain
max = max
kAxk
kAxk = p
A0A = UDU0
kAxk2 = x0UDUx
xk = 1 to kwk = 1
kAk2 = wX0Dw
Hence we obtain
λi
kAk = max
1.2 Proof
Theorem 7. If K is a positive definite matrix an
G0 is its incomplete
cholesky decomposition then the Euclidian norm of
ubtracted from K is
less than or equal to the trace of the uncalculated p
f K. Let ∆Ki be the
uncalculated part of K and let η = Trace(∆Ki) k ≤ η
kK − GiGi
Proof. Let GG0 be the being the complete cholesky
omposition K = GG0
where G is a lower triangular matrix were the upper
gular is zeros.
A 0
G =
B C
Let GiG
to be the incopmlete decomposition of K wh
are the iterations
of the Cholesky factorization procedure
Gi = G
i =
A
B
AA0 AB0
K = GG0 =
BA0 BB0 + CC0
0 0
∆Ki =
0 CC0
We show that CC0 is positive semi-definite
= Ki+1:n,i+1:n − BB0
= Ki+1:n,i+1:n − B · A−1A · B0
= Ki+1:n,i+1:n − B · A−1 · (AB0)
· K1:i,i+1:n
therefore
xCC0x = < xC, (xC)0 >
≥ 0
λc ≥ 0
CC0 is a positive semi-definite matrix, hence ∆Ki is a
positive semi-definite
matrix. Using Lemma 6 we are now able to show th
kK − K˜ik = k∆Kik
= λ iw i
i
= max λi
Xn λi
kK − GiGi k ≤
i
≤ Trace(Λ)
≤ Trace(∆Kiii ).
Therefore,
kK − GiGi k ≤ η.
Bibliography
[10]
G. H. Golub and C. F. V. Loa
n. Matrix Computations. Th
e Johns Hopkins
University Press, Baltimore,
MD, 1983.
[11]
David R. Hardoon and John
Shawe-Taylor. Kcca for differ
ent level preci-
sion in content-based image
retrieval. In Submitted to Thi
rd International
Workshop on Content-Based
Multimedia Indexing, IRISA,
Rennes, France,
2003.
[12]
H. Hotelling. Relations betwe
en two sets of variates. Bio
metrika, 28:312–
377, 1936.
[13]
E. Isaacson and H. B. Keller.
Analysis of Numerical Metho
ds. John Wiley
& Sons, Inc, 1966.
BIBLIOGRAPHY 37
[14]
J. R. Ketterling. Canonic
al analysis of several set
s of variables. Biometrik
a,
58:433–451, 1971.
[15]
T. Kolenda, L. K. Hans
en, J. Larsen, and O.
Winther. Independent co
m-
ponent analysis for un
derstanding multimedia
content. In H. Bourlard
,
T. Adali, S. Bengio, J. Lar
sen, and S. Douglas, edit
ors, Proceedings of IEE
E
Workshop on Neural Ne
tworks for Signal Proces
sing XII, pages 757–
766,
Piscataway, New Jersey,
2002. IEEE Press. Mart
igny, Valais, Switzerland,
[16]
Malte Kuss and Thore
Graepel. The geometry
of kernel canonical corre
la-
tion analysis. 2002.
[17]
Yong Rui, Thomas S.
Huang, and Shih-Fu Ch
ang. Image retrieval: C
ur-
rent techniques, promisi
ng directions, and open
issues. Journal of Visu
al
Communications and Im
age Representation, 10:3
9–62, 1999.
[18]
Alexei Vinokourov, Davi
d R. Hardoon, and Jo
hn Shawe-Taylor. Lear
n-
ing the semantics of m
ultimedia content with
application to web ima
ge
retrieval and classificati
on. In Proceedings of
Fourth International Sy
m-ndent Component Analys
posiu
is and Blind Source Sep
maration,
Indepe
Nara, Japan, 2003.
[19]
Alexei Vinokourov, Joh
n Shawe-Taylor, and N
ello Cristianini. Inferring
a
semantic representation
of text via cross-
language correlation ana
lysis. In
Advances of Neural Infor
mation Processing Syste
ms 15 (to appear), 2002
.