Sodapdf

Canonical correlation analysis; An
overview with application to learning

methods
David R. Hardoon , Sandor Szedmak and John Shawe-Taylor

Department of Computer Science
Royal Holloway, University of London
{davidh, sandor, john}@cs.rhul.ac.uk
Technical Report
CSD-TR-03-02
May 28, 2003
Department of Computer Science

Egham, Surrey TW20 0EX, England
Abstract
We present a g
eneral method
using kernel Ca
nonical Correlati
on Analysis to
learn a semantic
representation t
o web images a
nd their associat
ed text. The
semantic space
provides a com
mon representati
on and enables
a comparison
between the text
and images. In
the experiments
we look at two
approaches
of retrieving ima
ges based only
on their content
from a text quer
y. We com-
pare the approa
ches against a
standard cross-
representation re
trieval technique
known as the G
eneralised Vecto
r Space Model.
Keywords: Ca
nonical correlat
ion analysis,
kernel canonic
al correlation
analysis, partial
Gram-Schmidt
orthogonolisatio
n, Cholesky d
ecomposition,
incomplete Chol
esky decomposit
ion, kernel meth
ods.
Introduction 1
1 Introducti
on
During recent years
there have been a
dvances in data lea
rning using kernel
methods. Kernel r
epresentation offer
s an alternative le
arning to non-linear
functions by projec
ting the data into
a high dimensional
feature space to
increase the comput
ational power of the
linear learning mac
hines, though this
still leaves open the
issue of how best to
choose the features
or the kernel func-
tion in ways that wi
ll improve performa
nce. We review so
me of the methods
that have been deve
loped for learning th
e feature space.
Principal Componen
t Analysis (PCA) is
a multivariate data
analysis proce-
dure that involves a
transformation of a
number of possibly c
orrelated variables
into a smaller numb
er of uncorrelated va
riables known as pri
ncipal components.
PCA only makes us
e of the training inp
uts while making no
use of the labels.
Independent Comp
onent Analysis (IC
A) in contrast to
correlation-based
transformations such
as PCA not only de
correlates the signal
s but also reduces
higher-order statistic
al dependencies, att
empting to make th
e signals as inde-
pendent as possible.
In other words, ICA
is a way of finding
a linear not only
orthogonal co-
ordinate system in
any multivariate dat
a. The directions of
the
axes of this co-
ordinate system are
determined by both
the second and hig
her
order statistics of th
e original data. The
goal is to perform
a linear transform
which makes the res
ulting variables as st
atistically independe
nt from each other
as possible.
Partial Least Squar

es (PLS) is a met
hod similar to can
onical correlation
analysis. It selects
feature directions t
hat are useful for t
he task at hand,
though PLS only us
es one view of an o
bject and the label
as the corresponding
pair. PLS could be

thought of as a met
hod, which looks for
directions that are
good at distinguishi
ng the different lab
els.
Canonical Correlati
on Analysis (CCA)
is a method of co
rrelating linear
relationships betwee
n two multidimensio
nal variables. CCA
can be seen as us-
ing complex labels
as a way of guiding
feature selection to
wards the underling
semantics. CCA mak
es use of two views
of the same semant
ic object to extract
the representation o
f the semantics. T
he main difference
between CCA and
the other three met
hods is that CCA i
s closely related to
mutual information
(Borga 1998 [3]).
Hence CCA can be
easily motivated in
information based
tasks and is our na
tural selection.
Proposed by H. Hote
lling in 1936 [12], C
CA can be seen as t
he problem of find-
ing basis vectors for
two sets of variables
such that the correl
ation between the
projections of the v
ariables onto these
basis vectors are m
utually maximised.
In an attempt to inc
rease the flexibility
of the feature selecti
on, kernelisation of
CCA (KCCA) has b
een applied to map
the hypotheses to a
higher-dimensional
feature space. KCC
A has been applied
in some preliminary
work by Fyfe &
Lai [8], Akaho [1] an
d the recently Vinok
ourov et al. [19] wit
h improved results.
Introduction 2
During recent years

there has been a
vast increase in th
e amount of mul-
timedia content ava
ilable both off-line
and online, though
we are unable to
access or make us
e of this data unles
s it is organised in
such a way as to
allow efficient brow
sing. To enable co
ntent based retrieva
l with no reference
to labeling we atte
mpt to learn the s
emantic representat
ion of images and
their associated text
. We present a ge
neral approach usin
g KCCA that can
be used for content
[11] to as well as
mate based retrieval
[18, 11]. In both
cases we compare t
he KCCA approach
to the Generalised
Vector Space Model
(GVSM), which aim
s at capturing som
e term-term correlat
ions by looking at
co-occurrence infor
mation.
This study aims to

serve as a tutorial
and give additiona
l novel contribu-
tions in the followin
g ways:
In this study we f
ollow the work of
Borga [4] where w
e represent the
eigenproblem as two
eigenvalue equations
as this allows us to
reduce the com-
putation time and d
imensionality of the
eigenvectors.
Further to that, w
e follow the idea o
f Bach & Jordan [
2] to compute a
new correlation mat
rix with reduced di
mensionality. Thou
gh Bach & Jordan
[2]
address a very diffe
rent problem, they
use the same unde
rlining technique of
Cholesky decomposi
tion to re-represent
the kernel matrices.
We show that by
using partial Gram-
Schmidt orthogonoli
sation [6] is equiva
lent to incomplete
Cholesky decomposi
tion, in the sense t
hat incomplete Chol
esky decomposition
can be seen as a
dual implementation
of partial Gram-
Schmidt.
We show that the

general approach c
an be adapted to t
wo different types
of problems, conte
nt and mate retriev
al, by only changi
ng the selection of
eigenvectors used i
n the semantic proj
ection.
To simplify the lea

rning of the KCCA
we explore a met
hod of selecting
the regularization p
arameter a priori s
uch that it gives a
value that performs
well in several diffe
rent tasks.
In this study we al
so present a gener
alisation of the fra
mework for canoni-
cal correlation analy
sis. Our approach is
based on the works
of Gifi (1990) and
Ketterling (1971). Th
e purpose of the g
eneralisation is to e
xtend the canonical
correlation as an as
sociativity measure
between two set of
variables to more
than two sets, whil
st preserving most
of its properties.
The generalisation
starts with the opti
misation problem fo
rmulation of canoni
cal correlation. By
changing the object
ive function we will
arrive at the multi
set problem. Ap-
plying similar const
raint sets in the o
ptimisation problem
s we find that the
feasible solutions ar
e singular vectors of
matrices, which are
derived the same
way for the original
and generalised pro
blem.
In Section 2 we pr
esent the theoretic
al background of
CCA. In Section 3
Theoretical Foundations 3
we present the CCA and KCC

A algorithm. Approaches to dea
l with the com-
putational problems that arose i
n Section 3 are presented in S
ection 4. Our
experimental results are present
ed In Section 5. In Section 6
we present the
generalisation framework for CC
A while in Section 7 draws fina
l conclusions.
2 Theoretical Founda
tions
Proposed by H. Hotelling in 19
36 [12], Canonical correlation a
nalysis can be
seen as the problem of finding
basis vectors for two sets of var
iables such that
the correlation between the proj
ections of the variables onto the
se basis vectors
are mutually maximised. Correl
ation analysis is dependent on
the co-ordinate
system in which the variables a
re described, so even if there i
s a very strong
linear relationship between two
sets of multidimensional varia
bles, depending
on the co-ordinate system used,
this relationship might not be vis
ible as a cor-
relation. Canonical correlation
analysis seeks a pair of linear
transformations
one for each of the sets of var
iables such that when the set
of variables are
transformed the corresponding
co-ordinates are maximally corr
elated.
Consider a multivariate rando

m vector of the form (x, y).
Suppose we are
given a sample of instances S
= ((x1, y1), . . . , (xn, yn)) of (x, y)
, we use Sx to
denote (x1, . . . , xn) and similarl
y Sy to denote (y1, . . . , yn). W
e can consider
defining a new co-ordinate for
x by choosing a direction wx a
nd projecting x
onto that direction
x → hwx, xi
if we do the same for y by cho
osing a direction wy we obtain
a sample of the
new x co-ordinate as
Sx,wx = (hwx, x1i, . . . ,hwx, xni)

with the corresponding values o
f the new y co-ordinate being
Sy,wy = (hwy, y1i, . . . ,hwy, yni)

The first stage of canonical cor
relation is to choose wx and w
y to maximise
the correlation between the tw
o vectors. In other words the
function to be
maximised is
ρ = maxcorr(Sxwx, Sywy)
= max hSxwx, Sywyi
If we use Eˆ [f(x, y)] to denote th

e empirical expectation of the
function f(x, y),
were
Eˆ [f(x, y)] = 1
f(xi, yi)
m
i=1
Algorithm 4
we can rewrite th
e correlation expr
ession as
Eˆ[hwx, xi2]Eˆ[hwx,
xi2]
= max
q
follo
ws Eˆ[x
hat
y0]w .
y
wEˆ[xx0]w
Eˆ[yy0]wy
y
Where we use A0
to denote the tran
spose of a vector
or matrix A.
Now observe that
the covariance m
atrix of (x, y) is
"xx Cxx
C(x,
Cxy (
= E yy 2
Cyx
Cyy .
1
)
The total covarian

ce matrix C is a
block matrix wher
e the within-sets
covari-
ance matrices are
Cxx and Cyy and
the between-sets
covariance matric
es are
Cxy = Cyx0
Hence, we can re
write the function
ρ as
w
ρ x (
max
the maximum ca
nonical correlatio
n is the maximu
m of ρ with resp
ect to wx
and wy.
3 Algorith
m
In this section w
e will give an ov
erview of the Ca
nonical correlatio
n analysis
(CCA) and Kerne
l-CCA (KCCA) al
gorithms where
we formulate the
optimisa-
tion problem as a
generalised eigen
problem.
3.1 Canoni
cal Correlati
on Analysis
Observe that the
solution of equati
on (2.2) is not a
ffected by re-
scaling wx or
wy either together
or independently,
so that for examp
le replacing wx b
y αwx
gives the quotient
αwx0
w
xywy xC .
xw
xCxx
wxw w0
Since the choice

of re-scaling is th
erefore arbitrary,
the CCA optimisati
on prob-
lem formulated i
n equation (2.2)
is equivalent to
maximising the n
umerator
Algorithm 5
subject to
wx0 Cxxwx = 1
wy0 Cyywy = 1.
The correspondin
g Lagrangian is
λ
L(λ, (wx0 (wy0 Cyy
wx,
wy) wx wy − 1)
= w0 −
Taking derivatives
in respect to wx
and wy we obtain
∂f
∂w (
=
Cx
ywy
−
λx
Cxx
wx
=
0
∂f
(
∂w= C
yxw
x −
λ yC
yyw
y =
0.
ytimes the secon
d equation from
w0
xCxywy − wx0 λxCx

xwx − wy0 Cyxwx +
wy0 λyCyywy
which together wi
th the constraints
implies that λy −
λx = 0, let λ = λx
= λy .
Assuming Cyy is i
nvertible we have
C
C
wy λ (
and so substitutin
g in equation (3.1
) gives
CxyC−1
xwx
− λCxxwx
λ
= 0
or
Cxy
Cy (
xw
x
=
λ2
Cx
xw
x
We are left with

a generalised eig
enproblem of the
form Ax = λBx.
We can
therefore find the
co-ordinate syste
m that optimises
the correlation b
etween
corresponding co-
ordinates by first
solving for the ge
neralised eigenve
ctors of
equation (3.4) to
obtain the sequen
ce of wx’s and th
en using equation
(3.3) to
find the correspo
nding wy’s.
As the covarianc
e matrices Cxx
and Cyy are sy
mmetric positive
definite
we are able to d
ecompose them
using a complete
Cholesky decom
position
(more details on
Cholesky decomp
osition can be fou
nd in section 4.2)
Cxx = Rxx · Rx 0
where Rxx is a lo
wer triangular mat
rix. If we let ux =
Rxx0 · wx we are a
ble to
rewrite equation (
3.4) as follows
CxyC−1CyxR−1 ux
= λ2Rxxux
R−1CxyC−1CyxR−
1 u x = λ 2u x .
We are therefore
left with a symm
etric eigenproblem
of the form Ax =
λx.
Algorithm 6
3.2 Kernel
Canonical C
orrelation An
alysis
CCA may not ex
tract useful descr
iptors of the data
because of its li
nearity.
Kernel CCA offe
rs an alternative
solution by first
projecting the dat
a into a
higher dimensiona
l feature space
φ : x = (x1, . . . xn
) 7→ φ(x) = (φ1(x)
, . . . , φN (x)) (n
< N)
before performing
CCA in the new
feature space, es
sentially moving fr
om the
primal to the dua
l representation a
pproach. Kernels
are methods of i
mplicitly
mapping data into
a higher dimensio
nal feature space,
a method known
as the
”kernel trick”. A k
ernel is a functio
n K, such that fo
r all x, z ∈ X
K(
x,
z)
=<
φ(
x)
·φ
(z)
(3.
5)
where φ is a ma
pping from X to a
feature space F.
Kernels offer a gr
eat deal
of flexibility, as t
hey can be gene
rated from other
kernels. In the k
ernel the
data only appears
through entries in
the Gram matrix,
therefore this app
roach
gives a further ad
vantage as the n
umber of tuneabl
e parameters and
updating
time does not de
pend on the num
ber of attributes b
eing used.
Using the definitio

n of the covarian
ce matrix in equa
tion (2.1) we can
rewrite
the covariance m
atrix C using the
data matrices (of
vectors) X and Y
, which
have the sample
vector as rows an
d are therefore of
size m × N, we ob
tain
Cxx = X0X
Cxy = X0Y.
The directions wx
and wy (of length
N) can be rewritt
en as the projecti
on of
the data onto the
direction α and β
(of length m)
wx = X0α
wy = Y 0β.
Substituting into
equation (2.2) we
obtain the followi
ng
α0XX0
Y Y 0β
α,β
(
3
α0XX .
6
0XX0 )
α · β0Y
Y 0Y Y
0β
Let Kx = XX0 an
d Ky = Y Y 0 be th
e kernel matrices
corresponding to t
he two
representation. W
e substitute into
equation (3.6)
α
α,β
K.
K
β
α0 K
β0K
(3.7)
We find that in e
quation (3.7) the
variables are no
w represented in
the dual
form.
Observe that as w
ith the primal form
presented in equa
tion (2.2), equatio
n (3.7)
is not affected by
re-scaling of α a
nd β either togeth
er or independentl
y. Hence
Algorithm 7
the KCCA optimi

sation problem fo
rmulated in equat
ion (3.7) is equiv
alent to
maximising the n
umerator subject
to
α0Kx2α = 1
β0Ky2β = 1
The correspondin
g Lagrangian is
λ
L(λ, α, α0 β0Ky2β
= α0K α −1
−
Tak
ing
deri
(
vati
ves
in
res
pec
t to
α
and
β
we
obt
ain
∂f
∂α
(
Subtracting β0 tim
es the second eq
uation from α0 tim
es the first we ha
ve
0 = α 0KxKyβ −
α0λαKx2α − β0KyK
xα + β0λβKy2β
which together wi
th the constraints
implies that λα −
λβ = 0, let λ = λα
= λβ.
Considering the c
ase where the ke
rnel matrices Kx
and Ky are invert
ible, we
have
Ky−1Ky−1KyK
β
xα
λ
Ky−1Kxα
=
KxKyKy−1Kxα −
λ2KxKxα = 0.
Hence
KxKxα − λ2KxKx
α=0
or
(
We are left with
a generalised eig
enproblem of the
form Ax = λx.
We can
deduce from equ
ation 3.10 that λ
= 1 for every ve
ctor of α; hence
we can
choose the projec
tions wx to be un
it vectors ji i = 1,
. . . , m while wy a
re the
columns of 1
be formed. Since
kernel methods pr
ovide high dimens
ional representati
ons such
independence is n
ot uncommon. It i
s therefore clear t
hat a naive applic
ation of
CCA in kernel de
fined feature spac
e will not provide
useful results. In t
he next
section we investi
gate how this pro
blem can be avoi
ded.
Computational Issues 8
4 Computational Is
sues
We observe from equation (3
.10) that if Kx is invertible m
aximal correlation is
obtained, suggesting learnin
g is trivial. To force non-
trivial learning we intro-
duce a control on the flexibil
ity of the projections by pen
alising the norms of
the associated weight vectors
by a convex combination of
constraints based on
Partial Least Squares. Anot
her computational issue that
can arise is the use
of large training sets, as this
can lead to computational pro
blems and degener-
acy. To overcome this issu
e we apply partial Gram-
Schmidt orthogonolisation
(equivalently incomplete Cho
lesky decomposition) to redu
ce the dimensionality
of the kernel matrices.
4.1 Regularisation
To force non-trivial learning
on the correlation we introd
uce a control on the
flexibility of the projection
mappings using Partial Lea
st Squares (PLS) to
penalise the norms of the as
sociated weights. We conve
xly combine the PLS
term with the KCCA term in
the denominator of equation
(3.7) obtaining
α
ρ = 0
xα,β
K
x
K
y
β
(α0Kx2α + κkwxk2) · (
β0Ky2β + κkwyk2))
= max
β .
(α0Kx2α + κα0Kx
β0Ky2β + κβ0Kyβ)
We observe that the new reg
ularised equation is not affect
ed by re-scaling of α
or β, hence the optimisation
problem is subject to
(α0Kx2α + κα0Kxα) = 1
(β0Ky2β + κβ0Kyβ) = 1
The corresponding Lagrangai
n is
L(λα, λβ, α, β) = α0KxKyβ

λα
− 2 (α0Kx2α + κα0Kxα − 1)
− λβ
2 (β0Ky2β + κβ0Kyβ − 1).
Taking derivatives in respect
to α and β
∂f
∂α (4.
= K x K yβ
1)
− λα(Kx2α +
κKxα)
∂β = K yK x α (4.
2)
− λβ(Ky2β + κK
yβ).
Subtracting β0 times the sec

ond equation from α0 times th
e first we have
= λββ0(Ky2β + κKyβ) − λαα

0(Kx2α + κKxα).
Which together with the con

straints implies that λα −λβ
= 0, let λ = λα = λβ.
Consider the case where Kx
and Ky are invertible, we hav
e
(Ky + κI)−1Ky−1KyKxα
β =
λ
= (Ky + κI)−1Kxα
λ
substituting in equation 4.1
gives
KxKy(Ky + κI)−1Kxα =
λ2Kx(Kx + κI)α
Ky(Ky + κI)
−1Kxα = λ2(Kx + κI)α
(Kx + κI)−1Ky(Ky + κI)
−1Kxα = λ2α
We obtain a generalised eige

nproblem of the form Ax = λ
x .
4.2 Cholesky Deco

mposition
We describe some backgrou
nd information on direct fact
orisation methods on
triangular decomposition [13
].
LU = A
(4.3)
in which the diagonal eleme
nts of L are not necessarily
unity. We consider
L ≡ (lij) then equation (4.3)
implies
≥
Xk−1
lkpu
lkkukk f 2
p=1
akk − o
r
k
(4.4)
 
ukj Xk−1
lkp (4.5)
l a  for j
p=1
>k ≥2
 
lik Xk−1
lip
uap=1 for i (4.6)
>k ≥2
Theorem 1. Let A be sym

metric. If the factorisation
LU = A is possible,
then the choice lkk = ukk imp
lies lik = uki, that is, LLT =
A.
Proof. Use equation (4.4) an

d induction on k.
A simple, non-singular, sy
mmetric matrix for which th
e factorisation is not
possible is
0 1
1 0
On the other hand, if the sym
metric matrix A is positive de
finite (i.e., x0Ax > 0
if x0x > 0), then the factorisat
ion is possible. We have
Theorem 2. Let A be symmet

ric, positive definite. Then, A
can be factored in
the form
LL0 = A
Proof. If webkk then we will ob
ne lkk = ukktain from the previ
ous
√
equations LU = A where lik =
uki
Incomplete Cholesky Dec

omposition
Complete decomposition of a
kernel matrix is an expensive
step and should be
avoided with real world data.
Incomplete Cholesky decompo
sition as described
in [2] differs from Cholesky d
ecomposition in that all pivots
, which are below
a certain threshold are skipp
ed. If M is the number of n
on-skipped pivots,
then we obtain a lower trian
gular matrix Gi with only M
nonzero columns.
Symmetric permutations of row
s and columns are necessary
during the factori-
sation if we require the rank to
be as small as possible (Golu
b and loan, 1983).
We describe the algorithm fro

m [2] (with slight modification)
:
Input NxN matrix K

precision parameter η
1.Initialisation:PN i = 1, K0 = K,
2.P = I, for j ∈ [1, N], Gjj =
Kjj
While j=1 Gjj > η and i!
=N +1
•
•
•
Pnext = I, Pnextii = 0, P
• nextj∗j∗ = 0, Pnextij∗ = 1,
Pnextj∗i = 1
• P = P · Pnext
K
0
P
n
e
x
t
K
0
P
n
e
x
t
o
f
G
:
j
G ∗
i,
)
1:
i−
C
1a
↔l
c
u
Gl
a
jt
e
,1
:i
−
i
t
1h
G
c
(io
,l
u
)m
n
↔
o
Gf
(j
∗G
:
,
Gi+1:n,i = Pi
1 −1
P
3.
Output: an N × M lower t
riangular matrix G and a p
ermutation matrix
P such that kP 0KP − GG0k
≤ η (appendix 1.2 for proof).
The algorithm involves picki
ng one column of K at a ti
me, choosing the
column to be added by gree
dily maximising a lower boun
d on the reduction
in the error of the approximat

ion. After l steps, we have a
n approximation of
the form K˜l = GilG0il, where
Gil is N × l. The ranking of
the N − l vectors
can be computed by comparin
g the diagonal elements of th
e remainder matrix
K − GilG0il.
Partial Gram-Schmidt Ort

hogonolisation
We explore the Partial Gra
m-Schmidt Orthogonolisation
(PGSO) algorithm,
described in [6], as our matri
x decomposition approach. I
CD could been as
equivalent to PGSO as ICD
is the dual implementation
of PGSO. PGSO
works as follows; The project
ion is built up as the span
of a subset of the
projections of a set of m trai
ning examples. These are sel
ected by performing
a Gram-Schmidt orthogonalisa
tion of the training vectors in
the feature space.
We slightly modify the Gra
m-Schmidt algorithm so it w
ill use a precision
parameter as a stopping criteri
on as shown in [2].
Given a kernel K from a traini

ng set, and precision paramet
er η:
Initialisations:
m = size of K, a N × N mat
rix
j =1
size and index are a vector w
ith the same length as K
feat a zeros matrix equal to t
he size of K
for i = 1 to m do
norm2[i] = Kii;
Algorithm:
while
Pnorm2[i] > η and j! = N
+ 1 do
ij = argmaxi(nor
m2[i]);
index[j] =pij;
size[j] =
−1
feat[i,t]·feat[ij,t]
”
feat[i, j] size ;
= [j]
norm2[i] = norm2[i]− fea
t(i, j)· feat(i, j);
end;
j =j +1
end;
return feat
Output:
kK − feat · feat0k ≤ η wher
e feat is a N × M lower tri
angular matrix
We observe that the output is

equivalent to the output of IC
D.
To classify a new example

at location i:
Given a kernel K from a testi

ng set
for j = 1 to
newfeat[j] t=1 newfeat[t] · f
= (Ki,index[j]
end; eat[index[j], t])/
size[j];
The advantage of using the p

artial Gram-Schmidt orthonogo
lisation (PGSO) in
comparison to the incomplete
Cholesky decomposition (as d
escribed in Section
4.2) is that there is no need
for a permutation matrix P.
4.3 Kernel-CCA with

PGSO
So far we have considered th
e kernel matrices as invertible
, although in prac-
tice this may not be the case
. In this Section we address
the issue of using
large training sets, which may
lead to computational problem
s and degeneracy.
We use PGSO to approximat
e the kernel matrices such t
hat we are able to
re-represent the correlation w
ith reduced dimensiality.
Decomposing the kernel matri

ces Kx and Ky via PGSO, w
here R is a lower
triangular matrix, gives
Kx =˜ RxRx0
Ky =˜ RyRy0
substituting the new represent
ation into equations (3.8) and
(3.9)
(4.7)
(4.8)
Multiplying the first equation
with Rx0 and the second equat
ion with Ry0 gives
Rx0 RxRx0 RyRy0 β − λ (4.
9)
Rx0 RxRx0 RxRx0 α =
(4.1
0 0)
Ry0 RyRy0 RxRx0 α − λ
Ry0 RyRy0 RyRy0 β =
0.
Let Z be the new correlation
matrix with the reduced dime
nsiality
Rx0 Rx = Zxx
Ry0 Ry = Zyy
Rx0 Ry = Zxy
Ry0 Rx = Zyx
Let α˜ and β˜ be the reduced
directions, such that
α˜ = Rx0 α
β˜ = Ry0 β
substituting in equations (4.9)
and (4.10) we find that we re
turn to the primal
representation of CCA with a
dual representation of the dat
a
ZxxZxyβ˜ − λZxx2 α˜ = 0
ZyyZyxα˜ − λZyy2 β˜ = 0.
Assuming that the Zxx and Z

yy are invertible. We multiply
the first equation
with Zxx−1 and the second with
Zyy−1
Zxyβ˜− λZ
(4.
xxα˜ = 0 11
)
(4.
Zyxα˜ − λ 12
)
Zyyβ˜ =
0.
We are able to rewrite β˜ from
equation (4.12) as
Zyy−1Zyxα˜
β˜ =
λ
and substituting in equation (
4.11) gives
ZxyZyy−1Zyxα
˜ = λ2Zxx (4.
13
α˜ )
we are left with a generalise
d eigenproblem of the form
Ax = λBx. Let SS0
be equal to the complete Chol
esky decomposition of Zxx suc
h that Zxx = SS0
where S is a lower triangular
matrix, and let αˆ = S0·α.˜ Su
bstituting in equation
(4.13) we obtain
S−1ZxyZyy−1ZyxS−1 αˆ = λ2αˆ
We now have a symmetric g
eneralised eigenproblem of th
e form Ax = λx.
KCCA Regularisation with

PGSO
We combine the dimensiality
reduction introduced in the pr
evious Section 4.3
with the regularisation param
eter (Section 4.1) to maximis
e the learning. Fol-
lowing the same approach in th
e previous section we can rew
rite equations (4.1)
and (4.2) with the approximati
on of Kx and Ky as formulated
in equations (4.7)
and (4.8) respectively, in the
following manner
RxRx0 RyRy0 β −
λ(RxRx0 RxRx0 + κRxRx0 )α =
0
RyRy0 RxRx0 α −
λ(RyRy0 RyRy0 + κRyRy0 )β =
0
Multiplying the first equation
with Rx0 and the second equat
ion with Ry0 gives
Rx0 RxRx0 RyRy0 β − λRx0

RxRx0 RxRx0 + κRxRx0 )α
= 0
Ry0 RyRy0 RxRx0 α − λRy0
RyRy0 RyRy0 + κRyRy0 )β
= 0
rewriting equation (4.14) with

the new reduced correlation
matrix Z as defined
in the previous Section 4.3, w
e obtain
ZxxZxyβ˜− λZxx(Zxx + κI)˜α

= 0
ZyyZyxα˜ − λZyy(Zyy + κI)β˜
= 0.
Assuming that the Zxx and Z
yy are invertible. We multiply
the first equation
with Zxx−1 and the second with
Zyy−1
Zxyβ˜− λ(Zxx
(4.
+ κI)˜α = 16)
0 (4.
Zyxα˜ − λ(Zyy 17)
+ κI)β˜ =
0.
Experimental Results 14
We are able to rewrite β˜ fro

m equation (4.17) as
(Zyy + κI)−1Zyxα˜
β˜ =
λ
substituting in equation 4.16
gives
Zxy(Zyy + κI)−1Zyxα˜ = λ2(Zxx

+ κI)˜α
We are left with a generalised

eigenproblem of the form Ax
= λBx. Performing
a complete Cholesky decomp
osition on Zxx + κI = SS0 w
here S is a lower
triangular matrix. and let αˆ
= S0α,˜ substituting in equatio
n (4.18)
S−1Zxy(Zyy + κI)αˆ = λ2α.ˆ

−1ZyxS−1
We obtain a symmetric gener

alised eigenproblem of the for
m Ax = λx .
5 Experimental Res
ults
In the following experiments
the problem of learning sem
antics of multimedia
content by combining image a
nd text data is addressed. T
he synthesis is ad-
dressed by the kernel Canoni
cal correlation analysis descri
bed in Section 4.3.
We test the use of the derive
d semantic space in an imag
e retrieval task that
uses only image content. Th
e aim is to allow retrieval of
images from a text
query but without reference t
o any labeling associated wit
h the image. This
can be viewed as a cross-
modal retrieval task. We used
the combined multime-
dia image-text web database,
which was kindly provided by
the authors of [15],
where we are trying to facilitat
e mate retrieval on a test set.
The data was di-
vided into three classes (Figur
e 1) - Sport, Aviation and Pai
ntball - 400 records
each and consisted of jpeg i
mages retrieved from the Int
ernet with attached
text. We randomly split each
class into two halves which w
ere used as training
and test data accordingly. Th
e extracted features of the d
ata were used the
same as in [15] (detailed desc
ription of the features used ca
n be found in [15]:
image HSV colour, image Ga
bor texture and term frequenc
ies in text.
We compute the value of κ for

the regularization by running t
he KCCA with the
association between image an
d text randomized. Let λ(κ) be
the spectrum with-
out randomisation, the datab
ase with itself, and λR(κ) be
the spectrum with
randomisation, the database
with a randomised version of
itself, (by spectrum
it is meant that the vector wh
ose entries are the eigenvalue
s). We would like
to have the non-random spect
rum as distant as possible fr
om the randomised
spectrum, as if the same corr
elation occurs for λ(κ) and λR
κ then clearly over-
fitting is taking place. Theref
or we expect for κ = 0 (no
regularisation) and
let j = 1, . . . , 1 (the all ones
vector) that we may have λ(κ)
= λR(κ) = j, since
it is very possible that the ex
amples are linearly independe
nt. Though we find
that only 50% of the example
s are linearly independent, this
does not affect the
selection of κ through this m
ethod. We choose κ so that
the κ for which the
20 2 2
0 0
20
40
4 4 40
0 0
60
6 6 60
0 0
80
8 8 80
0 0
100
1 1 100
0 0
0 0
120
1 1 120
2 2
0 0
140
1 1 140
4 4
0 0
160
1 1 160
6 6
0 0
Aviatio
n
450
150 200 250 300 350 400 100 150 200 250 300 350 400 450 500 100 150 200 250 300 350 400 100 150 200 250 300 350 400
10
10 10
20
2
2 0 20
0
40
3
0 30
3
0
40
60 40
4
5
0
0
50
6
5 0
0
80
60
7
6 0
0
80
70
100
7
0 9 80
0
Sports 80 120 80 120
20
1
0 20
2
0
40
2
0
40
3
0 60
4
0
40
8
0
50
60
100
80 60
120
80 70
1 140
0
0
120
160
1 9
0 0
0
Paintba
20 30 40 50 60 70 80 90 100110 120 140 140 300
ll
Figure 1 Example of images i

n database.
difference between the spect

rum of the randomized set is
maximally different
(in the two norm) from the tr
ue spectrum.
κ = argmaxkλR(κ) − λ(κ)k
We find that κ = 7 and set v
ia a heuristic technique the G
ram-Schmidt preci-
sion parameter η = 0.5 .
To perform the test image r

etrieval we compute the feat
ures of the images
and text query using the Gr
am-Schmidt algorithm. Once
we have obtained
the features for the test query
(text) and test images we pro
ject them into the
semantic feature space using
β˜ and α˜ (which are comput
ed through training)
respectively. Now we can com
pare them using an inner prod
uct of the semantic
feature vector. The higher the
value of the inner product, th
e more similar the
two objects are. Hence, we re
trieve the images whose inner
products with the
test query are highest.
We compared the performan

ce of our methods with a
retrieval technique
based on the Generalised Vect
or Space Model (GVSM). This
uses as a seman-
tic feature vector the vector of
inner products between either
a text query and
each training label or test imag
e and each training image. F
or both methods we
have used a Gaussian kernel,
with σ = max. distance/20, fo
r the image colour
component and all experiment
s were an average of 10 run
s. For convenience
we separate the content-
based and mate-based appro
aches into the following
Subsections 5.1 and 5.2 resp
ectively.
5.1 Content-Based
Retrieval
In this experiment we used the
first 30 and 5 α˜ eigenvectors
and β˜ eigenvectors
(corresponding to the largest e
igenvalues). We computed the
10 and 30 images
for which their semantic featu
re vector has the closest inn
er product with the
semantic feature vector of the
chosen text. Success is consi
dered if the images
contained in the set are of the
same label as the query text (
Figure 3 - retrieval
example for set of 5 images).
ImageGVSMKCCA KCCA succ

et ccess cess (30)
ess (5)
10 7 8 90.
8 97
. %
9
3
%
30 7 8 90.6
6. 3 9%
8 .
2 0
% 2
%
Table 1 Success cross-results

between kernel-cca & generali
sed vector space.
KCCA (30) KCCA (30)
90
85
80
75
70
65
200
20 40 60 80 100 120 140 160 180
Image Set Size Image Set Size Image Set Size
Figure 2 Success plot for con

tent-based KCCA against GV
SM
In Tables 1 and 2 we compare

the performance of the kernel-
CCA algorithm and
generalised vector space mod
el. In Table 1 we present the
performance of the
methods over 10 and 30 imag
e sets where in Table 2 as pl
otted in Figure 2 we
see the overall performance of
the KCCA method against the
GVSM for image
sets (1 − 200), as in the 2000t
h image set location the max
imum of 200 × 600
of the same labelled images o
ver all text queries can be retri
eved (we only have
200 images per label). The suc
cess rate in Table 1 and Figur
e 2 is computed as
follow
s k=1
success % j=1
r image set ntjk × 100
=
where countjk = 1 if the image
k in the set is of the same lab
el as the text query
present in the set, else countjk
= 0. The success rate in Table
2 is computed as
above and averaged over all i
mage sets.
As visible in Figure 4 we obse

rve that when we add eigenve
ctors to the seman-
tic projection we will reduce
the success of the content b
ased retrieval. We
speculate that this may be t
he result of unnecessary det
ail in the semantic
projection. and as the sema
ntic information needed is co
ntained in the first
1 1
0 0
2
0
3
0
4
0
5
0
6
0
6
0
7
0
7
0
8
0
9
0
1
0
0
1
1
0
10
4 80
0
10
10
20
20
30
30
40
40
50
50
60
60
70
70
80
80
90
90
100
100
110
110
10
20 30 40 50 60 70 80 20 30 40 50 60 70 80
Figure 3 Images retrieved for

the text query: ”height: 6-
11 weight: 235 lbs
position: forward born: septe
mber 18, 1968, split, croatia
college: none”
Method overall success

GVSM 72.3%
KCCA (30) 79.12%
KCCA (5) 88.25%
Table 2 Success rate over all

image sets (1 − 200).
few eigenvectors. Hence a m

inimal selection of 5 eigenvec
tors is sufficient to
obtain a high success rate.
5.2 Mate-Based Re
trieval
In the experiment we used the fi
rst 150 and 30 α˜ eigenvectors
and β˜ eigenvectors
(corresponding to the largest e
igenvalues). We computed the
10 and 30 images
for which their semantic featu
re vector has the closest inn
er product with the
semantic feature vector of the
chosen text. A successful mat
ch is considered if
the image that actually match
ed the chosen text is contain
ed in this set. We
compute the success as the av
erage of 10 runs (Figure 5 - r
etrieval example for
set of 5 images).
ImageGVSMKCCA KCCA succ

et uccesscess (30)
ess (150)
10 8 1 59.
5%
30 1 3 69
%
Table 3 Success cross-results

between kernel-cca & generali
sed vector space.
In Table 3 we compare the pe

rformance of the KCCA algori
thm with the GVSM
90
80
70
60
50
40
30
10 15 20 25 30 35 40 45 50
eigenvectors
Figure 4 Content-Based plot

of eigenvector selection agai
nst overall success
(%).
over 10 and 30 image sets wh

ere in Table 4 we present the
overall success over
all image sets. In figure 6 we
see the overall performance of
the KCCA method
against the GVSM for all pos
sible image sets.
The success rate in Table 3 a
nd Figure 6 is computed as fol
lows
j=1 count
success % for imag
e set i = 60 × 100
0
where countj = 1 if the exact
matching image to the text qu
ery was present in
the set, else countj = 0. The s
uccess rate in Table 4 is com
puted as above and
averaged over all image sets.
Method overall success

GVSM 70.6511%
KCCA (30) 83.4671%
KCCA (150) 92.9781%
Table 4 Success rate over all i
mage sets.
As visible in Figure 7 we fi
nd that unlike the Content-
Based (Section 5.1)
retrieval, increasing the numb
er of eigenvectors used will a
ssist in locating the
matching image to the query
text. We speculate that this
may be the result
of added detail towards exac
t correlation in the semantic
projection. Though
we do not compute for all ei
genvectors as this process w
ould be expensive
20
40
20
60
20
40
80
60
4
1 0
0
0
80
120 60
100
120
1 8
4 0
0
140
1
6 1
0
0
160
180
1
2
0
200
200
140
50
100 150 200 250 300 350 400 450 100150200250300350400450500 40 60 80 100120 140 160 180 200 220
10
20
20
40
60
30
80
40
100
50
120
60
140
70
160
180
80
200
90
20 40 60 80 100 120 100 150 200 250 300 350 400 450 500
Figure 5 Images retrieved for

the text query: ”at phoenix s
ky harbor on july
6, 1997. 757-2s7, n907wa ph
oenix suns taxis past n902aw
teamwork america
west america west 757-2s7, n
907wa phoenix suns taxis pas
t n901aw arizona at
phoenix sky harbor on july 6,
1997.” The actual match is th
e middle picture in
the first row.
and the reminding eigenvector

s would not necessarily add
meaningful semantic
information.
It is visible that the kernel-

CCA significantly outperforme
s the GVSM method
both in content retrieval and i
n mate retrieval.
5.3 Regularisation
Parameter
We next verify that the met
hod of selecting the regulari
sation parameter κ
a priori gives a value perform
ed well. We randomly split e
ach class into two
halves which were used as
training and test data accord
ingly, we keep this
divided set for all runs. We
set the value of the incompl
ete Gram-Schmidt
orthogonolisation precision p
arameter η = 0.5 and run o
ver possible values
κ where for each value we t
est its content-based and m
ate-based retrieval
performance.
Let ˆκ be the previous optimal

choice of the regularisation pa
rameter κˆ = κ = 7.
As we define the new optimal
value of κ by its performance
on the testing set,
we can say that this method
is biased (loosely its cheating
). Though we will
show that despite this, the di
fference between the performa
nce of the biased κ
and our a priori κˆ is slight.
In table 5 we compare the

overall performance of the C
ontent Based (CB)
performance in respect to th
e different values of κ and i
n figures 8 and 9
we view the plotting of the
comparison. We observe th
at the difference in
100
90
80
70
60
50
40
30
20
KCCA (150) KCCA (150)

10
100 200 300 400 500 600
Figure 6 Success plot for KCC

A mate-based against GVSM (
success (%) against
image set size).
κ CB-KCCA CB-KCCA (5)

)
0 46.278% 43.8374%
ˆκ 83.5238% 91.7513%
90 88.4592% 92.7936%
230 88.5548% 92.5281%
Table 5 Overall success of C

ontent-Based (CB) KCCA with
respect to κ.
performance between the a

priori value ˆκ and the new f
ound optimal value
κ for 5 eigenvectors is 1.0423
% and for 30 eigenvectors is
5.031%. The more
substantial increase in perfor
mance on the latter is due to
the increase in the
selection of the regularisation
parameter, which compensate
s for the substantial
decrease in performance (fig
ure 6) of the content based
retrieval, when high
dimensional semantic feature s
pace is used.
κ MB-KCCA MB-KCCA (15

0) 0)
0 73.4756 83.46%
%
ˆκ 84.75% 92.4%
170 85.5086 92.9975%
%
240 85.5086 93.0083%
%
430 85.4914 93.027%
%
Table 6 Overall success of M

ate-Based (MB) KCCA with re
spect to κ.
In table 6 we compare the o

verall performance of the Mat
e-Based (MB) per-
formance with respect to the
different values of κ and in fi
gures 10 and 11 we
view a plot of the comparison
. We observe that in this cas
e the difference in
95
90
85
80
75
70
65
60
55
50
40 60 80 100 120 140 160 180
eigenvectors
Figure 7 Mate-Based plot of e

igenvector selection against ov
erall success (%).
88.55
88.5
88.45
88.4
100 120 140 160 180 200 220 240 260 280 300
Kappa
Figure 8 Content-Based. κ sel

ection over overall success for
30 eigenvectors.
92.796
92.794
92.792
92.79
92.788
92.786
92.784
92.782
92.78
92.778
85 90 95 100 105 110 115 120
Kappa
Figure 9 Content-Based. κ sel

ection over overall success for
5 eigenvectors.
85.515
85.51
85.505
85.5
85.495
85.49
85.485
300
120 140 160 180 200 220 240 260 280
Kappa
Figure 10 Mate-Based. κ selec

tion over overall success for 3
0 eigenvectors.
Generalisation of Canonical Correlation Analysis 23
93.03
93.02
93.01
93
92.99
92.98
92.97
92.96
100 150 200 250 300 350 400 450 500 550
kappa
Figure 11 Mate-Based. κ selection over overall success f

or 150 eigenvectors.
performance between the a priori value κˆ and the new

found optimal value κ
is for 150 eigenvectors 0.627% and for 30 eigenvectors is
0.7586%.
Our observed results support our proposed method f

or selecting the regu-
larisation parameter κ in an a priori fashion, since the
difference between the
actual optimal κ and the a priori ˆκ is very slight.
6 Generalisation of Canonical Correlatio

n Analysis
In this section we follow A. Gifi’s book “Nonlinear Multiva
riate Analysis” (1990)
and partially J. R. Ketterling “Canonical analysis of sev
eral sets of variables”
(1971).
6.1 Some notations

For an n × n square matrix A having elements {aij}, i
, j = 1, . . . , n we can
define the trace by the formula
Tr(A) = X (6.1)
i
the norm k kF , so called the Frobenius norm, defined by
kAkF = Tr A0A X (6.2)

ij
and if ai denotes the ith column(row) of A then we have
(6.3)
kAkF = kaik22 = hai, aii
the notation k k2 means the Euclidean, l2, norm of a ve
ctor.
6.2 Some propositions

Proposition 3. Let an optimisation problem be given in
the form
min f(x, y) (6.4)
subject to (6.5)
g(y) = 0, (6.6)
(6.7)
x ∈ Rm, y ∈ Rn.
Let the set Y ⊆ Rn the feasibility domain for y determi
ned by the constrain
g(y) = 0.
Assume the function f is convex in both variables x and

y, the optimal solution
of x can be expressed by the function h(y) of the optim
al solution of y, where
the function of h is defined on the whole set Y and t
he functions f, g, h are
twice continuously differentiable on Rm × Y .
Then the optimisation problem with the same constrain
min f(h(y), y) (6.8)
subject to (6.9)
g(y) = 0, (6.10)
y ∈ Rn, (6.11)
has the same optimal solution in y than equation (6.4) h
as.
Proof. Let the optimal solution of equation (6.4) be den

oted by x1, y1 and for
equation (6.8) be denoted by y2.
Based on the condition of the proposition we have x

1 = h(y1). Because
y1 is a feasible solution for equation (6.8) thus f(x1,
y1) = f(h(y1), y1) ≥
f(h(y2), y2), but the objective function of equation (6.
4) is not restricted in
the first variable, thus the inequality f(x1, y1) ≤ f(h
(y2), y2) holds, hence
f(x1, y1) = f(h(y2), y2).
From the convexity of f and the same feasibility domains

the optimum solutions
have to be the same.
6.3 Formulation of the Canonical Corr

elation
Let H(1), H(2) be matrices with size m × n1, m × n2 res
pectively and assume
the sum of the elements in the columns of these matrice
s are equal to 0, they
are centralised and they are linearly independent vect
ors within one matrix.
We consider arbitrary linear combinations of the column
s of these matrices in
the form H(1)a(1)i , H(2)a(2)i , i = 1, . . . , p. Let A(1) = a(1)1
, . . . , a(1)n1
a(2)1 , . . . , a(2)n2
columns. Introducing notations for the product of the m

atrices to simplify the
formulas:
Σij = H(i)0 H(j), i, j = 1, 2.
(6.12)
We are looking for linear combinations of the columns
of these matrices such
that the first pair of the vectors (a(1)1 , a(2)1 ) are optimal
solution of the optimi-
sation problem:
max a(1) (6.13)
a ,a
subject to
(6.14)
a(1)1 Σ11a(1)1 = 1, (6.15)
a1(2) Σ22a(2)1 = 1. (6.16)
The meaning of this optimisation problem is to find th
e maximum correlation
between the linear combinations of the columns of th
e matrices H(1), H(2),
subject to the length of the vectors corresponding to th
ese linear combinations
normalised to 1.
To determinate the remaining pairs of the vectors, colu

mns in A(1) and A(2),
a series of optimisation problems are solved successivel
y. For the pair of the
vectors (a(1)r , a(2)r ), r = 2, . . . , p we have
a(1)
(6.17)
r ,a
(6.18)
a(k)
(6.19)
r Σkka(k)r = 1,
(6.20)
a(k)
a(k)
(6.21)
k, l = 1, 2, j = 1, . . . , r − 1. (6.22)
The problem (6.13) expanded by the orthogonality con
strains (6.17), namely
the components of every new pair in the iteration have t
o be orthogonal to the
components of the previous pairs.
The upper limit p of the iteration has to be ≤ min(ran

k(H(1)), rank(H(2))).
Applying the Karush-Kuhn-Tucker conditions we can
express the optimal
solutions of the problem (6.13) and the problems (6.17)
for r = 2, . . . , p. Let’s
begin with the problem (6.17).
First we apply a substitution such that
a(k)i = Σ− 1
2
yi(k), (6.23)
1 1
Dkl = Σ− 2 ΣklΣ− 2 , (6.24)
k, l = 1, 2, i = 1, . . . , p, (6.25)
Thus we have the problem

0
max (6.26)
y ,y
1 1
(6.27)
y(k)
(6.28)
(6.29)
L1 = y(1) 0 1 0
(6.30)
2 2
where λ1 and λ2 are the Lagrangian multipliers. The
vectors of the partial
derivatives of L1respect to the vectors y1(1), y1(2) are eq
ual to 0 by the KKT
conditions, thus we get
∂L1
(6.31)
= 2D12a(2)1 − 2λ1y1(1) = 0,
∂y1(1)
∂L1
∂y1(2) = 2D21y1(1) − 2λ2y1(2) = 0. (6.32)
1 and equation (6.32) by y1(2) and dividing

by the constant 2 provides
0
(6.33)
(6.34)
12y1(1)
− λ2y(2)
Based on the constrains of the optimisation problem (
6.26) and the identity
λ1 = λ2 = y1(1) D12y1(2).
(6.35)
After replacing λ1 and λ2 with λ the following equality syst
em can be formulated
−λI D12 y1(1) !

(6.36)
y1(2)
It is not too hard to realise this equality system is a sin

gular vector and value
problem of the matrix D12 having y1(1) and y1(2) are a le
ft and a right singular
vectors and the value of the Lagrangian λ is equal to the
corresponding singular
value. Based on this statements we can claim that the
optimal solutions are
the singular vectors belonging to the greatest singular va
lue of the matrix D12.
Considering the successive optimisation problem and
applying similar sub-
stitution for the all variables a(k)i as introduced in equat
ion (6.23), a problem
with the greatest r singular values and the correspondin
g left and right singular
vectors arises.
6.4 The simultaneous formulation of th

e canonical correla-
tion
Instead of using the successive formulation of the canoni
cal correlation we can
join the subproblems into one. The simultaneous formula
tion is the optimisation
problem
Xp a(1) Σ12a(2)i (6.37)

(a(1),a ),...,(a(1)p ,a i=1
subject to (6.38)
a(1) 1 if i = j,
Σ11a(1)j = (6.39)
0 otherwise,
a(2) 1 if i = j,
Σ22a(2)j = (6.40)
0 otherwise,
i, j = 1, . . . , p,
(6.41)
a(1)
Σ12a(2)j = 0, (6.42)
i, j = 1, . . . , p, j 6= i. (6.43)
Based on equation (6.37) and the definition of the Frobe
nius norm we have a
compact formulation of the canonical correlation proble
m:
max Tr A(1) Σ12A(2) (6.44)
(6.45)
A(k) (6.46)
ΣkkA(k) = I,
a(k)
Σkla(l)j = 0, (6.47)
k, l = {1, 2}, l 6= k, i, j (6.48)

= 1, . . . , p, j 6= i.
where I is the identity matrix with size p
× p.
Repeating the substitution in equation (6.23) the set

of feasible vectors for
the simultaneous problem is equal to the left and right s
ingular vectors of ma-
trix D12, hence the optimal solution is compatible to the
successive problems.
6.5 Correlation versus Distance

The canonical correlation problem can be transformed in
to a distance problem
where the distance between two matrices is measured by
the Frobenius norm.
min (6.49)
H(1)A(1) − H(2)A(2)
(6.50)
A(k) (6.51)
ΣkkA(k) = I,
a(k)
Σkla(l)j = 0, (6.52)
k, l = 1, . . . , 2, l 6= k, i, j = 1, . . . , p, j 6= (6.53)
i.
Unfolding the objective function of the minimisation prob

lem (6.49) shows the
optimisation problem is the same as the maximisation p
roblem (6.44).
6.6 The generalisation of canonical cor

relation
Exploiting the distance problem we can give a general
isation of the canoni-
cal correlation for more than two known matrices. Gi
ven a set of matrices
{H(1), . . . , H(K)} with dimension m × n1, . . . , m × nK.
We are looking for
the linear combinations of the columns of these matric
es in the matrix form
A(1), . . . , A(K) such that they gives the optimum solutio
n of the problem
min XK (6.54)
k,l=1
H(k)A(k) − H(l)A(l)
(6.55)
A(k) (6.56)
ΣkkA(k) = I,
a(k)
Σkla(l)j = 0, (6.57)
k, l = 1, . . . , K, l 6= k, i, j = 1, . . . , p, j 6 (6.58)
= i.
In the forthcoming sections we will show how to simplif
y this problem.
6.7 Total Distance versus Variance

Given a set of vectors X = x1, . . . , xm ⊆ Rn. The notati
on xki means the ith
component of the vector xk.
The total squared distance, the sum of the squared

Euclidean distance of
all possible pairs of vectors in X is equal to
(6.59)
2 kxk − xlk22 =
as for any k, kxk − xkk = 0 we can drop the constrain l 6

= k, thus we have
= 1
(6.60)
2 kxk − xlk22 =
k=1,l=1
1 Xm Xn
= (6.61)
2 (xki − xli)2 =
k=1,l=1 i=1
1 Xm Xn
= (6.62)
2 i=1
k=1,l=1
1  Xm 
= Xn x2ki + Xm x2li − Xm 2xkix (6.63)
2 i=1 
k=1,l=1 k=1,l=1
1 !
= Xn Xm x2ki + mXm Xm xkiXm xli (6.64)
2 i=1 x2li − 2
k=1 l=1 k=1 l=1
to simplify the formula we introduce

1 Xm 1
M1(i) = xki, M2(i) = Xm x2ki, (6.65)
m m
k=1 k=1
we can reformulate equation (6.64)
= Xn m2M2(i) − m2(M1(i))2 = (6.66)

i=1
applying the well-known identity of the variance for the

vectors
(x11, . . . , xm1), . . . , (x1n, . . . xmn) gives
= m2 Xn (6.67)
(xki − M1(i))2.
i=1k=1
Hence the total squared distance turns to be equal to the

sum of the component-
wise variances of the vectors in X multiplied by the squ
are of the number of
the vectors.
Another statement about the variance is introduced. If

we have the following
optimisation problem
minz kz − xkk22, xk (6.68)

∈ X and z ∈ Rn,
then the optimal solution can be expres
sed by
1 Xm
zi = xki. (6.69)
n
k=1
The components of the optimal solution are equal to th

e mean values of the
corresponding components of the known vectors.
6.8 General form

Let H(1), . . . , H(K) be a set of known matrices with siz
e m × n1, . . . , m × nK
and X be an unknown matrix with size m × p. The co
lumns of the matrices
H(1), . . . , H(K) are centralised, i.e. the mean of every c
olumn in every matrix
is equal to 0. We assume the columns of every matr
ix H(k), k = 1, . . . , K
are linearly independent. A notation to simplify the fo
rmulas, is introduced;
Σkl = H(k)T H(l). We are looking for linear combinations
of the columns of the
known matrices and a corresponding X such that they a
re the optimal solution
of the optimisation problem given by
X,A(1),...,A(K) 1
K k=1
XK (6.70)
X − H(k)A(k)F
subject to (6.71)
0
1 if k = l and i = j,
i 0 if (k = l and i 6= j) or (k 6= l (6.72)
Σkla(l)j =
nd i 6= j),
k, l = 1, . . . , K, i, j = 1, . . . , p, except when (6.73)
6= l and i = j,
where a(k)i denotes the ith column of the matrix A(k) c

ontaining the possible
linear combinations.
Applying substitutions for all k = 1, . . . , K, i = 1, . . . , p

1
a(k)i = Σ− 2
yi(k), (6.74)
where we can compute the inverse because the columns

of the matrix H(k) are
independent meaning Σkk has full rank. We can trans
form this optimisation
problem into a more simply form. First, we modify the
set of constrains. To
make this modification readable the notation is introduc
ed
1 1
Σ− 2 ΣklΣ−2 = Dkl, k, l = 1, . . . , K, (6.75)
1
where we exploit the symmetricity of the matrices2 .
Thus the constrains get the form
1 if i = j,
y(k) yj(k) = (6.76)
0 if i 6= j,
k = 1, . . . , K, i, j = 1, . . . , p,
(6.77)
y(k)
Dklyj(l) = 0, (6.78)
k, l = 1, . . . , K, k 6= l, i, j = 1, . . . p, i 6 (6.79)
= j,
for which we can recognise the singular decomposition pr

oblems of the matrices
{Dkl}. If we consider the matrix Dkl for a fixed pair o
f the indeces k, l and
apply the singular decomposition we have
Dkl = Y (k)ΛklY (l) , (6.80)
the matrices Y (k) and Y (l) have columns being equal to th

e vectors yi(k) and yi(l)
respectively, where i = 1, . . . , p. The singular decompo
sition Λkl is a diagonal
matrix and Y (k)
items having indeces with the properties k 6= l and i = j.
They give the singular
values of the matrix Dkl
(6.81)
The consequence of the singular decomposition form
is that the set of the
feasible solutions of the optimisation problem with const
rains (6.76) are equal
to the set of the singular vectors of the matrices {Dkl, k,
l = 1 . . . , K}.
To express the objective function of the optimisation p
roblem (6.70) we use
the notations
1
Qk = H(k)Σ− 2
, (6.82)
Dkl = QTk Ql. (6.83)
We can derive another statement about the optimal so

lution of the problem.
Exploiting the definition of the Frobenius norm the obj
ective function (6.70)
can be rewritten as a sum of the Euclidean norm of the
column vectors, where
xi denotes the ith column of the matrix X,
1 XK
= (6.84)
K X − H(k)A(k) F
k=1
1 XK Xp
= = (6.85)
K xi − H(k)a(k)i 2
k=1 i=1
1 XK Xp
= = (6.86)
K xi − Qkyi(k) 2
k=1 i=1
1 XK
= hxi − Qkyi(k), xi − Qkyi(k)i. (6.87)
K
k=1i=1
The constrains are formulated in equation (6.76).
For the Lagrangian function of the optimisation problem

we have:
L = XK (6.88)
hxi − Qkyi(k), xi − Qkyi(k)i+
k=1i=1
+ XK Xp λk,ii 1 − yi(k) yi(k) + (6.89)

k i
+ XK Xp λk,ij −yi(k) yj(k) + (6.90)

k i,j
i6=j
+ XK Xp λkl,ij −yi(k) Dklyj(l) . (6.91)

k,l i,j
k6=l i6=j
K from the objective function (6.70).
After computing the partial derivatives, where xi sig

ns the ith column of
the matrix X, we get
∂L XK
= = 0, i = 1, . . . , p, (6.92)
∂xi k=1 2xi − 2Qkyi(k)
∂L 0
Xp XK Xp λkl,ijDklyj(l) = 0,
= 2Dkkyi(k) − xi − 2λk,ij yj(k) −
∂yi(k) j l j
l6=kj6=i
(6.93)
k = 1, . . . , K, i = 1, . . . , p.
(6.94)
Conclusions 32
We can express xi fo
r any i = 1, . . . , p fr
om (6.92)
xi X Qlyi( (
Kl=1 6
l), i
.
= 1, 9
..., 5
)
p.
Based on the propos

ition (3) we can repl
ace the variable X i
n equation (6.70)
by an expression of
the other variables
without changing the
optimum value
and the optimal solu
tion. Thus we have
the variance problem
.
7 Conclusion
s
Through this study
we have presented
a tutorial on cano
nical correlation
analysis and have e
stablished a novel g
eneral approach to
retrieving images
based solely on their
content. This is then
applied to content-
based and mate-
based retrieval. Exp
eriments show that i
mage retrieval can b
e more accurate
than with the Gener
alised Vector Space
Model. We demon
strate that one
can choose the regu
larisation parameter
κ a priori that perfor
ms well in very
different regimes. He
nce we have come t
o the conclusion that
kernel Canonical
Correlation Analysis
is a powerful tool fo
r image retrieval via
content. In the
future we will extend
our experiments to
other data collection
s.
In the procedure of
the generalisation
of the canonical co
rrelation analysis
we can see that the
original problem can
be transformed and r
einterpreted as a
total distance proble
m or variance mini
misation problem. T
his special duality
between the correlat
ion and the distanc
e requires more inv
estigation to give
more suitable descrip
tion of the structure
of some special spa
ces generated by
different kernels.
These approaches ca
n give tools to handl
e some problems in
the kernel space,
where the inner prod
ucts and the distanc
es between the point
s are known but
the coordinates are
not. For some probl
ems it is sufficient t
o know only the
coordinates of a few
special points, whic
h can be expressed
from the known
inner product, e.g. t
o do cluster analysis
in the kernel space
and to compute
the coordinates of th
e cluster centres onl
y.
Acknowledgme
nts
We would like to a
cknowledge the fina
ncial support of EU
Projects KerMIT,
No. IST-2000-25341
and LAVA, No. IST-
2001-34405.
1 Proofk ≤ η
− GiGi
1.1 Some
notation
Lemma 4. Let
and B be an P aii
re matrices such
hat Trace(A)
then we have
ace(AB) = Trace(
BA)
0
k ≤ η 33
Proof.
Trane(AB) = (ab)ii
i
= Xn aijbji
i,j
= Xn bjiaij
j,i
= (ba)jj
j
= Trace(BA)
Lemma 5. Let A be a symmetric m

having eigenvalue decomposition eq
to A = V 0ΛV (we are able to wr
= V 0AV ) and using Lemma 4, th
Trace(Λ) = Trace(A).
Proof.
Trace(Λ) = Trace(V 0AV )

= Trace((V 0A)V )
= Trace(V (V 0A))
= Trace(V V 0A)
= Trace(A)
Hence we show that the following
Xn aii = Xn λi
i
i
Lemma 6. If we have a symmetric

x A, the Euclidian norm is equal w
the maximum eigenvalue of A
Proof.
For any c ∈ R the scaling does not

ge
=
kcxk ckxk kxk
Proof kK − GiGi k ≤ η 34
Hence we obtain
max = max
kAxk
kAxk = p
Let UDU0 be the eigenvalue decomposition of AA0 s

hat D is a diagonal
matrix containing square of the eigenvalues of A
A0A = UDU0
kAxk2 = x0UDUx
Setting w = U0x and as U is orthognoal we can rew
xk = 1 to kwk = 1
kAk2 = wX0Dw
We can see that the following holds
max λ2i wi2 = max λ2i
Hence we obtain
λi
kAk = max
1.2 Proof
Theorem 7. If K is a positive definite matrix an
G0 is its incomplete
cholesky decomposition then the Euclidian norm of
ubtracted from K is
less than or equal to the trace of the uncalculated p
f K. Let ∆Ki be the
uncalculated part of K and let η = Trace(∆Ki) k ≤ η
kK − GiGi
Proof. Let GG0 be the being the complete cholesky
omposition K = GG0
where G is a lower triangular matrix were the upper
gular is zeros.
A 0
G =
B C
Let GiG
to be the incopmlete decomposition of K wh
are the iterations
of the Cholesky factorization procedure
Gi = G
i =
A
B
such that G= K˜i, where K˜i is the approximation of K

ject to a sym-
metric permutation of rows and columns. Assuming th
e rows and columns
Proof kK − GiGi k ≤ η 35
of K are ordered and no permutation is necessary (th

only for convenience
of the proof). Let ∆Ki = K − K˜i.
AA0 AB0
K = GG0 =
BA0 BB0 + CC0
K˜i = GiGi AA0 AB0

=
BA0 BB0
0 0
∆Ki =
0 CC0
We show that CC0 is positive semi-definite
= Ki+1:n,i+1:n − BB0
= Ki+1:n,i+1:n − B · A−1A · B0
= Ki+1:n,i+1:n − B · A−1 · (AB0)
· K1:i,i+1:n
therefore
xCC0x = < xC, (xC)0 >
≥ 0
λc ≥ 0
CC0 is a positive semi-definite matrix, hence ∆Ki is a
positive semi-definite
matrix. Using Lemma 6 we are now able to show th
kK − K˜ik = k∆Kik
= λ iw i
i
= max λi
As the maximum eigenvalue is less than or equal to th

m of all the eigenval-
ues, using Lemma 5, we are able to rewrite the expre
as
Xn λi
kK − GiGi k ≤
i
≤ Trace(Λ)
≤ Trace(∆Kiii ).
Therefore,
kK − GiGi k ≤ η.
Bibliography
[1]Shotaro Akaho. A kernel m

ethod for canonical correlatio
n analysis. In
International Meeting of Psyc
hometric Society, Osaka, 200
1.
[2]Francis Bach and Michael Jor

dan. Kernel independent comp
onent analysis.
Journal of Machine Leaning R
esearch, 3:1–48, 2002.
[3]Magnus Borga. Learning Mult

idimensional Signal Processin
g. PhD thesis,
Linkping Studies in Science a
nd Technology, 1998.
[4]Magnus Borga. Canonical co

rrelation a tutorial, 1999.
[5]Nello Cristianini and John Sh

awe-Taylor. An Introduction to
Support Vec-
tor Machines and other kernel
-based learning methods. Cam
bridge Univer-
sity Press, 2000.
[6]Nello Cristianini, John Shaw

e-Taylor, and Huma Lodhi.
Latent semantic
kernels. In Caria Brodley an
d Andrea Danyluk, editors, P
roceedings of
ICML-01, 18th International C
onference on Machine Learni
ng, pages 66–
73.Morgan Kaufmann Publishers,
San Francisco, US, 2001.
[7]Colin Fyfe and Pei Ling Lai. Ic

a using kernel canonical correl
ation analysis.
[8]Colin Fyfe and Pei Ling Lai.

Kernel and nonlinear canoni
cal correlation
analys
ernational Journal of Neural
is.Systems, 2001.
[9]A. Gifi. Nonlinear Multivariat

e Analysis. Wiley, 1990.
[10]
G. H. Golub and C. F. V. Loa
n. Matrix Computations. Th
e Johns Hopkins
University Press, Baltimore,
MD, 1983.
[11]
David R. Hardoon and John
Shawe-Taylor. Kcca for differ
ent level preci-
sion in content-based image
retrieval. In Submitted to Thi
rd International
Workshop on Content-Based
Multimedia Indexing, IRISA,
Rennes, France,
2003.
[12]
H. Hotelling. Relations betwe
en two sets of variates. Bio
metrika, 28:312–
377, 1936.
[13]
E. Isaacson and H. B. Keller.
Analysis of Numerical Metho
ds. John Wiley
& Sons, Inc, 1966.
BIBLIOGRAPHY 37
[14]
J. R. Ketterling. Canonic
al analysis of several set
s of variables. Biometrik
a,
58:433–451, 1971.
[15]
T. Kolenda, L. K. Hans
en, J. Larsen, and O.
Winther. Independent co
m-
ponent analysis for un
derstanding multimedia
content. In H. Bourlard
,
T. Adali, S. Bengio, J. Lar
sen, and S. Douglas, edit
ors, Proceedings of IEE
E
Workshop on Neural Ne
tworks for Signal Proces
sing XII, pages 757–
766,
Piscataway, New Jersey,
2002. IEEE Press. Mart
igny, Valais, Switzerland,
Sept. 4-6, 2002.
[16]
Malte Kuss and Thore
Graepel. The geometry
of kernel canonical corre
la-
tion analysis. 2002.
[17]
Yong Rui, Thomas S.
Huang, and Shih-Fu Ch
ang. Image retrieval: C
ur-
rent techniques, promisi
ng directions, and open
issues. Journal of Visu
al
Communications and Im
age Representation, 10:3
9–62, 1999.
[18]
Alexei Vinokourov, Davi
d R. Hardoon, and Jo
hn Shawe-Taylor. Lear
n-
ing the semantics of m
ultimedia content with
application to web ima
ge
retrieval and classificati
on. In Proceedings of
Fourth International Sy
m-ndent Component Analys
posiu
is and Blind Source Sep
maration,
Indepe
Nara, Japan, 2003.
[19]
Alexei Vinokourov, Joh
n Shawe-Taylor, and N
ello Cristianini. Inferring
a
semantic representation
of text via cross-
language correlation ana
lysis. In
Advances of Neural Infor
mation Processing Syste
ms 15 (to appear), 2002
.

Sodapdf

Uploaded by

Copyright:

Available Formats

You might also like

Sodapdf

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sodapdf

Uploaded by

Copyright:

Available Formats

Canonical correlation analysis; An

overview with application to learning

David R. Hardoon , Sandor Szedmak and John Shawe-Taylor

Department of Computer Science

Partial Least Squar

pair. PLS could be

During recent years

This study aims to

We show that the

To simplify the lea

we present the CCA and KCC

Consider a multivariate rando

Sx,wx = (hwx, x1i, . . . ,hwx, xni)

Sy,wy = (hwy, y1i, . . . ,hwy, yni)

= max hSxwx, Sywyi

If we use Eˆ [f(x, y)] to denote th

The total covarian

Since the choice

xCxywy − wx0 λxCx

We are left with

Using the deﬁnitio

the KCCA optimi

L(λα, λβ, α, β) = α0KxKyβ

Subtracting β0 times the sec

= λββ0(Ky2β + κKyβ) − λαα

Which together with the con

We obtain a generalised eige

4.2 Cholesky Deco

Theorem 1. Let A be sym

Proof. Use equation (4.4) an

Theorem 2. Let A be symmet

Incomplete Cholesky Dec

We describe the algorithm fro

Input NxN matrix K

in the error of the approximat

Partial Gram-Schmidt Ort

Given a kernel K from a traini

We observe that the output is

To classify a new example

Given a kernel K from a testi

The advantage of using the p

4.3 Kernel-CCA with

Decomposing the kernel matri

Assuming that the Zxx and Z

KCCA Regularisation with

Rx0 RxRx0 RyRy0 β − λRx0

rewriting equation (4.14) with

ZxxZxyβ˜− λZxx(Zxx + κI)˜α

We are able to rewrite β˜ fro

Zxy(Zyy + κI)−1Zyxα˜ = λ2(Zxx

We are left with a generalised

S−1Zxy(Zyy + κI)αˆ = λ2α.ˆ

We obtain a symmetric gener

We compute the value of κ for

Sports 80 120 80 120

Figure 1 Example of images i

diﬀerence between the spect

To perform the test image r