Sodapdf

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 83

Canonical correlation analysis; An

overview with application to learning


methods

David R. Hardoon , Sandor Szedmak and John Shawe-Taylor


Department of Computer Science
Royal Holloway, University of London
{davidh, sandor, john}@cs.rhul.ac.uk

Technical Report
CSD-TR-03-02
May 28, 2003

Department of Computer Science


Egham, Surrey TW20 0EX, England
Abstract

We present a g
eneral method
using kernel Ca
nonical Correlati
on Analysis to
learn a semantic
representation t
o web images a
nd their associat
ed text. The
semantic space
provides a com
mon representati
on and enables
a comparison
between the text
and images. In
the experiments
we look at two
approaches
of retrieving ima
ges based only
on their content
from a text quer
y. We com-
pare the approa
ches against a
standard cross-
representation re
trieval technique
known as the G
eneralised Vecto
r Space Model.

Keywords: Ca
nonical correlat
ion analysis,
kernel canonic
al correlation
analysis, partial
Gram-Schmidt
orthogonolisatio
n, Cholesky d
ecomposition,
incomplete Chol
esky decomposit
ion, kernel meth
ods.
Introduction 1

1 Introducti
on
During recent years
there have been a
dvances in data lea
rning using kernel
methods. Kernel r
epresentation offer
s an alternative le
arning to non-linear

functions by projec
ting the data into
a high dimensional
feature space to
increase the comput
ational power of the
linear learning mac
hines, though this
still leaves open the
issue of how best to
choose the features
or the kernel func-
tion in ways that wi
ll improve performa
nce. We review so
me of the methods
that have been deve
loped for learning th
e feature space.

Principal Componen
t Analysis (PCA) is
a multivariate data
analysis proce-
dure that involves a
transformation of a
number of possibly c
orrelated variables
into a smaller numb
er of uncorrelated va
riables known as pri
ncipal components.
PCA only makes us
e of the training inp
uts while making no
use of the labels.

Independent Comp
onent Analysis (IC
A) in contrast to
correlation-based
transformations such
as PCA not only de
correlates the signal
s but also reduces
higher-order statistic
al dependencies, att
empting to make th
e signals as inde-
pendent as possible.
In other words, ICA
is a way of finding
a linear not only
orthogonal co-
ordinate system in
any multivariate dat
a. The directions of
the
axes of this co-
ordinate system are
determined by both
the second and hig
her
order statistics of th
e original data. The
goal is to perform
a linear transform
which makes the res
ulting variables as st
atistically independe
nt from each other
as possible.

Partial Least Squar


es (PLS) is a met
hod similar to can
onical correlation
analysis. It selects
feature directions t
hat are useful for t
he task at hand,
though PLS only us
es one view of an o
bject and the label
as the corresponding

pair. PLS could be


thought of as a met
hod, which looks for
directions that are
good at distinguishi
ng the different lab
els.

Canonical Correlati
on Analysis (CCA)
is a method of co
rrelating linear
relationships betwee
n two multidimensio
nal variables. CCA
can be seen as us-
ing complex labels
as a way of guiding
feature selection to
wards the underling
semantics. CCA mak
es use of two views
of the same semant
ic object to extract
the representation o
f the semantics. T
he main difference
between CCA and
the other three met
hods is that CCA i
s closely related to
mutual information
(Borga 1998 [3]).
Hence CCA can be
easily motivated in
information based
tasks and is our na
tural selection.

Proposed by H. Hote
lling in 1936 [12], C
CA can be seen as t
he problem of find-
ing basis vectors for
two sets of variables
such that the correl
ation between the
projections of the v
ariables onto these
basis vectors are m
utually maximised.
In an attempt to inc
rease the flexibility
of the feature selecti
on, kernelisation of
CCA (KCCA) has b
een applied to map
the hypotheses to a
higher-dimensional
feature space. KCC
A has been applied
in some preliminary
work by Fyfe &
Lai [8], Akaho [1] an
d the recently Vinok
ourov et al. [19] wit
h improved results.
Introduction 2

During recent years


there has been a
vast increase in th
e amount of mul-
timedia content ava
ilable both off-line
and online, though
we are unable to
access or make us
e of this data unles
s it is organised in
such a way as to
allow efficient brow
sing. To enable co
ntent based retrieva
l with no reference
to labeling we atte
mpt to learn the s
emantic representat
ion of images and
their associated text
. We present a ge
neral approach usin
g KCCA that can
be used for content
[11] to as well as
mate based retrieval
[18, 11]. In both
cases we compare t
he KCCA approach
to the Generalised
Vector Space Model
(GVSM), which aim
s at capturing som
e term-term correlat
ions by looking at
co-occurrence infor
mation.

This study aims to


serve as a tutorial
and give additiona
l novel contribu-
tions in the followin
g ways:

In this study we f
ollow the work of
Borga [4] where w
e represent the
eigenproblem as two
eigenvalue equations
as this allows us to
reduce the com-
putation time and d
imensionality of the
eigenvectors.

Further to that, w
e follow the idea o
f Bach & Jordan [
2] to compute a
new correlation mat
rix with reduced di
mensionality. Thou
gh Bach & Jordan
[2]
address a very diffe
rent problem, they
use the same unde
rlining technique of
Cholesky decomposi
tion to re-represent
the kernel matrices.
We show that by
using partial Gram-
Schmidt orthogonoli
sation [6] is equiva
lent to incomplete
Cholesky decomposi
tion, in the sense t
hat incomplete Chol
esky decomposition
can be seen as a
dual implementation
of partial Gram-
Schmidt.

We show that the


general approach c
an be adapted to t
wo different types
of problems, conte
nt and mate retriev
al, by only changi
ng the selection of
eigenvectors used i
n the semantic proj
ection.

To simplify the lea


rning of the KCCA
we explore a met
hod of selecting
the regularization p
arameter a priori s
uch that it gives a
value that performs
well in several diffe
rent tasks.

In this study we al
so present a gener
alisation of the fra
mework for canoni-
cal correlation analy
sis. Our approach is
based on the works
of Gifi (1990) and
Ketterling (1971). Th
e purpose of the g
eneralisation is to e
xtend the canonical
correlation as an as
sociativity measure
between two set of
variables to more
than two sets, whil
st preserving most
of its properties.
The generalisation
starts with the opti
misation problem fo
rmulation of canoni
cal correlation. By
changing the object
ive function we will
arrive at the multi
set problem. Ap-
plying similar const
raint sets in the o
ptimisation problem
s we find that the
feasible solutions ar
e singular vectors of
matrices, which are
derived the same
way for the original
and generalised pro
blem.

In Section 2 we pr
esent the theoretic
al background of
CCA. In Section 3
Theoretical Foundations 3

we present the CCA and KCC


A algorithm. Approaches to dea
l with the com-
putational problems that arose i
n Section 3 are presented in S
ection 4. Our
experimental results are present
ed In Section 5. In Section 6
we present the
generalisation framework for CC
A while in Section 7 draws fina
l conclusions.

2 Theoretical Founda
tions
Proposed by H. Hotelling in 19
36 [12], Canonical correlation a
nalysis can be
seen as the problem of finding
basis vectors for two sets of var
iables such that
the correlation between the proj
ections of the variables onto the
se basis vectors
are mutually maximised. Correl
ation analysis is dependent on
the co-ordinate
system in which the variables a
re described, so even if there i
s a very strong
linear relationship between two
sets of multidimensional varia
bles, depending
on the co-ordinate system used,
this relationship might not be vis
ible as a cor-
relation. Canonical correlation
analysis seeks a pair of linear
transformations
one for each of the sets of var
iables such that when the set
of variables are
transformed the corresponding
co-ordinates are maximally corr
elated.

Consider a multivariate rando


m vector of the form (x, y).
Suppose we are
given a sample of instances S
= ((x1, y1), . . . , (xn, yn)) of (x, y)
, we use Sx to
denote (x1, . . . , xn) and similarl
y Sy to denote (y1, . . . , yn). W
e can consider
defining a new co-ordinate for
x by choosing a direction wx a
nd projecting x
onto that direction
x → hwx, xi
if we do the same for y by cho
osing a direction wy we obtain
a sample of the
new x co-ordinate as

Sx,wx = (hwx, x1i, . . . ,hwx, xni)


with the corresponding values o
f the new y co-ordinate being

Sy,wy = (hwy, y1i, . . . ,hwy, yni)


The first stage of canonical cor
relation is to choose wx and w
y to maximise
the correlation between the tw
o vectors. In other words the
function to be
maximised is

ρ = maxcorr(Sxwx, Sywy)

= max hSxwx, Sywyi

If we use Eˆ [f(x, y)] to denote th


e empirical expectation of the
function f(x, y),
were
Eˆ [f(x, y)] = 1
f(xi, yi)
m
i=1
Algorithm 4

we can rewrite th
e correlation expr
ession as

Eˆ[hwx, xi2]Eˆ[hwx,
xi2]
= max
q

follo
ws Eˆ[x
hat
y0]w .
y

wEˆ[xx0]w
Eˆ[yy0]wy
y

Where we use A0
to denote the tran
spose of a vector
or matrix A.
Now observe that
the covariance m
atrix of (x, y) is

"xx Cxx
C(x,
Cxy (
= E yy 2
Cyx
Cyy .
1
)

The total covarian


ce matrix C is a
block matrix wher
e the within-sets
covari-
ance matrices are
Cxx and Cyy and
the between-sets
covariance matric
es are
Cxy = Cyx0

Hence, we can re
write the function
ρ as
w
ρ x (
max
the maximum ca
nonical correlatio
n is the maximu
m of ρ with resp
ect to wx
and wy.

3 Algorith
m
In this section w
e will give an ov
erview of the Ca
nonical correlatio
n analysis
(CCA) and Kerne
l-CCA (KCCA) al
gorithms where
we formulate the
optimisa-
tion problem as a
generalised eigen
problem.

3.1 Canoni
cal Correlati
on Analysis
Observe that the
solution of equati
on (2.2) is not a
ffected by re-
scaling wx or
wy either together
or independently,
so that for examp
le replacing wx b
y αwx
gives the quotient

αwx0
w
xywy xC .
xw
xCxx
wxw w0

Since the choice


of re-scaling is th
erefore arbitrary,
the CCA optimisati
on prob-
lem formulated i
n equation (2.2)
is equivalent to
maximising the n
umerator
Algorithm 5

subject to
wx0 Cxxwx = 1
wy0 Cyywy = 1.
The correspondin
g Lagrangian is
λ
L(λ, (wx0 (wy0 Cyy
wx,
wy) wx wy − 1)
= w0 −

Taking derivatives
in respect to wx
and wy we obtain
∂f
∂w (
=
Cx
ywy


λx
Cxx
wx
=
0
∂f
(
∂w= C
yxw

x −
λ yC
yyw

y =
0.
ytimes the secon
d equation from
w0

xCxywy − wx0 λxCx


xwx − wy0 Cyxwx +
wy0 λyCyywy

which together wi
th the constraints
implies that λy −
λx = 0, let λ = λx
= λy .
Assuming Cyy is i
nvertible we have
C
C
wy λ (
and so substitutin
g in equation (3.1
) gives
CxyC−1
xwx
− λCxxwx
λ
= 0
or
Cxy
Cy (
xw
x
=
λ2
Cx
xw
x

We are left with


a generalised eig
enproblem of the
form Ax = λBx.
We can
therefore find the
co-ordinate syste
m that optimises
the correlation b
etween
corresponding co-
ordinates by first
solving for the ge
neralised eigenve
ctors of
equation (3.4) to
obtain the sequen
ce of wx’s and th
en using equation
(3.3) to
find the correspo
nding wy’s.

As the covarianc
e matrices Cxx
and Cyy are sy
mmetric positive
definite
we are able to d
ecompose them
using a complete
Cholesky decom
position
(more details on
Cholesky decomp
osition can be fou
nd in section 4.2)

Cxx = Rxx · Rx 0
where Rxx is a lo
wer triangular mat
rix. If we let ux =
Rxx0 · wx we are a
ble to
rewrite equation (
3.4) as follows
CxyC−1CyxR−1 ux
= λ2Rxxux
R−1CxyC−1CyxR−
1 u x = λ 2u x .

We are therefore
left with a symm
etric eigenproblem
of the form Ax =
λx.
Algorithm 6

3.2 Kernel
Canonical C
orrelation An
alysis
CCA may not ex
tract useful descr
iptors of the data
because of its li
nearity.
Kernel CCA offe
rs an alternative
solution by first
projecting the dat
a into a
higher dimensiona
l feature space

φ : x = (x1, . . . xn
) 7→ φ(x) = (φ1(x)
, . . . , φN (x)) (n
< N)
before performing
CCA in the new
feature space, es
sentially moving fr
om the
primal to the dua
l representation a
pproach. Kernels
are methods of i
mplicitly
mapping data into
a higher dimensio
nal feature space,
a method known
as the
”kernel trick”. A k
ernel is a functio
n K, such that fo
r all x, z ∈ X

K(
x,
z)
=<
φ(
x)
·φ
(z)
(3.
5)
where φ is a ma
pping from X to a
feature space F.
Kernels offer a gr
eat deal
of flexibility, as t
hey can be gene
rated from other
kernels. In the k
ernel the
data only appears
through entries in
the Gram matrix,
therefore this app
roach
gives a further ad
vantage as the n
umber of tuneabl
e parameters and
updating
time does not de
pend on the num
ber of attributes b
eing used.

Using the definitio


n of the covarian
ce matrix in equa
tion (2.1) we can
rewrite
the covariance m
atrix C using the
data matrices (of
vectors) X and Y
, which
have the sample
vector as rows an
d are therefore of
size m × N, we ob
tain
Cxx = X0X
Cxy = X0Y.

The directions wx
and wy (of length
N) can be rewritt
en as the projecti
on of
the data onto the
direction α and β
(of length m)

wx = X0α
wy = Y 0β.
Substituting into
equation (2.2) we
obtain the followi
ng

α0XX0
Y Y 0β
α,β
(
3
α0XX .
6
0XX0 )
α · β0Y
Y 0Y Y

Let Kx = XX0 an
d Ky = Y Y 0 be th
e kernel matrices
corresponding to t
he two
representation. W
e substitute into
equation (3.6)

α
α,β
K.
K
β

α0 K
β0K
(3.7)
We find that in e
quation (3.7) the
variables are no
w represented in
the dual
form.
Observe that as w
ith the primal form
presented in equa
tion (2.2), equatio
n (3.7)
is not affected by
re-scaling of α a
nd β either togeth
er or independentl
y. Hence
Algorithm 7

the KCCA optimi


sation problem fo
rmulated in equat
ion (3.7) is equiv
alent to
maximising the n
umerator subject
to

α0Kx2α = 1
β0Ky2β = 1

The correspondin
g Lagrangian is

λ
L(λ, α, α0 β0Ky2β
= α0K α −1

Tak
ing
deri
(
vati
ves
in
res
pec
t to
α
and
β
we
obt
ain

∂f
∂α
(

Subtracting β0 tim
es the second eq
uation from α0 tim
es the first we ha
ve

0 = α 0KxKyβ −
α0λαKx2α − β0KyK
xα + β0λβKy2β

which together wi
th the constraints
implies that λα −
λβ = 0, let λ = λα
= λβ.
Considering the c
ase where the ke
rnel matrices Kx
and Ky are invert
ible, we
have
Ky−1Ky−1KyK
β

λ

Ky−1Kxα
=

KxKyKy−1Kxα −
λ2KxKxα = 0.
Hence

KxKxα − λ2KxKx
α=0
or
(
We are left with
a generalised eig
enproblem of the
form Ax = λx.
We can
deduce from equ
ation 3.10 that λ
= 1 for every ve
ctor of α; hence
we can
choose the projec
tions wx to be un
it vectors ji i = 1,
. . . , m while wy a
re the
columns of 1
be formed. Since
kernel methods pr
ovide high dimens
ional representati
ons such
independence is n
ot uncommon. It i
s therefore clear t
hat a naive applic
ation of
CCA in kernel de
fined feature spac
e will not provide
useful results. In t
he next
section we investi
gate how this pro
blem can be avoi
ded.
Computational Issues 8

4 Computational Is
sues
We observe from equation (3
.10) that if Kx is invertible m
aximal correlation is
obtained, suggesting learnin
g is trivial. To force non-
trivial learning we intro-
duce a control on the flexibil
ity of the projections by pen
alising the norms of
the associated weight vectors
by a convex combination of
constraints based on
Partial Least Squares. Anot
her computational issue that
can arise is the use
of large training sets, as this
can lead to computational pro
blems and degener-
acy. To overcome this issu
e we apply partial Gram-
Schmidt orthogonolisation
(equivalently incomplete Cho
lesky decomposition) to redu
ce the dimensionality
of the kernel matrices.

4.1 Regularisation
To force non-trivial learning
on the correlation we introd
uce a control on the
flexibility of the projection
mappings using Partial Lea
st Squares (PLS) to
penalise the norms of the as
sociated weights. We conve
xly combine the PLS
term with the KCCA term in
the denominator of equation
(3.7) obtaining
α
ρ = 0
xα,β
K
x
K
y
β

(α0Kx2α + κkwxk2) · (
β0Ky2β + κkwyk2))
= max
β .
(α0Kx2α + κα0Kx
β0Ky2β + κβ0Kyβ)
We observe that the new reg
ularised equation is not affect
ed by re-scaling of α
or β, hence the optimisation
problem is subject to

(α0Kx2α + κα0Kxα) = 1

(β0Ky2β + κβ0Kyβ) = 1
The corresponding Lagrangai
n is

L(λα, λβ, α, β) = α0KxKyβ


λα
− 2 (α0Kx2α + κα0Kxα − 1)

− λβ
2 (β0Ky2β + κβ0Kyβ − 1).
Taking derivatives in respect
to α and β
∂f
∂α (4.
= K x K yβ
1)
− λα(Kx2α +
κKxα)

∂β = K yK x α (4.
2)
− λβ(Ky2β + κK
yβ).

Subtracting β0 times the sec


ond equation from α0 times th
e first we have

= λββ0(Ky2β + κKyβ) − λαα


0(Kx2α + κKxα).
Computational Issues 9

Which together with the con


straints implies that λα −λβ
= 0, let λ = λα = λβ.
Consider the case where Kx
and Ky are invertible, we hav
e

(Ky + κI)−1Ky−1KyKxα
β =
λ

= (Ky + κI)−1Kxα
λ
substituting in equation 4.1
gives

KxKy(Ky + κI)−1Kxα =
λ2Kx(Kx + κI)α
Ky(Ky + κI)
−1Kxα = λ2(Kx + κI)α
(Kx + κI)−1Ky(Ky + κI)
−1Kxα = λ2α

We obtain a generalised eige


nproblem of the form Ax = λ
x .

4.2 Cholesky Deco


mposition
We describe some backgrou
nd information on direct fact
orisation methods on
triangular decomposition [13
].
LU = A
(4.3)
in which the diagonal eleme
nts of L are not necessarily
unity. We consider
L ≡ (lij) then equation (4.3)
implies

Xk−1
lkpu
lkkukk f 2
p=1
akk − o
r
k
(4.4)
 
ukj Xk−1
lkp (4.5)
l a  for j
p=1
>k ≥2
 
lik Xk−1
lip
uap=1 for i (4.6)
>k ≥2

Theorem 1. Let A be sym


metric. If the factorisation
LU = A is possible,
then the choice lkk = ukk imp
lies lik = uki, that is, LLT =
A.

Proof. Use equation (4.4) an


d induction on k.

A simple, non-singular, sy
mmetric matrix for which th
e factorisation is not
possible is
0 1
1 0
On the other hand, if the sym
metric matrix A is positive de
finite (i.e., x0Ax > 0
if x0x > 0), then the factorisat
ion is possible. We have
Computational Issues 10

Theorem 2. Let A be symmet


ric, positive definite. Then, A
can be factored in
the form
LL0 = A
Proof. If webkk then we will ob
ne lkk = ukktain from the previ
ous

equations LU = A where lik =
uki

Incomplete Cholesky Dec


omposition
Complete decomposition of a
kernel matrix is an expensive
step and should be
avoided with real world data.
Incomplete Cholesky decompo
sition as described
in [2] differs from Cholesky d
ecomposition in that all pivots
, which are below
a certain threshold are skipp
ed. If M is the number of n
on-skipped pivots,
then we obtain a lower trian
gular matrix Gi with only M
nonzero columns.
Symmetric permutations of row
s and columns are necessary
during the factori-
sation if we require the rank to
be as small as possible (Golu
b and loan, 1983).

We describe the algorithm fro


m [2] (with slight modification)
:

Input NxN matrix K


precision parameter η

1.Initialisation:PN i = 1, K0 = K,
2.P = I, for j ∈ [1, N], Gjj =
Kjj
While j=1 Gjj > η and i!
=N +1




Pnext = I, Pnextii = 0, P
• nextj∗j∗ = 0, Pnextij∗ = 1,
Pnextj∗i = 1
• P = P · Pnext

K
0

P
n
e
x
t

K
0

P
n
e
x
t

o
f

G
:
j
G ∗
i,
)
1:

i−
C
1a

↔l
c
u
Gl
a
jt
e
,1

:i


i
t
1h

G
c
(io
,l
u
)m
n

o
Gf
(j
∗G
:
,
Gi+1:n,i = Pi
1 −1

P
3.

Output: an N × M lower t
riangular matrix G and a p
ermutation matrix
P such that kP 0KP − GG0k
≤ η (appendix 1.2 for proof).
The algorithm involves picki
ng one column of K at a ti
me, choosing the
column to be added by gree
dily maximising a lower boun
d on the reduction
Computational Issues 11

in the error of the approximat


ion. After l steps, we have a
n approximation of
the form K˜l = GilG0il, where
Gil is N × l. The ranking of
the N − l vectors
can be computed by comparin
g the diagonal elements of th
e remainder matrix
K − GilG0il.

Partial Gram-Schmidt Ort


hogonolisation
We explore the Partial Gra
m-Schmidt Orthogonolisation
(PGSO) algorithm,
described in [6], as our matri
x decomposition approach. I
CD could been as
equivalent to PGSO as ICD
is the dual implementation
of PGSO. PGSO
works as follows; The project
ion is built up as the span
of a subset of the
projections of a set of m trai
ning examples. These are sel
ected by performing
a Gram-Schmidt orthogonalisa
tion of the training vectors in
the feature space.
We slightly modify the Gra
m-Schmidt algorithm so it w
ill use a precision
parameter as a stopping criteri
on as shown in [2].

Given a kernel K from a traini


ng set, and precision paramet
er η:

Initialisations:
m = size of K, a N × N mat
rix
j =1
size and index are a vector w
ith the same length as K
feat a zeros matrix equal to t
he size of K
for i = 1 to m do
norm2[i] = Kii;

Algorithm:
while
Pnorm2[i] > η and j! = N
+ 1 do
ij = argmaxi(nor
m2[i]);
index[j] =pij;
size[j] =
−1
feat[i,t]·feat[ij,t]


feat[i, j] size ;
= [j]
norm2[i] = norm2[i]− fea
t(i, j)· feat(i, j);
end;
j =j +1
end;
return feat

Output:
kK − feat · feat0k ≤ η wher
e feat is a N × M lower tri
angular matrix

We observe that the output is


equivalent to the output of IC
D.

To classify a new example


at location i:
Computational Issues 12

Given a kernel K from a testi


ng set

for j = 1 to
newfeat[j] t=1 newfeat[t] · f
= (Ki,index[j]
end; eat[index[j], t])/
size[j];

The advantage of using the p


artial Gram-Schmidt orthonogo
lisation (PGSO) in
comparison to the incomplete
Cholesky decomposition (as d
escribed in Section
4.2) is that there is no need
for a permutation matrix P.

4.3 Kernel-CCA with


PGSO
So far we have considered th
e kernel matrices as invertible
, although in prac-
tice this may not be the case
. In this Section we address
the issue of using
large training sets, which may
lead to computational problem
s and degeneracy.
We use PGSO to approximat
e the kernel matrices such t
hat we are able to
re-represent the correlation w
ith reduced dimensiality.

Decomposing the kernel matri


ces Kx and Ky via PGSO, w
here R is a lower
triangular matrix, gives
Kx =˜ RxRx0
Ky =˜ RyRy0
substituting the new represent
ation into equations (3.8) and
(3.9)

(4.7)
(4.8)
Multiplying the first equation
with Rx0 and the second equat
ion with Ry0 gives
Rx0 RxRx0 RyRy0 β − λ (4.
9)
Rx0 RxRx0 RxRx0 α =
(4.1
0 0)
Ry0 RyRy0 RxRx0 α − λ
Ry0 RyRy0 RyRy0 β =
0.
Let Z be the new correlation
matrix with the reduced dime
nsiality
Rx0 Rx = Zxx
Ry0 Ry = Zyy
Rx0 Ry = Zxy
Ry0 Rx = Zyx
Let α˜ and β˜ be the reduced
directions, such that
α˜ = Rx0 α
β˜ = Ry0 β
substituting in equations (4.9)
and (4.10) we find that we re
turn to the primal
representation of CCA with a
dual representation of the dat
a

ZxxZxyβ˜ − λZxx2 α˜ = 0
ZyyZyxα˜ − λZyy2 β˜ = 0.
Computational Issues 13

Assuming that the Zxx and Z


yy are invertible. We multiply
the first equation
with Zxx−1 and the second with
Zyy−1

Zxyβ˜− λZ
(4.
xxα˜ = 0 11
)
(4.
Zyxα˜ − λ 12
)
Zyyβ˜ =
0.
We are able to rewrite β˜ from
equation (4.12) as

Zyy−1Zyxα˜
β˜ =
λ
and substituting in equation (
4.11) gives

ZxyZyy−1Zyxα
˜ = λ2Zxx (4.
13
α˜ )
we are left with a generalise
d eigenproblem of the form
Ax = λBx. Let SS0
be equal to the complete Chol
esky decomposition of Zxx suc
h that Zxx = SS0
where S is a lower triangular
matrix, and let αˆ = S0·α.˜ Su
bstituting in equation
(4.13) we obtain

S−1ZxyZyy−1ZyxS−1 αˆ = λ2αˆ
We now have a symmetric g
eneralised eigenproblem of th
e form Ax = λx.

KCCA Regularisation with


PGSO
We combine the dimensiality
reduction introduced in the pr
evious Section 4.3
with the regularisation param
eter (Section 4.1) to maximis
e the learning. Fol-
lowing the same approach in th
e previous section we can rew
rite equations (4.1)
and (4.2) with the approximati
on of Kx and Ky as formulated
in equations (4.7)
and (4.8) respectively, in the
following manner

RxRx0 RyRy0 β −
λ(RxRx0 RxRx0 + κRxRx0 )α =
0
RyRy0 RxRx0 α −
λ(RyRy0 RyRy0 + κRyRy0 )β =
0
Multiplying the first equation
with Rx0 and the second equat
ion with Ry0 gives

Rx0 RxRx0 RyRy0 β − λRx0


RxRx0 RxRx0 + κRxRx0 )α
= 0
Ry0 RyRy0 RxRx0 α − λRy0
RyRy0 RyRy0 + κRyRy0 )β
= 0

rewriting equation (4.14) with


the new reduced correlation
matrix Z as defined
in the previous Section 4.3, w
e obtain

ZxxZxyβ˜− λZxx(Zxx + κI)˜α


= 0
ZyyZyxα˜ − λZyy(Zyy + κI)β˜
= 0.
Assuming that the Zxx and Z
yy are invertible. We multiply
the first equation
with Zxx−1 and the second with
Zyy−1

Zxyβ˜− λ(Zxx
(4.
+ κI)˜α = 16)
0 (4.
Zyxα˜ − λ(Zyy 17)

+ κI)β˜ =
0.
Experimental Results 14

We are able to rewrite β˜ fro


m equation (4.17) as

(Zyy + κI)−1Zyxα˜
β˜ =
λ
substituting in equation 4.16
gives

Zxy(Zyy + κI)−1Zyxα˜ = λ2(Zxx


+ κI)˜α

We are left with a generalised


eigenproblem of the form Ax
= λBx. Performing
a complete Cholesky decomp
osition on Zxx + κI = SS0 w
here S is a lower
triangular matrix. and let αˆ
= S0α,˜ substituting in equatio
n (4.18)

S−1Zxy(Zyy + κI)αˆ = λ2α.ˆ


−1ZyxS−1

We obtain a symmetric gener


alised eigenproblem of the for
m Ax = λx .

5 Experimental Res
ults
In the following experiments
the problem of learning sem
antics of multimedia
content by combining image a
nd text data is addressed. T
he synthesis is ad-
dressed by the kernel Canoni
cal correlation analysis descri
bed in Section 4.3.
We test the use of the derive
d semantic space in an imag
e retrieval task that
uses only image content. Th
e aim is to allow retrieval of
images from a text
query but without reference t
o any labeling associated wit
h the image. This
can be viewed as a cross-
modal retrieval task. We used
the combined multime-
dia image-text web database,
which was kindly provided by
the authors of [15],
where we are trying to facilitat
e mate retrieval on a test set.
The data was di-
vided into three classes (Figur
e 1) - Sport, Aviation and Pai
ntball - 400 records
each and consisted of jpeg i
mages retrieved from the Int
ernet with attached
text. We randomly split each
class into two halves which w
ere used as training
and test data accordingly. Th
e extracted features of the d
ata were used the
same as in [15] (detailed desc
ription of the features used ca
n be found in [15]:
image HSV colour, image Ga
bor texture and term frequenc
ies in text.

We compute the value of κ for


the regularization by running t
he KCCA with the
association between image an
d text randomized. Let λ(κ) be
the spectrum with-
out randomisation, the datab
ase with itself, and λR(κ) be
the spectrum with
randomisation, the database
with a randomised version of
itself, (by spectrum
it is meant that the vector wh
ose entries are the eigenvalue
s). We would like
to have the non-random spect
rum as distant as possible fr
om the randomised
spectrum, as if the same corr
elation occurs for λ(κ) and λR
κ then clearly over-
fitting is taking place. Theref
or we expect for κ = 0 (no
regularisation) and
let j = 1, . . . , 1 (the all ones
vector) that we may have λ(κ)
= λR(κ) = j, since
it is very possible that the ex
amples are linearly independe
nt. Though we find
that only 50% of the example
s are linearly independent, this
does not affect the
selection of κ through this m
ethod. We choose κ so that
the κ for which the
Experimental Results 15

20 2 2
0 0
20

40

4 4 40
0 0

60

6 6 60
0 0

80

8 8 80
0 0

100

1 1 100
0 0
0 0

120

1 1 120
2 2
0 0

140

1 1 140
4 4
0 0

160

1 1 160
6 6
0 0

Aviatio
n
450
150 200 250 300 350 400 100 150 200 250 300 350 400 450 500 100 150 200 250 300 350 400 100 150 200 250 300 350 400

10

10 10

20
2
2 0 20
0

40
3
0 30
3
0

40

60 40
4
5
0
0
50

6
5 0
0

80
60
7
6 0
0

80
70

100
7
0 9 80
0

Sports 80 120 80 120

20

1
0 20
2
0
40

2
0

40
3
0 60
4
0

40

8
0

50
60
100

80 60
120

80 70
1 140
0
0

120
160
1 9
0 0
0

Paintba
20 30 40 50 60 70 80 90 100110 120 140 140 300

ll

Figure 1 Example of images i


n database.

difference between the spect


rum of the randomized set is
maximally different
(in the two norm) from the tr
ue spectrum.

κ = argmaxkλR(κ) − λ(κ)k
We find that κ = 7 and set v
ia a heuristic technique the G
ram-Schmidt preci-
sion parameter η = 0.5 .

To perform the test image r


etrieval we compute the feat
ures of the images
and text query using the Gr
am-Schmidt algorithm. Once
we have obtained
the features for the test query
(text) and test images we pro
ject them into the
semantic feature space using
β˜ and α˜ (which are comput
ed through training)
respectively. Now we can com
pare them using an inner prod
uct of the semantic
feature vector. The higher the
value of the inner product, th
e more similar the
two objects are. Hence, we re
trieve the images whose inner
products with the
test query are highest.

We compared the performan


ce of our methods with a
retrieval technique
based on the Generalised Vect
or Space Model (GVSM). This
uses as a seman-
tic feature vector the vector of
inner products between either
a text query and
each training label or test imag
e and each training image. F
or both methods we
have used a Gaussian kernel,
with σ = max. distance/20, fo
r the image colour
component and all experiment
s were an average of 10 run
s. For convenience
we separate the content-
based and mate-based appro
aches into the following
Subsections 5.1 and 5.2 resp
ectively.

5.1 Content-Based
Retrieval
In this experiment we used the
first 30 and 5 α˜ eigenvectors
and β˜ eigenvectors
(corresponding to the largest e
igenvalues). We computed the
10 and 30 images
for which their semantic featu
re vector has the closest inn
er product with the
semantic feature vector of the
chosen text. Success is consi
dered if the images
contained in the set are of the
same label as the query text (
Figure 3 - retrieval
Experimental Results 16

example for set of 5 images).

ImageGVSMKCCA KCCA succ


et ccess cess (30)
ess (5)
10 7 8 90.
8 97
. %
9
3
%

30 7 8 90.6
6. 3 9%
8 .
2 0
% 2
%

Table 1 Success cross-results


between kernel-cca & generali
sed vector space.

KCCA (30) KCCA (30)

90

85

80

75

70

65

200
20 40 60 80 100 120 140 160 180
Image Set Size Image Set Size Image Set Size

Figure 2 Success plot for con


tent-based KCCA against GV
SM

In Tables 1 and 2 we compare


the performance of the kernel-
CCA algorithm and
generalised vector space mod
el. In Table 1 we present the
performance of the
methods over 10 and 30 imag
e sets where in Table 2 as pl
otted in Figure 2 we
see the overall performance of
the KCCA method against the
GVSM for image
sets (1 − 200), as in the 2000t
h image set location the max
imum of 200 × 600
of the same labelled images o
ver all text queries can be retri
eved (we only have
200 images per label). The suc
cess rate in Table 1 and Figur
e 2 is computed as
follow
s k=1
success % j=1
r image set ntjk × 100
=
where countjk = 1 if the image
k in the set is of the same lab
el as the text query
present in the set, else countjk
= 0. The success rate in Table
2 is computed as
above and averaged over all i
mage sets.

As visible in Figure 4 we obse


rve that when we add eigenve
ctors to the seman-
tic projection we will reduce
the success of the content b
ased retrieval. We
speculate that this may be t
he result of unnecessary det
ail in the semantic
projection. and as the sema
ntic information needed is co
ntained in the first
Experimental Results 17

1 1
0 0

2
0

3
0

4
0

5
0

6
0
6
0

7
0
7
0

8
0

9
0

1
0
0

1
1
0

10
4 80
0

10

10

20

20

30

30

40
40

50

50

60

60

70

70

80

80

90

90

100
100

110

110

10
20 30 40 50 60 70 80 20 30 40 50 60 70 80

Figure 3 Images retrieved for


the text query: ”height: 6-
11 weight: 235 lbs
position: forward born: septe
mber 18, 1968, split, croatia
college: none”

Method overall success


GVSM 72.3%
KCCA (30) 79.12%
KCCA (5) 88.25%

Table 2 Success rate over all


image sets (1 − 200).

few eigenvectors. Hence a m


inimal selection of 5 eigenvec
tors is sufficient to
obtain a high success rate.

5.2 Mate-Based Re
trieval
In the experiment we used the fi
rst 150 and 30 α˜ eigenvectors
and β˜ eigenvectors
(corresponding to the largest e
igenvalues). We computed the
10 and 30 images
for which their semantic featu
re vector has the closest inn
er product with the
semantic feature vector of the
chosen text. A successful mat
ch is considered if
the image that actually match
ed the chosen text is contain
ed in this set. We
compute the success as the av
erage of 10 runs (Figure 5 - r
etrieval example for
set of 5 images).

ImageGVSMKCCA KCCA succ


et uccesscess (30)
ess (150)
10 8 1 59.
5%
30 1 3 69
%

Table 3 Success cross-results


between kernel-cca & generali
sed vector space.

In Table 3 we compare the pe


rformance of the KCCA algori
thm with the GVSM
Experimental Results 18

90

80

70

60

50

40

30
10 15 20 25 30 35 40 45 50
eigenvectors

Figure 4 Content-Based plot


of eigenvector selection agai
nst overall success
(%).

over 10 and 30 image sets wh


ere in Table 4 we present the
overall success over
all image sets. In figure 6 we
see the overall performance of
the KCCA method
against the GVSM for all pos
sible image sets.
The success rate in Table 3 a
nd Figure 6 is computed as fol
lows

j=1 count
success % for imag
e set i = 60 × 100
0
where countj = 1 if the exact
matching image to the text qu
ery was present in
the set, else countj = 0. The s
uccess rate in Table 4 is com
puted as above and
averaged over all image sets.

Method overall success


GVSM 70.6511%
KCCA (30) 83.4671%
KCCA (150) 92.9781%
Table 4 Success rate over all i
mage sets.

As visible in Figure 7 we fi
nd that unlike the Content-
Based (Section 5.1)
retrieval, increasing the numb
er of eigenvectors used will a
ssist in locating the
matching image to the query
text. We speculate that this
may be the result
of added detail towards exac
t correlation in the semantic
projection. Though
we do not compute for all ei
genvectors as this process w
ould be expensive
Experimental Results 19

20

40

20

60
20
40
80

60
4
1 0
0
0
80

120 60
100

120
1 8
4 0
0

140

1
6 1
0
0
160

180

1
2
0

200
200
140

50
100 150 200 250 300 350 400 450 100150200250300350400450500 40 60 80 100120 140 160 180 200 220

10

20

20

40

60
30

80
40
100

50

120

60

140

70

160

180
80
200

90
20 40 60 80 100 120 100 150 200 250 300 350 400 450 500

Figure 5 Images retrieved for


the text query: ”at phoenix s
ky harbor on july
6, 1997. 757-2s7, n907wa ph
oenix suns taxis past n902aw
teamwork america
west america west 757-2s7, n
907wa phoenix suns taxis pas
t n901aw arizona at
phoenix sky harbor on july 6,
1997.” The actual match is th
e middle picture in
the first row.

and the reminding eigenvector


s would not necessarily add
meaningful semantic
information.

It is visible that the kernel-


CCA significantly outperforme
s the GVSM method
both in content retrieval and i
n mate retrieval.

5.3 Regularisation
Parameter
We next verify that the met
hod of selecting the regulari
sation parameter κ
a priori gives a value perform
ed well. We randomly split e
ach class into two
halves which were used as
training and test data accord
ingly, we keep this
divided set for all runs. We
set the value of the incompl
ete Gram-Schmidt
orthogonolisation precision p
arameter η = 0.5 and run o
ver possible values
κ where for each value we t
est its content-based and m
ate-based retrieval
performance.

Let ˆκ be the previous optimal


choice of the regularisation pa
rameter κˆ = κ = 7.
As we define the new optimal
value of κ by its performance
on the testing set,
we can say that this method
is biased (loosely its cheating
). Though we will
show that despite this, the di
fference between the performa
nce of the biased κ
and our a priori κˆ is slight.

In table 5 we compare the


overall performance of the C
ontent Based (CB)
performance in respect to th
e different values of κ and i
n figures 8 and 9
we view the plotting of the
comparison. We observe th
at the difference in
Experimental Results 20

100

90

80

70

60

50

40

30

20

KCCA (150) KCCA (150)


10

100 200 300 400 500 600

Figure 6 Success plot for KCC


A mate-based against GVSM (
success (%) against
image set size).

κ CB-KCCA CB-KCCA (5)


)
0 46.278% 43.8374%
ˆκ 83.5238% 91.7513%
90 88.4592% 92.7936%
230 88.5548% 92.5281%

Table 5 Overall success of C


ontent-Based (CB) KCCA with
respect to κ.

performance between the a


priori value ˆκ and the new f
ound optimal value
κ for 5 eigenvectors is 1.0423
% and for 30 eigenvectors is
5.031%. The more
substantial increase in perfor
mance on the latter is due to
the increase in the
selection of the regularisation
parameter, which compensate
s for the substantial
decrease in performance (fig
ure 6) of the content based
retrieval, when high
dimensional semantic feature s
pace is used.

κ MB-KCCA MB-KCCA (15


0) 0)
0 73.4756 83.46%
%
ˆκ 84.75% 92.4%
170 85.5086 92.9975%
%
240 85.5086 93.0083%
%
430 85.4914 93.027%
%

Table 6 Overall success of M


ate-Based (MB) KCCA with re
spect to κ.

In table 6 we compare the o


verall performance of the Mat
e-Based (MB) per-
formance with respect to the
different values of κ and in fi
gures 10 and 11 we
view a plot of the comparison
. We observe that in this cas
e the difference in
Experimental Results 21

95

90

85

80

75

70

65

60

55

50
40 60 80 100 120 140 160 180
eigenvectors

Figure 7 Mate-Based plot of e


igenvector selection against ov
erall success (%).

88.55

88.5

88.45

88.4

100 120 140 160 180 200 220 240 260 280 300
Kappa

Figure 8 Content-Based. κ sel


ection over overall success for
30 eigenvectors.
Experimental Results 22

92.796

92.794

92.792

92.79

92.788

92.786

92.784

92.782

92.78

92.778
85 90 95 100 105 110 115 120
Kappa

Figure 9 Content-Based. κ sel


ection over overall success for
5 eigenvectors.

85.515

85.51

85.505

85.5

85.495

85.49

85.485
300
120 140 160 180 200 220 240 260 280
Kappa

Figure 10 Mate-Based. κ selec


tion over overall success for 3
0 eigenvectors.
Generalisation of Canonical Correlation Analysis 23

93.03

93.02

93.01

93

92.99

92.98

92.97

92.96

100 150 200 250 300 350 400 450 500 550
kappa

Figure 11 Mate-Based. κ selection over overall success f


or 150 eigenvectors.

performance between the a priori value κˆ and the new


found optimal value κ
is for 150 eigenvectors 0.627% and for 30 eigenvectors is
0.7586%.

Our observed results support our proposed method f


or selecting the regu-
larisation parameter κ in an a priori fashion, since the
difference between the
actual optimal κ and the a priori ˆκ is very slight.

6 Generalisation of Canonical Correlatio


n Analysis
In this section we follow A. Gifi’s book “Nonlinear Multiva
riate Analysis” (1990)
and partially J. R. Ketterling “Canonical analysis of sev
eral sets of variables”
(1971).

6.1 Some notations


For an n × n square matrix A having elements {aij}, i
, j = 1, . . . , n we can
define the trace by the formula

Tr(A) = X (6.1)
i

the norm k kF , so called the Frobenius norm, defined by

kAkF = Tr A0A X (6.2)


ij

and if ai denotes the ith column(row) of A then we have

(6.3)
kAkF = kaik22 = hai, aii
the notation k k2 means the Euclidean, l2, norm of a ve
ctor.
Generalisation of Canonical Correlation Analysis 24

6.2 Some propositions


Proposition 3. Let an optimisation problem be given in
the form

min f(x, y) (6.4)

subject to (6.5)
g(y) = 0, (6.6)
(6.7)
x ∈ Rm, y ∈ Rn.
Let the set Y ⊆ Rn the feasibility domain for y determi
ned by the constrain
g(y) = 0.

Assume the function f is convex in both variables x and


y, the optimal solution
of x can be expressed by the function h(y) of the optim
al solution of y, where
the function of h is defined on the whole set Y and t
he functions f, g, h are
twice continuously differentiable on Rm × Y .
Then the optimisation problem with the same constrain

min f(h(y), y) (6.8)

subject to (6.9)
g(y) = 0, (6.10)

y ∈ Rn, (6.11)
has the same optimal solution in y than equation (6.4) h
as.

Proof. Let the optimal solution of equation (6.4) be den


oted by x1, y1 and for
equation (6.8) be denoted by y2.

Based on the condition of the proposition we have x


1 = h(y1). Because
y1 is a feasible solution for equation (6.8) thus f(x1,
y1) = f(h(y1), y1) ≥
f(h(y2), y2), but the objective function of equation (6.
4) is not restricted in
the first variable, thus the inequality f(x1, y1) ≤ f(h
(y2), y2) holds, hence
f(x1, y1) = f(h(y2), y2).

From the convexity of f and the same feasibility domains


the optimum solutions
have to be the same.

6.3 Formulation of the Canonical Corr


elation
Let H(1), H(2) be matrices with size m × n1, m × n2 res
pectively and assume
the sum of the elements in the columns of these matrice
s are equal to 0, they
are centralised and they are linearly independent vect
ors within one matrix.
We consider arbitrary linear combinations of the column
s of these matrices in
the form H(1)a(1)i , H(2)a(2)i , i = 1, . . . , p. Let A(1) = a(1)1
, . . . , a(1)n1
a(2)1 , . . . , a(2)n2
Generalisation of Canonical Correlation Analysis 25

columns. Introducing notations for the product of the m


atrices to simplify the
formulas:
Σij = H(i)0 H(j), i, j = 1, 2.
(6.12)
We are looking for linear combinations of the columns
of these matrices such
that the first pair of the vectors (a(1)1 , a(2)1 ) are optimal
solution of the optimi-
sation problem:
max a(1) (6.13)
a ,a

subject to
(6.14)
a(1)1 Σ11a(1)1 = 1, (6.15)
a1(2) Σ22a(2)1 = 1. (6.16)
The meaning of this optimisation problem is to find th
e maximum correlation
between the linear combinations of the columns of th
e matrices H(1), H(2),
subject to the length of the vectors corresponding to th
ese linear combinations
normalised to 1.

To determinate the remaining pairs of the vectors, colu


mns in A(1) and A(2),
a series of optimisation problems are solved successivel
y. For the pair of the
vectors (a(1)r , a(2)r ), r = 2, . . . , p we have
a(1)
(6.17)
r ,a
(6.18)
a(k)
(6.19)
r Σkka(k)r = 1,
(6.20)
a(k)
a(k)
(6.21)

k, l = 1, 2, j = 1, . . . , r − 1. (6.22)
The problem (6.13) expanded by the orthogonality con
strains (6.17), namely
the components of every new pair in the iteration have t
o be orthogonal to the
components of the previous pairs.

The upper limit p of the iteration has to be ≤ min(ran


k(H(1)), rank(H(2))).
Applying the Karush-Kuhn-Tucker conditions we can
express the optimal
solutions of the problem (6.13) and the problems (6.17)
for r = 2, . . . , p. Let’s
begin with the problem (6.17).

First we apply a substitution such that

a(k)i = Σ− 1
2
yi(k), (6.23)
1 1
Dkl = Σ− 2 ΣklΣ− 2 , (6.24)
k, l = 1, 2, i = 1, . . . , p, (6.25)
Generalisation of Canonical Correlation Analysis 26

Thus we have the problem


0
max (6.26)
y ,y
1 1

(6.27)
y(k)
(6.28)
(6.29)

L1 = y(1) 0 1 0
(6.30)
2 2
where λ1 and λ2 are the Lagrangian multipliers. The
vectors of the partial
derivatives of L1respect to the vectors y1(1), y1(2) are eq
ual to 0 by the KKT
conditions, thus we get

∂L1
(6.31)
= 2D12a(2)1 − 2λ1y1(1) = 0,
∂y1(1)
∂L1
∂y1(2) = 2D21y1(1) − 2λ2y1(2) = 0. (6.32)

1 and equation (6.32) by y1(2) and dividing


by the constant 2 provides
0
(6.33)

(6.34)
12y1(1)
− λ2y(2)
Based on the constrains of the optimisation problem (
6.26) and the identity

λ1 = λ2 = y1(1) D12y1(2).
(6.35)
After replacing λ1 and λ2 with λ the following equality syst
em can be formulated

−λI D12 y1(1) !


(6.36)
y1(2)

It is not too hard to realise this equality system is a sin


gular vector and value
problem of the matrix D12 having y1(1) and y1(2) are a le
ft and a right singular
vectors and the value of the Lagrangian λ is equal to the
corresponding singular
value. Based on this statements we can claim that the
optimal solutions are
the singular vectors belonging to the greatest singular va
lue of the matrix D12.
Considering the successive optimisation problem and
applying similar sub-
stitution for the all variables a(k)i as introduced in equat
ion (6.23), a problem
with the greatest r singular values and the correspondin
g left and right singular
vectors arises.
Generalisation of Canonical Correlation Analysis 27

6.4 The simultaneous formulation of th


e canonical correla-
tion
Instead of using the successive formulation of the canoni
cal correlation we can
join the subproblems into one. The simultaneous formula
tion is the optimisation
problem

Xp a(1) Σ12a(2)i (6.37)


(a(1),a ),...,(a(1)p ,a i=1
subject to (6.38)

a(1) 1 if i = j,
Σ11a(1)j = (6.39)
0 otherwise,
a(2) 1 if i = j,
Σ22a(2)j = (6.40)
0 otherwise,
i, j = 1, . . . , p,
(6.41)
a(1)
Σ12a(2)j = 0, (6.42)
i, j = 1, . . . , p, j 6= i. (6.43)
Based on equation (6.37) and the definition of the Frobe
nius norm we have a
compact formulation of the canonical correlation proble
m:
max Tr A(1) Σ12A(2) (6.44)

(6.45)
A(k) (6.46)
ΣkkA(k) = I,
a(k)
Σkla(l)j = 0, (6.47)

k, l = {1, 2}, l 6= k, i, j (6.48)


= 1, . . . , p, j 6= i.
where I is the identity matrix with size p
× p.

Repeating the substitution in equation (6.23) the set


of feasible vectors for
the simultaneous problem is equal to the left and right s
ingular vectors of ma-
trix D12, hence the optimal solution is compatible to the
successive problems.

6.5 Correlation versus Distance


The canonical correlation problem can be transformed in
to a distance problem
where the distance between two matrices is measured by
the Frobenius norm.
min (6.49)
H(1)A(1) − H(2)A(2)
(6.50)
A(k) (6.51)
ΣkkA(k) = I,
a(k)
Σkla(l)j = 0, (6.52)
k, l = 1, . . . , 2, l 6= k, i, j = 1, . . . , p, j 6= (6.53)
i.
Generalisation of Canonical Correlation Analysis 28

Unfolding the objective function of the minimisation prob


lem (6.49) shows the
optimisation problem is the same as the maximisation p
roblem (6.44).

6.6 The generalisation of canonical cor


relation
Exploiting the distance problem we can give a general
isation of the canoni-
cal correlation for more than two known matrices. Gi
ven a set of matrices
{H(1), . . . , H(K)} with dimension m × n1, . . . , m × nK.
We are looking for
the linear combinations of the columns of these matric
es in the matrix form
A(1), . . . , A(K) such that they gives the optimum solutio
n of the problem

min XK (6.54)
k,l=1
H(k)A(k) − H(l)A(l)
(6.55)
A(k) (6.56)
ΣkkA(k) = I,
a(k)
Σkla(l)j = 0, (6.57)
k, l = 1, . . . , K, l 6= k, i, j = 1, . . . , p, j 6 (6.58)
= i.
In the forthcoming sections we will show how to simplif
y this problem.

6.7 Total Distance versus Variance


Given a set of vectors X = x1, . . . , xm ⊆ Rn. The notati
on xki means the ith
component of the vector xk.

The total squared distance, the sum of the squared


Euclidean distance of
all possible pairs of vectors in X is equal to

(6.59)
2 kxk − xlk22 =

as for any k, kxk − xkk = 0 we can drop the constrain l 6


= k, thus we have
= 1
(6.60)
2 kxk − xlk22 =
k=1,l=1

1 Xm Xn
= (6.61)
2 (xki − xli)2 =
k=1,l=1 i=1

1 Xm Xn
= (6.62)
2 i=1
k=1,l=1
1  Xm 
= Xn x2ki + Xm x2li − Xm 2xkix (6.63)
2 i=1 
k=1,l=1 k=1,l=1

1 !
= Xn Xm x2ki + mXm Xm xkiXm xli (6.64)
2 i=1 x2li − 2
k=1 l=1 k=1 l=1
Generalisation of Canonical Correlation Analysis 29

to simplify the formula we introduce


1 Xm 1
M1(i) = xki, M2(i) = Xm x2ki, (6.65)
m m
k=1 k=1

we can reformulate equation (6.64)

= Xn m2M2(i) − m2(M1(i))2 = (6.66)


i=1

applying the well-known identity of the variance for the


vectors
(x11, . . . , xm1), . . . , (x1n, . . . xmn) gives

= m2 Xn (6.67)
(xki − M1(i))2.
i=1k=1

Hence the total squared distance turns to be equal to the


sum of the component-
wise variances of the vectors in X multiplied by the squ
are of the number of
the vectors.

Another statement about the variance is introduced. If


we have the following
optimisation problem

minz kz − xkk22, xk (6.68)


∈ X and z ∈ Rn,
then the optimal solution can be expres
sed by
1 Xm
zi = xki. (6.69)
n
k=1

The components of the optimal solution are equal to th


e mean values of the
corresponding components of the known vectors.

6.8 General form


Let H(1), . . . , H(K) be a set of known matrices with siz
e m × n1, . . . , m × nK
and X be an unknown matrix with size m × p. The co
lumns of the matrices
H(1), . . . , H(K) are centralised, i.e. the mean of every c
olumn in every matrix
is equal to 0. We assume the columns of every matr
ix H(k), k = 1, . . . , K
are linearly independent. A notation to simplify the fo
rmulas, is introduced;
Σkl = H(k)T H(l). We are looking for linear combinations
of the columns of the
known matrices and a corresponding X such that they a
re the optimal solution
of the optimisation problem given by
X,A(1),...,A(K) 1
K k=1

XK (6.70)
X − H(k)A(k)F
subject to (6.71)
0

1 if k = l and i = j,
i 0 if (k = l and i 6= j) or (k 6= l (6.72)
Σkla(l)j =
nd i 6= j),
k, l = 1, . . . , K, i, j = 1, . . . , p, except when (6.73)
6= l and i = j,
Generalisation of Canonical Correlation Analysis 30

where a(k)i denotes the ith column of the matrix A(k) c


ontaining the possible
linear combinations.

Applying substitutions for all k = 1, . . . , K, i = 1, . . . , p


1
a(k)i = Σ− 2
yi(k), (6.74)

where we can compute the inverse because the columns


of the matrix H(k) are
independent meaning Σkk has full rank. We can trans
form this optimisation
problem into a more simply form. First, we modify the
set of constrains. To
make this modification readable the notation is introduc
ed
1 1
Σ− 2 ΣklΣ−2 = Dkl, k, l = 1, . . . , K, (6.75)
1
where we exploit the symmetricity of the matrices2 .

Thus the constrains get the form

1 if i = j,
y(k) yj(k) = (6.76)
0 if i 6= j,
k = 1, . . . , K, i, j = 1, . . . , p,
(6.77)
y(k)
Dklyj(l) = 0, (6.78)
k, l = 1, . . . , K, k 6= l, i, j = 1, . . . p, i 6 (6.79)
= j,

for which we can recognise the singular decomposition pr


oblems of the matrices
{Dkl}. If we consider the matrix Dkl for a fixed pair o
f the indeces k, l and
apply the singular decomposition we have

Dkl = Y (k)ΛklY (l) , (6.80)

the matrices Y (k) and Y (l) have columns being equal to th


e vectors yi(k) and yi(l)
respectively, where i = 1, . . . , p. The singular decompo
sition Λkl is a diagonal
matrix and Y (k)
items having indeces with the properties k 6= l and i = j.
They give the singular
values of the matrix Dkl
(6.81)
The consequence of the singular decomposition form
is that the set of the
feasible solutions of the optimisation problem with const
rains (6.76) are equal
to the set of the singular vectors of the matrices {Dkl, k,
l = 1 . . . , K}.
To express the objective function of the optimisation p
roblem (6.70) we use
the notations
1
Qk = H(k)Σ− 2
, (6.82)
Dkl = QTk Ql. (6.83)
Generalisation of Canonical Correlation Analysis 31

We can derive another statement about the optimal so


lution of the problem.
Exploiting the definition of the Frobenius norm the obj
ective function (6.70)
can be rewritten as a sum of the Euclidean norm of the
column vectors, where
xi denotes the ith column of the matrix X,

1 XK
= (6.84)
K X − H(k)A(k) F
k=1

1 XK Xp
= = (6.85)
K xi − H(k)a(k)i 2
k=1 i=1

1 XK Xp
= = (6.86)
K xi − Qkyi(k) 2
k=1 i=1

1 XK
= hxi − Qkyi(k), xi − Qkyi(k)i. (6.87)
K
k=1i=1

The constrains are formulated in equation (6.76).

For the Lagrangian function of the optimisation problem


we have:

L = XK (6.88)
hxi − Qkyi(k), xi − Qkyi(k)i+
k=1i=1

+ XK Xp λk,ii 1 − yi(k) yi(k) + (6.89)


k i

+ XK Xp λk,ij −yi(k) yj(k) + (6.90)


k i,j
i6=j

+ XK Xp λkl,ij −yi(k) Dklyj(l) . (6.91)


k,l i,j
k6=l i6=j

K from the objective function (6.70).

After computing the partial derivatives, where xi sig


ns the ith column of
the matrix X, we get

∂L XK
= = 0, i = 1, . . . , p, (6.92)
∂xi k=1 2xi − 2Qkyi(k)

∂L 0
Xp XK Xp λkl,ijDklyj(l) = 0,
= 2Dkkyi(k) − xi − 2λk,ij yj(k) −
∂yi(k) j l j
l6=kj6=i
(6.93)
k = 1, . . . , K, i = 1, . . . , p.
(6.94)
Conclusions 32

We can express xi fo
r any i = 1, . . . , p fr
om (6.92)

xi X Qlyi( (
Kl=1 6
l), i
.
= 1, 9
..., 5
)
p.

Based on the propos


ition (3) we can repl
ace the variable X i
n equation (6.70)
by an expression of
the other variables
without changing the
optimum value
and the optimal solu
tion. Thus we have
the variance problem
.

7 Conclusion
s
Through this study
we have presented
a tutorial on cano
nical correlation
analysis and have e
stablished a novel g
eneral approach to
retrieving images
based solely on their
content. This is then
applied to content-
based and mate-
based retrieval. Exp
eriments show that i
mage retrieval can b
e more accurate
than with the Gener
alised Vector Space
Model. We demon
strate that one
can choose the regu
larisation parameter
κ a priori that perfor
ms well in very
different regimes. He
nce we have come t
o the conclusion that
kernel Canonical
Correlation Analysis
is a powerful tool fo
r image retrieval via
content. In the
future we will extend
our experiments to
other data collection
s.

In the procedure of
the generalisation
of the canonical co
rrelation analysis
we can see that the
original problem can
be transformed and r
einterpreted as a
total distance proble
m or variance mini
misation problem. T
his special duality
between the correlat
ion and the distanc
e requires more inv
estigation to give
more suitable descrip
tion of the structure
of some special spa
ces generated by
different kernels.

These approaches ca
n give tools to handl
e some problems in
the kernel space,
where the inner prod
ucts and the distanc
es between the point
s are known but
the coordinates are
not. For some probl
ems it is sufficient t
o know only the
coordinates of a few
special points, whic
h can be expressed
from the known
inner product, e.g. t
o do cluster analysis
in the kernel space
and to compute
the coordinates of th
e cluster centres onl
y.
Acknowledgme
nts
We would like to a
cknowledge the fina
ncial support of EU
Projects KerMIT,
No. IST-2000-25341
and LAVA, No. IST-
2001-34405.

1 Proofk ≤ η
− GiGi
1.1 Some
notation
Lemma 4. Let
and B be an P aii
re matrices such
hat Trace(A)
then we have
ace(AB) = Trace(
BA)
0

k ≤ η 33

Proof.

Trane(AB) = (ab)ii
i

= Xn aijbji
i,j

= Xn bjiaij
j,i

= (ba)jj
j
= Trace(BA)

Lemma 5. Let A be a symmetric m


having eigenvalue decomposition eq
to A = V 0ΛV (we are able to wr
= V 0AV ) and using Lemma 4, th
Trace(Λ) = Trace(A).

Proof.

Trace(Λ) = Trace(V 0AV )


= Trace((V 0A)V )
= Trace(V (V 0A))
= Trace(V V 0A)
= Trace(A)

Hence we show that the following

Xn aii = Xn λi
i
i

Lemma 6. If we have a symmetric


x A, the Euclidian norm is equal w
the maximum eigenvalue of A

Proof.

For any c ∈ R the scaling does not


ge

=
kcxk ckxk kxk
Proof kK − GiGi k ≤ η 34

Hence we obtain

max = max
kAxk

kAxk = p

Let UDU0 be the eigenvalue decomposition of AA0 s


hat D is a diagonal
matrix containing square of the eigenvalues of A

A0A = UDU0

kAxk2 = x0UDUx

Setting w = U0x and as U is orthognoal we can rew

xk = 1 to kwk = 1

kAk2 = wX0Dw

We can see that the following holds

max λ2i wi2 = max λ2i

Hence we obtain

λi
kAk = max

1.2 Proof
Theorem 7. If K is a positive definite matrix an
G0 is its incomplete
cholesky decomposition then the Euclidian norm of
ubtracted from K is
less than or equal to the trace of the uncalculated p
f K. Let ∆Ki be the
uncalculated part of K and let η = Trace(∆Ki) k ≤ η
kK − GiGi
Proof. Let GG0 be the being the complete cholesky
omposition K = GG0
where G is a lower triangular matrix were the upper
gular is zeros.

A 0
G =
B C
Let GiG
to be the incopmlete decomposition of K wh
are the iterations
of the Cholesky factorization procedure
Gi = G
i =
A
B

such that G= K˜i, where K˜i is the approximation of K


ject to a sym-
metric permutation of rows and columns. Assuming th
e rows and columns
Proof kK − GiGi k ≤ η 35

of K are ordered and no permutation is necessary (th


only for convenience
of the proof). Let ∆Ki = K − K˜i.

AA0 AB0
K = GG0 =
BA0 BB0 + CC0

K˜i = GiGi AA0 AB0


=
BA0 BB0

0 0
∆Ki =
0 CC0
We show that CC0 is positive semi-definite

= Ki+1:n,i+1:n − BB0
= Ki+1:n,i+1:n − B · A−1A · B0
= Ki+1:n,i+1:n − B · A−1 · (AB0)

· K1:i,i+1:n
therefore
xCC0x = < xC, (xC)0 >

≥ 0
λc ≥ 0
CC0 is a positive semi-definite matrix, hence ∆Ki is a
positive semi-definite
matrix. Using Lemma 6 we are now able to show th

kK − K˜ik = k∆Kik

= λ iw i
i
= max λi

As the maximum eigenvalue is less than or equal to th


m of all the eigenval-
ues, using Lemma 5, we are able to rewrite the expre
as

Xn λi
kK − GiGi k ≤
i

≤ Trace(Λ)
≤ Trace(∆Kiii ).
Therefore,
kK − GiGi k ≤ η.
Bibliography

[1]Shotaro Akaho. A kernel m


ethod for canonical correlatio
n analysis. In
International Meeting of Psyc
hometric Society, Osaka, 200
1.

[2]Francis Bach and Michael Jor


dan. Kernel independent comp
onent analysis.
Journal of Machine Leaning R
esearch, 3:1–48, 2002.

[3]Magnus Borga. Learning Mult


idimensional Signal Processin
g. PhD thesis,
Linkping Studies in Science a
nd Technology, 1998.

[4]Magnus Borga. Canonical co


rrelation a tutorial, 1999.

[5]Nello Cristianini and John Sh


awe-Taylor. An Introduction to
Support Vec-
tor Machines and other kernel
-based learning methods. Cam
bridge Univer-
sity Press, 2000.

[6]Nello Cristianini, John Shaw


e-Taylor, and Huma Lodhi.
Latent semantic
kernels. In Caria Brodley an
d Andrea Danyluk, editors, P
roceedings of
ICML-01, 18th International C
onference on Machine Learni
ng, pages 66–
73.Morgan Kaufmann Publishers,
San Francisco, US, 2001.

[7]Colin Fyfe and Pei Ling Lai. Ic


a using kernel canonical correl
ation analysis.

[8]Colin Fyfe and Pei Ling Lai.


Kernel and nonlinear canoni
cal correlation
analys
ernational Journal of Neural
is.Systems, 2001.

[9]A. Gifi. Nonlinear Multivariat


e Analysis. Wiley, 1990.

[10]
G. H. Golub and C. F. V. Loa
n. Matrix Computations. Th
e Johns Hopkins
University Press, Baltimore,
MD, 1983.

[11]
David R. Hardoon and John
Shawe-Taylor. Kcca for differ
ent level preci-
sion in content-based image
retrieval. In Submitted to Thi
rd International
Workshop on Content-Based
Multimedia Indexing, IRISA,
Rennes, France,
2003.

[12]
H. Hotelling. Relations betwe
en two sets of variates. Bio
metrika, 28:312–
377, 1936.

[13]
E. Isaacson and H. B. Keller.
Analysis of Numerical Metho
ds. John Wiley
& Sons, Inc, 1966.
BIBLIOGRAPHY 37

[14]
J. R. Ketterling. Canonic
al analysis of several set
s of variables. Biometrik
a,
58:433–451, 1971.

[15]
T. Kolenda, L. K. Hans
en, J. Larsen, and O.
Winther. Independent co
m-
ponent analysis for un
derstanding multimedia
content. In H. Bourlard
,
T. Adali, S. Bengio, J. Lar
sen, and S. Douglas, edit
ors, Proceedings of IEE
E
Workshop on Neural Ne
tworks for Signal Proces
sing XII, pages 757–
766,
Piscataway, New Jersey,
2002. IEEE Press. Mart
igny, Valais, Switzerland,

Sept. 4-6, 2002.

[16]
Malte Kuss and Thore
Graepel. The geometry
of kernel canonical corre
la-
tion analysis. 2002.

[17]
Yong Rui, Thomas S.
Huang, and Shih-Fu Ch
ang. Image retrieval: C
ur-
rent techniques, promisi
ng directions, and open
issues. Journal of Visu
al
Communications and Im
age Representation, 10:3
9–62, 1999.

[18]
Alexei Vinokourov, Davi
d R. Hardoon, and Jo
hn Shawe-Taylor. Lear
n-
ing the semantics of m
ultimedia content with
application to web ima
ge
retrieval and classificati
on. In Proceedings of
Fourth International Sy
m-ndent Component Analys
posiu
is and Blind Source Sep
maration,
Indepe
Nara, Japan, 2003.

[19]
Alexei Vinokourov, Joh
n Shawe-Taylor, and N
ello Cristianini. Inferring
a
semantic representation
of text via cross-
language correlation ana
lysis. In
Advances of Neural Infor
mation Processing Syste
ms 15 (to appear), 2002
.

You might also like