Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Hyperkernel Construction for Support Vector Machines

Lei Jia, Shizhong Liao


School of Computer Science and Technology
Tianjin University, Tianjin 300072, P. R. China
{ljia,szliao}@tju.edu.cn
Abstract
Construction of kernel functions is crucial for research
and application of Support Vector Machines (SVM). In
this paper, we propose a combinatorial construction of hy-
perkernel functions for SVM. We rst analyze the under-
and over-learning phenomena of common kernel functions.
Then, we construct hyperkernel function with a polynomial
combination of common kernels, and prove the Mercer con-
dition of the hyperkernel. Finally, we experiment both on
simulation and benchmark data to demonstrate the perfor-
mance of hyperkernel for SVM. The theoretical proofs and
experimental results illuminate the validity and feasibility
of hyperkernel.
1. Introduction
Support Vector Machines(SVM) [10] provide a power-
ful and unied model for machine learning, pattern reorga-
nization and data mining. One of the challenges in SVM
research and application is kernel model selection. Kernels
are problem-specic functions that act as an interface be-
tween the learning machine and data. A poor kernel selec-
tion can lead to signicantly substandard performance.
However, selecting appropriate positive denite kernels
is not a trivial task. Practically, the researcher has to select
the kernel before the learning starts, with common selec-
tions of the translation or rotation invariant kernels. The
associated kernel parameters can then be determined by op-
timizing a quality function of the kernel. Examples of these
functions include cross validation error, kernel target align-
ment [4], maximum of the posterior probability [9], and
learning theoretical bound [2].
Instead of tuning only the kernel parameters, many de-
velopments solve this problem in the kernel matrix sense.
As all information in the feature space is encode in the ker-
nel matrix, one can learning the kernel matrix directly using
technique such as semidenite programming [5] or nonlin-
ear dimensionality reduction [11]. Earlier works based on
information geometry have been proposed in [1].
Recently, there have been many developments regarding
learning the kernel function from itself. The main idea is
that learning the kernel by performing the kernel trick on
the space of kernels, i.e., the notion of a hyperkernel [8, 3,
7]. However, there exist no general principles to guide the
selection of kernels and tuning of kernel parameters.
In this paper, we explore a combinatorial kernel con-
struction approach and derive a exible hyperkernel func-
tion for SVM. Specically, we rst analyze the under- and
over-learning phenomena of single kernel in section 2. Then
we present in section 3 the denition of our hyperkernel,
and prove that the hyperkernel is well-dened and satis-
es the Mercer condition. Finally in section 4, we exper-
iment several examples of hyperkernel and show their per-
formances in SVM both on simulation and benchmark data.
2. Limitations of Single Kernel
Two typical kernel functions often used in SVM are
the translation invariant kernel (also known as radial basis
kernel) and rotation invariant kernel (also known as dot-
product kernel). The translation invariant kernels can be
formulated as
K (u, v) = r (u v) ,
where r () is an arbitrary function. Example of translation
invariant kernel involves Gaussian kernel, thin plate spline,
multiquadratic kernel, etc.
The rotation invariant kernels can be formulated as
K (u, v) = d (u v) ,
where d () is an arbitrary function. Example of translation
invariant kernel involves polynomial kernel, the two-layer
neural network kernel, etc. Table 1 lists some commonly
used kernels in kernel methods.
The motivation behind our approach is to build kernel
such that can overcome two difculties that single kernel
Fourth International Conference on Natural Computation
978-0-7695-3304-9/08 $25.00 2008 IEEE
DOI
76
Fourth International Conference on Natural Computation
978-0-7695-3304-9/08 $25.00 2008 IEEE
DOI
76
Fourth International Conference on Natural Computation
978-0-7695-3304-9/08 $25.00 2008 IEEE
DOI 10.1109/ICNC.2008.156
76
Fourth International Conference on Natural Computation
978-0-7695-3304-9/08 $25.00 2008 IEEE
DOI 10.1109/ICNC.2008.156
76
Table 1. Commonly used kernels for machine learning and pattern recognition.
Kernel function Expression Type
Gaussian kernel K (u, v) = exp
_
u v
2
_
Translation invariant
Multiquadratic kernel K (u, v) =
_
u v
2
+
2
Translation invariant
Spline kernel K (u, v) = u v
2
ln u v Translation invariant
Polynomial kernel K (u, v) = (u v + c)
d
Rotation invariant
Two-layer neural network kernel K (u, v) = tanh (u v + ) Rotation invariant
always suffer from, the notation of under-learning and over-
learning.
First, considering Figure 1 and 2, which shows the sep-
arating hyperplane and the margin for a support vector ma-
chine using polynomial kernel and Gaussian kernel respec-
tively (* and denote two classes samples respectively,
SVs are cycled by ). Both gures reveal the rst com-
mon difculty in applying kernel methods with single ker-
nel that the kernel is not competent enough in mapping sam-
ples and the data set is under-learned. The following the-
orem demonstrates the fact that this phenomena does not
happens occasionally.
Theorem 2.1. Gaussian SVM becomes under-learned as
0.
Proof. When 0, lim
0
K
_
x
i
, x
j
_
= 1, the determine
function is
f
_
x
_
=
l

i=1

i
y
i
K
_
x
i
, x
_
+ b = b.
Gaussian SVM ranks all the test samples to the same class
and becomes under-learned.
Second, Figure 3 shows the training data and the classi-
cation function for a support vector machine using single
Gaussian kernel. Instead of under-learning the data set, the
hyperplane can perfectly classify every training points. As
all samples become support vectors, the learning machine
becomes over-learned.
Theorem 2.2. Gaussian kernel SVM becomes over-learned
as is growing ( ).
Proof. When ,
lim

K (x
i
, x
j
) =
_
1 : x
i
= x
j
,
0 : x
i
= x
j
.
Suppose the training set has l samples, in which the number
of samples whose label are y
i
= +1 or y
i
= 1 is l
+
or l

respectively. Assume the Lagrange multiplier


i
is

i
=
_

+
: y
i
= +1

: y
i
= 1
0 <
+
,

< C.
Because i l
i
> 0, all the sample points become sup-
port vectors. Below we want to nd a solution to
i
. From
the dual problem solution of SVM we have
l

i=1

i
y
i
= 0,
hence,

+
l
+

= 0. (1)
From the KKT condition we get

i
_
y
i
_
w
T
(x
i
) + b
_
1

= 0 and w
T
(x
i
) + b = y
i
,
which can be transformed to
l

j=1

j
y
j
K (x
j
, x
i
) + b = y
i
, i = 1, 2, . . . , n.
Let we get the simultaneous equations:
_

+
+ b = 1,

+ b = 1.
(2)
Combining with formula 1, we can nally obtain the value
of
i
and b:

+
= 2l

/l,

= 2l
+
/l, b = (l
+
l

)/l.
Let C > max {2l

/l, 2l
+
/l}, the claim holds. Gaussian
kernel SVM becomes over-learned.
Although theorem 1 and 2 investigate SVM with single
kernel in the limit case, the under- or over-learning phenom-
ena will happen when reach a certain value. Like Fig-
ure 2 and 3 have shown the under- and over-learned Gaus-
sian SVM respectively, where training samples are under-
learned with
9
and over-learned with = 2
6
. A much
better classier to the same data set is shown in Figure 4.
To avoid these problems, we derive a combinatorial the-
ory on kernel construction and propose a concept of hyper-
kernel function.
77 77 77 77
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
u
v
Figure 1. Polynomial kernel function does
not nonlinear-map sample points ef-
ciently. Data set is under-learned.
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
u
v
Figure 2. Gaussian kernel SVM with =
2
9
, all sample points are ranked to same
class. Data set is under-learned.
3. Hyperkernel
Before presenting the hyperkernel in detail, we review
the Mercer theorem rst, which provides a necessary and
sufcient condition to examine a kernel.
Let X be a contact set of R
n
, assume K is a continuous
and symmetric function. If there exists a integral operator
T
K
: L
2
(X) L
2
(Y ) ,
and
(T
K
f) () =
_
X
K (, u) f (u)du
is positive, i.e.
_
XX
K (u, v) f (u) f (v)dudv 0, f L
2
(X) . (3)
Then K (u, v) can be expanded to a consisted convergent
sequence (on X X)
K (u, v) =

i=1

i
(u)
i
(v),
i
0,
where
i
L
2
(X) is the eigenfunctions of T
K
and nor-
malized to 1. Formula 3 is Mercer condition, and a kernel
is called Mercer kernel if it satises Mercer condition.
Denition 3.1. Suppose K
i
(i = 1, , m) is a Mercer ker-
nel, then K = p (K
i
(u, v) , , ) is hyperkernel function,
where p (x, , ) =

m
i=1

i
x
i
, , 0.
We need to verify that hyperkernel is well dened, i.e.
hyperkernel satises Mercer condition.
Lemma 3.1. Linear combination of Mercer kernels re-
mains Mercer kernel.
Proof. Assume K
i
(i = 1, , m) be Mercer kernels.
Let
K (u, v) =
m

i=1

i
K
i
(u, v),
where
i
is a positive constant (
i
0).
Since K
i
is Mercer kernel, we have
_
K
i
(u, v) f (u) f (v)dudv 0
then
_
K (u, v) f (u) f (v)dudv
=
_
m

i=1

i
K
i
(u, v)f (u) f (v)dudv
=
m

i=1
_

i
K
i
(u, v) f (u) f (v)dudv
0
K (u, v) satises Mercer condition. Especially, K (u, v) is
a convex combination Mercer kernel when

m
i=1

i
= 1.
Lemma 3.2. Product of Mercer kernel remains Mercer ker-
nel.
Proof. Assume K
i
(i = 1, , m) be Mercer kernels.
78 78 78 78
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
u
v
Figure 3. Gaussian kernel SVM with = 2
6
,
all sample points become SVs. Data set is
over-learned.
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
u
v
Figure 4. Better learned data set. * and
denote two classes samples respectively.
SVs are cycled by .
Let
K (u, v) =
m

i=1

i
K
i
(u, v),
where
i
is a positive constant (
i
0).
Then we have
_
K (u, v) f (u) f (v)dudv
=
_
m

i=1

i
K
i
(u, v)f (u) f (v)dudv
=
m

i=1
_

i
K
i
(u, v) f (u) f (v)dudv
0
K (u, v) satises Mercer condition.
From Lemma 1 and 2 we can easily induce the following
theorem.
Theorem 3.1. If K
i
(i = 1, , m) is Mercer kernel, then
hyperkernel K = p (K
i
, , ) is Mercer kernel.
Theorem 3 implies that the proposed hyperkernel re-
mains Mercer kernel. Therefore we can construct a class
of valid hyperkernel from common kernel functions (such
as those in Table 1). The resulting hyperkernel can have
the property of translation and rotation invariant simultane-
ously, which makes them more exible in learning a wider
range of data sets.
Besides kernel parameters, hyperparameters and also
need to be tuned. Hence the resulting kernel would be dif-
cult to tune by cross validation as there are more free vari-
ables (one for each kernel). However, the simultaneous tun-
ing approach proposed in [6] can handle this problem.
4. Experiments
Two classication experiments are designed to examine
the efciency of hyperkernel in SVM.
Experiment 1. In the rst experiment, we choose two
classes of simulation data denoted by * and . 60 of
them are used for training and 20 for test. Figure 5, 6,
7 show the classier learned by cubic polynomial kernel
SVM, Gaussian kernel SVM and hyperkernel SVM respec-
tively. Hyperkernel used in this experiment is
K =
1
exp
_
u v
2
_
+
2
(u v)
d
,
and it has 4 parameters need to be tuned. Results are listed
in Table 2.
Experiment 2. To verify the performance of hyperk-
ernel on large-scale data, we run hyperkernel SVM on Ti-
tanic data set, which is a UCI benchmark database
1
. The
titanic dataset gives the values of four categorical attributes
for each of the 2201 people on board the Titanic when it
struck an iceberg and sank. The attributes are social class
(rst class, second class, third class, or crewmember), age
(adult or child), sex, and whether or not the person survived.
We choose 2000 samples for training and 201 for test. Hy-
perkernel used in this experiment is
K =
1
exp
_
u v
2
_
+
2
(u v + a)
d
+
3
tanh (u v + b),
1
Available at http://archive.ics.uci.edu/ml/.
79 79 79 79
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
u
v
Figure 5. Optimal classier
of SVM with cubic polyno-
mial kernel function.
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
u
v
Figure 6. Optimal classier
of SVM with Gaussian kernel
function.
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
u
v
Figure 7. Optimal classier
of SVM with hyperkernel
function.
and its 6 parameters are tuned simultaneously using the ap-
proach of [6]. Results are listed in Table 2.
5. Conclusions
We have proposed a combinatorial approach to hyperker-
nel function construction. The approach can preserve multi-
properties simultaneously, such as translation invariant and
rotation invariant, and construct hyperkernel not subject to
under- or over-learning.
The combinatorial approach to hyperkernel construction
also brightens a combinatorial approach to kernel model se-
lection, and efcient hyperkernel selection and hyperparam-
eter tuning methods need to be investigated.
Acknowledgment
This work is supported in part by Natural Science Foun-
dation of China under Grant No. 60678049 and Natural
Table 2. Comparison results between hy-
perkernel and common kernels in Experi-
ment 1 and 2. # Par denotes number of pa-
rameters, # SVs denotes number of support
vectors after learning, Training and Test de-
notes training and test accuracy respectively.
Kernel # Par # SVs Training Test
Cubic 1 22 0.85 0.80
Ept 1 Gaussian 1 16 0.90 0.90
Hyper 4 12 0.98 0.95
Cubic 1 1378 0.78 0.77
Ept 2 Gaussian 1 947 0.79 0.79
Hyper 7 900 0.81 0.80
Science Foundation of Tianjin under Grant No. 07JCY-
BJC14600.
References
[1] S. Amari and S. Wu. Improving support vector machine
classier by modifying kernel function. Neural Networks,
12(66):783789, 1999.
[2] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee.
Choosing multiple parameters for support vector machines.
Machine Learning, 46(11):131159, 2002.
[3] S. Cheng, A. Smola, and R. Williamson. Learning the kernel
with hyperkernels. Journal of Machine Learning Research,
6:10431071, 2005.
[4] N. Cristianini, J. Kandola, A. Elisseeff, and J. Shawe-Taylor.
On kernel-target alignment. Journal of Machine Learning
Research, 1:131, 2002.
[5] G. Lanckriet, N. Cristianini, P. Baltlett, L. El Ghaoui, and
M. Jordan. Learning the kernel matrix with semidenite pro-
gramming. Journal of Machine Learning Research, 5:27
72, 2004.
[6] S. Liao and L. Jia. Simultaneous tuning of hyperparameter
and parameter for support vector machines. In Proceedings
of the Eleventh Pacic-Asia Conference on Knowledge Dis-
covery and Data Mining, pages 162172, 2007.
[7] C. Ong and A. Smola. Machine learning using hyperkernels.
In Proceedings of the International Conference on Machine
Learning, pages 568575, 2003.
[8] C. S. Ong, A. J. Smola, and R. C. Williamson. Hyperkernels.
In Advances in Neural Information Processing Systems 14,
pages 478485, 2002.
[9] P. Sollish. Probability methods for support vector machines.
In S. Solla, T. Leen, and K.-R. M uller, editors, Advances in
Neural Information Processing Systems 12, pages 349355,
2000.
[10] V. Vapnik. The Nature of Statistical Learning Theory.
Springer, New York, 1995.
[11] K. Q. Weinberger, F. Sha, and L. K. Saul. Learning a kernel
matrix for nonlinear dimensionality reduction. In ACM In-
ternational Conference Proceeding Series, pages 839846,
2004.
80 80 80 80

You might also like