Professional Documents
Culture Documents
Hyper Kernel
Hyper Kernel
i=1
i
y
i
K
_
x
i
, x
_
+ b = b.
Gaussian SVM ranks all the test samples to the same class
and becomes under-learned.
Second, Figure 3 shows the training data and the classi-
cation function for a support vector machine using single
Gaussian kernel. Instead of under-learning the data set, the
hyperplane can perfectly classify every training points. As
all samples become support vectors, the learning machine
becomes over-learned.
Theorem 2.2. Gaussian kernel SVM becomes over-learned
as is growing ( ).
Proof. When ,
lim
K (x
i
, x
j
) =
_
1 : x
i
= x
j
,
0 : x
i
= x
j
.
Suppose the training set has l samples, in which the number
of samples whose label are y
i
= +1 or y
i
= 1 is l
+
or l
i
=
_
+
: y
i
= +1
: y
i
= 1
0 <
+
,
< C.
Because i l
i
> 0, all the sample points become sup-
port vectors. Below we want to nd a solution to
i
. From
the dual problem solution of SVM we have
l
i=1
i
y
i
= 0,
hence,
+
l
+
= 0. (1)
From the KKT condition we get
i
_
y
i
_
w
T
(x
i
) + b
_
1
= 0 and w
T
(x
i
) + b = y
i
,
which can be transformed to
l
j=1
j
y
j
K (x
j
, x
i
) + b = y
i
, i = 1, 2, . . . , n.
Let we get the simultaneous equations:
_
+
+ b = 1,
+ b = 1.
(2)
Combining with formula 1, we can nally obtain the value
of
i
and b:
+
= 2l
/l,
= 2l
+
/l, b = (l
+
l
)/l.
Let C > max {2l
/l, 2l
+
/l}, the claim holds. Gaussian
kernel SVM becomes over-learned.
Although theorem 1 and 2 investigate SVM with single
kernel in the limit case, the under- or over-learning phenom-
ena will happen when reach a certain value. Like Fig-
ure 2 and 3 have shown the under- and over-learned Gaus-
sian SVM respectively, where training samples are under-
learned with
9
and over-learned with = 2
6
. A much
better classier to the same data set is shown in Figure 4.
To avoid these problems, we derive a combinatorial the-
ory on kernel construction and propose a concept of hyper-
kernel function.
77 77 77 77
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
u
v
Figure 1. Polynomial kernel function does
not nonlinear-map sample points ef-
ciently. Data set is under-learned.
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
u
v
Figure 2. Gaussian kernel SVM with =
2
9
, all sample points are ranked to same
class. Data set is under-learned.
3. Hyperkernel
Before presenting the hyperkernel in detail, we review
the Mercer theorem rst, which provides a necessary and
sufcient condition to examine a kernel.
Let X be a contact set of R
n
, assume K is a continuous
and symmetric function. If there exists a integral operator
T
K
: L
2
(X) L
2
(Y ) ,
and
(T
K
f) () =
_
X
K (, u) f (u)du
is positive, i.e.
_
XX
K (u, v) f (u) f (v)dudv 0, f L
2
(X) . (3)
Then K (u, v) can be expanded to a consisted convergent
sequence (on X X)
K (u, v) =
i=1
i
(u)
i
(v),
i
0,
where
i
L
2
(X) is the eigenfunctions of T
K
and nor-
malized to 1. Formula 3 is Mercer condition, and a kernel
is called Mercer kernel if it satises Mercer condition.
Denition 3.1. Suppose K
i
(i = 1, , m) is a Mercer ker-
nel, then K = p (K
i
(u, v) , , ) is hyperkernel function,
where p (x, , ) =
m
i=1
i
x
i
, , 0.
We need to verify that hyperkernel is well dened, i.e.
hyperkernel satises Mercer condition.
Lemma 3.1. Linear combination of Mercer kernels re-
mains Mercer kernel.
Proof. Assume K
i
(i = 1, , m) be Mercer kernels.
Let
K (u, v) =
m
i=1
i
K
i
(u, v),
where
i
is a positive constant (
i
0).
Since K
i
is Mercer kernel, we have
_
K
i
(u, v) f (u) f (v)dudv 0
then
_
K (u, v) f (u) f (v)dudv
=
_
m
i=1
i
K
i
(u, v)f (u) f (v)dudv
=
m
i=1
_
i
K
i
(u, v) f (u) f (v)dudv
0
K (u, v) satises Mercer condition. Especially, K (u, v) is
a convex combination Mercer kernel when
m
i=1
i
= 1.
Lemma 3.2. Product of Mercer kernel remains Mercer ker-
nel.
Proof. Assume K
i
(i = 1, , m) be Mercer kernels.
78 78 78 78
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
u
v
Figure 3. Gaussian kernel SVM with = 2
6
,
all sample points become SVs. Data set is
over-learned.
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
u
v
Figure 4. Better learned data set. * and
denote two classes samples respectively.
SVs are cycled by .
Let
K (u, v) =
m
i=1
i
K
i
(u, v),
where
i
is a positive constant (
i
0).
Then we have
_
K (u, v) f (u) f (v)dudv
=
_
m
i=1
i
K
i
(u, v)f (u) f (v)dudv
=
m
i=1
_
i
K
i
(u, v) f (u) f (v)dudv
0
K (u, v) satises Mercer condition.
From Lemma 1 and 2 we can easily induce the following
theorem.
Theorem 3.1. If K
i
(i = 1, , m) is Mercer kernel, then
hyperkernel K = p (K
i
, , ) is Mercer kernel.
Theorem 3 implies that the proposed hyperkernel re-
mains Mercer kernel. Therefore we can construct a class
of valid hyperkernel from common kernel functions (such
as those in Table 1). The resulting hyperkernel can have
the property of translation and rotation invariant simultane-
ously, which makes them more exible in learning a wider
range of data sets.
Besides kernel parameters, hyperparameters and also
need to be tuned. Hence the resulting kernel would be dif-
cult to tune by cross validation as there are more free vari-
ables (one for each kernel). However, the simultaneous tun-
ing approach proposed in [6] can handle this problem.
4. Experiments
Two classication experiments are designed to examine
the efciency of hyperkernel in SVM.
Experiment 1. In the rst experiment, we choose two
classes of simulation data denoted by * and . 60 of
them are used for training and 20 for test. Figure 5, 6,
7 show the classier learned by cubic polynomial kernel
SVM, Gaussian kernel SVM and hyperkernel SVM respec-
tively. Hyperkernel used in this experiment is
K =
1
exp
_
u v
2
_
+
2
(u v)
d
,
and it has 4 parameters need to be tuned. Results are listed
in Table 2.
Experiment 2. To verify the performance of hyperk-
ernel on large-scale data, we run hyperkernel SVM on Ti-
tanic data set, which is a UCI benchmark database
1
. The
titanic dataset gives the values of four categorical attributes
for each of the 2201 people on board the Titanic when it
struck an iceberg and sank. The attributes are social class
(rst class, second class, third class, or crewmember), age
(adult or child), sex, and whether or not the person survived.
We choose 2000 samples for training and 201 for test. Hy-
perkernel used in this experiment is
K =
1
exp
_
u v
2
_
+
2
(u v + a)
d
+
3
tanh (u v + b),
1
Available at http://archive.ics.uci.edu/ml/.
79 79 79 79
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
u
v
Figure 5. Optimal classier
of SVM with cubic polyno-
mial kernel function.
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
u
v
Figure 6. Optimal classier
of SVM with Gaussian kernel
function.
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
u
v
Figure 7. Optimal classier
of SVM with hyperkernel
function.
and its 6 parameters are tuned simultaneously using the ap-
proach of [6]. Results are listed in Table 2.
5. Conclusions
We have proposed a combinatorial approach to hyperker-
nel function construction. The approach can preserve multi-
properties simultaneously, such as translation invariant and
rotation invariant, and construct hyperkernel not subject to
under- or over-learning.
The combinatorial approach to hyperkernel construction
also brightens a combinatorial approach to kernel model se-
lection, and efcient hyperkernel selection and hyperparam-
eter tuning methods need to be investigated.
Acknowledgment
This work is supported in part by Natural Science Foun-
dation of China under Grant No. 60678049 and Natural
Table 2. Comparison results between hy-
perkernel and common kernels in Experi-
ment 1 and 2. # Par denotes number of pa-
rameters, # SVs denotes number of support
vectors after learning, Training and Test de-
notes training and test accuracy respectively.
Kernel # Par # SVs Training Test
Cubic 1 22 0.85 0.80
Ept 1 Gaussian 1 16 0.90 0.90
Hyper 4 12 0.98 0.95
Cubic 1 1378 0.78 0.77
Ept 2 Gaussian 1 947 0.79 0.79
Hyper 7 900 0.81 0.80
Science Foundation of Tianjin under Grant No. 07JCY-
BJC14600.
References
[1] S. Amari and S. Wu. Improving support vector machine
classier by modifying kernel function. Neural Networks,
12(66):783789, 1999.
[2] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee.
Choosing multiple parameters for support vector machines.
Machine Learning, 46(11):131159, 2002.
[3] S. Cheng, A. Smola, and R. Williamson. Learning the kernel
with hyperkernels. Journal of Machine Learning Research,
6:10431071, 2005.
[4] N. Cristianini, J. Kandola, A. Elisseeff, and J. Shawe-Taylor.
On kernel-target alignment. Journal of Machine Learning
Research, 1:131, 2002.
[5] G. Lanckriet, N. Cristianini, P. Baltlett, L. El Ghaoui, and
M. Jordan. Learning the kernel matrix with semidenite pro-
gramming. Journal of Machine Learning Research, 5:27
72, 2004.
[6] S. Liao and L. Jia. Simultaneous tuning of hyperparameter
and parameter for support vector machines. In Proceedings
of the Eleventh Pacic-Asia Conference on Knowledge Dis-
covery and Data Mining, pages 162172, 2007.
[7] C. Ong and A. Smola. Machine learning using hyperkernels.
In Proceedings of the International Conference on Machine
Learning, pages 568575, 2003.
[8] C. S. Ong, A. J. Smola, and R. C. Williamson. Hyperkernels.
In Advances in Neural Information Processing Systems 14,
pages 478485, 2002.
[9] P. Sollish. Probability methods for support vector machines.
In S. Solla, T. Leen, and K.-R. M uller, editors, Advances in
Neural Information Processing Systems 12, pages 349355,
2000.
[10] V. Vapnik. The Nature of Statistical Learning Theory.
Springer, New York, 1995.
[11] K. Q. Weinberger, F. Sha, and L. K. Saul. Learning a kernel
matrix for nonlinear dimensionality reduction. In ACM In-
ternational Conference Proceeding Series, pages 839846,
2004.
80 80 80 80