Professional Documents
Culture Documents
Boser1992 Refrensi 12
Boser1992 Refrensi 12
144
trons have a dual kernel representation implementing The coefficients ak are the parameters to be adjusted
the same decision function. The optimal margin algo- and the xk are the training patterns. The function K
rithm exploits this duality both for improved efficiency is a predefine kernel, for example a potential function
and flexibility. In the dual space the decision function [ABR64] or any Radial Basis Function [BL88, MD89].
is expressed as a linear combination of basis functions Under certain conditions [C H!53], symmetric kernels
parametrized by the supporting patterns. The support- possess finite or infinite series expansions of the form
ing patterns correspond to the class centers of RBF
classifiers and are chosen automatically by the maxi- K(x, x’) = ~ pi(x) $Di(x’). (5)
mum margin training procedure. In the case of polyno- i
mial classifiers, the Perception represent ation involves
In particular, the kernel K(x, x’) = (x . x’ + 1)9 cor-
an untractable number of parameters. This problem is
responds to a polynomial expansion p(x) of order q
overcome in the dual space representation, where the
[Pog75].
classification rule is a weighted sum of a kernel func-
tion [Pog75] for each supporting pattern. High order Provided that the expansion stated in equation 5 exists,
polynomial classifiers with very large training sets can equations 3 and 4 are dual representations of the same
therefore be handled efficiently with the proposed algo- decision function and
rithm.
The training algorithm is described in Section 2. Section
3 summarizes important properties of optimal margin
classifiers. Experimental results are reported in Section The parameters wi are called direct parameters, and the
4. CYkare referred to as dual parameters.
The proposed training algorithm is based on the “gener-
2 MAXIMUM MARGIN TRAINING
alized portrait” method described in [Vap82] that con-
ALGORITHM structs separating hyperplanes with maximum margin.
Here this algorithm is extended to train classifiers lin-
The maximum margin training algorithm finds a deci- ear in their parameters, First, the margin between the
sion function for pattern vectors x of dimension n be- class boundary and the training patterns is formulated
longing to either of two classes A and B. The input to in the direct space. This problem description is then
the training algorithm is a set of p examples xi with transformed into the dual space by means of the La-
labels gi: grangian. The resulting problem is that of maximizing
(1) a quadratic form with constraints and is amenable to
(Xl) vi), (X2, Y2), (X3, Y3), . ~, (Xp, yp)
efficient numeric optimization algorithms [Lue84].
yk=l if xk E class A
where
{ ~k = –1 ifxk E ClaSSB.
2.1 MAXIMIZING THE MARGIN IN THE
From these training examples the algorithm finds the DIRECT SPACE
parameters of the decision function D(x) during a learn-
ing phase. After training, the classification of unknown In the direct space the decision function is
patterns is predicted according to the following rule:
D(X) = W ~ $O(X) + b, (7)
XEA if D(x)>O
(2)
x c B otherwise. where w and P(X) are N dimensional vectors and b is
a bias. It defines “a separating hyperplane in ~-space.
The decision functions must be linear in their parame- The distance between this hyperplane and pattern x
ters but are not restricted to linear dependence of x. is D(x)/ llwll (Figure 1). Assuming that a separation
These functions can be expressed either in direct, or in of the training set with margin M between the class
dual space. The direct space notation is identical to the boundary and-the training p~tterns exists, all training
Perception decision function [Ros62]: patterns fulfill the following inequality:
N
145
\\
o ‘\\ x,
\\
Figure 1: Maximum margin linear decision function D(x) = w . x + b (V = x). The gray levels encode the absolute
value of the decision function (solid black corresponds to D(x) = O). The numbers indicate the supporting patterns.
These patterns are called the supporting patterns of the 2.2 MAXIMIZING THE MARGIN IN THE
decision boundary. DUAL SPACE
A decision function with maximum margin is illustrated Problem 13 can be transformed into the dual space by
in figure 1. The problem of finding a hyperplane in means of the Lagrangian [Lue84]
~-space with maximum margin is therefore a minimax
problem:
L(w, b,a) = ;IIW112 – ~a’k [~k~(Xk) – 1](14)
max miny~ll(x~). (11)
W,llwll=l k k=l
subject to ~k ~ O, k=l,2,..., p.
The norm of the parameter vector in equations 9 and
11 is fixed to pick one of an infinite number of possible The factors ~k are called Lagrange multipliers or Kiihn-
solutions that differ only in scaling. Instead of fixing Tucker coefficients and satisfy the conditions
the norm of w to take care of the scaling problem, the
product of the margin M and the norm of a weight ~k (y~~(X~) – 1) = O, k=l,2,..., p. (15)
vector w can be fixed.
The factor one half has been included for cosmetic rea-
fqwl] = 1. (12) sons; it does not change the solution.
The optimization problem 13 is equivalent to searching
Thus, maximizing the margin A4 is equivalent to mini-
a saddle point of the function L(w, b, a). This saddle
mizing the norm IIwII.1 Then the problem of finding a
point is a the minimum of L(w, 6, a) with respect to w,
maximum margin separating hyperplane w* stated in 9
and a maximum with respect to a (cr~ > O). At the
reduces to solving the following quadratic problem:
solution, the following necessary condition—is met:
m~llw112 (13)
146
The dependence of the Lagrangian L(w, b,a) on the exceedingly large dimensionality, the training data is di-
weight vector w is removed by substituting the expan- vided into chunks that are processed iteratively [Vap82].
sion of w* given by equation 16 for w. Further trans- The maximum margin hypersurface is constructed for
formations using 3 and 5 result in a Lagrangian which the first chunk and a new training set is formed con-
is a function of the parameters a and the bias b only: sisting of the supporting patterns from the solution and
those patterns xk in the second chunk of the training
set for which yk D(xk ) < 1 – c. A new classifier is
J(a, b)= ~a~(l–by~)– jcdl.cx, (17)
trained and used to construct a training set consisting
kml
of supporting patterns and examples from the first three
subject to ~k ~ O, k=l,2,..., p. chunks which satisfy yk D(xk) < 1 — c, This process is
repeated until the entire training set is separated.
Here H is a square matrix of size p x p with elements
~kl = yky~~f(xk,x~).
3 PROPERTIES OF THE
In order for a unique solution to exist, II must be posi- ALGORITHM
tive definite. For fixed bias b, the solution a“ is obtained
by maximizing J(cx, b) under the conditions O!k ~ O. In this Section, we highlight some important aspects of
Based on equations 7 and 16, the resulting decision func- the optimal margin training algorithm. The description
tion is of the form is split into a discussion of the qualities of the resulting
classifier, and computational considerations. Classifica-
D(X) = W* . V(X)+ b (18)
tion performance advantages over other techniques will
~~~~~~(x~, x) + b, a; > 0, be illustrated in the Section on experimental results.
‘x
k
3.1 PROPERTIES OF THE SOLUTION
where only the supporting patterns appear in the sum
with nonzero weight,
Since maximizing the margin between the decision
The choice of the bias b gives rise to several variants of boundary and the training patterns is equivalent to
the algorithm. The two considered here are maximizing a quadratic form in the positive quadrant,
there are no local minima and the solution is always
1. The bias can be fixed a priori and not subjected unique if H has full rank. At the optimum
to training. This corresponds to the “Generalized
Portrait Technique” described in [Vap82]. J(~*) = ;Ilw”[y = 1
2 (M*)2 = ;i~~ (20)
2. The cost function 17 can be optimized with respect k=l
tow and b. This approach gives the largest possible The uniqueness of the solution is a consequence of the
margin M* in q-space [VC74]. maximum margin cost function and represents an im-
portant advantage over other algorithms for which the
In both cases the solution is found with standard non- solution depends on the initial conditions or other pa-
linear optimization algorithms for quadratic forms with rameters that are difficult to control.
linear constraints [Lue84, Lo072]. The second approach
gives the largest possible margin. There is no guaran- Another benefit of the maximum margin objective is its
tee, however, that this solution exhibits also the best insensitivity to small changes of the parameters w or
generalization performance. a. Since the decision function D(x) is a linear func-
tion of w in the direct, and of cx in the dual space, the
A strategy to optimize the margin with respect to both probability of misclassifications due to parameter vari-
w and b is described in [Vap82]. It solves problem 17 for ations of the components of these vectors is minimized
differences of pattern vectors to obtain a“ independent for maximum margin, The robustness of the solution—
of the bias, which is computed subsequently. The mar- and potentially its generalization performance—can be
gin in ~-space is maximized when the decision boundary increased further by omitting some supporting patterns
is halfway between the two classes. Hence the bias b* from the solution. Equation 20 indicates that the largest
is obtained by applying 18 to two arbitrary supporting increase in the maximum margin M“ occurs when the
patterns xA e chiss A and xB c class B and taking into supporting patterns with largest Q!kare eliminated. The
account that D(xA) = 1 and D(xB) = –1. elimination can be performed automatically or with as-
sistance from a supervisor. This feature gives rise to
b* = ‘j (w” . ~(xA) + W* ~$O(XB)) (19) other important uses of the optimum margin algorithm
in database cleaning applications [MGB+92].
147
Figure 2: Linear decision boundary for MSE (left) and maximum margin cost functions (middle, right) in the presence
of an outlier. In the rightmost picture the outlier has been removed. The numbers reflect the ranking of supporting
patterns according to the magnitude of their Lagrange coefficient czk for each class individually.
decision boundary. These examples are readily iden- polynomial. In practice, m < p << N, i.e. the number
tified as those with the largest O!k and can be elimi- of supporting patterns is much smaller than the dimen-
nated either automatically or with supervision. Hence, sion of the ~-space. The capacity tuning realized by the
optimal margin classifiers give complete control over maximum margin algorithm is essential to get general-
the handling of outliers, as opposed to quietly ignoring ization with high-order polynomial classifiers.
them.
The optimum margin algorithm performs automatic ca- 3.2 COMPUTATIONAL CONSIDERATIONS
pacity tuning of the decision function to achieve good
generalization. An estimate for an upper bound of the
Speed and convergence are important practical consid-
generalization error is obtained with the “leave-one-out”
erations of classification algorithms. The benefit of the
method: A pattern Xk is removed from the training set.
dual space representation to reduce the number of com-
A classifier is then trained on the remaining patterns
putations required for example for polynomial classifiers
and tested on xk. This process is repeated for all p
has been pointed out already. In the dual space, each
training patterns. The generalization error is estimated
evaluation of the decision function D(x) requires m eval-
by the ratio of misclassified patterns over p. For a max-
uations of the kernel function l((x~, x) and forming the
imum margin classifier, two cases arise: ~f xk is not a
weighted sum of the results. This number can be fur-
supporting pattern, the decision boundary is unchanged
ther reduced through the use of appropriate search tech-
and xk will be classified correctly. If xk is a supporting
niques which omit evaluations of Ii that yield negligible
pattern, two cases are possible:
contributions to D(x) [Omo9 1].
1. The pattern xk is linearly dependent on the other Typically, the training time for a separating surface
supporting patterns. In this case it will be classified from a database with several thousand examples is a few
correctly. minutes on a workstation, when an efficient optimiza-
2. Xk is linearly independent from the other support- tion algorithm is used. All experiments reported in the
ing patterns. In this case the outcome is uncertain. next section on a database with 7300 training examples
In the worst case m’ linearly independent support- took less than five minutes of CPU time per separating
ing patterns are misclassified when they are omit- surface. The optimization was performed with an al-
ted from the training data. gorithm due to Powell that is described in [Lue84] and
available from public numerical libraries.
Hence the frequency of errors obtained by this method
Quadratic optimization problems of the form stated in
is at most m’/p, and has no direct relationship with
17 can be solved in polynomial time with the Ellipsoid
the number of adjustable parameters. The number of
method [NY83]. This technique finds first a hyperspace
linearly independent supporting patterns m’ itself is
that is guaranteed to contain the optimum; then the
bounded by min(lV, p). This suggests that the number
volume of this space is reduced iteratively by a constant
of supporting patterns is related to an effective capac-
fraction. The algorithm is polynomial in the number of
ity of the classifier that is usually much smaller than the
free parameters p and the encoding size (i. e. the accu-
VC-dimension, IV+ 1 [Vap82, HLW88].
racy of the problem and solution). In practice, however,
In polynomial classifiers, for example, IV % n~, where algorithms without guaranteed polynomial convergence
n is the dimension of x-space and q is the order of the are more efficient.
148
❑
alpha= 1.37
mm
alpha= 1.05 alpha=O.747 alpha=O.641 alpha=O.651
❑
alpha=O.556 alpha=O.549 alpha=O.544
❑
alpha=O.541
mn
alpha= 0.54 alpha=O.495 alpha=O.454
❑alpha=O.512
mum
alpha=O.445 alpha=O.444 alpha=O.429
Figure 3: Supporting patterns from database DB2forclass 2 before cleaning. The patterns areranked accordingto
cr~.
r
and has been recorded from actual mail pieces. Results error <m> error <m> N
for this data have been reported in several publications, 1 linear 256
see e.g. [CBD+ 90]. The resolution of the images in both 2 3 ~ 104
databases is 16 by 16 pixels. 3 8.107
In all experiments, the margin is maximized with re- 4 4.109
spect to w and b. Ten hypersurfaces, one per class, are 5 m 1.1012
used to separate the digits. Regardless of the difficulty
of the problem—measured for example by the number of The results obtained for DB2 show a strong decrease
supporting patterns found by the algorithm—the same of the number of supporting patterns from a linear to
similarity function K(x, x’) and preprocessing is used a third order polynomial classification function and an
for all hypersurfaces of one experiment. The results ob- equivalently significant decrease of the error rate. Fur-
tained with different choices of K corresponding to lin- ther increase of the order of the polynomial has little ef-
ear hyperplanes, polynomial classifiers, and basis func- fect on either the number of supporting patterns or the
tions are summarized below. The effect of smoothing is performance, unlike the dimension of V-space, IV, which
investigated as a simple form of preprocessing. increases exponentially. The lowest error rate, 4.9’10 is
obtained with a forth order polynomial and is slightly
For linear hyperplane classifiers, corresponding to the better than the 5.1 % reported for a five layer neural net-
similarity function K(x, x’) = x. x’, the algorithm finds work with a sophisticated architecture [CBD+90], which
an errorless separation for database DB1. The percent- has been trained and tested on the same data.
age of errors on the test set is 3.270. This result com-
pares favorably to hyperplane classifiers which minimize In the above experiment, the performance changes dras-
the mean squared error (backpropagation or pseudo- tically between first and second order polynomials. This
inverse), for which the error on the test set is 12.7 ‘%0. may be a consequence of the fact that maximum VC-
dimension of an q-th order polynomial classifier is equal
Database DB2 is also linearly separable but contains to the dimension n of the patterns to the q-th power
several meaningless patterns. Figure 3 shows the sup- and thus much larger than n. A more gradual change
porting patterns with large Lagrange multipliers a~ for of the VC-dimension is possible when the function K is
the hyperplane for class 2. The percentage of misclassi- chosen to be a power series, for example
fications on the test set of DB2 drops from 15.2% with-
out cleaning to 10.5 ‘%0after removing meaningless and K(x, x’) = exp (-yx . x’) – 1. (21)
ambiguous patterns.
In this equation the parameter y is used to vary the VC-
Better performance has been achieved with both dimension gradually. For small values of y, equation
databases using multilayer neural networks or other 21 approaches a linear classifier with VC-dimension at
149
Figure 4: Decision boundaries for maximum margin classifiers with second order polynomial decision rule If(x, x’) =
(x. x’+ 1)2 (left) and an exponential RBF I{(x, x’) = exp(–llx – x’11/2) (middle). The rightmost picture shows the
decision boundary of a two layer neural network with two hidden units trained with backpropagation.
most equal to the dimension n of the patterns plus one. The performance improved considerably for DB1. For
Experiments with database DB1 lead to a slightly bet- DB2 the improvement is less significant and the opti-
ter performance than the 1.570 obtained with a second mum was obtained for less smoothing than for DB1.
order polynomial classifier: This is expected since the number of training patterns
in DB2 is much larger than in DB1 (7000 versus 600). A
DB1 higher performance gain can be expected for more selec-
T
0:5 2.3 0 tive hints than smoothing, such as invariance to small
0.50 2.2% rotations or scaling of the digits [SLD92].
0.75 1.3% Better performance might be achieved with other sim-
1.00 1.5% ilarity functions I<(x, x’). Figure 4 shows the decision
boundary obtained with a second order polynomial and
When I<(x, x’) is chosen to be the hyperbolic tangent, a radial basis function (RBF) maximum margin classi-
the resulting classifier can be interpreted as a neural fier with 1{(x, x’) = exp (–11x – x’11/2). The decision
network with one hidden layer with m hidden units. The boundary of the polynomial classifier is much closer to
supporting patterns are the weights in the first layer, one of the two classes. This is a consequence oft he non-
and the coefficients ak the weights of the second, linear linear transform from q-space to x-space of polynomials
layer. The number of hidden units is chosen by the which realizes a position dependent scaling of distance.
training algorithm to maximize the margin between the Radial Basis Functions do not exhibit this problem. The
classes A and B. Substituting the hyperbolic tangent for decision boundary of a two layer neural network trained
the exponential function did not lead to better results with backpropagat ion is shown for comparison.
in our experiments.
The importance of a suitable preprocessing to incorpo-
5 CONCLUSIONS
rate knowledge about the task at hand has been pointed
out by many researchers. In optical character recogni-
tion, preprocessing that introduce some invariance to Maximizing the margin between the class boundary
scaling, rotation, and other distortions are particularly and training patterns is an alternative to other train-
important [SLD92]. As in [C,VB+ 92], smoothing is used ing methods optimizing cost functions such as the mean
to achieve insensitivity to small distortions. The table squared error. This principle is equivalent to minimiz-
below lists the error on the test set for different amounts ing the maximum loss and has a number of important
of smoothing. A second order polynomial classifier was features. These include automatic capacity tuning of
used for database DB1, and a forth order polynomial for the classification function, extraction of a small num-
DB2. The smoothing kernel is Gaussian with standard ber of supporting patterns from the training data that
deviation u. are relevant for the classification, and uniqueness of the
solution. They are exploited in an efficient learning al-
DB1 DB2 gorithm for classifiers linear in their parameters with
error <m> error <m> very large capacity, such as high order polynomial or
E
RBF classifiers. Key is the representation of the deci-
no smoothing 1.5’%0 44 4.970 72
sion function in a dual space which is of much lower
0.5 1.3% 41 4.6 % 73
dimensionality than the feature space.
0.8 0.8 % 36 5.0% 79
1.0 0.3% 31 6.0% 83 The efficiency and performance of the algorithm have
E 1.2 0.8% 31 been demonstrated on handwritten digit recognition
150
problems. The achieved performance matches that of Proc. Ini. Joint Conf. Neural Networks. Int.
sophisticated classifiers, even though no task specific Joint Conference on Neural Networks, 1989.
knowledge has been used. The training algorithm is
[GVB+92] Isabelle Guyon, Vladimir Vapnik, Bernhard
polynomial in the number of training patterns, even
Boser, Leon Bottou, and Sara Solla. Struc-
in cases when the dimension of the solution space (p-
tural risk minimization for character recog-
space) is exponential or infinite. The training time in
nition. In David S. Touretzky, editor, Neural
all experiments was less than an hour on a workstation.
Information Processing Systems, volume 4.
Morgan Kaufmann Publishers, San Mateo,
Acknowledgements CA, 1992. To appear.
We wish to thank our colleagues at UC Berkeley and [HLW88] David Haussler, Nick Littlestone, and Man-
AT&T Bell Laboratories for many suggestions and stim- fred Warmuth. Predicting O,l-functions on
ulating discussions. Comments by L. Bottou, C. Cortes, randomly drawn points. In Proceedings of
S. Sanders, S. Solla, A. Zakhor, and the reviewers are the 29th Annual Symposium on the Founda-
gratefully acknowledged. We are especially indebted tions of Computer Science, pages 100-109.
to R. Baldick and D. Hochbaum for investigating the IEEE, 1988.
polynomial convergence property, S. Hein for providing
[KM87] W. Krauth and M. Mezard. Learning algo-
the code for constrained nonlinear optimization, and
rithms with optimal stability in neural net-
D. Haussler and M. Warmuth for help and advice re-
works. J. Phys. A: Math. gen., 20: L745,
garding performance bounds.
1987.
[DH73] R.O. Duda and P.E. Hart. Pattern Classifi- Complexity and Method Eficiency in Opti-
cation And Scene Analysis. Wiley and Son, mization. Wiley, New York, 1983.
1973. [Omo91] S.M. Omohundro. Bumptrees for efficient
[GBD92] S. Geman, E. Bienenstock, and R. Dour- function, constraint and classification learn-
sat. Neural networks and the bias/variance ing. In R.P. Lippmann and et al., editors,
dilemma. Neural Computation, 4 (1):1 -58, NIPS-90, San Mateo CA, 1991. IEEE, Mor-
1992. gan Kaufmann.
[GPP+89] I. Guyon, I. Poujaud, L. Personnaz, [PG90] T. Poggio and F. Girosi. Regularization al-
G. Dreyfus, J. Denker, and Y. LeCun. Com- gorithms for learning that are equivalent to
paring different neural network architec- multilayer networks. Science, 247:978 – 982,
tures for classifying handwritten digits. In February 1990.
151
[Pog75] T. Poggio. On optimal nonlinear associative
recall. Biol. Cybernetics, Vol. 19:201–209,
1975.
[ROS62] F. Rosenblatt. Princzp!es of neurodynamics.
Spartan Books, New York, 1962.
[SLD92] P. Simard, Y. LeCun, and J. Denker. Tan-
gent prop—a formalism for specifying se-
lected invariance in an adaptive network.
In David S. Touretzky, editor, Neural Infor-
mation Processing Systems, volume 4. Mor-
gan Kaufmann Publishers, San Mateo, CA,
1992. To appear.
[TLS89] N. Tishby, E. Levin, and S. A. Solla. Con-
sistent inference of probabilities in layered
networks: Predictions and generalization.
In Proceedings of the International Joint
Conference on Neural Networks, Washing-
ton DC, 1989.
[Vap82] Vladimir Vapnik. Estimation of Depen-
dence Based on Empirical Data. Springer
Verlag, New York, 1982.
[VC74] V.N. Vapnik and A.Ya. Chervonenkis. The
theory of pattern recognition. Nauka,
Moscow, 1974.
152