Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

A Training Algorithm for

Optimal Margin Classifiers

Bernhard E. Boser” Isabelle M. Guyon Vladimir N. Vapnik


EECS Department AT&T Bell Laboratories AT&T Bell Laboratories
University of California 50 Fremont Street, 6th Floor Crawford Corner Road
Berkeley, CA 94720 San Francisco, CA 94105 Holmdel, NJ 07733
boser@eecs.berkeley. edu isabelle@neural .att .com vlad@neural. att .com

Abstract Mo092, GVB+92, Vap82, BH89, TLS89, Mac92] link the


generalization of a classifier to the error on the training
A training algorithm that maximizes the mar- examples and the complexity of the classifier. Meth-
gin between the training patterns and the de- ods such as structural risk minimization [Vap82] vary
cision boundary is presented. The technique the complexity of the classification function in order to
is applicable to a wide variety of classifiac- optimize the generalization.
tion functions, including Perceptions, polyno- In this paper we describe a training algorithm that au-
mials, and Radial Basis Functions. The ef- tomatically tunes the capacity of the classification func-
fective number of parameters is adjusted auto- tion by maximizing the margin between training exam-
matically to match the complexity of the prob- ples and class boundary [KM87], optionally after re-
lem. The solution is expressed as a linear com- moving some atypical or meaningless examples from the
bination of supporting patterns. These are the training data. The resulting classification function de-
subset of training patterns that are closest to pends only on so-called supporting patterns [Vap82].
the decision boundary. Bounds on the general- These are those training examples that are closest to
ization performance based on the leave-one-out the decision boundary and are usually a small subset of
method and the VC-dimension are given. Ex- the training data.
perimental results on optical character recog-
nition problems demonstrate the good gener- It will be demonstrated that maximizing the margin
alization obtained when compared with other amounts to minimizing the maximum loss, as opposed
learning algorithms. to some average quantity such as the mean squared er-
ror. This has several desirable consequences. The re-
sulting classification rule achieves an errorless separa-
1 INTRODUCTION tion of the training data if possible. Outliers or mean-
ingless patterns are identified by the algorithm and can
Good generalization performance of pattern classifiers is therefore be eliminated easily with or without super-
achieved when the capacity of the classification function vision. This contrasts classifiers based on minimizing
is matched to the size of the training set. Classifiers with the mean squared error, which quiet ly ignore at ypi-
a large number of adjustable parameters and therefore cal patterns. Another advantage of maximum margin
large capacity likely learn the training set without error, classifiers is that the sensitivity of the classifier to lim-
but exhibit poor generalization. Conversely, a classifier ited computational accuracy is minimal compared to
with insufficient capacity might not be able to learn the other separations with smaller margin. In analogy to
task at all. In between, there is an optimal capacity of [Vap82, HLW88] a bound on the generalization perfor-
the classifier which minimizes the expected generaliza- mance is obtained with the “leave-one-out” method. For
tion error for a given amount of training data. Both the maximum margin classifier it is the ratio of the num-
experimental evidence and theoretical studies [GBD92, ber of linearly independent supporting patterns to the
number of training examples. This bound is tighter than
*Part of this work was performed while B. Boser was a bound based on the capacity of the classifier family.
with AT&T Bell Laboratories. He is now at the University
of California, Berkeley. The proposed algorithm operates with a large class of
Permission to copy without fee all or part of this material is decision functions that are linear in their parameters
granted provided that the copies are not made or distributed for but not restricted to linear dependence in the input
direct commercial advantage, the ACM copyright notice and the components. Perceptions [Ros62], polynomial classi-
title of tha publication and its date appear, and notice ia given fiers, neural networks with one hidden layer, and Radial
that copying is by permission of the Association for Computing Basis Function (RBF) or potential function classifiers
Machinery. To copy otherwise, or to republish, requires a fee [ABR64, BL88, MD89] fall into this class. As pointed
and/or specific permission. out by several authors [ABR64, DH73, PG90], Percep-
COLT’92-71921PA, USA
@ 1992 ACM 0-89791 -498 -8/92 /0007 /0144 . ..$1 .50

144
trons have a dual kernel representation implementing The coefficients ak are the parameters to be adjusted
the same decision function. The optimal margin algo- and the xk are the training patterns. The function K
rithm exploits this duality both for improved efficiency is a predefine kernel, for example a potential function
and flexibility. In the dual space the decision function [ABR64] or any Radial Basis Function [BL88, MD89].
is expressed as a linear combination of basis functions Under certain conditions [C H!53], symmetric kernels
parametrized by the supporting patterns. The support- possess finite or infinite series expansions of the form
ing patterns correspond to the class centers of RBF
classifiers and are chosen automatically by the maxi- K(x, x’) = ~ pi(x) $Di(x’). (5)
mum margin training procedure. In the case of polyno- i
mial classifiers, the Perception represent ation involves
In particular, the kernel K(x, x’) = (x . x’ + 1)9 cor-
an untractable number of parameters. This problem is
responds to a polynomial expansion p(x) of order q
overcome in the dual space representation, where the
[Pog75].
classification rule is a weighted sum of a kernel func-
tion [Pog75] for each supporting pattern. High order Provided that the expansion stated in equation 5 exists,
polynomial classifiers with very large training sets can equations 3 and 4 are dual representations of the same
therefore be handled efficiently with the proposed algo- decision function and
rithm.
The training algorithm is described in Section 2. Section
3 summarizes important properties of optimal margin
classifiers. Experimental results are reported in Section The parameters wi are called direct parameters, and the
4. CYkare referred to as dual parameters.
The proposed training algorithm is based on the “gener-
2 MAXIMUM MARGIN TRAINING
alized portrait” method described in [Vap82] that con-
ALGORITHM structs separating hyperplanes with maximum margin.
Here this algorithm is extended to train classifiers lin-
The maximum margin training algorithm finds a deci- ear in their parameters, First, the margin between the
sion function for pattern vectors x of dimension n be- class boundary and the training patterns is formulated
longing to either of two classes A and B. The input to in the direct space. This problem description is then
the training algorithm is a set of p examples xi with transformed into the dual space by means of the La-
labels gi: grangian. The resulting problem is that of maximizing
(1) a quadratic form with constraints and is amenable to
(Xl) vi), (X2, Y2), (X3, Y3), . ~, (Xp, yp)
efficient numeric optimization algorithms [Lue84].
yk=l if xk E class A
where
{ ~k = –1 ifxk E ClaSSB.
2.1 MAXIMIZING THE MARGIN IN THE
From these training examples the algorithm finds the DIRECT SPACE
parameters of the decision function D(x) during a learn-
ing phase. After training, the classification of unknown In the direct space the decision function is
patterns is predicted according to the following rule:
D(X) = W ~ $O(X) + b, (7)
XEA if D(x)>O
(2)
x c B otherwise. where w and P(X) are N dimensional vectors and b is
a bias. It defines “a separating hyperplane in ~-space.
The decision functions must be linear in their parame- The distance between this hyperplane and pattern x
ters but are not restricted to linear dependence of x. is D(x)/ llwll (Figure 1). Assuming that a separation
These functions can be expressed either in direct, or in of the training set with margin M between the class
dual space. The direct space notation is identical to the boundary and-the training p~tterns exists, all training
Perception decision function [Ros62]: patterns fulfill the following inequality:
N

D(X) = ~ W/f~(X) + b. (3) (8)


i=l
In this equation the pi are predefine functions of x, and The objective of the training algorithm is to find the
the Wi and b are the adjustable parameters of the deci- parameter vector w that maximizes M:
sion function. Polynomial classifiers are a special case of
Perceptions for which pi(x) are products of components M* = max M (9)
W,llwl[=l
of x.
subject to .yk~(x~) ~ M, k=l,2, . . ..p.
In the dual space, the decision functions are of the form
The bound M* is attained for those patterns satisfying
(4) (lo)
m~nyk~(xk) = M*.
k=l

145
\\

o ‘\\ x,
\\

Figure 1: Maximum margin linear decision function D(x) = w . x + b (V = x). The gray levels encode the absolute
value of the decision function (solid black corresponds to D(x) = O). The numbers indicate the supporting patterns.

These patterns are called the supporting patterns of the 2.2 MAXIMIZING THE MARGIN IN THE
decision boundary. DUAL SPACE

A decision function with maximum margin is illustrated Problem 13 can be transformed into the dual space by
in figure 1. The problem of finding a hyperplane in means of the Lagrangian [Lue84]
~-space with maximum margin is therefore a minimax
problem:
L(w, b,a) = ;IIW112 – ~a’k [~k~(Xk) – 1](14)
max miny~ll(x~). (11)
W,llwll=l k k=l
subject to ~k ~ O, k=l,2,..., p.
The norm of the parameter vector in equations 9 and
11 is fixed to pick one of an infinite number of possible The factors ~k are called Lagrange multipliers or Kiihn-
solutions that differ only in scaling. Instead of fixing Tucker coefficients and satisfy the conditions
the norm of w to take care of the scaling problem, the
product of the margin M and the norm of a weight ~k (y~~(X~) – 1) = O, k=l,2,..., p. (15)
vector w can be fixed.
The factor one half has been included for cosmetic rea-
fqwl] = 1. (12) sons; it does not change the solution.
The optimization problem 13 is equivalent to searching
Thus, maximizing the margin A4 is equivalent to mini-
a saddle point of the function L(w, b, a). This saddle
mizing the norm IIwII.1 Then the problem of finding a
point is a the minimum of L(w, 6, a) with respect to w,
maximum margin separating hyperplane w* stated in 9
and a maximum with respect to a (cr~ > O). At the
reduces to solving the following quadratic problem:
solution, the following necessary condition—is met:
m~llw112 (13)

under conditions yk ~(xk ) ~ 1, k=l,2, . . ..p.

The maximum margin is M“ = l/l\w*ll. hence


In principle the problem stated in 13 can be solved di- w* =
2 ~;yk~~. (16)
rectly with numerical techniques. However, this ap-
k=l
proach is impractical when the dimensionality of the
~-space is large or infinite. Moreover, no information is The patterns which satisfy y~D(x~) = 1 are the sup-
gained about the supporting patterns. porting patterns. According to equation 16, the vector
w* that specifies the hyperplane with maximum margin
1If the tr~~ing &ta is not linearly separable the maXi- is a linear combination of only the supporting patterns,
mum margin may be negative. In this case, Jfllwll = –1 which are those patterns for which a~ # 0. Usually the
is imposed. Maximizing the margin is then equivalent to number of supporting patterns is much smaller than the
maximizing IIw II. number p of patterns in the training set.

146
The dependence of the Lagrangian L(w, b,a) on the exceedingly large dimensionality, the training data is di-
weight vector w is removed by substituting the expan- vided into chunks that are processed iteratively [Vap82].
sion of w* given by equation 16 for w. Further trans- The maximum margin hypersurface is constructed for
formations using 3 and 5 result in a Lagrangian which the first chunk and a new training set is formed con-
is a function of the parameters a and the bias b only: sisting of the supporting patterns from the solution and
those patterns xk in the second chunk of the training
set for which yk D(xk ) < 1 – c. A new classifier is
J(a, b)= ~a~(l–by~)– jcdl.cx, (17)
trained and used to construct a training set consisting
kml
of supporting patterns and examples from the first three
subject to ~k ~ O, k=l,2,..., p. chunks which satisfy yk D(xk) < 1 — c, This process is
repeated until the entire training set is separated.
Here H is a square matrix of size p x p with elements

~kl = yky~~f(xk,x~).
3 PROPERTIES OF THE
In order for a unique solution to exist, II must be posi- ALGORITHM
tive definite. For fixed bias b, the solution a“ is obtained
by maximizing J(cx, b) under the conditions O!k ~ O. In this Section, we highlight some important aspects of
Based on equations 7 and 16, the resulting decision func- the optimal margin training algorithm. The description
tion is of the form is split into a discussion of the qualities of the resulting
classifier, and computational considerations. Classifica-
D(X) = W* . V(X)+ b (18)
tion performance advantages over other techniques will
~~~~~~(x~, x) + b, a; > 0, be illustrated in the Section on experimental results.
‘x
k
3.1 PROPERTIES OF THE SOLUTION
where only the supporting patterns appear in the sum
with nonzero weight,
Since maximizing the margin between the decision
The choice of the bias b gives rise to several variants of boundary and the training patterns is equivalent to
the algorithm. The two considered here are maximizing a quadratic form in the positive quadrant,
there are no local minima and the solution is always
1. The bias can be fixed a priori and not subjected unique if H has full rank. At the optimum
to training. This corresponds to the “Generalized
Portrait Technique” described in [Vap82]. J(~*) = ;Ilw”[y = 1
2 (M*)2 = ;i~~ (20)
2. The cost function 17 can be optimized with respect k=l

tow and b. This approach gives the largest possible The uniqueness of the solution is a consequence of the
margin M* in q-space [VC74]. maximum margin cost function and represents an im-
portant advantage over other algorithms for which the
In both cases the solution is found with standard non- solution depends on the initial conditions or other pa-
linear optimization algorithms for quadratic forms with rameters that are difficult to control.
linear constraints [Lue84, Lo072]. The second approach
gives the largest possible margin. There is no guaran- Another benefit of the maximum margin objective is its
tee, however, that this solution exhibits also the best insensitivity to small changes of the parameters w or
generalization performance. a. Since the decision function D(x) is a linear func-
tion of w in the direct, and of cx in the dual space, the
A strategy to optimize the margin with respect to both probability of misclassifications due to parameter vari-
w and b is described in [Vap82]. It solves problem 17 for ations of the components of these vectors is minimized
differences of pattern vectors to obtain a“ independent for maximum margin, The robustness of the solution—
of the bias, which is computed subsequently. The mar- and potentially its generalization performance—can be
gin in ~-space is maximized when the decision boundary increased further by omitting some supporting patterns
is halfway between the two classes. Hence the bias b* from the solution. Equation 20 indicates that the largest
is obtained by applying 18 to two arbitrary supporting increase in the maximum margin M“ occurs when the
patterns xA e chiss A and xB c class B and taking into supporting patterns with largest Q!kare eliminated. The
account that D(xA) = 1 and D(xB) = –1. elimination can be performed automatically or with as-
sistance from a supervisor. This feature gives rise to
b* = ‘j (w” . ~(xA) + W* ~$O(XB)) (19) other important uses of the optimum margin algorithm
in database cleaning applications [MGB+92].

= ‘~~yk~~[~~(xA,xk)+ ~~(x~,xk)] Figure 2 compares the decision boundary for a maxi-


k=l mum margin and mean squared error (MSE) cost func-
tions. Unlike the MSE based decision function which
The dimension of problem 17 equals the size of the train- simply ignores the outlier, optimal margin classifiers are
ing set, p. To avoid the need to solve a dual problem of very sensitive to atypical patterns that are close to the

147
Figure 2: Linear decision boundary for MSE (left) and maximum margin cost functions (middle, right) in the presence
of an outlier. In the rightmost picture the outlier has been removed. The numbers reflect the ranking of supporting
patterns according to the magnitude of their Lagrange coefficient czk for each class individually.

decision boundary. These examples are readily iden- polynomial. In practice, m < p << N, i.e. the number
tified as those with the largest O!k and can be elimi- of supporting patterns is much smaller than the dimen-
nated either automatically or with supervision. Hence, sion of the ~-space. The capacity tuning realized by the
optimal margin classifiers give complete control over maximum margin algorithm is essential to get general-
the handling of outliers, as opposed to quietly ignoring ization with high-order polynomial classifiers.
them.
The optimum margin algorithm performs automatic ca- 3.2 COMPUTATIONAL CONSIDERATIONS
pacity tuning of the decision function to achieve good
generalization. An estimate for an upper bound of the
Speed and convergence are important practical consid-
generalization error is obtained with the “leave-one-out”
erations of classification algorithms. The benefit of the
method: A pattern Xk is removed from the training set.
dual space representation to reduce the number of com-
A classifier is then trained on the remaining patterns
putations required for example for polynomial classifiers
and tested on xk. This process is repeated for all p
has been pointed out already. In the dual space, each
training patterns. The generalization error is estimated
evaluation of the decision function D(x) requires m eval-
by the ratio of misclassified patterns over p. For a max-
uations of the kernel function l((x~, x) and forming the
imum margin classifier, two cases arise: ~f xk is not a
weighted sum of the results. This number can be fur-
supporting pattern, the decision boundary is unchanged
ther reduced through the use of appropriate search tech-
and xk will be classified correctly. If xk is a supporting
niques which omit evaluations of Ii that yield negligible
pattern, two cases are possible:
contributions to D(x) [Omo9 1].
1. The pattern xk is linearly dependent on the other Typically, the training time for a separating surface
supporting patterns. In this case it will be classified from a database with several thousand examples is a few
correctly. minutes on a workstation, when an efficient optimiza-
2. Xk is linearly independent from the other support- tion algorithm is used. All experiments reported in the
ing patterns. In this case the outcome is uncertain. next section on a database with 7300 training examples
In the worst case m’ linearly independent support- took less than five minutes of CPU time per separating
ing patterns are misclassified when they are omit- surface. The optimization was performed with an al-
ted from the training data. gorithm due to Powell that is described in [Lue84] and
available from public numerical libraries.
Hence the frequency of errors obtained by this method
Quadratic optimization problems of the form stated in
is at most m’/p, and has no direct relationship with
17 can be solved in polynomial time with the Ellipsoid
the number of adjustable parameters. The number of
method [NY83]. This technique finds first a hyperspace
linearly independent supporting patterns m’ itself is
that is guaranteed to contain the optimum; then the
bounded by min(lV, p). This suggests that the number
volume of this space is reduced iteratively by a constant
of supporting patterns is related to an effective capac-
fraction. The algorithm is polynomial in the number of
ity of the classifier that is usually much smaller than the
free parameters p and the encoding size (i. e. the accu-
VC-dimension, IV+ 1 [Vap82, HLW88].
racy of the problem and solution). In practice, however,
In polynomial classifiers, for example, IV % n~, where algorithms without guaranteed polynomial convergence
n is the dimension of x-space and q is the order of the are more efficient.

148

alpha= 1.37
mm
alpha= 1.05 alpha=O.747 alpha=O.641 alpha=O.651

alpha=O.556 alpha=O.549 alpha=O.544


alpha=O.541
mn
alpha= 0.54 alpha=O.495 alpha=O.454
❑alpha=O.512
mum
alpha=O.445 alpha=O.444 alpha=O.429

Figure 3: Supporting patterns from database DB2forclass 2 before cleaning. The patterns areranked accordingto
cr~.

4 EXPERIMENTAL RESULTS classification functions with higher capacity than linear


subdividing planes. Tests with polynomial classifiers of
order q, for which K(x, x’) = (x . x’ + 1)9, give the
The maximum margin training algorithm has been following error rates and average number of support-
tested on two databases with images of handwritten ing patterns per hypersurface, <m>. This average is
digits. The first database (DB1) consists of 1200 clean computed as the total number of supporting patterns
images recorded from ten subjects. Half of this data is divided by the number of decision functions. Patterns
used for training, and the other half is used to evaluate that support more than one hypersurface are counted
the generalization performance. A comparative analy- onlv once in the total. For com~arison, the dimension
sis of the performance of various classification methods N ~f ~-space is also listed. -
on DB1 can be found in [GVB+92, GPP+89, GBD92].
The other database (DB2) used in the experiment con-
sists of 7300 images for training and 2000 for testing DB1 DB2
G

r
and has been recorded from actual mail pieces. Results error <m> error <m> N
for this data have been reported in several publications, 1 linear 256
see e.g. [CBD+ 90]. The resolution of the images in both 2 3 ~ 104
databases is 16 by 16 pixels. 3 8.107
In all experiments, the margin is maximized with re- 4 4.109
spect to w and b. Ten hypersurfaces, one per class, are 5 m 1.1012
used to separate the digits. Regardless of the difficulty
of the problem—measured for example by the number of The results obtained for DB2 show a strong decrease
supporting patterns found by the algorithm—the same of the number of supporting patterns from a linear to
similarity function K(x, x’) and preprocessing is used a third order polynomial classification function and an
for all hypersurfaces of one experiment. The results ob- equivalently significant decrease of the error rate. Fur-
tained with different choices of K corresponding to lin- ther increase of the order of the polynomial has little ef-
ear hyperplanes, polynomial classifiers, and basis func- fect on either the number of supporting patterns or the
tions are summarized below. The effect of smoothing is performance, unlike the dimension of V-space, IV, which
investigated as a simple form of preprocessing. increases exponentially. The lowest error rate, 4.9’10 is
obtained with a forth order polynomial and is slightly
For linear hyperplane classifiers, corresponding to the better than the 5.1 % reported for a five layer neural net-
similarity function K(x, x’) = x. x’, the algorithm finds work with a sophisticated architecture [CBD+90], which
an errorless separation for database DB1. The percent- has been trained and tested on the same data.
age of errors on the test set is 3.270. This result com-
pares favorably to hyperplane classifiers which minimize In the above experiment, the performance changes dras-
the mean squared error (backpropagation or pseudo- tically between first and second order polynomials. This
inverse), for which the error on the test set is 12.7 ‘%0. may be a consequence of the fact that maximum VC-
dimension of an q-th order polynomial classifier is equal
Database DB2 is also linearly separable but contains to the dimension n of the patterns to the q-th power
several meaningless patterns. Figure 3 shows the sup- and thus much larger than n. A more gradual change
porting patterns with large Lagrange multipliers a~ for of the VC-dimension is possible when the function K is
the hyperplane for class 2. The percentage of misclassi- chosen to be a power series, for example
fications on the test set of DB2 drops from 15.2% with-
out cleaning to 10.5 ‘%0after removing meaningless and K(x, x’) = exp (-yx . x’) – 1. (21)
ambiguous patterns.
In this equation the parameter y is used to vary the VC-
Better performance has been achieved with both dimension gradually. For small values of y, equation
databases using multilayer neural networks or other 21 approaches a linear classifier with VC-dimension at

149
Figure 4: Decision boundaries for maximum margin classifiers with second order polynomial decision rule If(x, x’) =
(x. x’+ 1)2 (left) and an exponential RBF I{(x, x’) = exp(–llx – x’11/2) (middle). The rightmost picture shows the
decision boundary of a two layer neural network with two hidden units trained with backpropagation.

most equal to the dimension n of the patterns plus one. The performance improved considerably for DB1. For
Experiments with database DB1 lead to a slightly bet- DB2 the improvement is less significant and the opti-
ter performance than the 1.570 obtained with a second mum was obtained for less smoothing than for DB1.
order polynomial classifier: This is expected since the number of training patterns
in DB2 is much larger than in DB1 (7000 versus 600). A
DB1 higher performance gain can be expected for more selec-

T
0:5 2.3 0 tive hints than smoothing, such as invariance to small
0.50 2.2% rotations or scaling of the digits [SLD92].
0.75 1.3% Better performance might be achieved with other sim-
1.00 1.5% ilarity functions I<(x, x’). Figure 4 shows the decision
boundary obtained with a second order polynomial and
When I<(x, x’) is chosen to be the hyperbolic tangent, a radial basis function (RBF) maximum margin classi-
the resulting classifier can be interpreted as a neural fier with 1{(x, x’) = exp (–11x – x’11/2). The decision
network with one hidden layer with m hidden units. The boundary of the polynomial classifier is much closer to
supporting patterns are the weights in the first layer, one of the two classes. This is a consequence oft he non-
and the coefficients ak the weights of the second, linear linear transform from q-space to x-space of polynomials
layer. The number of hidden units is chosen by the which realizes a position dependent scaling of distance.
training algorithm to maximize the margin between the Radial Basis Functions do not exhibit this problem. The
classes A and B. Substituting the hyperbolic tangent for decision boundary of a two layer neural network trained
the exponential function did not lead to better results with backpropagat ion is shown for comparison.
in our experiments.
The importance of a suitable preprocessing to incorpo-
5 CONCLUSIONS
rate knowledge about the task at hand has been pointed
out by many researchers. In optical character recogni-
tion, preprocessing that introduce some invariance to Maximizing the margin between the class boundary
scaling, rotation, and other distortions are particularly and training patterns is an alternative to other train-
important [SLD92]. As in [C,VB+ 92], smoothing is used ing methods optimizing cost functions such as the mean
to achieve insensitivity to small distortions. The table squared error. This principle is equivalent to minimiz-
below lists the error on the test set for different amounts ing the maximum loss and has a number of important
of smoothing. A second order polynomial classifier was features. These include automatic capacity tuning of
used for database DB1, and a forth order polynomial for the classification function, extraction of a small num-
DB2. The smoothing kernel is Gaussian with standard ber of supporting patterns from the training data that
deviation u. are relevant for the classification, and uniqueness of the
solution. They are exploited in an efficient learning al-
DB1 DB2 gorithm for classifiers linear in their parameters with
error <m> error <m> very large capacity, such as high order polynomial or
E
RBF classifiers. Key is the representation of the deci-
no smoothing 1.5’%0 44 4.970 72
sion function in a dual space which is of much lower
0.5 1.3% 41 4.6 % 73
dimensionality than the feature space.
0.8 0.8 % 36 5.0% 79
1.0 0.3% 31 6.0% 83 The efficiency and performance of the algorithm have
E 1.2 0.8% 31 been demonstrated on handwritten digit recognition

150
problems. The achieved performance matches that of Proc. Ini. Joint Conf. Neural Networks. Int.
sophisticated classifiers, even though no task specific Joint Conference on Neural Networks, 1989.
knowledge has been used. The training algorithm is
[GVB+92] Isabelle Guyon, Vladimir Vapnik, Bernhard
polynomial in the number of training patterns, even
Boser, Leon Bottou, and Sara Solla. Struc-
in cases when the dimension of the solution space (p-
tural risk minimization for character recog-
space) is exponential or infinite. The training time in
nition. In David S. Touretzky, editor, Neural
all experiments was less than an hour on a workstation.
Information Processing Systems, volume 4.
Morgan Kaufmann Publishers, San Mateo,
Acknowledgements CA, 1992. To appear.
We wish to thank our colleagues at UC Berkeley and [HLW88] David Haussler, Nick Littlestone, and Man-
AT&T Bell Laboratories for many suggestions and stim- fred Warmuth. Predicting O,l-functions on
ulating discussions. Comments by L. Bottou, C. Cortes, randomly drawn points. In Proceedings of
S. Sanders, S. Solla, A. Zakhor, and the reviewers are the 29th Annual Symposium on the Founda-
gratefully acknowledged. We are especially indebted tions of Computer Science, pages 100-109.
to R. Baldick and D. Hochbaum for investigating the IEEE, 1988.
polynomial convergence property, S. Hein for providing
[KM87] W. Krauth and M. Mezard. Learning algo-
the code for constrained nonlinear optimization, and
rithms with optimal stability in neural net-
D. Haussler and M. Warmuth for help and advice re-
works. J. Phys. A: Math. gen., 20: L745,
garding performance bounds.
1987.

[Lo072] F. A. Lootsma, editor. Numerical Meth-


References
ods for Non-linear Optimization, Academic
[ABR64] M.A. Aizerman, E.M. Braverman, and L.I. Press, London, 1972.
Rozonoer. Theoretical foundations of the [Lue84] David Luenberger. Linear and Nonlinear
potential function method in pattern recog- Programming. Addison-Wesley, 1984.
nition learning. Automation and Remote
[Mac92] D. MacKay. A practical bayesian framework
Control, 25:821-837, 1964.
for backprop networks. In David S. Touret-
[BH89] E. B. Baum and D. Haussler. What size net zky, editor, Neurai Information Processing
gives valid generalization? Neural Compu- Systems, volume 4. Morgan Kaufmann Pub-
tation, 1(1):151–160, 1989. lishers, San Mateo, CA, 1992. To appear.
[BL88] D. S. Broomhead and D. Lowe. Multi- [MD89] J. Moody and C. Darken. Fast learning in
variate functional interpolation and adap- networks of locally tuned processing units.
tive networks. Complex Systems, 2:321- Neural Compuiaiaon, 1 (2)~281 -294, 1989.
355, 1988.
IMGB+921 N. Matic, I. Guyon, L. Bottou, J. Denker,
[CBD+90] Yann Le Cun, Bernhard Boser, John S.
and V. Vapnik. Computer-aided cleaning
Denker, Donnie Henderson, Richard E.
of large databases for character recognition.
Howard, Wayne Hubbard, and Larry D.
In Digest ICPR. ICPR, Amsterdam, August
Jackel. Handwritten digit recognition with
1992.
a back-propagation network. In David S.
Touretzky, editor, Neural Information Pro- [Mo092] J. Moody. Generalization, weight decay, and
cessing Systems, volume 2, pages 396–404. architecture selection for nonlinear learn-
Morgan Kaufmann Publishers, San Mateo, ing systems. In David S. Touretzky, edi-
CA, 1990. tor, Neural Information Processing Systems,
volume 4. Morgan Kaufmann Publishers,
[CH53] R. Courant and D. Hilbert. Methods of
mathematical physics. Interscience, New San Mateo, CA, 1992. To appear.
York, 1953. [NY83] A.S. Nemirovsky and D. D. Yudin. Problem

[DH73] R.O. Duda and P.E. Hart. Pattern Classifi- Complexity and Method Eficiency in Opti-
cation And Scene Analysis. Wiley and Son, mization. Wiley, New York, 1983.
1973. [Omo91] S.M. Omohundro. Bumptrees for efficient
[GBD92] S. Geman, E. Bienenstock, and R. Dour- function, constraint and classification learn-
sat. Neural networks and the bias/variance ing. In R.P. Lippmann and et al., editors,
dilemma. Neural Computation, 4 (1):1 -58, NIPS-90, San Mateo CA, 1991. IEEE, Mor-
1992. gan Kaufmann.

[GPP+89] I. Guyon, I. Poujaud, L. Personnaz, [PG90] T. Poggio and F. Girosi. Regularization al-
G. Dreyfus, J. Denker, and Y. LeCun. Com- gorithms for learning that are equivalent to
paring different neural network architec- multilayer networks. Science, 247:978 – 982,
tures for classifying handwritten digits. In February 1990.

151
[Pog75] T. Poggio. On optimal nonlinear associative
recall. Biol. Cybernetics, Vol. 19:201–209,
1975.
[ROS62] F. Rosenblatt. Princzp!es of neurodynamics.
Spartan Books, New York, 1962.
[SLD92] P. Simard, Y. LeCun, and J. Denker. Tan-
gent prop—a formalism for specifying se-
lected invariance in an adaptive network.
In David S. Touretzky, editor, Neural Infor-
mation Processing Systems, volume 4. Mor-
gan Kaufmann Publishers, San Mateo, CA,
1992. To appear.
[TLS89] N. Tishby, E. Levin, and S. A. Solla. Con-
sistent inference of probabilities in layered
networks: Predictions and generalization.
In Proceedings of the International Joint
Conference on Neural Networks, Washing-
ton DC, 1989.
[Vap82] Vladimir Vapnik. Estimation of Depen-
dence Based on Empirical Data. Springer
Verlag, New York, 1982.
[VC74] V.N. Vapnik and A.Ya. Chervonenkis. The
theory of pattern recognition. Nauka,
Moscow, 1974.

152

You might also like