Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

K

scheduling, supply chain automation, finance, control,


k-Armed Bandit information technology, etc. (Berry & Fristedt, ;
Cesa-Bianchi & Lugosi, ; Gittins, ).
Shie Mannor The term “multi-armed bandit” is borrowed from
Israel Institute of Technology, Haifa, Israel the slang term for a slot machine (the one-armed ban-
dit), where a decision maker has to decide whether to
Synonyms insert a coin into the gambling machine and pull a lever
Multi-armed bandit; Multi-armed bandit problem possibly getting a significant reward, or to quit without
spending any money.

Definition
In the classical k-armed bandit problem, there are k Theory
alternative arms, each with a stochastic reward whose We briefly review some of the most popular bandit
probability distribution is initially unknown. A decision variants.
maker can try these arms in some order, which may
depend on the rewards that have been observed so far. The Stochastic k-Armed Bandit Problem
A common objective in this context is to find a policy for The classical stochastic bandit problem is described as
choosing the next arm to be tried, under which the sum follows. There are k arms (or machines or actions) and
of the expected rewards comes as close as possible to a single decision maker (or controller or agent). Each
the ideal reward, that is, the expected reward that would arm corresponds to a discrete time Markov process. At
be obtained if it were to try the “best” arm at all times. each timestep, the decision maker observes the current
There are many variants of the k-armed bandit problem state of each arm’s process and selects one of the arms.
that are distinguished by the objective of the decision As a result, the decision maker obtains a reward from
maker, the process governing the reward of each arm, the process of the selected arm and the state of the cor-
and the information available to the decision maker at responding process changes. Arms that are not selected
the end of every trial. are “frozen” and their processes remain in the same
state. The objective of the decision maker is to maximize
Motivation and Background her (discounted) reward.
k-Armed bandit problems are a family of sequential More formally, let the state of arm n’s process at stage
decision problems that are among the most studied t be xn (t). Then, if the decision maker selects arm m(t)
problems in statistics, control, decision theory, and at time t we have that:
machine learning. In spite of their simplicity, they


encompass many of the basic problems of sequential ⎪
⎪ xn (t)
⎪ n ≠ m(t)
decision making in uncertain environments such as the xn (t + ) = ⎨ ,



tradeoff between exploration and exploitation. ⎪ f (x (t), ω) n = m(t)
⎩ n n
There are many variants of bandit problems includ-
ing Bayesian, Markovian, adversarial, budgeted, and where fn (x, ω) is a function that describes the (possi-
exploratory variants. Bandit formulations arise natu- bly stochastic) transition probability of the n-th process
rally in multiple fields and disciplines including com- and accepts the state of the n-th process and a random
munication networks, clinical trials, search theory, disturbance ω.

Claude Sammut & Geoffrey I. Webb (eds.), Encyclopedia of Machine Learning, DOI ./----,
© Springer Science+Business Media LLC 
 K k-Armed Bandit

The reward the decision maker receives at time t is where r(t) is sampled from the arm m(t). This quantity
a function of the current state and a random element: represents the expected loss for not choosing the arm
r(xm(t) (t), ω). The objective of the decision maker is with the highest expected reward on every timestep.
to maximize her cumulative discounted reward. That is, This variant of the bandit problem highlights the
she wishes to maximize tension between acquiring information (exploration)
∞ and using the available information (exploitation). The
V = Eπ [∑ γ t r (xm(t) (t) , ω t )] , decision maker should carefully balance between the
t= two since if she chooses to only try the arm with
π
where E is the expectation obtained when following the highest estimated reward she might regret not
policy π and γ is a discount factor ( < γ < ). A policy exploring other arms whose reward is underestimated
is a decision rule for selected arms as a function of the but is actually higher than the reward of the arm with
state of the processes. highest estimated reward.
This problem can be solved using 7dynamic pro- A basic question in this context is whether R(t) can
gramming, but the state space of the joint Markov deci- be made to grow sub-linearly. Robbins () answered
sion process is exponential in the number of arms. this question in the affirmative. It was later proved
Moreover, the dynamic programming solution does (Lai & Robbins, ) that it is possible in fact to obtain
not reveal the important structural properties of the logarithmic regret (the growth of the regret is loga-
solution. rithmic in the number of timesteps). Matching lower
Gittins and Jones () showed that there exists bounds (and constants) were also derived.
an optimal index policy. That is, there is a function
that maps the state of each arm to real number (the The Non-stochastic k-Armed Bandit
“index”) such that the optimal policy is to choose the Problem
arm with the highest index at any given time. Therefore, A third popular variant of the bandit problem is the
the stochastic bandit problem reduces to the problem of non-stochastic one. In this problem, it is assumed that
computing the index, which can be easily done in many the sequence of rewards each arm produces is deter-
important cases. ministic (possibly adversarial). The decision maker, as
in the stochastic bandit problem, wants to minimize
Regret Minimization for the Stochastic her regret, where the regret is measured with respect
k-Armed Bandit Problem to the best fixed arm (this best arm might change with
A different flavor of the bandit problem focuses on the time, however). Letting the reward of arm m at time t
notion of regret, or learning loss. In this formulation, be rm (t), we redefine the regret as:
there are k arms as before and when selecting arm m a
t t
reward that is independent and identically distributed r(t) = max≤m≤k ∑ rm (τ) − Eπ [∑ r(τ)],
is given (the reward depends only on the identity of τ= τ=
the arm and not on some internal state or the results
of previous trials). The decision maker’s objective is to where the expectation is now taken with respect to ran-
obtain high expected reward. Of course, if the decision domness in the arm selection. The basic question here is
maker had known the statistical properties of each arm, if the regret can be made to grow sub-linearly. The case
she would have always chosen the arm with the highest where the reward of each arm is observed was addressed
expected reward. However, the decision maker does not in the s (see Cesa-Bianchi & Lugosi, , for a
know the statistical properties of the arms in advance, in discussion), where it was shown that there are algo-

this setting. rithms that guarantee that the regret grows like t. For
More formally, if the reward when choosing arm m the more difficult case, where only the reward of the
has expectation rm , the regret is defined as: selected arm is observed and that the rewards of the
other arms may not be observed it was shown (Auer,
t Cesa-Bianchi, Freund, & Schapire, ) that the same
r(t) = t ⋅ max≤m≤k rm − Eπ [∑ r(τ)] ,
τ=
conclusion still holds.
K-Means Clustering K 

It should be noticed that the optimal policy of the 7Dynamic Programming


decision maker in this adversarial setting is generally 7Machine Learning in Games
randomized. That is, the decision maker has to select 7Markov Processes
an action at random by following some distribution. 7PAC Learning
The reason is that if the action the decision maker takes 7Reinforcement Learning
is deterministic and can be predicted by Nature, then
Nature can consistently “give” the decision maker a low Recommended Reading
reward for the selected arm while “giving” a high reward Auer, P., Cesa-Bianchi, N., Freund, Y., & Schapire, R. E. (). The
to all other arms, leading to a linear regret. non-stochastic multi-armed bandit problem. SIAM Journal on
Computing, (), –.
There are some interesting relationships between Berry, D., & Fristedt, B. (). Bandit problems: Sequential alloca-
the non-stochastic bandit problem and prediction with tion of experiments. London/New York: Chapman and Hall.
expert advice, universal prediction, and learning in Cesa-Bianchi, N., & Lugosi, G. (). Prediction, learning, and
games. New York: Cambridge University Press.
games (Cesa-Bianchi & Lugosi, ).
Gittins, J. C. (). Multi-armed bandit allocation indices. New
York: Wiley.
The Exploratory k-Armed Bandit Problem Gittins, J., & Jones, D. (). A dynamic allocation index for sequen-
tial design of experiments. In Progress in statistics, European
This bandit variant emphasizes efficient exploration Meeting of Statisticians (Vol. , pp. –).
rather than on the exploration–exploitation tradeoff. As Lai, T. L., & Robbins, H. (). Asymptotically efficient adaptive
in the stochastic bandit problem, the decision maker allocation rules. Advances in Applied Mathematics, , –.
Mannor, S., & Tsitsiklis, J. N. (). The sample complexity of
is given access to k arms where each arm is associ- exploration in the multi-armed bandit problem. Journal of
ated with an independent and identically distributed Machine Learning Research, , –.
K
random variable with unknown statistics. The decision Robbins, H. (). Some aspects of the sequential design of exper-
iments. Bulletin of the American Mathematical Society, ,
maker’s goal is to identify the “best” arm. That is, the
–.
decision maker wishes to find the arm with the highest
expected reward as quickly as possible.
The exploratory bandit problem is a sequential
hypothesis testing problem but with the added com-
K-Means Clustering
plication that the decision maker can choose where to
sample next, making it among the simplest active learn-
Xin Jin, Jiawei Han
ing problems. In the context of the probably approx-
University of Illinois at Urbana-Champaign
imate correct (PAC) setup, it was shown (Mannor &
Urbana, IL, USA
Tsitsiklis, ) that finding the ε-optimal arm (that is,
an arm whose expected reward is lower than that of the
best arm by at most ε) with probability of at least  − δ K-means (Lloyd, ; MacQueen, ) is one of the
requires most popular clustering methods. Algorithm  shows
k  the procedure of K-means clustering. The basic idea
O (  log ( )) is: Given an initial but not optimal clustering, relocate
ε δ
each point to its new nearest center, update the clus-
samples on expectation. Moreover, this bound can be tering centers by calculating the mean of the member
obtained (up to multiplicative constants) via an algo- points, and repeat the relocating-and-updating process
rithm known as median elimination. until convergence criteria (such as predefined number
Bandit analyses such as these have played a key of iterations, difference on the value of the distortion
role in understanding the efficiency of 7reinforcement- function) are satisfied.
learning algorithm as well. The task of initialization is to form the initial
K clusters. Many initializing techniques have been pro-
Cross References posed, from simple methods, such as choosing the first
7Active Learning K data points, Forgy initialization (randomly choosing
7Associative Bandit Problems K data points in the dataset) and Random partitions
 K K-Medoids Clustering

(a) Initialization (b) Re-assignment


K-Means Clustering. Figure . K-Means clustering example (K = ). The center of each cluster is marked by “x”

(dividing the data points randomly into K subsets), Recommended Reading


to more sophisticated methods, such as density-based Lloyd, S. P. (). Least squares quantization in PCM. Technical
initialization, Intelligent initialization, Furthest First Report RR-, Bell Lab, September .
MacQueen, J. B. (). Some methods for classification and
initialization (FF for short, it works by picking the analysis of multivariate observations. In L. M. Le Cam &
first center point randomly, then adding more center J. Neyman (Eds.), Proceedings of the fifth Berkeley symposium
points which are furthest from existing ones), and sub- on mathematical statistics and probability (Vol. , pp. –).
California: University of California Press.
set furthest-first (SFF)initialization. For more details, Steinley, D., & Brusco, M. J. (). Initializing k-means batch clus-
refer to paper Steinley and Brusco () which pro- tering: A critical evaluation of several techniques. Journal of
vides a survey and comparison of over  initialization Classification, (), –.
methods.
Figure  shows an example of K-means clustering on
a set of points, with K = . The clusters are initialized by K-Medoids Clustering
randomly selecting two points as centers.
Complexity analysis. Let N be the number of points, Xin Jin, Jiawei Han
D the number of dimensions, and K the number of University of Illinois at Urbana-Champaign
centers. Suppose the algorithm runs I iterations to Urbana, IL, USA
converge. The space complexity of K-means clustering
algorithm is O(N(D+K)). Based on the number of dis- The K-means clustering algorithm is sensitive to out-
tance calculations, the time complexity of K-means is liers, because a mean is easily influenced by extreme val-
O(NKI). ues. K-medoids clustering is a variant of K-means that is
more robust to noises and outliers. Instead of using the
mean point as the center of a cluster, K-medoids uses
Algorithm  K-means clustering algorithm an actual point in the cluster to represent it. Medoid
Require: K, number of clusters; D, a data set of N points is the most centrally located object of the cluster, with
Ensure: A set of K clusters minimum sum of distances to other points. Figure 
. Initialization. shows the difference between mean and medoid in a
. repeat -D example. The group of points in the right form a
. for each point p in D do cluster, while the rightmost point is an outlier. Mean is
. find the nearest center and assign p to the greatly influenced by the outlier and thus cannot repre-
corresponding cluster. sent the correct cluster center, while medoid is robust to
. end for the outlier and correctly represents the cluster center.
. update clusters by calculating new centers using Partitioning around medoids (PAM) (Kaufman &
mean of the members. Rousseeuw, )  is a representative K-medoids clus-
. until stop-iteration criteria satisfied tering method. The basic idea is as follows: Select
. return clustering result. K representative points to form initial clusters, and then
K-Way Spectral Clustering K 

(a) Mean (b) Medoid


K-Medoids Clustering. Figure . Mean vs. medoid in -D space. In both figures (a) and (b), the group of points in the
right form a cluster and the rightmost point is an outlier. The red point represents the center found by mean or medoid

repeatedly moves to better cluster representatives. All points is larger than a certain threshold, and the edge is
possible combinations of representative and nonrep- weighted by the similarity value. Clustering is achieved
resentative points are analyzed, and the quality of the by choosing a suitable partition of the graph that each
resulting clustering is calculated for each pair. An orig- group corresponds to one cluster.
inal representative point is replaced with the new point A good partition (i.e., a good clustering) is that
which causes the greatest reduction in distortion func- the edges between different groups have overall low
tion. At each iteration, the set of best points for each weights and the edges within a group have high weights, K
cluster form the new respective medoids. which indicates that the points in different clusters are
The time complexity of the PAM algorithm is dissimilar from each other and the points within the
O(K(N − K) I). PAM is not scalable for large dataset, same cluster are similar to each other. One basic spec-
and some algorithms have been proposed to improve tral clustering algorithm finds a good partition in the
the efficiency, such as Clustering LARge Applications following way:
(CLARA) (Kaufman & Rousseeuw, ) and Clus- Given a set of data points P and the similarity matrix
tering Large Applications based upon RANdomized S, where Sij measures the similarity between points
Search (CLARANS) (Ng & Han, ). i, j ∈ P, form a graph. Build a Laplacian matrix L of the
graph,
Recommended Reading L = I − D−/ SD−/ , ()
Kaufman, L., & Rousseeuw, P. J. (). Finding groups in data: An
introduction to cluster analysis (Wiley series in probability and where D is the diagonal matrix
statistics). New York: Wiley-Interscience.
Ng, R. T., & Han, J. (). Clarans: A method for clustering objects Dii = ∑ Sij . ()
for spatial data mining. IEEE Transactions on Knowledge and j
Data Engineering, (), –.

Find the eigenvalues and eigenvectors of the matrix


L, map the vertices to corresponding components and
K-Way Spectral Clustering form clusters based on the embedding space.
The methods to find K clusters include recursive
Xin Jin, Jiawei Han bipartitioning and clustering multiple eigenvectors. The
University of Illinois at Urbana-Champaign former technique is inefficient and unstable. The latter
Urbana, IL, USA approach is more preferable because it is able to prevent
instability due to information loss.
In spectral clustering (Luxburg, ), the dataset is
represented as a similarity graph G = (V, E). The ver- Recommended Reading
tices represent the data points. Two vertices are con- Luxburg, U. (). A tutorial on spectral clustering. Statistics and
nected if the similarity between the corresponding data Computing, (), –.
 K Kernel Density Estimation

Motivation and Background


Kernel Density Estimation Over the past decade, kernel methods have gained much
popularity in machine learning. Linear estimators have
7Density Estimation been popular due to their convenience in analysis and
computation. However, nonlinear dependencies exist
intrinsically in many real applications, and are indis-
Kernel Matrix pensable for effective modeling. Kernel methods can
sometimes offer the best of both aspects. The repro-
ducing kernel Hilbert space provides a convenient way
Synonyms to model nonlinearity, while the estimation is kept lin-
Gram matrix
ear. Kernels also offer significant flexibility in analyzing
generic non-Euclidean objects such as graphs, sets, and
Definition dynamic systems. Moreover, kernels induce a rich func-
Given a kernel function k : X × X → and patterns tion space where functional optimization can be per-
x , . . . , xm ∈ X, the m × m matrix K with elements Kij := formed efficiently. Furthermore, kernels have also been
k(xi , xj ) is called the kernel matrix of k with respect to used to define statistical models via exponential families
x , . . . , xm . or Gaussian processes, and can be factorized by graph-
ical models. Indeed, kernel methods have been widely
used in almost all tasks in machine learning.
Kernel Methods The reproducing kernel was first studied by Aron-
szajn (). Poggio and Girosi () and Wahba
Xinhua Zhang () used kernels for data analysis and Boser, Guyon,
Australian National University, Canberra, Australia and Vapnik () incorporated kernel function into
NICTA London Circuit, Canberra, Australia the maximum margin models. Schölkopf, Smola, and
Müller () first used kernels for principal component
analysis.
Definition
Kernel methods refer to a class of techniques that
Theory
employ positive definite kernels. At an algorithmic level,
Positive semi-definite kernels are the most commonly
its basic idea is quite intuitive: implicitly map objects
used type of kernels, and its motivation is as follows.
to high-dimensional feature spaces, and then directly
Given two objects x , x from a space X , which is
specify the inner product there. As a more principled
not necessarily Euclidean, we map them to a high-
interpretation, it formulates learning and estimation
dimensional feature space via ϕ(x ) and ϕ(x ), and
problems in a reproducing kernel Hilbert space, which
then compute the inner products there by k(x , x ) =
is advantageous in a number of ways:
⟨ϕ(x ), ϕ(x )⟩. In many algorithms, the set {xi } influ-
ences learning only via inner products between xi and
● It induces a rich feature space and admits a large class
xj , hence it is sufficient to specify k(x , x ) directly
of (nonlinear) functions.
without explicitly defining ϕ. This leads to consid-
● It can be flexibly applied to a wide range of domains
erable savings in computation, when ϕ ranges in
including both Euclidean and non-Euclidean spaces.
high-dimensional spaces or even infinite-dimensional
● Searching in this infinite-dimensional space of func-
spaces. Clearly, the function k must satisfy some con-
tions can be performed efficiently, and one only
ditions. For example, as a necessary condition, for any
needs to consider the finite subspace expanded by
finite number of examples x , . . . , xn from X , the matrix
the data.
● Working in the linear spaces of function lends signif-
icant convenience to the construction and analysis of K := (k(xi , xj ))i,j =(ϕ(x ), . . . , ϕ(xn ))⊺
learning algorithms. (ϕ(x ), . . . , ϕ(xn ))
Kernel Methods K 

must be positive semi-definite. Surprisingly, this turns Example Kernels


out to be a sufficient condition as well, and hence we One of the key advantage of kernels lies in its applica-
define the positive semi-definite kernels. bility to a wide range of objects.
Euclidean spaces: In Rn , popular kernels include
linear kernel k(x , x ) = ⟨x , x ⟩, polynomial kernels
Definition  (Positive semi-definite kernels) Let X
k(x , x ) = (⟨x , x ⟩ + c)d where d ∈ N and c ≥ ,
be a nonempty set. A function k : X × X ↦ R is 
Gaussian RBF kernels k(x , x ) = exp(−γ ∥x − x ∥ )
called a positive semi-definite kernel if for any n ∈ N and
where γ > , and Laplacian RBF kernels k(x , x ) =
x , . . . , xn ∈ X , the Gram matrix K := (k(xi , xj ))i,j is
exp(−γ ∥x − x ∥). Another useful type of kernels on
symmetric and positive semi-definite (psd).
Euclidean spaces is the spline kernels.
Convolution kernels: Haussler () investigated
Reproducing Kernel Hilbert Space
how to define a kernel between composite objects by
Given a psd kernel k, we are able to construct a map
building on the similarity measures that assess their
ϕ from X to an inner product space H, such that
respective parts. It needs to enumerate all possible ways
⟨ϕ(x ), ϕ(x )⟩ = k(x , x ). The image of x under ϕ is
to decompose the objects, hence efficient algorithms
just a function ϕ(x) := k(x, ⋅), where k(x, ⋅) is a func-
like dynamic programming are needed.
tion of ⋅, assigning the value k(x, x′ ) for any x′ ∈ X .
Graph kernels: Graph kernels are available in two
To define inner products between functions, we need
categories: between graphs and on a graph. The first type
to construct an inner product space H that contains
is similar to convolution kernels, which measures the
{k(x, ⋅) : x ∈ X }. First, H must contain the linear com-
similarity between two graphs. The second type defines K
binations {∑ni= α i k(xi , ⋅) : n ∈ N, xi ∈ X , α i ∈ R}. Then,
a metric between the vertices, and is generally based
we endow it with an inner product as follows. For any
on the graph Laplacian. By applying various transform
f , g ∈ H and f = ∑ni= α i k(xi , ⋅), g = ∑m ′
j= β j k (xj , ⋅), define
functions to the eigenvalue of the graph Laplacian,
n m
various smoothing and regularization effects can be
⟨ f , g⟩ := ∑ ∑ α i β j k (xi , x′j ) , achieved.
i= j= Fisher kernels: Kernels can also be defined between
probability densities p(x∣θ). Let Uθ (x) = −∂ θ log p(x∣θ)
and it is easy to show that this is well defined (indepen- and I = Ex [Uθ (x)Uθ⊺ (x)] be the Fisher score and Fisher
dent of the expansion of f and g). Using the induced information matrix respectively. Then the normalized
norm, we can complete the space and thus get a Hilbert and unnormalized Fisher kernels are defined by
space H, which is called reproducing kernel Hilbert
space (RKHS). The term “reproducing” is because for
k(x, x′ ) = Uθ⊺ (x)I − Uθ (x′ ) and
any function f ∈ H, ⟨ f , k(x, ⋅)⟩ = f (x).

k(x, x ) = Uθ⊺ (x)Uθ (x′ ),
Properties of psd Kernels
Let X be a nonempty set and k , k , . . . be arbitrary psd respectively. In theory, estimation using normalized
kernels on X × X . Then Fisher kernels corresponds to regularization on the
L (p(⋅∣θ)) norm. And in the context of exponential
● The set of psd kernels is a closed convex cone, that families, the unnormalized Fisher kernels are identical
is, (a) if α  , α  ≥ , then α  k + α  k is psd; (b) if to the inner product of sufficient statistics.
k(x, x′ ) := limn→∞ kn (x, x′ ) exists for all x, x′ , then k
is psd. Kernel Function Classes
● The pointwise product k k is psd. Many machine learning algorithms can be posed as
● Assume for i = , , ki is a psd kernel on Xi × Xi , functional minimization problems, and the RKHS is
where Xi is a nonempty set. Then the tensor product chosen as the candidate function set. The main advan-
k ⊗ k and the direct sum k ⊕ k are psd kernels on tage of optimizing over an RKHS originates from the
(X × X ) × (X × X ). representor theorem.
 K Kernel Methods

Theorem  (Representor theorem) Denote by Ω : Clearly, this can be extended to feature maps and ker-
[, ∞) ↦ R a strictly monotonic increasing function, by nels by setting k(xi , xj ) = ⟨xi , xj ⟩. The same trick can
X a set, and by c : (X × R )n ↦ R ∪ {∞} an arbitrary be applied to other algorithms like ν-SVM, regression,
loss function. Then each minimizer f ∈ H of the regular- density estimation, etc. For multi-class classification
ized risk functional and structured output classification where the possi-
ble label set Y can be large, kernel maximum margin

c((x , y , f (x )), . . . , (xn , yn , f (xn ))) + Ω (∥f ∥H ) () machines can be formulated by introducing a joint ker-
nel on pairs of (xi , y) (y ∈ Y), that is, the feature map
admits a representation of the form takes the tuple (xi , y). Letting ∆(yi , y) be the discrep-
n ancy between the true label yi and the candidate label y,
f (x) = ∑ α i k(xi , x). the primal form is
i=

λ   n
The representer theorem is important in that al- minimize ∥w∥ + ∑ ξ i ,
w,ξ i  n i=
though the optimization problem is in an infinite-
dimensional space H, the solution is guaranteed to lie in s.t. ⟨w, ϕ(xi , yi ) − ϕ(xi , y)⟩ ≥ ∆(yi , y) − ξ i , ∀ i, y,
the span of n particular kernels centered on the training
points. and the dual form is
The objective () is composed of two parts: the first
n 
part measures the loss on the training set {xi , yi }i= , minimize ∑ α i,y α i′ ,y′ ⟨ϕ(xi , yi )
α i,y λ (i,y),(i′ ,y′ )
which depends on f only via its value at xi . The second
part is the regularizer, which encourages small RKHS
− ϕ(xi , y), ϕ(xi′ , yi′ ) − ϕ(xi′ , y′ )⟩
norm of f . Intuitively, this regularizer penalizes the
complexity of f and prefers smooth f . When the kernel − ∑ ∆(yi , y)α i,y
k is translation invariant, that is, k(x , x ) = h(x − x ), i,y

Smola, Schölkopf, and Müller () showed that ∥f ∥ is 
related to the Fourier transform of h, with more penalty s.t. α i,y ≥ , ∀ i, y; ∑ α i,y = , ∀i.
y n
imposed on the high frequency components of f .
Again all the inner products ⟨ϕ(xi , y), ϕ(xi′ , y′ )⟩ can
Applications be replaced by the joint kernel k((xi , y), (xi′ , y′ )). Fur-
Kernels have been applied to almost all branches of ther factorization using graphical models are possible
machine learning. (see Taskar, Guestrin, & Koller, ). Notice when
Y = {, −}, setting ϕ(xi , y) = yϕ(xi ) recovers the
Supervised Learning
binary SVM formulation. Effective methods to optimize
One of the most well-known applications of kernel
the dual objective include sequential minimal opti-
method is the SVM for binary classification. Its primal
mization, exponentiated gradient (Collins, Globerson,
form can be written as
Koo, Carreras, & Bartlett, ), mirror descent, cutting
λ   n plane, or bundle methods (Smola, Vishwanathan, & Le,
minimize ∥w∥ + ∑ ξ i , ).
w,b,ξ  n i=
s.t. yi (⟨w, xi ⟩ + b) ≥  − ξ i , and ξ i ≥ , ∀i.
Unsupervised Learning
Its dual form can be written as Data analysis can benefit from modeling the distribu-
tion of data in feature space. There we can still use the
 rather simple linear methods, which gives rise to non-
minimize ∑ yi yj ⟨xi , xj ⟩ α i α j − ∑ α i ,
αi λ i,j i linear methods on the original data space. For exam-
s.t. ∑ yi α i = , α i ∈ [, n− ], ∀i. ple, the principal components analysis (PCA) can be
i extended to Hilbert spaces (Schölkopf et al., ),
Kernel Methods K 

which allows for image denoising, clustering, and non- Cross References
linear dimensionality reduction. 7Principal Component Analysis
n
Given a set of data points {xi }i= , PCA tries to find a 7Support Vector Machine
direction d such that the projection of {xi } to d has the
maximal variance. Mathematically, one solves:

max Var {⟨xi , d⟩} ⇐⇒ Further Reading


d:∥d∥= A survey paper on kernel methods up to year  is
Hofmann, Schölkopf, and Smola (). For an intro-
⎛  ⊺⎞
max d⊺ ⊺
∑ xi xi −  ∑ xi xj d, duction to SVMs and kernel methods, read Cristianini
d:∥d∥= ⎝n i n ij ⎠
and Shawe-Taylor (). More comprehensive treat-
which can be solved by finding the maximum eigen- ment can be found in Schölkopf and Smola (),
value of the variance of {xi }. Along the same line, we Shawe-Taylor and Cristianini (), and Steinwart and
can map the examples to the RKHS and find the maxi- Christmann (). As far as applications are con-
mum variance projection direction again. Here we first cerned, see Lampert () for computer vision and
center the data, that is, let the feature map be ϕ̃(xi ) = Schölkopf, Tsuda, and Vert () for bioinformatics.
ϕ(xi ) − n ∑j ϕ(xj ), and define a kernel k̃ based on the Finally, Vapnik () provides the details on statistical
centered feature. So we have ∑nj= K̃ij =  for all i. Now learning theory.
the objective can be written as

max Var {⟨ϕ̃(xi ), f ⟩H̃ } ⇐⇒ max Var {f (xi )} K


f :∥f ∥H̃ = f :∥f ∥= Recommended Reading
Aronszajn, N. (). Theory of reproducing kernels. Transactions
⇐⇒ max Var {f (xi )}. () of the American Mathematical Society, , –.
f :∥f ∥≤ Bach, F. R., & Jordan, M. I. (). Kernel independent component
analysis. Journal of Machine Learning Research, , –.
Treat the constraint ∥f ∥ ≤  as an indicator function Boser, B., Guyon, I., & Vapnik, V. (). A training algorithm

Ω(∥f ∥ ) where Ω(x) =  if x ≤  and ∞ otherwise. Then for optimal margin classifiers. In D. Haussler, (Ed.), Proceed-
ings of the annual conference computational learning theory,
the representer theorem can be invoked to guaran- (pp. –). Pittsburgh: ACM Press.
tee that the optimal solution is f = ∑i α i k̃(xi , ⋅) for Collins, M., Globerson, A., Koo, T., Carreras, X., & Bartlett, P.
some α i ∈ R. Plugging it into (), the problem becomes (). Exponentiated gradient algorithms for conditional ran-
dom fields and max-margin Markov networks. Journal of
max α:α ⊺ K̃ α= α ⊺ K̃  α. To get necessary conditions for
Machine Learning Research, , –.
optimality, we write out the Lagrangian L = α K̃  α − Cristianini, N., & Shawe-Taylor, J. (). An introduction to sup-
λ(α K̃α − ). Setting to  the derivative over α, we get port vector machines and other kernel-based learning methods.
Cambridge: Cambridge University Press.
K̃  α = λK̃α. () Haussler, D. (). Convolution kernels on discrete structures (Tech.
Rep. UCS-CRL--). University of California, Santa Cruz.
Therefore α ⊺ K̃  α = λ. Although () does not guaran- Hofmann, T., Schölkopf, B., & Smola, A. J. (). Kernel
methods in machine learning. Annals of Statistics, (),
tee that α is an eigenvector of K̃, one can show that for –.
each λ satisfying () there exists an eigenvector α of K̃ Lampert, C. H. (). Kernel methods in computer vision. Foun-
such that K̃α = λα. Hence, it is sufficient to study the dations and Trends in Computer Graphics and Vision, (),
–.
eigensystem of K̃ just like in the vanilla PCA. Once the
Poggio, T., & Girosi, F. (). Networks for approximation and
optimal α i∗ is obtained, any data point x can be projected learning. Proceedings of the IEEE, (), –.
to ∑i α i∗ k̃(xi , x). Schölkopf, B., & Smola, A. (). Learning with Kernels. Cambridge:
More applications of kernels in unsupervised learn- MIT Press.
Schölkopf, B., Smola, A. J., & Müller, K.-R. (). Nonlinear com-
ing can be found in canonical correlation analysis, inde- ponent analysis as a kernel eigenvalue problem. Neural Compu-
pendent component analysis (Bach & Jordan, ), tation, , –.
kernelized independence criteria via Hilbert space Schölkopf, B., Tsuda, K., & Vert, J.-P. (). Kernel methods in
computational biology. Cambridge: MIT Press.
embeddings of distributions (Smola, Gretton, Song, & Shawe-Taylor, J., & Cristianini, N. (). Kernel methods for pattern
Schölkopf, ), etc. analysis. Cambridge: Cambridge University Press.
 K Kernel Shaping

Smola, A. J., Gretton, A., Song, L., & Schölkopf, B. (). A Hilbert
space embedding for distributions. In International conference
Kernel-Based Reinforcement
on algorithmic learning theory. LNAI (Vol. , pp. –). Learning
Springer, Berlin, Germany.
Smola, A. J., Schölkopf, B., & Müller, K.-R. (). The connection
between regularization operators and support vector kernels.
7Instance-Based Reinforcement Learning
Neural Networks, (), –.
Smola, A., Vishwanathan, S. V. N., & Le, Q. (). Bundle methods
for machine learning. In D. Koller, & Y. Singer, (Eds.), Advances
in neural information processing systems (Vol. ). Cambridge: Kernels
MIT Press.
Steinwart, I., & Christmann, A. (). Support vector machines. 7Gaussian Process
Information Science and Statistics. Springer, New York.
Taskar, B., Guestrin, C., & Koller, D. (). Max-margin Markov
networks. In S. Thrun, L. Saul, & B. Schölkopf, (Eds.), Advances
in neural information processing systems (Vol. , pp. –). Kind
Cambridge: MIT Press.
Vapnik, V. (). Statistical learning theory. New York: Wiley.
Wahba, G. (). Spline models for observational data. CBMS- 7Class
NSF regional conference series in applied mathematics (Vol. ).
Philadelphia: SIAM.

Knowledge Discovery

7Text Mining for Semantic Web

Kernel Shaping
Kohonen Maps
7Local Distance Metric Adaptation
7Locally Weighted Regression for Control 7Self-Organizing Maps

You might also like