Professional Documents
Culture Documents
Jin-Han2010 ReferenceWorkEntry K-MeansClustering
Jin-Han2010 ReferenceWorkEntry K-MeansClustering
Definition
In the classical k-armed bandit problem, there are k Theory
alternative arms, each with a stochastic reward whose We briefly review some of the most popular bandit
probability distribution is initially unknown. A decision variants.
maker can try these arms in some order, which may
depend on the rewards that have been observed so far. The Stochastic k-Armed Bandit Problem
A common objective in this context is to find a policy for The classical stochastic bandit problem is described as
choosing the next arm to be tried, under which the sum follows. There are k arms (or machines or actions) and
of the expected rewards comes as close as possible to a single decision maker (or controller or agent). Each
the ideal reward, that is, the expected reward that would arm corresponds to a discrete time Markov process. At
be obtained if it were to try the “best” arm at all times. each timestep, the decision maker observes the current
There are many variants of the k-armed bandit problem state of each arm’s process and selects one of the arms.
that are distinguished by the objective of the decision As a result, the decision maker obtains a reward from
maker, the process governing the reward of each arm, the process of the selected arm and the state of the cor-
and the information available to the decision maker at responding process changes. Arms that are not selected
the end of every trial. are “frozen” and their processes remain in the same
state. The objective of the decision maker is to maximize
Motivation and Background her (discounted) reward.
k-Armed bandit problems are a family of sequential More formally, let the state of arm n’s process at stage
decision problems that are among the most studied t be xn (t). Then, if the decision maker selects arm m(t)
problems in statistics, control, decision theory, and at time t we have that:
machine learning. In spite of their simplicity, they
⎧
⎪
encompass many of the basic problems of sequential ⎪
⎪ xn (t)
⎪ n ≠ m(t)
decision making in uncertain environments such as the xn (t + ) = ⎨ ,
⎪
⎪
⎪
tradeoff between exploration and exploitation. ⎪ f (x (t), ω) n = m(t)
⎩ n n
There are many variants of bandit problems includ-
ing Bayesian, Markovian, adversarial, budgeted, and where fn (x, ω) is a function that describes the (possi-
exploratory variants. Bandit formulations arise natu- bly stochastic) transition probability of the n-th process
rally in multiple fields and disciplines including com- and accepts the state of the n-th process and a random
munication networks, clinical trials, search theory, disturbance ω.
Claude Sammut & Geoffrey I. Webb (eds.), Encyclopedia of Machine Learning, DOI ./----,
© Springer Science+Business Media LLC
K k-Armed Bandit
The reward the decision maker receives at time t is where r(t) is sampled from the arm m(t). This quantity
a function of the current state and a random element: represents the expected loss for not choosing the arm
r(xm(t) (t), ω). The objective of the decision maker is with the highest expected reward on every timestep.
to maximize her cumulative discounted reward. That is, This variant of the bandit problem highlights the
she wishes to maximize tension between acquiring information (exploration)
∞ and using the available information (exploitation). The
V = Eπ [∑ γ t r (xm(t) (t) , ω t )] , decision maker should carefully balance between the
t= two since if she chooses to only try the arm with
π
where E is the expectation obtained when following the highest estimated reward she might regret not
policy π and γ is a discount factor ( < γ < ). A policy exploring other arms whose reward is underestimated
is a decision rule for selected arms as a function of the but is actually higher than the reward of the arm with
state of the processes. highest estimated reward.
This problem can be solved using 7dynamic pro- A basic question in this context is whether R(t) can
gramming, but the state space of the joint Markov deci- be made to grow sub-linearly. Robbins () answered
sion process is exponential in the number of arms. this question in the affirmative. It was later proved
Moreover, the dynamic programming solution does (Lai & Robbins, ) that it is possible in fact to obtain
not reveal the important structural properties of the logarithmic regret (the growth of the regret is loga-
solution. rithmic in the number of timesteps). Matching lower
Gittins and Jones () showed that there exists bounds (and constants) were also derived.
an optimal index policy. That is, there is a function
that maps the state of each arm to real number (the The Non-stochastic k-Armed Bandit
“index”) such that the optimal policy is to choose the Problem
arm with the highest index at any given time. Therefore, A third popular variant of the bandit problem is the
the stochastic bandit problem reduces to the problem of non-stochastic one. In this problem, it is assumed that
computing the index, which can be easily done in many the sequence of rewards each arm produces is deter-
important cases. ministic (possibly adversarial). The decision maker, as
in the stochastic bandit problem, wants to minimize
Regret Minimization for the Stochastic her regret, where the regret is measured with respect
k-Armed Bandit Problem to the best fixed arm (this best arm might change with
A different flavor of the bandit problem focuses on the time, however). Letting the reward of arm m at time t
notion of regret, or learning loss. In this formulation, be rm (t), we redefine the regret as:
there are k arms as before and when selecting arm m a
t t
reward that is independent and identically distributed r(t) = max≤m≤k ∑ rm (τ) − Eπ [∑ r(τ)],
is given (the reward depends only on the identity of τ= τ=
the arm and not on some internal state or the results
of previous trials). The decision maker’s objective is to where the expectation is now taken with respect to ran-
obtain high expected reward. Of course, if the decision domness in the arm selection. The basic question here is
maker had known the statistical properties of each arm, if the regret can be made to grow sub-linearly. The case
she would have always chosen the arm with the highest where the reward of each arm is observed was addressed
expected reward. However, the decision maker does not in the s (see Cesa-Bianchi & Lugosi, , for a
know the statistical properties of the arms in advance, in discussion), where it was shown that there are algo-
√
this setting. rithms that guarantee that the regret grows like t. For
More formally, if the reward when choosing arm m the more difficult case, where only the reward of the
has expectation rm , the regret is defined as: selected arm is observed and that the rewards of the
other arms may not be observed it was shown (Auer,
t Cesa-Bianchi, Freund, & Schapire, ) that the same
r(t) = t ⋅ max≤m≤k rm − Eπ [∑ r(τ)] ,
τ=
conclusion still holds.
K-Means Clustering K
repeatedly moves to better cluster representatives. All points is larger than a certain threshold, and the edge is
possible combinations of representative and nonrep- weighted by the similarity value. Clustering is achieved
resentative points are analyzed, and the quality of the by choosing a suitable partition of the graph that each
resulting clustering is calculated for each pair. An orig- group corresponds to one cluster.
inal representative point is replaced with the new point A good partition (i.e., a good clustering) is that
which causes the greatest reduction in distortion func- the edges between different groups have overall low
tion. At each iteration, the set of best points for each weights and the edges within a group have high weights, K
cluster form the new respective medoids. which indicates that the points in different clusters are
The time complexity of the PAM algorithm is dissimilar from each other and the points within the
O(K(N − K) I). PAM is not scalable for large dataset, same cluster are similar to each other. One basic spec-
and some algorithms have been proposed to improve tral clustering algorithm finds a good partition in the
the efficiency, such as Clustering LARge Applications following way:
(CLARA) (Kaufman & Rousseeuw, ) and Clus- Given a set of data points P and the similarity matrix
tering Large Applications based upon RANdomized S, where Sij measures the similarity between points
Search (CLARANS) (Ng & Han, ). i, j ∈ P, form a graph. Build a Laplacian matrix L of the
graph,
Recommended Reading L = I − D−/ SD−/ , ()
Kaufman, L., & Rousseeuw, P. J. (). Finding groups in data: An
introduction to cluster analysis (Wiley series in probability and where D is the diagonal matrix
statistics). New York: Wiley-Interscience.
Ng, R. T., & Han, J. (). Clarans: A method for clustering objects Dii = ∑ Sij . ()
for spatial data mining. IEEE Transactions on Knowledge and j
Data Engineering, (), –.
Theorem (Representor theorem) Denote by Ω : Clearly, this can be extended to feature maps and ker-
[, ∞) ↦ R a strictly monotonic increasing function, by nels by setting k(xi , xj ) = ⟨xi , xj ⟩. The same trick can
X a set, and by c : (X × R )n ↦ R ∪ {∞} an arbitrary be applied to other algorithms like ν-SVM, regression,
loss function. Then each minimizer f ∈ H of the regular- density estimation, etc. For multi-class classification
ized risk functional and structured output classification where the possi-
ble label set Y can be large, kernel maximum margin
c((x , y , f (x )), . . . , (xn , yn , f (xn ))) + Ω (∥f ∥H ) () machines can be formulated by introducing a joint ker-
nel on pairs of (xi , y) (y ∈ Y), that is, the feature map
admits a representation of the form takes the tuple (xi , y). Letting ∆(yi , y) be the discrep-
n ancy between the true label yi and the candidate label y,
f (x) = ∑ α i k(xi , x). the primal form is
i=
λ n
The representer theorem is important in that al- minimize ∥w∥ + ∑ ξ i ,
w,ξ i n i=
though the optimization problem is in an infinite-
dimensional space H, the solution is guaranteed to lie in s.t. ⟨w, ϕ(xi , yi ) − ϕ(xi , y)⟩ ≥ ∆(yi , y) − ξ i , ∀ i, y,
the span of n particular kernels centered on the training
points. and the dual form is
The objective () is composed of two parts: the first
n
part measures the loss on the training set {xi , yi }i= , minimize ∑ α i,y α i′ ,y′ ⟨ϕ(xi , yi )
α i,y λ (i,y),(i′ ,y′ )
which depends on f only via its value at xi . The second
part is the regularizer, which encourages small RKHS
− ϕ(xi , y), ϕ(xi′ , yi′ ) − ϕ(xi′ , y′ )⟩
norm of f . Intuitively, this regularizer penalizes the
complexity of f and prefers smooth f . When the kernel − ∑ ∆(yi , y)α i,y
k is translation invariant, that is, k(x , x ) = h(x − x ), i,y
Smola, Schölkopf, and Müller () showed that ∥f ∥ is
related to the Fourier transform of h, with more penalty s.t. α i,y ≥ , ∀ i, y; ∑ α i,y = , ∀i.
y n
imposed on the high frequency components of f .
Again all the inner products ⟨ϕ(xi , y), ϕ(xi′ , y′ )⟩ can
Applications be replaced by the joint kernel k((xi , y), (xi′ , y′ )). Fur-
Kernels have been applied to almost all branches of ther factorization using graphical models are possible
machine learning. (see Taskar, Guestrin, & Koller, ). Notice when
Y = {, −}, setting ϕ(xi , y) = yϕ(xi ) recovers the
Supervised Learning
binary SVM formulation. Effective methods to optimize
One of the most well-known applications of kernel
the dual objective include sequential minimal opti-
method is the SVM for binary classification. Its primal
mization, exponentiated gradient (Collins, Globerson,
form can be written as
Koo, Carreras, & Bartlett, ), mirror descent, cutting
λ n plane, or bundle methods (Smola, Vishwanathan, & Le,
minimize ∥w∥ + ∑ ξ i , ).
w,b,ξ n i=
s.t. yi (⟨w, xi ⟩ + b) ≥ − ξ i , and ξ i ≥ , ∀i.
Unsupervised Learning
Its dual form can be written as Data analysis can benefit from modeling the distribu-
tion of data in feature space. There we can still use the
rather simple linear methods, which gives rise to non-
minimize ∑ yi yj ⟨xi , xj ⟩ α i α j − ∑ α i ,
αi λ i,j i linear methods on the original data space. For exam-
s.t. ∑ yi α i = , α i ∈ [, n− ], ∀i. ple, the principal components analysis (PCA) can be
i extended to Hilbert spaces (Schölkopf et al., ),
Kernel Methods K
which allows for image denoising, clustering, and non- Cross References
linear dimensionality reduction. 7Principal Component Analysis
n
Given a set of data points {xi }i= , PCA tries to find a 7Support Vector Machine
direction d such that the projection of {xi } to d has the
maximal variance. Mathematically, one solves:
Smola, A. J., Gretton, A., Song, L., & Schölkopf, B. (). A Hilbert
space embedding for distributions. In International conference
Kernel-Based Reinforcement
on algorithmic learning theory. LNAI (Vol. , pp. –). Learning
Springer, Berlin, Germany.
Smola, A. J., Schölkopf, B., & Müller, K.-R. (). The connection
between regularization operators and support vector kernels.
7Instance-Based Reinforcement Learning
Neural Networks, (), –.
Smola, A., Vishwanathan, S. V. N., & Le, Q. (). Bundle methods
for machine learning. In D. Koller, & Y. Singer, (Eds.), Advances
in neural information processing systems (Vol. ). Cambridge: Kernels
MIT Press.
Steinwart, I., & Christmann, A. (). Support vector machines. 7Gaussian Process
Information Science and Statistics. Springer, New York.
Taskar, B., Guestrin, C., & Koller, D. (). Max-margin Markov
networks. In S. Thrun, L. Saul, & B. Schölkopf, (Eds.), Advances
in neural information processing systems (Vol. , pp. –). Kind
Cambridge: MIT Press.
Vapnik, V. (). Statistical learning theory. New York: Wiley.
Wahba, G. (). Spline models for observational data. CBMS- 7Class
NSF regional conference series in applied mathematics (Vol. ).
Philadelphia: SIAM.
Knowledge Discovery
Kernel Shaping
Kohonen Maps
7Local Distance Metric Adaptation
7Locally Weighted Regression for Control 7Self-Organizing Maps