Download as pdf or txt
Download as pdf or txt
You are on page 1of 97

Bayesian k-nearest-neighbour classification

Bayesian k-nearest-neighbour classification


Christian P. Robert
Universit
e Paris Dauphine & CREST, INSEE
http://www.ceremade.dauphine.fr/xian

Joint work with G. Celeux, J.M. Marin, & D.M. Titterington

Bayesian k-nearest-neighbour classification

Outline
1

MRFs

Bayesian inference in Gibbs random fields

Perfect sampling

k-nearest-neighbours

Pseudo-likelihood reassessed

Variable selection

Bayesian k-nearest-neighbour classification


MRFs

MRFs

Markov random fields: natural spatial generalisation of Markov


chains

Bayesian k-nearest-neighbour classification


MRFs

MRFs

Markov random fields: natural spatial generalisation of Markov


chains
They can be derived from graph structures, when ignoring time
directionality/causality
E.g., a Markov chain is also a chain graph of random variables,
where each variable in the graph has the property that it is
independent of all (both past and future) others given its two
nearest neighbours.

Bayesian k-nearest-neighbour classification


MRFs

MRFs (contd)

Definition (MRF)
A general Markov random field is the extension of the above to
any graph structure on the random variables, i.e., a collection of
rvs such that each one is independent of all the others given its
immediate neighbours in the corresponding graph.
[Cressie, 1993]

Bayesian k-nearest-neighbour classification


MRFs

A formal definition
Take y1 , . . . , yn , rvs with values in a finite set S, and let
G = (N, E) be a finite graph with N = {1, ..., n} the collection of
nodes and E the collection of edges, made of pairs from N .

Bayesian k-nearest-neighbour classification


MRFs

A formal definition
Take y1 , . . . , yn , rvs with values in a finite set S, and let
G = (N, E) be a finite graph with N = {1, ..., n} the collection of
nodes and E the collection of edges, made of pairs from N .
For A N , A denotes the set of neighbours of A, i.e. the
collection of all points in N/A that have a neighbour in A.

Bayesian k-nearest-neighbour classification


MRFs

A formal definition
Take y1 , . . . , yn , rvs with values in a finite set S, and let
G = (N, E) be a finite graph with N = {1, ..., n} the collection of
nodes and E the collection of edges, made of pairs from N .
For A N , A denotes the set of neighbours of A, i.e. the
collection of all points in N/A that have a neighbour in A.

Definition (MRF)
y = (y1 , . . . , yn ) is a Markov random field associated with the
graph G if its full conditionals satisfy
f (yi |yi ) = f (yi |yi ) .
Cliques are sets of points that are all neighbours of one another.

Bayesian k-nearest-neighbour classification


MRFs

Gibbs distributions

Special case of MRF:

Bayesian k-nearest-neighbour classification


MRFs

Gibbs distributions

Special case of MRF:


y = (y1 , . . . , yn ) is a Gibbs random field associated with the
graph G if
(
)
X
1
f (y) = exp
Vc (yc ) .
Z
cC

where Z is the normalising constant, C is the set ofPcliques and Vc


is any function also called potential (and U (y) = cC Vc (yc ) is
the energy function)

Bayesian k-nearest-neighbour classification


MRFs

Statistical perspective
Introduce a parameter in the Gibbs distribution:
f (y|) =

exp {Q (y)}
Z()

Bayesian k-nearest-neighbour classification


MRFs

Statistical perspective
Introduce a parameter in the Gibbs distribution:
f (y|) =

exp {Q (y)}
Z()

and estimate it from observed data y.

Bayesian k-nearest-neighbour classification


MRFs

Statistical perspective
Introduce a parameter in the Gibbs distribution:
f (y|) =

exp {Q (y)}
Z()

and estimate it from observed data y.


Bayesian approach:
put a prior distribution () on and use posterior distribution
(|y) f (y|)() =

exp {Q (y)}
()
Z()

Bayesian k-nearest-neighbour classification


MRFs

Potts model

Example (Boltzman dependence)


Case when Q (y) is of the form
Q (y) = S(y)
X
yl =yi
=
lk i

Path for Potts

Bayesian k-nearest-neighbour classification


Bayesian inference in Gibbs random fields

MRFs

Bayesian inference in Gibbs random fields


Pseudo-posterior inference
Path sampling
Auxiliary variables

Perfect sampling

k-nearest-neighbours

Pseudo-likelihood reassessed

Variable selection

Bayesian k-nearest-neighbour classification


Bayesian inference in Gibbs random fields

Bayesian inference in Gibbs random fields


Use the posterior (|y) to draw inference

Bayesian k-nearest-neighbour classification


Bayesian inference in Gibbs random fields

Bayesian inference in Gibbs random fields


Use the posterior (|y) to draw inference
Problem
X
Z() =
exp {Q (y)}
y

not available analytically & exact computation not feasible.

Bayesian k-nearest-neighbour classification


Bayesian inference in Gibbs random fields

Bayesian inference in Gibbs random fields


Use the posterior (|y) to draw inference
Problem
X
Z() =
exp {Q (y)}
y

not available analytically & exact computation not feasible.


Solutions:
Pseudo-posterior inference
Path sampling approximations
Auxiliary variable method

Bayesian k-nearest-neighbour classification


Bayesian inference in Gibbs random fields
Pseudo-posterior inference

Pseudo-posterior inference
Oldest solution: replace the likelihood with the pseudo-likelihood
pseudo-like(y|) =

n
Y

f (yi |yi , ) .

i=1

Then define pseudo-posterior


pseudo-post(|y)

n
Y

f (yi |yi , )()

i=1

and resort to MCMC methods to derive a sample from pseudo-post


[Besag, 1974-75]

Bayesian k-nearest-neighbour classification


Bayesian inference in Gibbs random fields
Path sampling

Path sampling

Generate a sample from [the true] (|y) by a


Metropolis-Hastings algorithm, with acceptance probability




Z()
exp [Q (y)] ( )
q1 (| )

MH1 ( |) =
Z( )
exp [Q (y)] ()
q1 ( |)
where q1 ( |) is a [arbitrary] proposal density.
[Robert & Casella, 1999/2004]

Bayesian k-nearest-neighbour classification


Bayesian inference in Gibbs random fields
Path sampling

Path sampling (contd)


When Q (y) = S(y) [cf. Gibbs/Potts distribution],
X
Z() =
exp [S(y)]
y

and

dZ()
d

S(y) exp[S(y)]

= Z()

X
y


S(y) exp{S(y)} Z()

= Z() E [S(y)] .

Bayesian k-nearest-neighbour classification


Bayesian inference in Gibbs random fields
Path sampling

Path sampling (contd)


When Q (y) = S(y) [cf. Gibbs/Potts distribution],
X
Z() =
exp [S(y)]
y

and

dZ()
d

S(y) exp[S(y)]

= Z()

X
y


S(y) exp{S(y)} Z()

= Z() E [S(y)] .

c Derivative expressed as an expectation under f (y|)


Bayesian k-nearest-neighbour classification


Bayesian inference in Gibbs random fields
Path sampling

Path sampling identity

Therefore, the ratio Z()/Z( ) can be derived from an integral,


since

 Z
Z()
log
Eu [S(y)]du .
=
Z( )

[Gelman & Meng, 1998]

Bayesian k-nearest-neighbour classification


Bayesian inference in Gibbs random fields
Path sampling

Implementation for Potts


Potts

Step X: Monte Carlo approximation of E [S(X)] derived form


MCMC sequence of Xs for fixed

Potts Metropolis-Hastings Sampler


Iteration t (t 1):
1
2

Generate u = (ui )iI random permutation of I;


For 1 |I|,
generate
(t1)
x
(t)
1, xu(t1)
+ 1, . . . , G}) ,
u U ({1, . . . , xu

(t)

(t)

(t)

compute the nul ,g s and l = {exp([nu ,x nu ,xu ])} 1 ,


(t)

and set xu equal to x


u with probability l .

Bayesian k-nearest-neighbour classification


Bayesian inference in Gibbs random fields
Path sampling

Implementation for Potts (2)

Step : Use (a) importance


sampling recycling when
changing the value of and (b)
numerical quadrature for integral
approximation

Bayesian k-nearest-neighbour classification


Bayesian inference in Gibbs random fields
Path sampling

Implementation for Potts (2)

Step : Use (a) importance


sampling recycling when
changing the value of and (b)
numerical quadrature for integral
approximation
Illustration: Approximation of
E,k [S(y)] for Ripleys
benchmark, for k = 1, 125
Perfect Potts

Bayesian k-nearest-neighbour classification


Bayesian inference in Gibbs random fields
Auxiliary variables

Auxiliary variables

Introduce z auxiliary/extraneous variable on the same state space


as y, with conditional density g(z|, y) and consider the [artificial]
joint posterior
(, z|y) (, z, y) = g(z|, y)f (y|)()

Bayesian k-nearest-neighbour classification


Bayesian inference in Gibbs random fields
Auxiliary variables

Auxiliary variables

Introduce z auxiliary/extraneous variable on the same state space


as y, with conditional density g(z|, y) and consider the [artificial]
joint posterior
(, z|y) (, z, y) = g(z|, y)f (y|)()

Expltion: Integrating out z gets us back to (|y)


[Mller et al., 2006]

Bayesian k-nearest-neighbour classification


Bayesian inference in Gibbs random fields
Auxiliary variables

Auxiliary variables (contd)


For q1 [arbitrary] proposal density on and
q2 (( , z )|(, z))) = q1 ( |)f (z | ) ,
(i.e., simulating z from the likelihood), the Metropolis-Hastings
ratio associated with q2 is




Z()
exp {Q (y)} ( )
g(z | , y)

MH2 (( , z )|(, z)) =
Z( )
exp {Q (y)} ()
g(z|, y)




q1 (| ) exp {Q (z)}
Z( )

q1 ( |) exp {Q (z)}
Z()
and....

Bayesian k-nearest-neighbour classification


Bayesian inference in Gibbs random fields
Auxiliary variables

Auxiliary variables (contd)


...Z() vanishes:


exp {Q (y)} ( )
MH2 (( , z )|(, z)) =
exp {Q (y)} ()


q1 (| ) exp {Q (z)}

q( |) exp {Q (z)}



g(z | , y)
g(z|, y)

Bayesian k-nearest-neighbour classification


Bayesian inference in Gibbs random fields
Auxiliary variables

Auxiliary variables (contd)


...Z() vanishes:


exp {Q (y)} ( )
MH2 (( , z )|(, z)) =
exp {Q (y)} ()


q1 (| ) exp {Q (z)}

q( |) exp {Q (z)}

Choice of



g(z|, y) = exp Q(z) /Z()

g(z | , y)
g(z|, y)

where is the maximum pseudo-likelihood estimate of .

Bayesian k-nearest-neighbour classification


Bayesian inference in Gibbs random fields
Auxiliary variables

Auxiliary variables (contd)


...Z() vanishes:


exp {Q (y)} ( )
MH2 (( , z )|(, z)) =
exp {Q (y)} ()


q1 (| ) exp {Q (z)}

q( |) exp {Q (z)}

Choice of



g(z|, y) = exp Q(z) /Z()

g(z | , y)
g(z|, y)

where is the maximum pseudo-likelihood estimate of .


New problem: Need to simulate from f (y|)

Bayesian k-nearest-neighbour classification


Perfect sampling

Perfect sampling

Coupling From The Past:


algorithm that allows for exact and iid sampling from a given
distribution while using basic steps from an MCMC algorithm

Bayesian k-nearest-neighbour classification


Perfect sampling

Perfect sampling

Coupling From The Past:


algorithm that allows for exact and iid sampling from a given
distribution while using basic steps from an MCMC algorithm
Underlying concept: run coupled Markov chains that start from all
possible states in the state space. Once all chains have
met/coalesced, they stick to the same path; the effect of the initial
state has vanished.
[Propp & Wilson, 1996]

Bayesian k-nearest-neighbour classification


Perfect sampling

Implementation for Potts (3)


In the case of a two-colour Ising model, existence of a perfect
sampler by virtue of monotonicity properties:

Potts

Bayesian k-nearest-neighbour classification


Perfect sampling

Implementation for Potts (3)


In the case of a two-colour Ising model, existence of a perfect
sampler by virtue of monotonicity properties:

Potts

Ising Metropolis-Hastings Perfect Sampler


For T large enough,
1
2

Start two chains x0,t and x1,t from saturated states


For t = T, . . . , 1, couple both chains:
if missing, generate the basic uniforms u(t)
use u(t) to update both x0,t into x0,t+1 and x1,t into x1,t+1

Check coalescence at time 0: if x0,0 = x1,0 i, stop


else increase T and recycle younger u(t) s

Bayesian k-nearest-neighbour classification


Perfect sampling

Implementation for Potts (3)


In the case of a two-colour Ising model, existence of a perfect
sampler by virtue of monotonicity properties:

Potts

Ising Metropolis-Hastings Perfect Sampler


For T large enough,
1
2

Start two chains x0,t and x1,t from saturated states


For t = T, . . . , 1, couple both chains:
if missing, generate the basic uniforms u(t)
use u(t) to update both x0,t into x0,t+1 and x1,t into x1,t+1

Check coalescence at time 0: if x0,0 = x1,0 i, stop


else increase T and recycle younger u(t) s

Limitation: Slow down & down when increase

Bayesian k-nearest-neighbour classification


k-nearest-neighbours

KNNs as a probability distribution


1

MRFs

Bayesian inference in Gibbs random fields

Perfect sampling

k-nearest-neighbours
KNNs as a clustering rule
KNNs as a probabilistic model
Bayesian inference on KNNs
MCMC implementation

Pseudo-likelihood reassessed

Bayesian k-nearest-neighbour classification


k-nearest-neighbours
KNNs as a clustering rule

KNNs as a clustering rule

The k-nearest-neighbour procedure is a supervised clustering


method that allocates [new] subjects to one of G categories based
on the most frequent class [within a learning sample] in their
neighbourhood.

Bayesian k-nearest-neighbour classification


k-nearest-neighbours
KNNs as a clustering rule

Supervised classification
Infer from a partitioned dataset
the classes of a new dataset

Bayesian k-nearest-neighbour classification


k-nearest-neighbours
KNNs as a clustering rule

Supervised classification
Infer from a partitioned dataset
the classes of a new dataset
Data: training dataset

yitr , xtr
i i=1,...,n

with class label 1 yitr Q and


predictor covariates xtr
i

Bayesian k-nearest-neighbour classification


k-nearest-neighbours
KNNs as a clustering rule

with unknown yite s

0.5
x2

0.0
0.5

with class label 1 yitr Q and


predictor covariates xtr
i
and testing dataset

yite , xte
i i=1,...,m

1.0

Infer from a partitioned dataset


the classes of a new dataset
Data: training dataset

yitr , xtr
i i=1,...,n

1.0

Supervised classification

1.0

0.5

0.0

0.5

1.0

Bayesian k-nearest-neighbour classification


k-nearest-neighbours
KNNs as a clustering rule

1.0

Classification

Neighbourhood based on
Euclidean metric

0.5

x2

Prediction for a new point


(yjte , xte
j ) (j = 1, . . . , m): the
most common class amongst the
k-nearest-neighbours of xte
j in
the training set

1.0

Principle

0.0

0.5

Skip animation

1.0

0.5

0.0

0.5

1.0

Bayesian k-nearest-neighbour classification


k-nearest-neighbours
KNNs as a clustering rule

1.0

Classification

Neighbourhood based on
Euclidean metric

0.5

x2

Prediction for a new point


(yjte , xte
j ) (j = 1, . . . , m): the
most common class amongst the
k-nearest-neighbours of xte
j in
the training set

1.0

Principle

0.0

0.5

Skip animation

1.0

0.5

0.0

0.5

1.0

Bayesian k-nearest-neighbour classification


k-nearest-neighbours
KNNs as a clustering rule

1.0

Classification

Neighbourhood based on
Euclidean metric

0.5

x2

Prediction for a new point


(yjte , xte
j ) (j = 1, . . . , m): the
most common class amongst the
k-nearest-neighbours of xte
j in
the training set

1.0

Principle

0.0

0.5

Skip animation

1.0

0.5

0.0

0.5

1.0

Bayesian k-nearest-neighbour classification


k-nearest-neighbours
KNNs as a clustering rule

1.0

Classification

Neighbourhood based on
Euclidean metric

0.5

x2

Prediction for a new point


(yjte , xte
j ) (j = 1, . . . , m): the
most common class amongst the
k-nearest-neighbours of xte
j in
the training set

1.0

Principle

0.0

0.5

Skip animation

1.0

0.5

0.0

0.5

1.0

Bayesian k-nearest-neighbour classification


k-nearest-neighbours
KNNs as a clustering rule

1.0

Classification

Skip animation

x2

0.0
0.5

Prediction for a new point


(yjte , xte
j ) (j = 1, . . . , m): the
most common class amongst the
k-nearest-neighbours of xte
j in
the training set

0.5

Principle

1.0

Neighbourhood based on
Euclidean metric
1.0

0.5

0.0

0.5

1.0

Bayesian k-nearest-neighbour classification


k-nearest-neighbours
KNNs as a clustering rule

1.0

Classification

Skip animation

x2

0.0
0.5

Prediction for a new point


(yjte , xte
j ) (j = 1, . . . , m): the
most common class amongst the
k-nearest-neighbours of xte
j in
the training set

0.5

Principle

1.0

Neighbourhood based on
Euclidean metric
1.0

0.5

0.0

0.5

1.0

Bayesian k-nearest-neighbour classification


k-nearest-neighbours
KNNs as a clustering rule

Standard procedure

Example : help(knn)
data(iris3)
train=rbind(iris3[1:25,,1],iris3[1:25,,2],iris3[1:25,,3])
test=rbind(iris3[26:50,,1],iris3[26:50,,2],iris3[26:50,,3])
cl=factor(c(rep("s",25),rep("c",25),rep("v",25)))
library(class)
knn(train,test,cl,k=3,prob=TRUE)
attributes(.Last.value)

Bayesian k-nearest-neighbour classification


k-nearest-neighbours
KNNs as a clustering rule

Model choice perspective

Back to idea

Choice of k?
Usually chosen by minimising cross-validated misclassification rate
(non-parametric or even non-probabilist!)

Bayesian k-nearest-neighbour classification


k-nearest-neighbours
KNNs as a clustering rule

Influence of k
1.0
0.8
0.6
0.2

0.0

0.2

0.4

0.6
0.4
0.2
0.0
0.2

1.0

0.5

0.0

0.5

1.0

1.0

0.5

0.0

0.5

1.0

0.5

1.0

1.0
0.8
0.6
0.4
0.2
0.0
0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

k=137

1.2

k=57

0.2

Dataset of Ripley (1994),


with two classes where
each population of xi s is
from a mixture of two
bivariate normal
distributions.
Training set of n = 250
points and testing set on a
set of m = 1, 000 points

0.8

1.0

1.2

k=11

1.2

k=1

1.0

0.5

0.0

0.5

1.0

1.0

0.5

0.0

Bayesian k-nearest-neighbour classification


k-nearest-neighbours
KNNs as a clustering rule

Influence of k (contd)
k-nearest-neighbour leave-one-out cross-validation:
Solutions 17 18 35 36 45 46 51 52 53 54 (29)

Procedure
1-nn
3-nn
15-nn
17-nn
54-nn

Misclassn error rate


0.150 (150)
0.134 (134)
0.095 (095)
0.087 (087)
0.081 (081)

Bayesian k-nearest-neighbour classification


k-nearest-neighbours
KNNs as a probabilistic model

KNNs as a probabilistic model


k-nearest-neighbour model
Based on full conditional distributions (1 Q)

X
tr
(yltr ) k
P(yitr = |yi
, xtr , , k) exp

>0

li

where l i is the k-nearest-neighbour relation


[Holmes & Adams, 2002]
This can also be seen as a Potts model.

Bayesian k-nearest-neighbour classification


k-nearest-neighbours
KNNs as a probabilistic model

Motivations

does not exist in the original k-nn procedure.

Bayesian k-nearest-neighbour classification


k-nearest-neighbours
KNNs as a probabilistic model

Motivations

does not exist in the original k-nn procedure.


It is only relevant from a statistical point of view as a measure of
uncertainty about the model:
= 0 corresponds to a uniform distribution on all classes ;
= + leads to a point mass distribution on the prevalent
class.

Bayesian k-nearest-neighbour classification


k-nearest-neighbours
KNNs as a probabilistic model

MRF-like expression

Closed form expression for the full conditionals


X
tr
tr
tr
P(yi = |yi , x , , k) = exp (n (i)/k)
exp (nq (i)/k)
q

where n (i) number of neighbours of i with class label


Potts

Bayesian k-nearest-neighbour classification


k-nearest-neighbours
KNNs as a probabilistic model

Drawback

Because the neighbourhood


structure is not symmetric (xi
may be one of the k nearest
neighbours of xj while xj is not
one of the k nearest neighbours
of xi ),

Bayesian k-nearest-neighbour classification


k-nearest-neighbours
KNNs as a probabilistic model

Drawback

Because the neighbourhood


structure is not symmetric (xi
may be one of the k nearest
neighbours of xj while xj is not
one of the k nearest neighbours
of xi ), there usually is no joint
probability distribution
corresponding to these full
conditionals!

Bayesian k-nearest-neighbour classification


k-nearest-neighbours
KNNs as a probabilistic model

Drawback (2)
Note: Holmes & Adams (2002) solve this problem by directly
defining the joint as the pseudo-likelihood
f (y tr |xtr , , k)

n
Y

exp (nyi (i)/k)

i=1

[with a missing constant Z()]

X
q

exp (nq (i)/k) . . .

Bayesian k-nearest-neighbour classification


k-nearest-neighbours
KNNs as a probabilistic model

Drawback (2)
Note: Holmes & Adams (2002) solve this problem by directly
defining the joint as the pseudo-likelihood
f (y tr |xtr , , k)

n
Y
i=1

exp (nyi (i)/k)

X

exp (nq (i)/k) . . .

[with a missing constant Z()]


... but they are still using the same [wrong] predictive
X
te
tr tr te
exp (nq (j)/k)
P(yj = |y , x , xj , , k) = exp (n (j)/k)
q

Bayesian k-nearest-neighbour classification


k-nearest-neighbours
KNNs as a probabilistic model

Resolution
Symmetrise the neighbourhood relation:

Bayesian k-nearest-neighbour classification


k-nearest-neighbours
KNNs as a probabilistic model

Resolution
Symmetrise the neighbourhood relation:

Principle: if xtr
i belongs to the
k-nearest-neighbour set for xtr
j
and xtr
does
not
belong
to
the
j
k-nearest-neighbour set for xtr
i ,
tr
xj is added to the set of
neighbours of xtr
i

Bayesian k-nearest-neighbour classification


k-nearest-neighbours
KNNs as a probabilistic model

Resolution
Symmetrise the neighbourhood relation:

Principle: if xtr
i belongs to the
k-nearest-neighbour set for xtr
j
and xtr
does
not
belong
to
the
j
k-nearest-neighbour set for xtr
i ,
tr
xj is added to the set of
neighbours of xtr
i

Bayesian k-nearest-neighbour classification


k-nearest-neighbours
KNNs as a probabilistic model

Consequence
Given the full conditionals

X
tr
(yltr ) k
, xtr , , k) exp
P(yitr = |yi

l#i

where l#i is the symmetrised k-nearest-neighbour relation,

Bayesian k-nearest-neighbour classification


k-nearest-neighbours
KNNs as a probabilistic model

Consequence
Given the full conditionals

X
tr
(yltr ) k
, xtr , , k) exp
P(yitr = |yi

l#i

where l#i is the symmetrised k-nearest-neighbour relation,


there exists a corresponding joint distribution

Bayesian k-nearest-neighbour classification


k-nearest-neighbours
KNNs as a probabilistic model

Extension to unclassified points

Predictive distribution of yjte (j = 1, . . . , m) defined as

X
tr tr

(yltr ) k
P(yjte = |xte
j , y , x , , k) exp

l#j

where l#j is the symmetrised k-nearest-neighbour relation wrt the


tr
training set {xtr
1 , . . . , xn }

Bayesian k-nearest-neighbour classification


k-nearest-neighbours
Bayesian inference on KNNs

Bayesian modelling
Within the Bayesian paradigm, assign a prior (, k) like
(, k) I(1 k kmax ) I(0 max )
because there is a maximum value (e.g., max = 15) after which
the distribution is Dirac [as in Potts model]

Bayesian k-nearest-neighbour classification


k-nearest-neighbours
Bayesian inference on KNNs

Bayesian modelling
Within the Bayesian paradigm, assign a prior (, k) like
(, k) I(1 k kmax ) I(0 max )
because there is a maximum value (e.g., max = 15) after which
the distribution is Dirac [as in Potts model] and because it can be
argued that kmax = n/2

Note
is dimension-less because of the use of frequencies n (i)/k as
covariates

Bayesian k-nearest-neighbour classification


k-nearest-neighbours
Bayesian inference on KNNs

Bayesian global inference


Use marginal predictive distribution of yjte given xte
j (j = 1, . . . , m)
Z

tr tr
tr tr
P(yjte = |xte
j , y , x , , k)(, k|y , x )d dk

where
(, k|y tr , xtr ) f (y tr |xtr , , k)(, k)
posterior distribution of (, k) given the training dataset y tr
[b
yjte = MAP estimate]

Note
Model choice with no varying dimension because is the same for
all models

Bayesian k-nearest-neighbour classification


k-nearest-neighbours
MCMC implementation

MCMC implementation
A Markov Chain Monte Carlo (MCMC) approximation of
f (yn+1 |xn+1 , y, X)
is provided by
M 1

M


X
f yn+1 |xn+1 , y, X, (, k)(i)
i=1

where {(, k)(1) , . . . , (, k)(M ) } MCMC output associated with


stationary distribution (, k|y, X).

Bayesian k-nearest-neighbour classification


k-nearest-neighbours
MCMC implementation

Auxiliary variable version


Random walk MetropolisHastings algorithm on both and k
Since (0, max ), a logistic reparameterisation of is

= max e 1 + e ,

and the random walk N ((t) , 2 ) is on

For k, uniform proposal on 2r neighboursTof k (t) ,


{k (t) r, . . . , k (t) 1, k (t) + 1, . . . k (t) + r} {1, . . . , K}.

Bayesian k-nearest-neighbour classification


k-nearest-neighbours
MCMC implementation

Auxiliary variable version


Random walk MetropolisHastings algorithm on both and k
Since (0, max ), a logistic reparameterisation of is

= max e 1 + e ,

and the random walk N ((t) , 2 ) is on

For k, uniform proposal on 2r neighboursTof k (t) ,


{k (t) r, . . . , k (t) 1, k (t) + 1, . . . k (t) + r} {1, . . . , K}.

Simulation of f (ztr |xtr , , k) by perfect sampling taking advantage


of monotonicity properties [but may get stuck for too large values
of ]

Bayesian k-nearest-neighbour classification


k-nearest-neighbours
MCMC implementation

k)
paramount:
Choice of (,
)
= (53, 2.28) versus
Illustration of Ripleys dataset: (k,

(k, ) = (13, 1.45)

Bayesian k-nearest-neighbour classification


k-nearest-neighbours
MCMC implementation

k)
paramount:
Choice of (,
)
= (53, 2.28) versus
Illustration of Ripleys dataset: (k,

(k, ) = (13, 1.45)

Bayesian k-nearest-neighbour classification


k-nearest-neighbours
MCMC implementation

Diabetes in Pima Indian women


Example (R benchmark)
A population of women who were at least 21 years old, of Pima Indian
heritage and living near Phoenix (AZ), was tested for diabetes according
to WHO criteria. The data were collected by the US National Institute of
Diabetes and Digestive and Kidney Diseases. We used the 532 complete
records after dropping the (mainly missing) data on serum insulin.
number of pregnancies
plasma glucose concentration in an oral glucose tolerance test
diastolic blood pressure
triceps skin fold thickness
body mass index
diabetes pedigree function
age

Bayesian k-nearest-neighbour classification


k-nearest-neighbours
MCMC implementation

Diabetes in Pima Indian women


MCMC output for max = 1.5, = 1.15, k = 40, and 20, 000
simulations.

Bayesian k-nearest-neighbour classification


k-nearest-neighbours
MCMC implementation

Diabetes in Pima Indian women

Example (Error rate & k selection)


k
1
3
15
31
57
66

Misclassification
error rate
0.316
0.229
0.226
0.211
0.205
0.208

Bayesian k-nearest-neighbour classification


k-nearest-neighbours
MCMC implementation

Predictive output

The approximate Bayesian prediction of yn+1 is


yn+1 = arg max M 1
g

M


X
f g|xn+1 , y, X, (i) , k (i) .
i=1

Bayesian k-nearest-neighbour classification


k-nearest-neighbours
MCMC implementation

Predictive output

The approximate Bayesian prediction of yn+1 is


yn+1 = arg max M 1
g

M


X
f g|xn+1 , y, X, (i) , k (i) .
i=1

E.g., Ripleys dataset misclassification error rate: 0.082.

Bayesian k-nearest-neighbour classification


Pseudo-likelihood reassessed

A reassessment of pseudo-likelihood
1

MRFs

Bayesian inference in Gibbs random fields

Perfect sampling

k-nearest-neighbours

Pseudo-likelihood reassessed

Variable selection

Bayesian k-nearest-neighbour classification


Pseudo-likelihood reassessed

Pseudo-likelihood
Pseudo-likelihood leads to (almost) straightforward MCMC
implementation

Bayesian k-nearest-neighbour classification


Pseudo-likelihood reassessed

Magnitude of the approximation


Since perfect and path sampling approaches also are available for
small datasets, possibility of evaluation of pseudo-likelihood
approximation

Bayesian k-nearest-neighbour classification


Pseudo-likelihood reassessed

Ripleys benchmark (1)


Approximations to the posterior of based on the pseudo (green),
the path (red) and the perfect (yellow) schemes with
k = 1, 10, 70, 125, for 20, 000 iterations:

Bayesian k-nearest-neighbour classification


Pseudo-likelihood reassessed

Ripleys benchmark (2)


Approximations of posteriors of (top) and k (bottom)

Bayesian k-nearest-neighbour classification


Variable selection

Variable selection
1

MRFs

Bayesian inference in Gibbs random fields

Perfect sampling

k-nearest-neighbours

Pseudo-likelihood reassessed

Variable selection

Bayesian k-nearest-neighbour classification


Variable selection

1.0
0.8
0.6
0.4
0.2

0.0

0.2

0.4
0.2
0.0
0.2

Goal: Selection of the


components of the predictor
vector that best contribute to the
classification

0.6

0.8

1.0

1.2

gamma=(1,0), err=284

1.2

gamma=(1,1), err=78

1.0

0.5

1.0

1.0

0.5

0.0

0.5

1.0

1.0
0.8
0.6
0.4
0.2
0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

gamma=(1,1,1), err=159

1.2

gamma=(0,1), err=116

0.0

Efficiency (more components


blur class differences)

0.0

0.2

Parsimony (dimension of
predictor may be larger than
training sample size n)

0.5

1.0

0.5

0.0

0.5

1.0

1.0

0.5

0.0

0.5

1.0

Bayesian k-nearest-neighbour classification


Variable selection

Component indicators
Completion of (, k) with indicator variables j {0, 1}
(1 j p) that determine which components of x are active in
the model


X
Cj (yl ) k
P(yi = Cj |yi , X, , k, ) exp
lvk (i)

with vk (i) (symmetrised) k nearest neighbourhood of xi for the


distance
p
X
2
d(xi , x ) =
j (xij xj )2
j=1

Bayesian k-nearest-neighbour classification


Variable selection

Variable selection

Formal similarity with usual variable selection in regression models.


Use of a uniform prior on the j s on {0, 1}, independently for all
js
Exploration of a range of models M of size 2p that may be too
large (see, e.g., the vision dataset with p = 200)

Bayesian k-nearest-neighbour classification


Variable selection

Implementation
Use of a naive reversible jump MCMC, where
1

(, k) are changed conditional on and

is changed one component at a time conditional on (, k)


[and the data]

Note
Validation of simple jumps due to (a) saturation of the dimension
by associating a j to each variable and (b) hierarchical structure
of the (, k) part.
c This is not a varying dimension problem

Bayesian k-nearest-neighbour classification


Variable selection

MCMC algorithm
Variable selection k-nearest-neighbours

(0)
At time 0, generate j B(1/2), log (0) N 0, 2 and
k (0) U{1,...,K}
At time 1 t T ,

N log (t1) , 2 and
1 Generate log
k U ({k r, k r + 1, . . . , k + r 1, k + r})

Bayesian k-nearest-neighbour classification


Variable selection

MCMC algorithm
Variable selection k-nearest-neighbours

(0)
At time 0, generate j B(1/2), log (0) N 0, 2 and
k (0) U{1,...,K}
At time 1 t T ,

N log (t1) , 2 and
1 Generate log
k U ({k r, k r + 1, . . . , k + r 1, k + r})
2

Calculate Metropolis-Hastings acceptance probability


k,
(t1) , k (t1) )
(,

Bayesian k-nearest-neighbour classification


Variable selection

MCMC algorithm
Variable selection k-nearest-neighbours

(0)
At time 0, generate j B(1/2), log (0) N 0, 2 and
k (0) U{1,...,K}
At time 1 t T ,

N log (t1) , 2 and
1 Generate log
k U ({k r, k r + 1, . . . , k + r 1, k + r})
2

Calculate Metropolis-Hastings acceptance probability


k,
(t1) , k (t1) )
(,

Move to (t) , k (t) by Metropolis-Hastings step

Bayesian k-nearest-neighbour classification


Variable selection

MCMC algorithm
Variable selection k-nearest-neighbours

(0)
At time 0, generate j B(1/2), log (0) N 0, 2 and
k (0) U{1,...,K}
At time 1 t T ,

N log (t1) , 2 and
1 Generate log
k U ({k r, k r + 1, . . . , k + r 1, k + r})
2

3
4

Calculate Metropolis-Hastings acceptance probability


k,
(t1) , k (t1) )
(,

Move to (t) , k (t) by Metropolis-Hastings step
(t)

(t)

For j = 1, . . . , p, generate j (j |y, X, j , (t) , k (t) )

Bayesian k-nearest-neighbour classification


Variable selection

Benchmark 1
Ripleys dataset with 8 additional potential [useless] covariates
simulated from N (0, .052 )
Using the 250 datapoints for variable selection, comparison of the
210 = 1024 models by pseudo-maximum likelihood estimation of
(k, ) and by comparison of pseudo-likelihoods leads to select the
proper submodel
1 = 2 = 1

and 3 = = 10 = 0

with k = 3.1 and = 3.8. Forward and backward selection


procedures lead to same conclusion.

Bayesian k-nearest-neighbour classification


Variable selection

Benchmark 1
Ripleys dataset with 8 additional potential [useless] covariates
simulated from N (0, .052 )
Using the 250 datapoints for variable selection, comparison of the
210 = 1024 models by pseudo-maximum likelihood estimation of
(k, ) and by comparison of pseudo-likelihoods leads to select the
proper submodel
1 = 2 = 1

and 3 = = 10 = 0

with k = 3.1 and = 3.8. Forward and backward selection


procedures lead to same conclusion.
MCMC algorithm produces 1 = 2 = 1 and 3 = = 1 0 = 0 as
the MMAP, with very similar values for k and [Hardly any move
away from (1, 1, 0, . . . , 0) is accepted]

Bayesian k-nearest-neighbour classification


Variable selection

Benchmark 2
Ripleys dataset with now 28 additional covariates simulated from
N (0, .052 )
Using the 250 datapoints for variable selection, direct comparison
of the 230 models by pseudo-maximum likelihood estimation
impossible!
Forward and backward selection procedures both lead to the proper
submodel = (1, 1, 0, . . . , 0)

Bayesian k-nearest-neighbour classification


Variable selection

Benchmark 2
Ripleys dataset with now 28 additional covariates simulated from
N (0, .052 )
Using the 250 datapoints for variable selection, direct comparison
of the 230 models by pseudo-maximum likelihood estimation
impossible!
Forward and backward selection procedures both lead to the proper
submodel = (1, 1, 0, . . . , 0)
MCMC algorithm again produces 1 = 2 = 1 and
3 = = 1 0 = 0 as the MMAP, with more moves around
= (1, 1, 0, . . . , 0)

You might also like