ChristianRobert KNN

Bayesian k-nearest-neighbour classification

Christian P. Robert
Universit
e Paris Dauphine & CREST, INSEE
http://www.ceremade.dauphine.fr/xian
Joint work with G. Celeux, J.M. Marin, & D.M. Titterington
Outline
1
MRFs
Bayesian inference in Gibbs random fields
Perfect sampling
k-nearest-neighbours
Pseudo-likelihood reassessed
Variable selection

MRFs
MRFs
Markov random fields: natural spatial generalisation of Markov

chains

MRFs
MRFs
Markov random fields: natural spatial generalisation of Markov

chains
They can be derived from graph structures, when ignoring time
directionality/causality
E.g., a Markov chain is also a chain graph of random variables,
where each variable in the graph has the property that it is
independent of all (both past and future) others given its two
nearest neighbours.

MRFs
MRFs (contd)
Definition (MRF)
A general Markov random field is the extension of the above to
any graph structure on the random variables, i.e., a collection of
rvs such that each one is independent of all the others given its
immediate neighbours in the corresponding graph.
[Cressie, 1993]

MRFs
A formal definition
Take y1 , . . . , yn , rvs with values in a finite set S, and let
G = (N, E) be a finite graph with N = {1, ..., n} the collection of
nodes and E the collection of edges, made of pairs from N .

MRFs
A formal definition
For A N , A denotes the set of neighbours of A, i.e. the
collection of all points in N/A that have a neighbour in A.

MRFs
A formal definition
For A N , A denotes the set of neighbours of A, i.e. the
collection of all points in N/A that have a neighbour in A.
Definition (MRF)
y = (y1 , . . . , yn ) is a Markov random field associated with the
graph G if its full conditionals satisfy
f (yi |yi ) = f (yi |yi ) .
Cliques are sets of points that are all neighbours of one another.

MRFs
Gibbs distributions
Special case of MRF:

MRFs
Gibbs distributions
Special case of MRF:

y = (y1 , . . . , yn ) is a Gibbs random field associated with the
graph G if
(
)
X
1
f (y) = exp
Vc (yc ) .
Z
cC
where Z is the normalising constant, C is the set ofPcliques and Vc

is any function also called potential (and U (y) = cC Vc (yc ) is
the energy function)

MRFs
Statistical perspective
Introduce a parameter in the Gibbs distribution:
f (y|) =
exp {Q (y)}
Z()

MRFs
f (y|) =
exp {Q (y)}
Z()
and estimate it from observed data y.

MRFs
f (y|) =
exp {Q (y)}
Z()
and estimate it from observed data y.

Bayesian approach:
put a prior distribution () on and use posterior distribution
(|y) f (y|)() =
exp {Q (y)}
()
Z()

MRFs
Potts model
Example (Boltzman dependence)

Case when Q (y) is of the form
Q (y) = S(y)
X
yl =yi
=
lk i
Path for Potts

MRFs

Pseudo-posterior inference
Path sampling
Auxiliary variables
Perfect sampling
Variable selection


Use the posterior (|y) to draw inference


Problem
X
Z() =
exp {Q (y)}
y
not available analytically & exact computation not feasible.


Problem
X
Z() =
exp {Q (y)}
y
not available analytically & exact computation not feasible.

Solutions:
Path sampling approximations
Auxiliary variable method

Oldest solution: replace the likelihood with the pseudo-likelihood
pseudo-like(y|) =
n
Y
f (yi |yi , ) .
i=1
Then define pseudo-posterior

pseudo-post(|y)
n
Y
f (yi |yi , )()
i=1
and resort to MCMC methods to derive a sample from pseudo-post

[Besag, 1974-75]

Path sampling
Path sampling
Generate a sample from [the true] (|y) by a

Metropolis-Hastings algorithm, with acceptance probability

Z()
exp [Q (y)] ( )
q1 (| )
MH1 ( |) =
Z( )
exp [Q (y)] ()
q1 ( |)
where q1 ( |) is a [arbitrary] proposal density.
[Robert & Casella, 1999/2004]

Path sampling
Path sampling (contd)

When Q (y) = S(y) [cf. Gibbs/Potts distribution],
X
Z() =
exp [S(y)]
y
and
dZ()
d
S(y) exp[S(y)]
= Z()
X
y

S(y) exp{S(y)} Z()
= Z() E [S(y)] .

Path sampling
Path sampling (contd)

When Q (y) = S(y) [cf. Gibbs/Potts distribution],
X
Z() =
exp [S(y)]
y
and
dZ()
d
S(y) exp[S(y)]
= Z()
X
y

S(y) exp{S(y)} Z()
= Z() E [S(y)] .
c Derivative expressed as an expectation under f (y|)


Path sampling
Path sampling identity
Therefore, the ratio Z()/Z( ) can be derived from an integral,

since

Z
Z()
log
Eu [S(y)]du .
=
Z( )
[Gelman & Meng, 1998]

Path sampling
Implementation for Potts

Potts
Step X: Monte Carlo approximation of E [S(X)] derived form

MCMC sequence of Xs for fixed
Potts Metropolis-Hastings Sampler

Iteration t (t 1):
1
2
Generate u = (ui )iI random permutation of I;

For 1 |I|,
generate
(t1)
x
(t)
1, xu(t1)
+ 1, . . . , G}) ,
u U ({1, . . . , xu
(t)
(t)
(t)
compute the nul ,g s and l = {exp([nu ,x nu ,xu ])} 1 ,

(t)
and set xu equal to x

u with probability l .

Path sampling
Implementation for Potts (2)
Step : Use (a) importance

sampling recycling when
changing the value of and (b)
numerical quadrature for integral
approximation

Path sampling
Step : Use (a) importance

sampling recycling when
changing the value of and (b)
numerical quadrature for integral
approximation
Illustration: Approximation of
E,k [S(y)] for Ripleys
benchmark, for k = 1, 125
Perfect Potts

Auxiliary variables
Auxiliary variables
Introduce z auxiliary/extraneous variable on the same state space

as y, with conditional density g(z|, y) and consider the [artificial]
joint posterior
(, z|y) (, z, y) = g(z|, y)f (y|)()

Auxiliary variables
Auxiliary variables
Introduce z auxiliary/extraneous variable on the same state space

as y, with conditional density g(z|, y) and consider the [artificial]
joint posterior
(, z|y) (, z, y) = g(z|, y)f (y|)()
Expltion: Integrating out z gets us back to (|y)

[Mller et al., 2006]

Auxiliary variables
Auxiliary variables (contd)

For q1 [arbitrary] proposal density on and
q2 (( , z )|(, z))) = q1 ( |)f (z | ) ,
(i.e., simulating z from the likelihood), the Metropolis-Hastings
ratio associated with q2 is

Z()
exp {Q (y)} ( )
g(z | , y)

MH2 (( , z )|(, z)) =
Z( )
exp {Q (y)} ()
g(z|, y)

q1 (| ) exp {Q (z)}
Z( )
q1 ( |) exp {Q (z)}
Z()
and....

Auxiliary variables

...Z() vanishes:

exp {Q (y)} ( )
MH2 (( , z )|(, z)) =
exp {Q (y)} ()

q1 (| ) exp {Q (z)}
q( |) exp {Q (z)}

g(z | , y)
g(z|, y)

Auxiliary variables

...Z() vanishes:

exp {Q (y)} ( )
MH2 (( , z )|(, z)) =
exp {Q (y)} ()

q1 (| ) exp {Q (z)}
q( |) exp {Q (z)}
Choice of

g(z|, y) = exp Q(z) /Z()
g(z | , y)
g(z|, y)
where is the maximum pseudo-likelihood estimate of .

Auxiliary variables

...Z() vanishes:

exp {Q (y)} ( )
MH2 (( , z )|(, z)) =
exp {Q (y)} ()

q1 (| ) exp {Q (z)}
q( |) exp {Q (z)}
Choice of

g(z|, y) = exp Q(z) /Z()
g(z | , y)
g(z|, y)
where is the maximum pseudo-likelihood estimate of .

New problem: Need to simulate from f (y|)

Perfect sampling
Perfect sampling
Coupling From The Past:

algorithm that allows for exact and iid sampling from a given
distribution while using basic steps from an MCMC algorithm

Perfect sampling
Perfect sampling
Coupling From The Past:

algorithm that allows for exact and iid sampling from a given
distribution while using basic steps from an MCMC algorithm
Underlying concept: run coupled Markov chains that start from all
possible states in the state space. Once all chains have
met/coalesced, they stick to the same path; the effect of the initial
state has vanished.
[Propp & Wilson, 1996]

Perfect sampling

In the case of a two-colour Ising model, existence of a perfect
sampler by virtue of monotonicity properties:
Potts

Perfect sampling

Potts
Ising Metropolis-Hastings Perfect Sampler

For T large enough,
1
2
Start two chains x0,t and x1,t from saturated states

For t = T, . . . , 1, couple both chains:
if missing, generate the basic uniforms u(t)
use u(t) to update both x0,t into x0,t+1 and x1,t into x1,t+1
Check coalescence at time 0: if x0,0 = x1,0 i, stop

else increase T and recycle younger u(t) s

Perfect sampling

Potts
Ising Metropolis-Hastings Perfect Sampler

For T large enough,
1
2
Start two chains x0,t and x1,t from saturated states

For t = T, . . . , 1, couple both chains:
if missing, generate the basic uniforms u(t)
use u(t) to update both x0,t into x0,t+1 and x1,t into x1,t+1
Check coalescence at time 0: if x0,0 = x1,0 i, stop

else increase T and recycle younger u(t) s
Limitation: Slow down & down when increase

KNNs as a probability distribution

1
MRFs
Perfect sampling
KNNs as a clustering rule
KNNs as a probabilistic model
Bayesian inference on KNNs
MCMC implementation

The k-nearest-neighbour procedure is a supervised clustering

method that allocates [new] subjects to one of G categories based
on the most frequent class [within a learning sample] in their
neighbourhood.

Supervised classification
Infer from a partitioned dataset
the classes of a new dataset

Data: training dataset

yitr , xtr
i i=1,...,n
with class label 1 yitr Q and

predictor covariates xtr
i

with unknown yite s
0.5
x2
0.0
0.5
with class label 1 yitr Q and

predictor covariates xtr
i
and testing dataset

yite , xte
i i=1,...,m
1.0

Data: training dataset

yitr , xtr
i i=1,...,n
1.0
1.0
0.5
0.0
0.5
1.0

1.0
Classification
Neighbourhood based on
Euclidean metric
0.5
x2
Prediction for a new point

(yjte , xte
j ) (j = 1, . . . , m): the
most common class amongst the
k-nearest-neighbours of xte
j in
the training set
1.0
Principle
0.0
0.5
Skip animation
1.0
0.5
0.0
0.5
1.0

1.0
Classification
Euclidean metric
0.5
x2

(yjte , xte
j ) (j = 1, . . . , m): the
j in
the training set
1.0
Principle
0.0
0.5
Skip animation
1.0
0.5
0.0
0.5
1.0

1.0
Classification
Euclidean metric
0.5
x2

(yjte , xte
j ) (j = 1, . . . , m): the
j in
the training set
1.0
Principle
0.0
0.5
Skip animation
1.0
0.5
0.0
0.5
1.0

1.0
Classification
Euclidean metric
0.5
x2

(yjte , xte
j ) (j = 1, . . . , m): the
j in
the training set
1.0
Principle
0.0
0.5
Skip animation
1.0
0.5
0.0
0.5
1.0

1.0
Classification
Skip animation
x2
0.0
0.5

(yjte , xte
j ) (j = 1, . . . , m): the
j in
the training set
0.5
Principle
1.0
Euclidean metric
1.0
0.5
0.0
0.5
1.0

1.0
Classification
Skip animation
x2
0.0
0.5

(yjte , xte
j ) (j = 1, . . . , m): the
j in
the training set
0.5
Principle
1.0
Euclidean metric
1.0
0.5
0.0
0.5
1.0

Standard procedure
Example : help(knn)
data(iris3)
train=rbind(iris3[1:25,,1],iris3[1:25,,2],iris3[1:25,,3])
test=rbind(iris3[26:50,,1],iris3[26:50,,2],iris3[26:50,,3])
cl=factor(c(rep("s",25),rep("c",25),rep("v",25)))
library(class)
knn(train,test,cl,k=3,prob=TRUE)
attributes(.Last.value)

Model choice perspective
Back to idea
Choice of k?
Usually chosen by minimising cross-validated misclassification rate
(non-parametric or even non-probabilist!)

Influence of k
1.0
0.8
0.6
0.2
0.0
0.2
0.4
0.6
0.4
0.2
0.0
0.2
1.0
0.5
0.0
0.5
1.0
1.0
0.5
0.0
0.5
1.0
0.5
1.0
1.0
0.8
0.6
0.4
0.2
0.0
0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
k=137
1.2
k=57
0.2
Dataset of Ripley (1994),

with two classes where
each population of xi s is
from a mixture of two
bivariate normal
distributions.
Training set of n = 250
points and testing set on a
set of m = 1, 000 points
0.8
1.0
1.2
k=11
1.2
k=1
1.0
0.5
0.0
0.5
1.0
1.0
0.5
0.0

Influence of k (contd)
k-nearest-neighbour leave-one-out cross-validation:
Solutions 17 18 35 36 45 46 51 52 53 54 (29)
Procedure
1-nn
3-nn
15-nn
17-nn
54-nn
Misclassn error rate

0.150 (150)
0.134 (134)
0.095 (095)
0.087 (087)
0.081 (081)


k-nearest-neighbour model
Based on full conditional distributions (1 Q)
X
tr
(yltr ) k
P(yitr = |yi
, xtr , , k) exp
>0
li
where l i is the k-nearest-neighbour relation

[Holmes & Adams, 2002]
This can also be seen as a Potts model.

Motivations
does not exist in the original k-nn procedure.

Motivations
does not exist in the original k-nn procedure.

It is only relevant from a statistical point of view as a measure of
uncertainty about the model:
= 0 corresponds to a uniform distribution on all classes ;
= + leads to a point mass distribution on the prevalent
class.

MRF-like expression
Closed form expression for the full conditionals

X
tr
tr
tr
P(yi = |yi , x , , k) = exp (n (i)/k)
exp (nq (i)/k)
q
where n (i) number of neighbours of i with class label

Potts

Drawback
Because the neighbourhood

structure is not symmetric (xi
may be one of the k nearest
neighbours of xj while xj is not
one of the k nearest neighbours
of xi ),

Drawback
Because the neighbourhood

structure is not symmetric (xi
may be one of the k nearest
neighbours of xj while xj is not
one of the k nearest neighbours
of xi ), there usually is no joint
probability distribution
corresponding to these full
conditionals!

Drawback (2)
Note: Holmes & Adams (2002) solve this problem by directly
defining the joint as the pseudo-likelihood
f (y tr |xtr , , k)
n
Y
exp (nyi (i)/k)
i=1
[with a missing constant Z()]
X
q
exp (nq (i)/k) . . .

Drawback (2)
Note: Holmes & Adams (2002) solve this problem by directly
defining the joint as the pseudo-likelihood
f (y tr |xtr , , k)
n
Y
i=1
exp (nyi (i)/k)
X
exp (nq (i)/k) . . .
[with a missing constant Z()]

... but they are still using the same [wrong] predictive
X
te
tr tr te
exp (nq (j)/k)
P(yj = |y , x , xj , , k) = exp (n (j)/k)
q

Resolution
Symmetrise the neighbourhood relation:

Resolution
Principle: if xtr
i belongs to the
k-nearest-neighbour set for xtr
j
and xtr
does
not
belong
to
the
j
i ,
tr
xj is added to the set of
neighbours of xtr
i

Resolution
Principle: if xtr
i belongs to the
j
and xtr
does
not
belong
to
the
j
i ,
tr
xj is added to the set of
neighbours of xtr
i

Consequence
Given the full conditionals
X
tr
(yltr ) k
, xtr , , k) exp
P(yitr = |yi
l#i
where l#i is the symmetrised k-nearest-neighbour relation,

Consequence
Given the full conditionals
X
tr
(yltr ) k
, xtr , , k) exp
P(yitr = |yi
l#i
where l#i is the symmetrised k-nearest-neighbour relation,

there exists a corresponding joint distribution

Extension to unclassified points
Predictive distribution of yjte (j = 1, . . . , m) defined as
X
tr tr
(yltr ) k
P(yjte = |xte
j , y , x , , k) exp
l#j
where l#j is the symmetrised k-nearest-neighbour relation wrt the

tr
training set {xtr
1 , . . . , xn }

Bayesian modelling
Within the Bayesian paradigm, assign a prior (, k) like
(, k) I(1 k kmax ) I(0 max )
because there is a maximum value (e.g., max = 15) after which
the distribution is Dirac [as in Potts model]

Bayesian modelling
Within the Bayesian paradigm, assign a prior (, k) like
(, k) I(1 k kmax ) I(0 max )
because there is a maximum value (e.g., max = 15) after which
the distribution is Dirac [as in Potts model] and because it can be
argued that kmax = n/2
Note
is dimension-less because of the use of frequencies n (i)/k as
covariates

Bayesian global inference

Use marginal predictive distribution of yjte given xte
j (j = 1, . . . , m)
Z
tr tr
tr tr
P(yjte = |xte
j , y , x , , k)(, k|y , x )d dk
where
(, k|y tr , xtr ) f (y tr |xtr , , k)(, k)
posterior distribution of (, k) given the training dataset y tr
[b
yjte = MAP estimate]
Note
Model choice with no varying dimension because is the same for
all models

MCMC implementation
MCMC implementation
A Markov Chain Monte Carlo (MCMC) approximation of
f (yn+1 |xn+1 , y, X)
is provided by
M 1
M

X
f yn+1 |xn+1 , y, X, (, k)(i)
i=1
where {(, k)(1) , . . . , (, k)(M ) } MCMC output associated with

stationary distribution (, k|y, X).

MCMC implementation
Auxiliary variable version

Random walk MetropolisHastings algorithm on both and k
Since (0, max ), a logistic reparameterisation of is

= max e 1 + e ,
and the random walk N ((t) , 2 ) is on
For k, uniform proposal on 2r neighboursTof k (t) ,

{k (t) r, . . . , k (t) 1, k (t) + 1, . . . k (t) + r} {1, . . . , K}.

MCMC implementation
Auxiliary variable version

Random walk MetropolisHastings algorithm on both and k
Since (0, max ), a logistic reparameterisation of is

= max e 1 + e ,
and the random walk N ((t) , 2 ) is on
For k, uniform proposal on 2r neighboursTof k (t) ,

{k (t) r, . . . , k (t) 1, k (t) + 1, . . . k (t) + r} {1, . . . , K}.
Simulation of f (ztr |xtr , , k) by perfect sampling taking advantage

of monotonicity properties [but may get stuck for too large values
of ]

MCMC implementation
k)
paramount:
Choice of (,
)
= (53, 2.28) versus
Illustration of Ripleys dataset: (k,
(k, ) = (13, 1.45)

MCMC implementation
k)
paramount:
Choice of (,
)
= (53, 2.28) versus
Illustration of Ripleys dataset: (k,
(k, ) = (13, 1.45)

MCMC implementation
Diabetes in Pima Indian women

Example (R benchmark)
A population of women who were at least 21 years old, of Pima Indian
heritage and living near Phoenix (AZ), was tested for diabetes according
to WHO criteria. The data were collected by the US National Institute of
Diabetes and Digestive and Kidney Diseases. We used the 532 complete
records after dropping the (mainly missing) data on serum insulin.
number of pregnancies
plasma glucose concentration in an oral glucose tolerance test
diastolic blood pressure
triceps skin fold thickness
body mass index
diabetes pedigree function
age

MCMC implementation

MCMC output for max = 1.5, = 1.15, k = 40, and 20, 000
simulations.

MCMC implementation
Example (Error rate & k selection)

k
1
3
15
31
57
66
Misclassification
error rate
0.316
0.229
0.226
0.211
0.205
0.208

MCMC implementation
Predictive output
The approximate Bayesian prediction of yn+1 is

yn+1 = arg max M 1
g
M

X
f g|xn+1 , y, X, (i) , k (i) .
i=1

MCMC implementation
Predictive output
The approximate Bayesian prediction of yn+1 is

yn+1 = arg max M 1
g
M

X
f g|xn+1 , y, X, (i) , k (i) .
i=1
E.g., Ripleys dataset misclassification error rate: 0.082.

A reassessment of pseudo-likelihood
1
MRFs
Perfect sampling
Variable selection

Pseudo-likelihood
Pseudo-likelihood leads to (almost) straightforward MCMC
implementation

Magnitude of the approximation

Since perfect and path sampling approaches also are available for
small datasets, possibility of evaluation of pseudo-likelihood
approximation

Ripleys benchmark (1)

Approximations to the posterior of based on the pseudo (green),
the path (red) and the perfect (yellow) schemes with
k = 1, 10, 70, 125, for 20, 000 iterations:

Ripleys benchmark (2)

Approximations of posteriors of (top) and k (bottom)

Variable selection
Variable selection
1
MRFs
Perfect sampling
Variable selection

Variable selection
1.0
0.8
0.6
0.4
0.2
0.0
0.2
0.4
0.2
0.0
0.2
Goal: Selection of the

components of the predictor
vector that best contribute to the
classification
0.6
0.8
1.0
1.2
gamma=(1,0), err=284
1.2
gamma=(1,1), err=78
1.0
0.5
1.0
1.0
0.5
0.0
0.5
1.0
1.0
0.8
0.6
0.4
0.2
0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
gamma=(1,1,1), err=159
1.2
gamma=(0,1), err=116
0.0
Efficiency (more components

blur class differences)
0.0
0.2
Parsimony (dimension of
predictor may be larger than
training sample size n)
0.5
1.0
0.5
0.0
0.5
1.0
1.0
0.5
0.0
0.5
1.0

Variable selection
Component indicators
Completion of (, k) with indicator variables j {0, 1}
(1 j p) that determine which components of x are active in
the model

X
Cj (yl ) k
P(yi = Cj |yi , X, , k, ) exp
lvk (i)
with vk (i) (symmetrised) k nearest neighbourhood of xi for the

distance
p
X
2
d(xi , x ) =
j (xij xj )2
j=1

Variable selection
Variable selection
Formal similarity with usual variable selection in regression models.

Use of a uniform prior on the j s on {0, 1}, independently for all
js
Exploration of a range of models M of size 2p that may be too
large (see, e.g., the vision dataset with p = 200)

Variable selection
Implementation
Use of a naive reversible jump MCMC, where
1
(, k) are changed conditional on and
is changed one component at a time conditional on (, k)

[and the data]
Note
Validation of simple jumps due to (a) saturation of the dimension
by associating a j to each variable and (b) hierarchical structure
of the (, k) part.
c This is not a varying dimension problem


Variable selection
MCMC algorithm
Variable selection k-nearest-neighbours

(0)
At time 0, generate j B(1/2), log (0) N 0, 2 and
k (0) U{1,...,K}
At time 1 t T ,

N log (t1) , 2 and
1 Generate log
k U ({k r, k r + 1, . . . , k + r 1, k + r})

Variable selection
MCMC algorithm

(0)
k (0) U{1,...,K}
At time 1 t T ,

N log (t1) , 2 and
1 Generate log
k U ({k r, k r + 1, . . . , k + r 1, k + r})
2
Calculate Metropolis-Hastings acceptance probability

k,
(t1) , k (t1) )
(,

Variable selection
MCMC algorithm

(0)
k (0) U{1,...,K}
At time 1 t T ,

N log (t1) , 2 and
1 Generate log
k U ({k r, k r + 1, . . . , k + r 1, k + r})
2

k,
(t1) , k (t1) )
(,

Move to (t) , k (t) by Metropolis-Hastings step

Variable selection
MCMC algorithm

(0)
k (0) U{1,...,K}
At time 1 t T ,

N log (t1) , 2 and
1 Generate log
k U ({k r, k r + 1, . . . , k + r 1, k + r})
2
3
4

k,
(t1) , k (t1) )
(,

Move to (t) , k (t) by Metropolis-Hastings step
(t)
(t)
For j = 1, . . . , p, generate j (j |y, X, j , (t) , k (t) )

Variable selection
Benchmark 1
Ripleys dataset with 8 additional potential [useless] covariates
simulated from N (0, .052 )
Using the 250 datapoints for variable selection, comparison of the
210 = 1024 models by pseudo-maximum likelihood estimation of
(k, ) and by comparison of pseudo-likelihoods leads to select the
proper submodel
1 = 2 = 1
and 3 = = 10 = 0
with k = 3.1 and = 3.8. Forward and backward selection

procedures lead to same conclusion.

Variable selection
Benchmark 1
Ripleys dataset with 8 additional potential [useless] covariates
simulated from N (0, .052 )
Using the 250 datapoints for variable selection, comparison of the
210 = 1024 models by pseudo-maximum likelihood estimation of
(k, ) and by comparison of pseudo-likelihoods leads to select the
proper submodel
1 = 2 = 1
and 3 = = 10 = 0
with k = 3.1 and = 3.8. Forward and backward selection

procedures lead to same conclusion.
MCMC algorithm produces 1 = 2 = 1 and 3 = = 1 0 = 0 as
the MMAP, with very similar values for k and [Hardly any move
away from (1, 1, 0, . . . , 0) is accepted]

Variable selection
Benchmark 2
Ripleys dataset with now 28 additional covariates simulated from
N (0, .052 )
Using the 250 datapoints for variable selection, direct comparison
of the 230 models by pseudo-maximum likelihood estimation
impossible!
Forward and backward selection procedures both lead to the proper
submodel = (1, 1, 0, . . . , 0)

Variable selection
Benchmark 2
Ripleys dataset with now 28 additional covariates simulated from
N (0, .052 )
Using the 250 datapoints for variable selection, direct comparison
of the 230 models by pseudo-maximum likelihood estimation
impossible!
Forward and backward selection procedures both lead to the proper
submodel = (1, 1, 0, . . . , 0)
MCMC algorithm again produces 1 = 2 = 1 and
3 = = 1 0 = 0 as the MMAP, with more moves around
= (1, 1, 0, . . . , 0)

ChristianRobert KNN

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ChristianRobert KNN

Uploaded by

Copyright:

Available Formats

Bayesian k-nearest-neighbour classification

Bayesian k-nearest-neighbour classification

Joint work with G. Celeux, J.M. Marin, & D.M. Titterington

Bayesian k-nearest-neighbour classification

Bayesian inference in Gibbs random fields

Bayesian k-nearest-neighbour classification

Markov random fields: natural spatial generalisation of Markov

Bayesian k-nearest-neighbour classification

Markov random fields: natural spatial generalisation of Markov

Bayesian k-nearest-neighbour classification

Bayesian k-nearest-neighbour classification

Bayesian k-nearest-neighbour classification

Bayesian k-nearest-neighbour classification

Bayesian k-nearest-neighbour classification

Special case of MRF:

Bayesian k-nearest-neighbour classification

Special case of MRF:

where Z is the normalising constant, C is the set ofPcliques and Vc

Bayesian k-nearest-neighbour classification

Bayesian k-nearest-neighbour classification

and estimate it from observed data y.

Bayesian k-nearest-neighbour classification

and estimate it from observed data y.

Bayesian k-nearest-neighbour classification

Example (Boltzman dependence)

Path for Potts

Bayesian k-nearest-neighbour classification

Bayesian inference in Gibbs random fields

Bayesian k-nearest-neighbour classification

Bayesian inference in Gibbs random fields

Bayesian k-nearest-neighbour classification

Bayesian inference in Gibbs random fields

not available analytically & exact computation not feasible.

Bayesian k-nearest-neighbour classification

Bayesian inference in Gibbs random fields

not available analytically & exact computation not feasible.

Bayesian k-nearest-neighbour classification

Then define pseudo-posterior

f (yi |yi , )()

and resort to MCMC methods to derive a sample from pseudo-post

Bayesian k-nearest-neighbour classification

Generate a sample from [the true] (|y) by a

Bayesian k-nearest-neighbour classification

Path sampling (contd)

Bayesian k-nearest-neighbour classification

Path sampling (contd)

c Derivative expressed as an expectation under f (y|)

Bayesian k-nearest-neighbour classification

Path sampling identity

Therefore, the ratio Z()/Z( ) can be derived from an integral,

[Gelman & Meng, 1998]

Bayesian k-nearest-neighbour classification

Implementation for Potts

Step X: Monte Carlo approximation of E [S(X)] derived form

Potts Metropolis-Hastings Sampler

Generate u = (ui )iI random permutation of I;

compute the nul ,g s and l = {exp([nu ,x nu ,xu ])} 1 ,

and set xu equal to x

Bayesian k-nearest-neighbour classification

Implementation for Potts (2)

Step : Use (a) importance

Bayesian k-nearest-neighbour classification

Implementation for Potts (2)