Professional Documents
Culture Documents
ChristianRobert KNN
ChristianRobert KNN
Outline
1
MRFs
Perfect sampling
k-nearest-neighbours
Pseudo-likelihood reassessed
Variable selection
MRFs
MRFs
MRFs (contd)
Definition (MRF)
A general Markov random field is the extension of the above to
any graph structure on the random variables, i.e., a collection of
rvs such that each one is independent of all the others given its
immediate neighbours in the corresponding graph.
[Cressie, 1993]
A formal definition
Take y1 , . . . , yn , rvs with values in a finite set S, and let
G = (N, E) be a finite graph with N = {1, ..., n} the collection of
nodes and E the collection of edges, made of pairs from N .
A formal definition
Take y1 , . . . , yn , rvs with values in a finite set S, and let
G = (N, E) be a finite graph with N = {1, ..., n} the collection of
nodes and E the collection of edges, made of pairs from N .
For A N , A denotes the set of neighbours of A, i.e. the
collection of all points in N/A that have a neighbour in A.
A formal definition
Take y1 , . . . , yn , rvs with values in a finite set S, and let
G = (N, E) be a finite graph with N = {1, ..., n} the collection of
nodes and E the collection of edges, made of pairs from N .
For A N , A denotes the set of neighbours of A, i.e. the
collection of all points in N/A that have a neighbour in A.
Definition (MRF)
y = (y1 , . . . , yn ) is a Markov random field associated with the
graph G if its full conditionals satisfy
f (yi |yi ) = f (yi |yi ) .
Cliques are sets of points that are all neighbours of one another.
Gibbs distributions
Gibbs distributions
Statistical perspective
Introduce a parameter in the Gibbs distribution:
f (y|) =
exp {Q (y)}
Z()
Statistical perspective
Introduce a parameter in the Gibbs distribution:
f (y|) =
exp {Q (y)}
Z()
Statistical perspective
Introduce a parameter in the Gibbs distribution:
f (y|) =
exp {Q (y)}
Z()
exp {Q (y)}
()
Z()
Potts model
MRFs
Perfect sampling
k-nearest-neighbours
Pseudo-likelihood reassessed
Variable selection
Pseudo-posterior inference
Oldest solution: replace the likelihood with the pseudo-likelihood
pseudo-like(y|) =
n
Y
f (yi |yi , ) .
i=1
n
Y
i=1
Path sampling
MH1 ( |) =
Z( )
exp [Q (y)] ()
q1 ( |)
where q1 ( |) is a [arbitrary] proposal density.
[Robert & Casella, 1999/2004]
and
dZ()
d
S(y) exp[S(y)]
= Z()
X
y
S(y) exp{S(y)} Z()
= Z() E [S(y)] .
and
dZ()
d
S(y) exp[S(y)]
= Z()
X
y
S(y) exp{S(y)} Z()
= Z() E [S(y)] .
(t)
(t)
(t)
Auxiliary variables
Auxiliary variables
q1 (| ) exp {Q (z)}
Z( )
q1 ( |) exp {Q (z)}
Z()
and....
exp {Q (y)} ( )
MH2 (( , z )|(, z)) =
exp {Q (y)} ()
q1 (| ) exp {Q (z)}
q( |) exp {Q (z)}
g(z | , y)
g(z|, y)
exp {Q (y)} ( )
MH2 (( , z )|(, z)) =
exp {Q (y)} ()
q1 (| ) exp {Q (z)}
q( |) exp {Q (z)}
Choice of
g(z | , y)
g(z|, y)
exp {Q (y)} ( )
MH2 (( , z )|(, z)) =
exp {Q (y)} ()
q1 (| ) exp {Q (z)}
q( |) exp {Q (z)}
Choice of
g(z | , y)
g(z|, y)
Perfect sampling
Perfect sampling
Potts
Potts
Potts
MRFs
Perfect sampling
k-nearest-neighbours
KNNs as a clustering rule
KNNs as a probabilistic model
Bayesian inference on KNNs
MCMC implementation
Pseudo-likelihood reassessed
Supervised classification
Infer from a partitioned dataset
the classes of a new dataset
Supervised classification
Infer from a partitioned dataset
the classes of a new dataset
Data: training dataset
yitr , xtr
i i=1,...,n
0.5
x2
0.0
0.5
1.0
1.0
Supervised classification
1.0
0.5
0.0
0.5
1.0
1.0
Classification
Neighbourhood based on
Euclidean metric
0.5
x2
1.0
Principle
0.0
0.5
Skip animation
1.0
0.5
0.0
0.5
1.0
1.0
Classification
Neighbourhood based on
Euclidean metric
0.5
x2
1.0
Principle
0.0
0.5
Skip animation
1.0
0.5
0.0
0.5
1.0
1.0
Classification
Neighbourhood based on
Euclidean metric
0.5
x2
1.0
Principle
0.0
0.5
Skip animation
1.0
0.5
0.0
0.5
1.0
1.0
Classification
Neighbourhood based on
Euclidean metric
0.5
x2
1.0
Principle
0.0
0.5
Skip animation
1.0
0.5
0.0
0.5
1.0
1.0
Classification
Skip animation
x2
0.0
0.5
0.5
Principle
1.0
Neighbourhood based on
Euclidean metric
1.0
0.5
0.0
0.5
1.0
1.0
Classification
Skip animation
x2
0.0
0.5
0.5
Principle
1.0
Neighbourhood based on
Euclidean metric
1.0
0.5
0.0
0.5
1.0
Standard procedure
Example : help(knn)
data(iris3)
train=rbind(iris3[1:25,,1],iris3[1:25,,2],iris3[1:25,,3])
test=rbind(iris3[26:50,,1],iris3[26:50,,2],iris3[26:50,,3])
cl=factor(c(rep("s",25),rep("c",25),rep("v",25)))
library(class)
knn(train,test,cl,k=3,prob=TRUE)
attributes(.Last.value)
Back to idea
Choice of k?
Usually chosen by minimising cross-validated misclassification rate
(non-parametric or even non-probabilist!)
Influence of k
1.0
0.8
0.6
0.2
0.0
0.2
0.4
0.6
0.4
0.2
0.0
0.2
1.0
0.5
0.0
0.5
1.0
1.0
0.5
0.0
0.5
1.0
0.5
1.0
1.0
0.8
0.6
0.4
0.2
0.0
0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
k=137
1.2
k=57
0.2
0.8
1.0
1.2
k=11
1.2
k=1
1.0
0.5
0.0
0.5
1.0
1.0
0.5
0.0
Influence of k (contd)
k-nearest-neighbour leave-one-out cross-validation:
Solutions 17 18 35 36 45 46 51 52 53 54 (29)
Procedure
1-nn
3-nn
15-nn
17-nn
54-nn
X
tr
(yltr ) k
P(yitr = |yi
, xtr , , k) exp
>0
li
Motivations
Motivations
MRF-like expression
Drawback
Drawback
Drawback (2)
Note: Holmes & Adams (2002) solve this problem by directly
defining the joint as the pseudo-likelihood
f (y tr |xtr , , k)
n
Y
i=1
X
q
Drawback (2)
Note: Holmes & Adams (2002) solve this problem by directly
defining the joint as the pseudo-likelihood
f (y tr |xtr , , k)
n
Y
i=1
X
Resolution
Symmetrise the neighbourhood relation:
Resolution
Symmetrise the neighbourhood relation:
Principle: if xtr
i belongs to the
k-nearest-neighbour set for xtr
j
and xtr
does
not
belong
to
the
j
k-nearest-neighbour set for xtr
i ,
tr
xj is added to the set of
neighbours of xtr
i
Resolution
Symmetrise the neighbourhood relation:
Principle: if xtr
i belongs to the
k-nearest-neighbour set for xtr
j
and xtr
does
not
belong
to
the
j
k-nearest-neighbour set for xtr
i ,
tr
xj is added to the set of
neighbours of xtr
i
Consequence
Given the full conditionals
X
tr
(yltr ) k
, xtr , , k) exp
P(yitr = |yi
l#i
Consequence
Given the full conditionals
X
tr
(yltr ) k
, xtr , , k) exp
P(yitr = |yi
l#i
X
tr tr
(yltr ) k
P(yjte = |xte
j , y , x , , k) exp
l#j
Bayesian modelling
Within the Bayesian paradigm, assign a prior (, k) like
(, k) I(1 k kmax ) I(0 max )
because there is a maximum value (e.g., max = 15) after which
the distribution is Dirac [as in Potts model]
Bayesian modelling
Within the Bayesian paradigm, assign a prior (, k) like
(, k) I(1 k kmax ) I(0 max )
because there is a maximum value (e.g., max = 15) after which
the distribution is Dirac [as in Potts model] and because it can be
argued that kmax = n/2
Note
is dimension-less because of the use of frequencies n (i)/k as
covariates
tr tr
tr tr
P(yjte = |xte
j , y , x , , k)(, k|y , x )d dk
where
(, k|y tr , xtr ) f (y tr |xtr , , k)(, k)
posterior distribution of (, k) given the training dataset y tr
[b
yjte = MAP estimate]
Note
Model choice with no varying dimension because is the same for
all models
MCMC implementation
A Markov Chain Monte Carlo (MCMC) approximation of
f (yn+1 |xn+1 , y, X)
is provided by
M 1
M
X
f yn+1 |xn+1 , y, X, (, k)(i)
i=1
k)
paramount:
Choice of (,
)
= (53, 2.28) versus
Illustration of Ripleys dataset: (k,
k)
paramount:
Choice of (,
)
= (53, 2.28) versus
Illustration of Ripleys dataset: (k,
Misclassification
error rate
0.316
0.229
0.226
0.211
0.205
0.208
Predictive output
M
X
f g|xn+1 , y, X, (i) , k (i) .
i=1
Predictive output
M
X
f g|xn+1 , y, X, (i) , k (i) .
i=1
A reassessment of pseudo-likelihood
1
MRFs
Perfect sampling
k-nearest-neighbours
Pseudo-likelihood reassessed
Variable selection
Pseudo-likelihood
Pseudo-likelihood leads to (almost) straightforward MCMC
implementation
Variable selection
1
MRFs
Perfect sampling
k-nearest-neighbours
Pseudo-likelihood reassessed
Variable selection
1.0
0.8
0.6
0.4
0.2
0.0
0.2
0.4
0.2
0.0
0.2
0.6
0.8
1.0
1.2
gamma=(1,0), err=284
1.2
gamma=(1,1), err=78
1.0
0.5
1.0
1.0
0.5
0.0
0.5
1.0
1.0
0.8
0.6
0.4
0.2
0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
gamma=(1,1,1), err=159
1.2
gamma=(0,1), err=116
0.0
0.0
0.2
Parsimony (dimension of
predictor may be larger than
training sample size n)
0.5
1.0
0.5
0.0
0.5
1.0
1.0
0.5
0.0
0.5
1.0
Component indicators
Completion of (, k) with indicator variables j {0, 1}
(1 j p) that determine which components of x are active in
the model
X
Cj (yl ) k
P(yi = Cj |yi , X, , k, ) exp
lvk (i)
Variable selection
Implementation
Use of a naive reversible jump MCMC, where
1
Note
Validation of simple jumps due to (a) saturation of the dimension
by associating a j to each variable and (b) hierarchical structure
of the (, k) part.
c This is not a varying dimension problem
MCMC algorithm
Variable selection k-nearest-neighbours
(0)
At time 0, generate j B(1/2), log (0) N 0, 2 and
k (0) U{1,...,K}
At time 1 t T ,
N log (t1) , 2 and
1 Generate log
k U ({k r, k r + 1, . . . , k + r 1, k + r})
MCMC algorithm
Variable selection k-nearest-neighbours
(0)
At time 0, generate j B(1/2), log (0) N 0, 2 and
k (0) U{1,...,K}
At time 1 t T ,
N log (t1) , 2 and
1 Generate log
k U ({k r, k r + 1, . . . , k + r 1, k + r})
2
MCMC algorithm
Variable selection k-nearest-neighbours
(0)
At time 0, generate j B(1/2), log (0) N 0, 2 and
k (0) U{1,...,K}
At time 1 t T ,
N log (t1) , 2 and
1 Generate log
k U ({k r, k r + 1, . . . , k + r 1, k + r})
2
MCMC algorithm
Variable selection k-nearest-neighbours
(0)
At time 0, generate j B(1/2), log (0) N 0, 2 and
k (0) U{1,...,K}
At time 1 t T ,
N log (t1) , 2 and
1 Generate log
k U ({k r, k r + 1, . . . , k + r 1, k + r})
2
3
4
(t)
Benchmark 1
Ripleys dataset with 8 additional potential [useless] covariates
simulated from N (0, .052 )
Using the 250 datapoints for variable selection, comparison of the
210 = 1024 models by pseudo-maximum likelihood estimation of
(k, ) and by comparison of pseudo-likelihoods leads to select the
proper submodel
1 = 2 = 1
and 3 = = 10 = 0
Benchmark 1
Ripleys dataset with 8 additional potential [useless] covariates
simulated from N (0, .052 )
Using the 250 datapoints for variable selection, comparison of the
210 = 1024 models by pseudo-maximum likelihood estimation of
(k, ) and by comparison of pseudo-likelihoods leads to select the
proper submodel
1 = 2 = 1
and 3 = = 10 = 0
Benchmark 2
Ripleys dataset with now 28 additional covariates simulated from
N (0, .052 )
Using the 250 datapoints for variable selection, direct comparison
of the 230 models by pseudo-maximum likelihood estimation
impossible!
Forward and backward selection procedures both lead to the proper
submodel = (1, 1, 0, . . . , 0)
Benchmark 2
Ripleys dataset with now 28 additional covariates simulated from
N (0, .052 )
Using the 250 datapoints for variable selection, direct comparison
of the 230 models by pseudo-maximum likelihood estimation
impossible!
Forward and backward selection procedures both lead to the proper
submodel = (1, 1, 0, . . . , 0)
MCMC algorithm again produces 1 = 2 = 1 and
3 = = 1 0 = 0 as the MMAP, with more moves around
= (1, 1, 0, . . . , 0)