Download as pdf or txt
Download as pdf or txt
You are on page 1of 52

MIMA Group

M L
D M

Chapter 4
Non-Parameter Estimation

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University


Contents MIMA

 Introduction
 Parzen Windows
 K-Nearest-Neighbor Estimation
 Classification Techniques
 The Nearest-Neighbor rule(1-NN)
 The Nearest-Neighbor rule(k-NN)
 Distance Metrics

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 2


Bayes Rule for Classification MIMA

p( x | i ) p(i )
P(i | x) 
p ( x)
 To compute the posterior probability, we need to
know the prior probability and the likelihood.
 Case I: p( x | i ) has certain parametric form
 Maximum-Likelihood Estimation
 Bayesian Parameter Estimation
 Problems:
 The assumed parametric form may not fit the ground-
truth density encountered in practice, e.g., assumed
parametric form: unimodal; ground-truth: multimodal
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 3
Non-Parameter Estimation MIMA

 Case II: p ( x | i ) doesn’t have parametric form


 How?

Let the data Parzen Windows


speak for Kn-Nearest-Neighbor
themselves!

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 4


Goals MIMA

 Estimate class-conditional densities

p ( x | i )

 Estimate posterior probabilities

P(i | x)

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 5


Density Estimation MIMA

 Assume p(x) is continuous, and R is small


 Fundamental fact
 The probability of a vector x fall into a region R:

P( X  R )   p(x' )dx'  p(x)  dx'


R R

 p(x)VR  PR
x+
Given n examples (i.i.d.) {x1, x2,…, xn}, let K R
denote the random variable representing
number of samples falling into R, K will take
Binomial distribution:
n k
K ~ B (n, PR ) P( K  k )    PR (1  PR ) n  k
k  n samples
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 6
Density Estimation MIMA

 Assume p(x) is continuous, and R is small


 Fundamental fact
 The probability of a vector x fall into a region R:

P( X  R )   p(x' )dx'  p(x)  dx'


R R

 p(x)VR  PR
x+

p(x)VR  PR R
E[ K ] / n
p ( x) 
E[ K ]  nPR VR
Let kR denote the actual kR / n
p ( x) 
number of samples in R VR n samples
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 7
Density Estimation MIMA
kR / n
p ( x) 
VR
 Use subscript n to take sample size into account
kn / n
pn ( x) 
Vn
 We hope that: lim pn ( x)  p ( x)
n 

 To do this, we should have x+


lim Vn  0 R
n 

lim k n  
n 

lim k n / n  0
n 
n samples
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 8
Density Estimation MIMA

kn / n What items can be controlled?


pn ( x) 
Vn How?

 Fix Vn and determine kn


 Parzen Windows

 Fix kn and determine Vn


 kn-Nearest-Neighbor

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 9


Parzen Windows MIMA
kn / n
pn ( x)  Fix Vn and determine kn
Vn
 Assume Rn is a d-dimensional hypercube
 The length of each edge is hn
Vn  hnd

Determine kn with window


function a.k.a. kernel function,
potential function. Emanuel Parzen
(1929-)
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 10
Window function MIMA

1
 It defines a unit hypercube 1
centered at the origin.

 x  1 | x | hn / 2
     hn
 hn  0 otherwise
hn
hn
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 11
Window function MIMA

 x  x  1 | x j  xj | hn / 2
    
 hn  0 otherwise
 1 means that x’ falls within the hypercube of
volume vn centered at x.
 kn: # samples inside the hypercube
centered at x, x hn

n
 x  xi  hn
k n      hn
i 1  hn 

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 12


Parzen Window Estimation MIMA
n
 x  xi 
k n     
i 1  hn 1 n 1  x  xi 
 pn (x)     
kn / n n i 1 Vn  hn 
pn ( x) 
Vn Parzen pdf

  (u) is not limited to be the hypercube window


function defined previously. It could be any pdf
function
  (u)du  1

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 13


Parzen Window Estimation MIMA

1 n 1  x  xi 
pn (x)     
n i 1 Vn  hn 

 pn(x) is a pdf function?


Set (x-xi)/hn=u.

1 n 1  x  xi  1 n 1  x  xi  1 n
n  dx      dx     u du
i 1 Vn  hn  n i 1 Vn  hn  n i 1

Window function  Window  Training Parzen


Being pdf width data pdf

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 14


Parzen Window Estimation MIMA

1 n 1  x  xi 
pn (x)     
n i 1 Vn  hn 
1 x 1 n
 n (x)     pn ( x)    n ( x - x i )
Vn  hn  n i 1

 pn(x): superpostion (叠加)of


n interpolations (插值)
What is the effect of
 xi: contributes to pn(x) based
hn (window width)
on its “distance” from x. on the Parzen pdf?

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 15


Parzen Window Estimation MIMA

 The effect of hn

1 x 1 x
 n (x)      d   
Vn  hn  hn  hn 

Affects the amplitude Affects the width


(vertical scale) (horizontal scale)

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 16


Parzen Window Estimation MIMA

1 x Suppose φ(.) being a 2-d


 n (x)    
Vn  hn  Gaussian pdf.

The shape of δn(x) with decreasing values of hn

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 17


Parzen Window Estimation MIMA

1 x 1 n
 n (x)     pn ( x)    n ( x - x i )
Vn  hn  n i 1

 When hn is very large, δn(x) will be broad with


small amplitude.
 Pn(x) will be the superposition of n broad, slowly
changing functions, i.e., being smooth with low
resolution.
 When hn is very small, δn(x) will be sharp with
large amplitude.
 Pn(x) will be the superposition of n sharp pulses, i.e.,
being variable/unstable with high resolution.

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 18


Parzen Window Estimation MIMA

1 x 1 n
 n (x)     pn ( x)    n ( x - x i )
Vn  hn  n i 1

 Pazen window estimations for five samples,


supposing that φ(.) is a 2-d Gaussian pdf.

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 19


Parzen Window Estimation MIMA

 Convergence conditions
 To ensure convergence, i.e.,

lim E[ pn (x)]  p (x) lim Var[ pn (x)]  0


n  n 

 We have the following additional constraints:

sup  (u)   lim Vn  0


u n 

d
lim  (u) ui  0 lim nVn  
||u|| n 
i 1

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 20


Illustrations MIMA

 One dimension case: X ~ N (0,1)

1 u 2 / 2
 (u)  e
2

1 n 1  x  xi 
pn (x)     
n i 1 hn  hn 

hn  h1 / n

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 21


Illustrations MIMA

 One dimension case:

1 u 2 / 2
 (u)  e
2

1 n 1  x  xi 
pn (x)     
n i 1 hn  hn 

hn  h1 / n

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 22


Illustrations MIMA

 Two dimension case:

1 n 1  x  xi 
pn (x)   2   
n i 1 hn  hn 

hn  h1 / n

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 23


Classification Example MIMA

Smaller window Larger window

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 24


Choosing Window Function MIMA

 Vn must approach zero when n, but at a rate


slower than 1/n, e.g.,

Vn  V1 / n
 The value of initial volume V1 is important.
 In some cases, a cell volume is proper for one
region but unsuitable in a different region.

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 25


kn-Nearest Neighbor MIMA

kn / n
pn ( x) 
Vn
 Fix kn and then determine Vn
 To estimate p(x), we can center a cell about x
and let it grow until it captures kn samples,
kn is some specified function of n, e.g.,
lim k n  
kn  n
n 
Principled rule to
choose kn lim Vn  0
n 

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 26


kn-Nearest Neighbor MIMA

Eight points in one dimension(n=8, d=1)


Red curve: kn=3
Black curve: kn=5

Thirty-one points in two dimensions


(n = 31, d=2)
Black surface: kn=5

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 27


Estimation of A Posterior probability MIMA

Pn(i|x)=?
p n ( x , i ) ki
Pn (i | x)  c

kn
 p ( x,  )
j 1
n j

ki / n
x
p n ( x , i ) 
Vn
c
kn / n

j 1
p n ( xn ,  j ) 
Vn
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 28
Estimation of A Posterior probability MIMA

Pn(i|x)=?
p n ( x , i ) ki
Pn (i | x)  c

kn
 p ( x,  )
j 1
n j

ki / n
p n ( x , i ) 
Vn The value of Vn or kn can be
c determined base on Parzen window
kn / n

j 1
p n ( xn ,  j ) 
Vn
or kn-nearest-neighbor technique.

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 29


Nearest Neighbor Classifier MIMA

 Store all training examples


 Given a new example x to be classified, search
for the training example (xi, yi) whose xi is most
similar (or closest) to x, and predict yi. (Lazy
Learning)
p n ( x , i ) ki
Pn (i | x)  c 
kn
 p n ( x,  j )
j 1

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 30


Decision Boundaries MIMA

 Decision Boundaries
 The voronoi diagram
 Given a set of points, a Voronoi diagram describes
the areas that are nearest to any given point.
 These areas can be viewed as zones of control.

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 31


Decision Boundaries MIMA

 Decision boundary is
formed by only retaining
these line segment
separating different classes.
 The more training examples
we have stored, the more
complex the decision
boundaries can become.

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 32


Decision Boundaries MIMA

 With large number of


examples and noise in the
labels, the decision boundary
can become nasty!
 It can be bad some times-note
the islands in this figure, they
are formed because of noisy
examples.
 If the nearest neighbor
happens to be a noisy point,
the prediction will be incorrect.
How to deal with this?

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 33


Effect of k MIMA

 Different k values give different results:


 Large k produces smoother boundaries
 The impact of class label noises canceled out by one another.
 When k is too large, what will happen.
 Oversimplified boundaries, e.g., k=N, we always predict the majority
class

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 34


How to Choose k? MIMA

 Can we choose k to minimize the mistakes that


we make on training examples? (training error)
 What is the training error of nearest-neighbor?
 Can we choose k to minimize the mistakes that
we make on test examples? (test error)

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 35


How to Choose k? MIMA

 How do training error and test error change as


we change the value ok k?

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 36


Model Selection MIMA

 Choosing k for k-NN is just one of the many


model selection problems we face in machine
learning.
 Model selection is about choosing among
different models
 Linear regression vs. quadratic regression
 K-NN vs. decision tree
 Heavily studied in machine learning, crucial
importance in practice.
 If we use training error to select models, we will
always choose more complex ones.
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 37
Model Selection MIMA

 Choosing k for k-NN is just one of the many


model selection problems we face in machine
learning.
 Model selection is about choosing among
different models
 Linear regression vs. quadratic regression
 K-NN vs. decision tree
 Heavily studied in machine learning, crucial
importance in practice.
 If we use training error to select models, we will
always choose more complex ones.
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 38
Model Selection MIMA

 We can keep part of the labeled data apart as


validation data.
 Evaluate different k values based on the
prediction accuracy on the validation data
 Choose k that minimize validation error

Validation can be viewed as another name for testing, but the name testing is
typically reserved for final evaluation purpose, whereas validation is mostly used
for model selection purpose.

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 39


Model Selection MIMA

 The impact of validation set size


 If we only reserve one point in our validation set,
should we trust the validation error as a reliable
estimate of our classifier’s performance?
 The larger the validation set, the more reliable our
model selection choices are
 When the total labeled set is small, we might not be
able to get a big enough validation set – leading to
unreliable model selection decisions

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 40


Model Selection MIMA

 K-fold Cross Validation


 Perform learning/testing K times
 Each time reserve one subset for validation set,
train on the rest

Special case: Learve one-out crass validation

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 41


Other issues of kNN MIMA

 It can be computationally expensive to find the


nearest neighbors!
 Speed up the computation by using smart data
structures to quickly search for approximate solutions
 For large data set, it requires a lot of memory
 Remove unimportant examples

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 42


Final words on KNN MIMA

 KNN is what we call lazy learning (vs. eager learning)


 Lazy: learning only occur when you see the test example
 Eager: learn a model before you see the test example, training examples can be
thrown away after learning
 Advantage:
 Conceptually simple, easy to understand and explain
 Very flexible decision boundaries
 Not much learning at all!
 Disadvantage
 – It can be hard to find a good distance measure
 – Irrelevant features and noise can be very detrimental
 – Typically can not handle more than 30 attributes
 – Computational cost: requires a lot computation and memory

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 43


Distance Metrics MIMA

 Distance Measurement is an importance factor


for nearest-neighbor classifier, e.g.,
 To achieve invariant pattern recognition and data
mining results.
The effect of change units

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 44


Distance Metrics MIMA

 Distance Measurement is an importance factor


for nearest-neighbor classifier, e.g.,
 To achieve invariant pattern recognition and data
mining results.
The effect of change units

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 45


Properties of a Distance Metric MIMA

 Nonnegativity
D(a, b)  0
 Reflexivity
D(a, b)  0 iff a  b
 Symmetry
D(a, b)  D(b, a)
 Triangle Inequality
D(a, b)  D(b, c)  D(a, c)

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 46


Minkowski Metric (Lp Norm) MIMA
1/ p
 d
p
|| x || p    | x | 
 i 1 
 1. L1 norm
d
Manhattan or city L1 (a, b) || a  b ||1   | ai  bi |
block distance i 1

 2. L2 norm
1/ 2
 d
2
Euclidean L2 (a, b) || a  b ||2    | ai  bi | 
distance  i 1 
 3. L norm
1/ 
Chessboard L (a, b) || a  b ||    | ai  bi | 
d
 max(| ai  bi |)
distance  i 1 
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 47
Minkowski Metric (Lp Norm) MIMA
1/ p
 d
p
|| x || p    | x | 
 i 1 
 1. L1 norm
d
Manhattan or city L1 (a, b) || a  b ||1   | ai  bi |
block distance i 1

 2. L2 norm
1/ 2
 d
2
Euclidean L2 (a, b) || a  b ||2    | ai  bi | 
distance  i 1 
 3. L norm
1/ 
Chessboard L (a, b) || a  b ||    | ai  bi | 
d
 max(| ai  bi |)
distance  i 1 
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 48
Summary MIMA

 Basic setting for non-parametric techniques


 Let the data speak for themselves
 Parametric form not assumed for class-conditional
pdf
 Estimate class-conditional pdf from training examples
 Make predictions based on Bayes Theorem

 Fundamental results in density estimation


kn / n
pn ( x) 
Vn

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 49


kn / n
pn ( x) 
Summary Vn MIMA

 Parzen Windows
 Fix Vn and then determine kn

1 n 1  x  xi 
pn (x)     
n i 1 Vn  hn 

Window function  Window  Training Parzen


Being pdf width data pdf

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 50


Summary MIMA

 kn-Nearest-Neighbor
 Fix kn and then determine Vn

 Fix kn and then determine Vn


 To estimate p(x), we can center a cell about x
and let it grow until it captures kn samples,
where is some specified function of n, e.g.,
lim k n  
kn  n
n 
Principled rule to
choose kn lim Vn  0
n 

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 51


MIMA Group

Any Question?

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University

You might also like