Chapter 4

MIMA Group
M L
D M
Chapter 4
Non-Parameter Estimation
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University

Contents MIMA
 Introduction
 Parzen Windows
 K-Nearest-Neighbor Estimation
 Classification Techniques
 The Nearest-Neighbor rule(1-NN)
 The Nearest-Neighbor rule(k-NN)
 Distance Metrics
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 2

Bayes Rule for Classification MIMA
p( x | i ) p(i )
P(i | x) 
p ( x)
 To compute the posterior probability, we need to
know the prior probability and the likelihood.
 Case I: p( x | i ) has certain parametric form
 Maximum-Likelihood Estimation
 Bayesian Parameter Estimation
 Problems:
 The assumed parametric form may not fit the ground-
truth density encountered in practice, e.g., assumed
parametric form: unimodal; ground-truth: multimodal
Non-Parameter Estimation MIMA
 Case II: p ( x | i ) doesn’t have parametric form

 How?
Let the data Parzen Windows

speak for Kn-Nearest-Neighbor
themselves!

Goals MIMA
 Estimate class-conditional densities
p ( x | i )
 Estimate posterior probabilities
P(i | x)

Density Estimation MIMA
 Assume p(x) is continuous, and R is small

 Fundamental fact
 The probability of a vector x fall into a region R:
P( X  R )   p(x' )dx'  p(x)  dx'

R R
 p(x)VR  PR
x+
Given n examples (i.i.d.) {x1, x2,…, xn}, let K R
denote the random variable representing
number of samples falling into R, K will take
Binomial distribution:
n k
K ~ B (n, PR ) P( K  k )    PR (1  PR ) n  k
k  n samples
 Assume p(x) is continuous, and R is small

 Fundamental fact
 The probability of a vector x fall into a region R:
P( X  R )   p(x' )dx'  p(x)  dx'

R R
 p(x)VR  PR
x+
p(x)VR  PR R
E[ K ] / n
p ( x) 
E[ K ]  nPR VR
Let kR denote the actual kR / n
p ( x) 
number of samples in R VR n samples
kR / n
p ( x) 
VR
 Use subscript n to take sample size into account
kn / n
pn ( x) 
Vn
 We hope that: lim pn ( x)  p ( x)
n 
 To do this, we should have x+

lim Vn  0 R
n 
lim k n  
n 
lim k n / n  0
n 
n samples
kn / n What items can be controlled?

pn ( x) 
Vn How?
 Fix Vn and determine kn

 Parzen Windows
 Fix kn and determine Vn

 kn-Nearest-Neighbor

Parzen Windows MIMA
kn / n
pn ( x)  Fix Vn and determine kn
Vn
 Assume Rn is a d-dimensional hypercube
 The length of each edge is hn
Vn  hnd
Determine kn with window

function a.k.a. kernel function,
potential function. Emanuel Parzen
(1929-)
Window function MIMA
1
 It defines a unit hypercube 1
centered at the origin.
 x  1 | x | hn / 2
     hn
 hn  0 otherwise
hn
hn
Window function MIMA
 x  x  1 | x j  xj | hn / 2
    
 hn  0 otherwise
 1 means that x’ falls within the hypercube of
volume vn centered at x.
 kn: # samples inside the hypercube
centered at x, x hn
n
 x  xi  hn
k n      hn
i 1  hn 

Parzen Window Estimation MIMA
n
 x  xi 
k n     
i 1  hn 1 n 1  x  xi 
 pn (x)     
kn / n n i 1 Vn  hn 
pn ( x) 
Vn Parzen pdf
  (u) is not limited to be the hypercube window

function defined previously. It could be any pdf
function
  (u)du  1

1 n 1  x  xi 
pn (x)     
n i 1 Vn  hn 
 pn(x) is a pdf function?

Set (x-xi)/hn=u.
1 n 1  x  xi  1 n 1  x  xi  1 n
n  dx      dx     u du
i 1 Vn  hn  n i 1 Vn  hn  n i 1
Window function  Window  Training Parzen

Being pdf width data pdf

1 n 1  x  xi 
pn (x)     
n i 1 Vn  hn 
1 x 1 n
 n (x)     pn ( x)    n ( x - x i )
Vn  hn  n i 1
 pn(x): superpostion (叠加）of

n interpolations (插值）
What is the effect of
 xi: contributes to pn(x) based
hn (window width)
on its “distance” from x. on the Parzen pdf?

 The effect of hn
1 x 1 x
 n (x)      d   
Vn  hn  hn  hn 
Affects the amplitude Affects the width

(vertical scale) (horizontal scale)

1 x Suppose φ(.) being a 2-d

 n (x)    
Vn  hn  Gaussian pdf.
The shape of δn(x) with decreasing values of hn

1 x 1 n
 n (x)     pn ( x)    n ( x - x i )
Vn  hn  n i 1
 When hn is very large, δn(x) will be broad with

small amplitude.
 Pn(x) will be the superposition of n broad, slowly
changing functions, i.e., being smooth with low
resolution.
 When hn is very small, δn(x) will be sharp with
large amplitude.
 Pn(x) will be the superposition of n sharp pulses, i.e.,
being variable/unstable with high resolution.

1 x 1 n
 n (x)     pn ( x)    n ( x - x i )
Vn  hn  n i 1
 Pazen window estimations for five samples,

supposing that φ(.) is a 2-d Gaussian pdf.

 Convergence conditions
 To ensure convergence, i.e.,
lim E[ pn (x)]  p (x) lim Var[ pn (x)]  0

n  n 
 We have the following additional constraints:
sup  (u)   lim Vn  0

u n 
d
lim  (u) ui  0 lim nVn  
||u|| n 
i 1

Illustrations MIMA
 One dimension case: X ~ N (0,1)
1 u 2 / 2
 (u)  e
2
1 n 1  x  xi 
pn (x)     
n i 1 hn  hn 
hn  h1 / n

Illustrations MIMA
 One dimension case:
1 u 2 / 2
 (u)  e
2
1 n 1  x  xi 
pn (x)     
n i 1 hn  hn 
hn  h1 / n

Illustrations MIMA
 Two dimension case:
1 n 1  x  xi 
pn (x)   2   
n i 1 hn  hn 
hn  h1 / n

Classification Example MIMA
Smaller window Larger window

Choosing Window Function MIMA
 Vn must approach zero when n, but at a rate

slower than 1/n, e.g.,
Vn  V1 / n
 The value of initial volume V1 is important.
 In some cases, a cell volume is proper for one
region but unsuitable in a different region.

kn-Nearest Neighbor MIMA
kn / n
pn ( x) 
Vn
 Fix kn and then determine Vn
 To estimate p(x), we can center a cell about x
and let it grow until it captures kn samples,
kn is some specified function of n, e.g.,
lim k n  
kn  n
n 
Principled rule to
choose kn lim Vn  0
n 

kn-Nearest Neighbor MIMA
Eight points in one dimension(n=8, d=1)

Red curve: kn=3
Black curve: kn=5
Thirty-one points in two dimensions

(n = 31, d=2)
Black surface: kn=5

Estimation of A Posterior probability MIMA
Pn(i|x)=?
p n ( x , i ) ki
Pn (i | x)  c

kn
 p ( x,  )
j 1
n j
ki / n
x
p n ( x , i ) 
Vn
c
kn / n

j 1
p n ( xn ,  j ) 
Vn
Estimation of A Posterior probability MIMA
Pn(i|x)=?
p n ( x , i ) ki
Pn (i | x)  c

kn
 p ( x,  )
j 1
n j
ki / n
p n ( x , i ) 
Vn The value of Vn or kn can be
c determined base on Parzen window
kn / n

j 1
p n ( xn ,  j ) 
Vn
or kn-nearest-neighbor technique.

Nearest Neighbor Classifier MIMA
 Store all training examples

 Given a new example x to be classified, search
for the training example (xi, yi) whose xi is most
similar (or closest) to x, and predict yi. (Lazy
Learning)
p n ( x , i ) ki
Pn (i | x)  c 
kn
 p n ( x,  j )
j 1

Decision Boundaries MIMA
 Decision Boundaries
 The voronoi diagram
 Given a set of points, a Voronoi diagram describes
the areas that are nearest to any given point.
 These areas can be viewed as zones of control.

 Decision boundary is
formed by only retaining
these line segment
separating different classes.
 The more training examples
we have stored, the more
complex the decision
boundaries can become.

 With large number of

examples and noise in the
labels, the decision boundary
can become nasty!
 It can be bad some times-note
the islands in this figure, they
are formed because of noisy
examples.
 If the nearest neighbor
happens to be a noisy point,
the prediction will be incorrect.
How to deal with this?

Effect of k MIMA
 Different k values give different results:

 Large k produces smoother boundaries
 The impact of class label noises canceled out by one another.
 When k is too large, what will happen.
 Oversimplified boundaries, e.g., k=N, we always predict the majority
class

How to Choose k? MIMA
 Can we choose k to minimize the mistakes that

we make on training examples? (training error)
 What is the training error of nearest-neighbor?
 Can we choose k to minimize the mistakes that
we make on test examples? (test error)

How to Choose k? MIMA
 How do training error and test error change as

we change the value ok k?

Model Selection MIMA
 Choosing k for k-NN is just one of the many

model selection problems we face in machine
learning.
 Model selection is about choosing among
different models
 Linear regression vs. quadratic regression
 K-NN vs. decision tree
 Heavily studied in machine learning, crucial
importance in practice.
 If we use training error to select models, we will
always choose more complex ones.
 Choosing k for k-NN is just one of the many

model selection problems we face in machine
learning.
 Model selection is about choosing among
different models
 Linear regression vs. quadratic regression
 K-NN vs. decision tree
 Heavily studied in machine learning, crucial
importance in practice.
 If we use training error to select models, we will
always choose more complex ones.
 We can keep part of the labeled data apart as

validation data.
 Evaluate different k values based on the
prediction accuracy on the validation data
 Choose k that minimize validation error
Validation can be viewed as another name for testing, but the name testing is
typically reserved for final evaluation purpose, whereas validation is mostly used
for model selection purpose.

 The impact of validation set size

 If we only reserve one point in our validation set,
should we trust the validation error as a reliable
estimate of our classifier’s performance?
 The larger the validation set, the more reliable our
model selection choices are
 When the total labeled set is small, we might not be
able to get a big enough validation set – leading to
unreliable model selection decisions

 K-fold Cross Validation

 Perform learning/testing K times
 Each time reserve one subset for validation set,
train on the rest
Special case: Learve one-out crass validation

Other issues of kNN MIMA
 It can be computationally expensive to find the

nearest neighbors!
 Speed up the computation by using smart data
structures to quickly search for approximate solutions
 For large data set, it requires a lot of memory
 Remove unimportant examples

Final words on KNN MIMA
 KNN is what we call lazy learning (vs. eager learning)

 Lazy: learning only occur when you see the test example
 Eager: learn a model before you see the test example, training examples can be
thrown away after learning
 Advantage:
 Conceptually simple, easy to understand and explain
 Very flexible decision boundaries
 Not much learning at all!
 Disadvantage
 – It can be hard to find a good distance measure
 – Irrelevant features and noise can be very detrimental
 – Typically can not handle more than 30 attributes
 – Computational cost: requires a lot computation and memory

Distance Metrics MIMA
 Distance Measurement is an importance factor

for nearest-neighbor classifier, e.g.,
 To achieve invariant pattern recognition and data
mining results.
The effect of change units

Distance Metrics MIMA
 Distance Measurement is an importance factor

for nearest-neighbor classifier, e.g.,
 To achieve invariant pattern recognition and data
mining results.
The effect of change units

Properties of a Distance Metric MIMA
 Nonnegativity
D(a, b)  0
 Reflexivity
D(a, b)  0 iff a  b
 Symmetry
D(a, b)  D(b, a)
 Triangle Inequality
D(a, b)  D(b, c)  D(a, c)

Minkowski Metric (Lp Norm) MIMA
1/ p
 d
p
|| x || p    | x | 
 i 1 
 1. L1 norm
d
Manhattan or city L1 (a, b) || a  b ||1   | ai  bi |
block distance i 1
 2. L2 norm
1/ 2
 d
2
Euclidean L2 (a, b) || a  b ||2    | ai  bi | 
distance  i 1 
 3. L norm
1/ 
Chessboard L (a, b) || a  b ||    | ai  bi | 
d
 max(| ai  bi |)
Minkowski Metric (Lp Norm) MIMA
1/ p
 d
p
|| x || p    | x | 
 i 1 
 1. L1 norm
d
Manhattan or city L1 (a, b) || a  b ||1   | ai  bi |
block distance i 1
 2. L2 norm
1/ 2
 d
2
Euclidean L2 (a, b) || a  b ||2    | ai  bi | 
 3. L norm
1/ 
Chessboard L (a, b) || a  b ||    | ai  bi | 
d
 max(| ai  bi |)
Summary MIMA
 Basic setting for non-parametric techniques

 Let the data speak for themselves
 Parametric form not assumed for class-conditional
pdf
 Estimate class-conditional pdf from training examples
 Make predictions based on Bayes Theorem
 Fundamental results in density estimation

kn / n
pn ( x) 
Vn

kn / n
pn ( x) 
Summary Vn MIMA
 Parzen Windows
 Fix Vn and then determine kn
1 n 1  x  xi 
pn (x)     
n i 1 Vn  hn 
Window function  Window  Training Parzen

Being pdf width data pdf

Summary MIMA
 kn-Nearest-Neighbor
 Fix kn and then determine Vn
 Fix kn and then determine Vn

 To estimate p(x), we can center a cell about x
and let it grow until it captures kn samples,
where is some specified function of n, e.g.,
lim k n  
kn  n
n 
Principled rule to
choose kn lim Vn  0
n 

MIMA Group
Any Question?
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University

Chapter 4

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 4

Uploaded by

Copyright:

Available Formats

MIMA Group

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 2

 Case II: p ( x | i ) doesn’t have parametric form

Let the data Parzen Windows

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 4

 Estimate class-conditional densities

 Estimate posterior probabilities

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 5

 Assume p(x) is continuous, and R is small

P( X  R )   p(x' )dx'  p(x)  dx'

 Assume p(x) is continuous, and R is small

P( X  R )   p(x' )dx'  p(x)  dx'

 To do this, we should have x+

kn / n What items can be controlled?

 Fix Vn and determine kn

 Fix kn and determine Vn

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 9

Determine kn with window

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 12

  (u) is not limited to be the hypercube window

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 13

 pn(x) is a pdf function?

Window function  Window  Training Parzen

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 14

 pn(x): superpostion (叠加）of

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 15

Affects the amplitude Affects the width

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 16

1 x Suppose φ(.) being a 2-d

The shape of δn(x) with decreasing values of hn

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 17

 When hn is very large, δn(x) will be broad with

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 18

 Pazen window estimations for five samples,

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 19

lim E[ pn (x)]  p (x) lim Var[ pn (x)]  0

 We have the following additional constraints:

sup  (u)   lim Vn  0

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 20

 One dimension case: X ~ N (0,1)

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 21

 One dimension case:

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 22

 Two dimension case:

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 23

Smaller window Larger window

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 24

 Vn must approach zero when n, but at a rate

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 25

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 26

Eight points in one dimension(n=8, d=1)

Thirty-one points in two dimensions

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 27

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 29

 Store all training examples

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 30

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 31

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 32

 With large number of

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 33

 Different k values give different results:

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 34

 Can we choose k to minimize the mistakes that