Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 32

Nonparametric Techniques

Dr. Manoj BR
Assistant Professor
Department of Electronics & Electrical Engineering
Indian Institute of Technology Guwahati
Introduction

 So far, we treated supervised learning under the assumption that the forms of the underlying

density functions were known

 When the data is generated from a density function that is known, such as normal distribution,

except for finite number of unknown parameters the model is called a parametric model

 In most pattern recognition applicationsthe common parametric probability density models rarely fit

the densities actually encountered in practice

 Here, we shall examine nonparametric procedures that can be used with arbitrary distributions

and without the assumption that the forms of the underlying densities are known
Nonparametric Techniques

 Types of nonparametric methods of interest in pattern recognition:

• Generative: Estimation of the class-conditional density function from samples. If these

estimates are satisfactory, they can be substituted for the true densities when designing the

classifier

• Discriminative: Estimation of the a posteriori probabilities from the samples. This is

closely related to nonparametric design procedures such as the nearest-neighbor rule,

which bypass probability estimation and go directly to decision functions

• As we have more data, we should be able to learn more complex models


Histogram-based Density Estimation
 Consider a set of i.i.d. samples drawn from an unknown distribution

 The histogram is obtained by dividing each feature axis into number of bins of width and
approximating the density at each value of by the fraction of the points that fall inside the
corresponding bin

{ Number of 𝑥 in   the   same   bin   as   𝑥 𝑡 }


𝑝^ ( 𝑥 )=
𝑁h
Histogram-based Density Estimation
 The number of bins (or bin size ) is acting as a smoothing parameter

• If bin width is very small (large ), then the estimated density is very spiky (i.e., noisy)
resulting in the main trends in the data may be obscured (masked)

• If bin width is very large (small ), then the true structure of the density is smoothed out and
smaller features in the distribution of the data are not captured

 In practice, we need to find an optimal value for and that compromises between these two issues
Histogram based Density Estimation
 Example
• To give a concrete example, we will consider the passengers of the Titanic
• There were approximately 1300 passengers on the Titanic (not counting crew), and we
have reported ages for 756 of them
• We call the relative proportions of different ages among the passengers as the age
distribution of the passengers
Histogram based Density Estimation

Histograms depend on the chosen bin width. Here, the same age distribution of Titanic passengers is shown
with four different bin widths: (a) one year; (b) three years; (c) five years; (d) fifteen years. For the age
distribution of Titanic passengers, we can see that a bin width of one year is too small and a bin width of
fifteen years is too large, whereas bin widths between three to five years work fine
Drawbacks of Histogram-based Density Estimation

 The estimated density is not smooth and has discontinuities at the boundaries of the histogram bins

 Histograms do not generalize well in high dimensions

• Consider a dimensional feature space. If we divide each variable in intervals, we will end
up with
• A huge number of examples would be required to obtain good estimates. Otherwise, most
bins would be empty and density will be approximated by zero
Density Estimation
 The most fundamental techniques rely on the fact that the probability that a vector will fall in a
region is given by:

 Thus is a smoothed or averaged version of the density function . We can estimate this smoothed
value of by estimating the probability

 Suppose that samples are drawn i.i.d. according to the probability law

• The probability that of these fall in is given by the binomial law, with as the probability of
each sample falling in :

Expected value for is:


Density Estimation
 This binomial distribution for peaks very sharply about the mean

 So we expect that the ratio will be a very good estimate for the probability , and hence for the
smoothed density function. This estimate is especially accurate when is very large

E ( )
𝑘
𝑛
=𝑃

As is growing from 20 to 100, the


distribution is peaking sharply around mean
Density Estimation
 Next, we assume that is continuous and that the region is so small that does not vary appreciably
within it, we can write

is a point within and is the volume enclosed by

The probability of a data


𝑝( 𝒙 ) belonging to the green region
is given by
𝒙
𝑉
Density Estimation
 Combining equations:

 We arrive at the following estimate for :

 The validity of our estimate depends on two contradictory assumptions:

• The region must be sufficiently small so that the density is approximately constant over the
region
• The region must be sufficiently large so that the number of points falling inside it is sufficient
to yield a sharp peaked binomial
Density Estimation

 To estimate the density at , we form a sequence of regions containing

• The first region to be used with 1 sample

• The second region with 2 samples, and so on

 Let be the volume of , be the number of samples falling in , and be the -th estimate for
Density Estimation

 For to converge to , the following conditions to be satisfied:

• , this ensures that the averaged density will converge to

• , this ensures that the binomial distribution will peak around

• , this ensures to converge at all. This also says that although a huge number of samples will fall
within the region , they will form a negligibly small fraction of the total number of samples
Parzen Windows
 The Parzen-window approach to estimating densities can be introduced by temporarily assuming
that the region is a dimensional hypercube

 If is the length of an edge of that hypercube, then its volume is given by:

 Window function

• We can obtain an analytic expression for , the number of samples falling in the hypercube, by
defining the following window function

 defines a unit hypercube centered at the origin. It follow that is equal to unity if falls within the
hypercube of volume centered at , and is zero otherwise
Parzen Windows
 The number of samples in this hypercube is therefore given by

 We know that:

 Therefore, we substitute into , we obtain the estimate


Nearest Neighbor Estimation

 A potential remedy for the problem of the unknown ‘best’ window function is to let the cell volume
be a function of the training data, rather than some arbitrary function of the overall number of cells

 One way: Specify as some function of such as. Then, we grow the volume until it encloses
neighbors of . That is, let the volume be a function of the training data
Nearest Neighbor Estimation
 To estimate from training samples:

• Center a hypersphere or cell around and let it grow until it captures samples. is a
function of . These samples are the nearest-neighbors of

• is determined by the volume of the hypersphere. Because depends on , which depends


on . It is a function of training data size.

 If the density is high near , the cell will be relatively small, which leads to good resolution

 If the density is low, the cell will grow large, but it will stop soon after it enters regions of higher
density
Nearest Neighbor Estimation
 In either case, we want to go to infinity as goes to infinity, since this assures us that will be a good
estimate of the probability that a point will fall in the cell of volume

 However, we also want to grow sufficiently slowly that the size of the cell needed to capture training
samples will shrink to zero

 Thus, it is clear from that the ratio must go to zero

 The conditions and are necessary and sufficient for to converge to in probability at all points where is
continuous

Size of the cell depends


on the density
Nearest Neighbor Estimation
 The parameter acts as smoothing parameter and needs to be optimized

Figure: Eight points in one dimension and the k-nearest-neighbor density


estimates, for k = 3 and 5
 
𝐄𝐬𝐭𝐢𝐦𝐚𝐭𝐢𝐨𝐧 𝐨𝐟 𝐀 𝐏 𝐨𝐬𝐭𝐞𝐫𝐢𝐨𝐫𝐢 𝐏 𝐫𝐨𝐛𝐚𝐛𝐢𝐥𝐢𝐭𝐢𝐞𝐬
 A posteriori probabilities can be estimated from a set of labelled samples by using the samples
to estimate the densities involved

 Suppose that we place a cell of volumearound and capture samples, of which turn out to be
labelled . Then the estimate for the joint probability is

 A reasonable estimate for is:

 That is, the estimate of the a posteriori probability that is the state of nature is merely the
fraction of the samples within the cell that are labelled
𝐄𝐬𝐭𝐢𝐦𝐚𝐭𝐢𝐨𝐧 𝐨𝐟 𝐀 𝐏 𝐨𝐬𝐭𝐞𝐫𝐢𝐨𝐫𝐢 𝐏 𝐫𝐨𝐛𝐚𝐛𝐢𝐥𝐢𝐭𝐢𝐞𝐬
 Two approach for choosing the size of the cell

1. The Parzen-window approach: would be some specified function of , such as

2. -nearest-neighbor approach: would be expanded until some specified number of samples


were captured, such as

• In either case, as goes to infinity an infinite number of samples will fall within the infinitely
small cell
𝐓𝐡𝐞 𝐍𝐞𝐚𝐫𝐞𝐬𝐭 − 𝐍𝐞𝐢𝐠𝐡𝐛𝐨𝐫 𝐑𝐮𝐥𝐞
 Let denote a set of labelled prototypes

 be the prototype nearest to a test point . Then the nearest-neighbor rule for classifying is to

assign it the label associated with

 To calculate distance from the test point to prototypes is calculated using a distance measure, such

as Euclidean distance, Manhattan distance or Minkowski distance

 The nearest-neighbor rule is a sub-optimal procedure; its use will usually lead to an error rate

greater than the minimum possible, the Bayes rate


𝐓𝐡𝐞 𝐍𝐞𝐚𝐫𝐞𝐬𝐭 − 𝐍𝐞𝐢𝐠𝐡𝐛𝐨𝐫 𝐑𝐮𝐥𝐞
 When the number of samples is very large, it is reasonable to assume that is sufficiently
close to that Since this is exactly the probability that nature will be in state , the nearest-
neighbor rule is effectively matching probabilities with nature
𝐓𝐡𝐞 𝐍𝐞𝐚𝐫𝐞𝐬𝐭 − 𝐍𝐞𝐢𝐠𝐡𝐛𝐨𝐫 𝐑𝐮𝐥𝐞

Visual representation of how the kNN algorithm works. The distance measure can be: Euclidean,
Manhattan, Minkwoski etc. The value of k is always odd
𝐓𝐡𝐞 𝐍𝐞𝐚𝐫𝐞𝐬𝐭 − 𝐍𝐞𝐢𝐠𝐡𝐛𝐨𝐫 𝐑𝐮𝐥𝐞
 Let is class with maximum probability given by:

 Bayes decision rule always selects class which results in minimum risk i.e., highest probability,

which is . kNN rule allows us to partition the feature space into cells consisting of all points closer

to a given training point than to any other training points. All points in such a cell are thus labelled

by the category of the training point — a so-called Voronoi tesselation of the space

In two dimensions, the nearest-neighbor algorithm leads to


a partitioning of the input space into Voronoi cells, each
labelled by the category of the training point it contains. In
three dimensions, the cells are three-dimensional, and the
decision boundary resembles the surface of a crystal
𝐓𝐡𝐞 𝐍𝐞𝐚𝐫𝐞𝐬𝐭 − 𝐍𝐞𝐢𝐠𝐡𝐛𝐨𝐫 𝐑𝐮𝐥𝐞
 When is close to unity, the nearest-neighbor selection is almost always the same as the Bayes

selection

 When is close to 1/c, so that all classes are essentially equally likely, the selections made by the

nearest-neighbor rule and the Bayes decision rule are rarely the same

 The average probability of error is found by averaging over all

 Let be the minimum possible value of , and be the minimum possible value of , then and
𝐓𝐡𝐞 𝒌 −𝐍𝐞𝐚𝐫𝐞𝐬𝐭 −𝐍𝐞𝐢𝐠𝐡𝐛𝐨𝐫 𝐑𝐮𝐥𝐞
 It is the extension of the nearest-neighbor rule

 This rule classifies by assigning it the label most frequently represented among the nearest samples

 In other words, a decision is made by examining the labels on the nearest neighbors and taking a
vote

The -nearest-neighbor query starts at the test


point and grows a spherical region until it
encloses training samples, and labels the test
point by a majority vote of these samples. In
this case, the test point would be labelled the
category of the black points

must be an odd number to prevent ties


𝐓𝐡𝐞 𝒌 −𝐍𝐞𝐚𝐫𝐞𝐬𝐭 −𝐍𝐞𝐢𝐠𝐡𝐛𝐨𝐫 𝐑𝐮𝐥𝐞
 The nearest–neighbor rule attempts to match probabilities with nature

 As in the single-nearest-neighbor cases, the labels on each of the -nearest-neighbors are random
variables, which independently assume the values with probabilities ,

 If is the larger a posteriori probability, then the Bayes decision rule always selects

 The single-nearest-neighbor rule selects with probability

 The -nearest-neighbor rule selects if a majority of the nearest neighbors are labeled , an event of
probability

 In general, the larger the value of , the greater the probability that will be selected
𝐓𝐡𝐞 𝒌 −𝐍𝐞𝐚𝐫𝐞𝐬𝐭 −𝐍𝐞𝐢𝐠𝐡𝐛𝐨𝐫 𝐑𝐮𝐥𝐞
 Consider, a U-shaped dataset as shown:

 In this dataset, we can observe that the classes are in the form of an interlocked
U. This is a non-linear dataset
 The blue points belong to class 0 and the orange points belong to class 1
𝐓𝐡𝐞 𝒌 −𝐍𝐞𝐚𝐫𝐞𝐬𝐭 −𝐍𝐞𝐢𝐠𝐡𝐛𝐨𝐫 𝐑𝐮𝐥𝐞
 Now, when kNN algorithm is applied with different values of , the decision boundary is going to
change accordingly:

 In the top row, we have =1, 5, 20 (from left to right). In the bottom row, we have =30, 40, 60 (from left to
right)
 In case of =1, we are overfitting the model, whereas in case of =60, we are underfitting the model.
Underfitting refers to a model that can neither model the training data nor generalize to new data
𝐓𝐡𝐞 𝒌 −𝐍𝐞𝐚𝐫𝐞𝐬𝐭 −𝐍𝐞𝐢𝐠𝐡𝐛𝐨𝐫 𝐑𝐮𝐥𝐞
 There is no optimal value of that suits all kind of data sets. Each dataset has it’s own

requirements

 In the case of a small , the noise will have a higher influence on the result, and a large will

make it computationally expensive

 A small is most flexible fit which will have low bias but the high variance and a large will

have a smoother decision boundary which means lower variance but higher bias

 A higher averages more points in each prediction and hence is more resilient to outliers

You might also like