Professional Documents
Culture Documents
Nonparametric Techniques
Nonparametric Techniques
Dr. Manoj BR
Assistant Professor
Department of Electronics & Electrical Engineering
Indian Institute of Technology Guwahati
Introduction
So far, we treated supervised learning under the assumption that the forms of the underlying
When the data is generated from a density function that is known, such as normal distribution,
except for finite number of unknown parameters the model is called a parametric model
In most pattern recognition applicationsthe common parametric probability density models rarely fit
Here, we shall examine nonparametric procedures that can be used with arbitrary distributions
and without the assumption that the forms of the underlying densities are known
Nonparametric Techniques
estimates are satisfactory, they can be substituted for the true densities when designing the
classifier
The histogram is obtained by dividing each feature axis into number of bins of width and
approximating the density at each value of by the fraction of the points that fall inside the
corresponding bin
• If bin width is very small (large ), then the estimated density is very spiky (i.e., noisy)
resulting in the main trends in the data may be obscured (masked)
• If bin width is very large (small ), then the true structure of the density is smoothed out and
smaller features in the distribution of the data are not captured
In practice, we need to find an optimal value for and that compromises between these two issues
Histogram based Density Estimation
Example
• To give a concrete example, we will consider the passengers of the Titanic
• There were approximately 1300 passengers on the Titanic (not counting crew), and we
have reported ages for 756 of them
• We call the relative proportions of different ages among the passengers as the age
distribution of the passengers
Histogram based Density Estimation
Histograms depend on the chosen bin width. Here, the same age distribution of Titanic passengers is shown
with four different bin widths: (a) one year; (b) three years; (c) five years; (d) fifteen years. For the age
distribution of Titanic passengers, we can see that a bin width of one year is too small and a bin width of
fifteen years is too large, whereas bin widths between three to five years work fine
Drawbacks of Histogram-based Density Estimation
The estimated density is not smooth and has discontinuities at the boundaries of the histogram bins
• Consider a dimensional feature space. If we divide each variable in intervals, we will end
up with
• A huge number of examples would be required to obtain good estimates. Otherwise, most
bins would be empty and density will be approximated by zero
Density Estimation
The most fundamental techniques rely on the fact that the probability that a vector will fall in a
region is given by:
Thus is a smoothed or averaged version of the density function . We can estimate this smoothed
value of by estimating the probability
Suppose that samples are drawn i.i.d. according to the probability law
• The probability that of these fall in is given by the binomial law, with as the probability of
each sample falling in :
So we expect that the ratio will be a very good estimate for the probability , and hence for the
smoothed density function. This estimate is especially accurate when is very large
E ( )
𝑘
𝑛
=𝑃
• The region must be sufficiently small so that the density is approximately constant over the
region
• The region must be sufficiently large so that the number of points falling inside it is sufficient
to yield a sharp peaked binomial
Density Estimation
Let be the volume of , be the number of samples falling in , and be the -th estimate for
Density Estimation
• , this ensures to converge at all. This also says that although a huge number of samples will fall
within the region , they will form a negligibly small fraction of the total number of samples
Parzen Windows
The Parzen-window approach to estimating densities can be introduced by temporarily assuming
that the region is a dimensional hypercube
If is the length of an edge of that hypercube, then its volume is given by:
Window function
• We can obtain an analytic expression for , the number of samples falling in the hypercube, by
defining the following window function
defines a unit hypercube centered at the origin. It follow that is equal to unity if falls within the
hypercube of volume centered at , and is zero otherwise
Parzen Windows
The number of samples in this hypercube is therefore given by
We know that:
A potential remedy for the problem of the unknown ‘best’ window function is to let the cell volume
be a function of the training data, rather than some arbitrary function of the overall number of cells
One way: Specify as some function of such as. Then, we grow the volume until it encloses
neighbors of . That is, let the volume be a function of the training data
Nearest Neighbor Estimation
To estimate from training samples:
• Center a hypersphere or cell around and let it grow until it captures samples. is a
function of . These samples are the nearest-neighbors of
If the density is high near , the cell will be relatively small, which leads to good resolution
If the density is low, the cell will grow large, but it will stop soon after it enters regions of higher
density
Nearest Neighbor Estimation
In either case, we want to go to infinity as goes to infinity, since this assures us that will be a good
estimate of the probability that a point will fall in the cell of volume
However, we also want to grow sufficiently slowly that the size of the cell needed to capture training
samples will shrink to zero
The conditions and are necessary and sufficient for to converge to in probability at all points where is
continuous
Suppose that we place a cell of volumearound and capture samples, of which turn out to be
labelled . Then the estimate for the joint probability is
That is, the estimate of the a posteriori probability that is the state of nature is merely the
fraction of the samples within the cell that are labelled
𝐄𝐬𝐭𝐢𝐦𝐚𝐭𝐢𝐨𝐧 𝐨𝐟 𝐀 𝐏 𝐨𝐬𝐭𝐞𝐫𝐢𝐨𝐫𝐢 𝐏 𝐫𝐨𝐛𝐚𝐛𝐢𝐥𝐢𝐭𝐢𝐞𝐬
Two approach for choosing the size of the cell
• In either case, as goes to infinity an infinite number of samples will fall within the infinitely
small cell
𝐓𝐡𝐞 𝐍𝐞𝐚𝐫𝐞𝐬𝐭 − 𝐍𝐞𝐢𝐠𝐡𝐛𝐨𝐫 𝐑𝐮𝐥𝐞
Let denote a set of labelled prototypes
be the prototype nearest to a test point . Then the nearest-neighbor rule for classifying is to
To calculate distance from the test point to prototypes is calculated using a distance measure, such
The nearest-neighbor rule is a sub-optimal procedure; its use will usually lead to an error rate
Visual representation of how the kNN algorithm works. The distance measure can be: Euclidean,
Manhattan, Minkwoski etc. The value of k is always odd
𝐓𝐡𝐞 𝐍𝐞𝐚𝐫𝐞𝐬𝐭 − 𝐍𝐞𝐢𝐠𝐡𝐛𝐨𝐫 𝐑𝐮𝐥𝐞
Let is class with maximum probability given by:
Bayes decision rule always selects class which results in minimum risk i.e., highest probability,
which is . kNN rule allows us to partition the feature space into cells consisting of all points closer
to a given training point than to any other training points. All points in such a cell are thus labelled
by the category of the training point — a so-called Voronoi tesselation of the space
selection
When is close to 1/c, so that all classes are essentially equally likely, the selections made by the
nearest-neighbor rule and the Bayes decision rule are rarely the same
Let be the minimum possible value of , and be the minimum possible value of , then and
𝐓𝐡𝐞 𝒌 −𝐍𝐞𝐚𝐫𝐞𝐬𝐭 −𝐍𝐞𝐢𝐠𝐡𝐛𝐨𝐫 𝐑𝐮𝐥𝐞
It is the extension of the nearest-neighbor rule
This rule classifies by assigning it the label most frequently represented among the nearest samples
In other words, a decision is made by examining the labels on the nearest neighbors and taking a
vote
As in the single-nearest-neighbor cases, the labels on each of the -nearest-neighbors are random
variables, which independently assume the values with probabilities ,
If is the larger a posteriori probability, then the Bayes decision rule always selects
The -nearest-neighbor rule selects if a majority of the nearest neighbors are labeled , an event of
probability
In general, the larger the value of , the greater the probability that will be selected
𝐓𝐡𝐞 𝒌 −𝐍𝐞𝐚𝐫𝐞𝐬𝐭 −𝐍𝐞𝐢𝐠𝐡𝐛𝐨𝐫 𝐑𝐮𝐥𝐞
Consider, a U-shaped dataset as shown:
In this dataset, we can observe that the classes are in the form of an interlocked
U. This is a non-linear dataset
The blue points belong to class 0 and the orange points belong to class 1
𝐓𝐡𝐞 𝒌 −𝐍𝐞𝐚𝐫𝐞𝐬𝐭 −𝐍𝐞𝐢𝐠𝐡𝐛𝐨𝐫 𝐑𝐮𝐥𝐞
Now, when kNN algorithm is applied with different values of , the decision boundary is going to
change accordingly:
In the top row, we have =1, 5, 20 (from left to right). In the bottom row, we have =30, 40, 60 (from left to
right)
In case of =1, we are overfitting the model, whereas in case of =60, we are underfitting the model.
Underfitting refers to a model that can neither model the training data nor generalize to new data
𝐓𝐡𝐞 𝒌 −𝐍𝐞𝐚𝐫𝐞𝐬𝐭 −𝐍𝐞𝐢𝐠𝐡𝐛𝐨𝐫 𝐑𝐮𝐥𝐞
There is no optimal value of that suits all kind of data sets. Each dataset has it’s own
requirements
In the case of a small , the noise will have a higher influence on the result, and a large will
A small is most flexible fit which will have low bias but the high variance and a large will
have a smoother decision boundary which means lower variance but higher bias
A higher averages more points in each prediction and hence is more resilient to outliers