Chapter One

CHAPTER ONE
1.0 INTRODUCTION
This project seeks to explore the concept of density estimation, especially
kernel density estimation with a close look on bandwidths selection. There are
several methods of density estimation, namely:
Histogram, naïve estimator, the nearest neighborhood estimator, the viable
kernel, orthogonal series estimator, maximum penalized likelihood estimator,
general weight function estimators and kernel density estimators. Thus in this
study we shall make use of the kernel density estimator as an improvement
over all other methods mentioned above.
Moreover, knowing fully well the influence of the choice of k and h in kernel
density estimation, we also critically examine the various methods of
choosing h, which translate into the smoothing parameter of the density
estimation and construction. Five methods of estimating bandwidth will be
considered in this study. They are least square cross validation, maximum
likelihood cross validation, Sheather and Jones plug in, Subjective and
Silverman’s rule of thumb. The optimum value of h of each of these methods
is calculated and the mean integrated square errors (MISEs are calculated by
1
estimating the bias square and the variances of each of the selected window
width.
1.1 DENSITY ESTIMATION
Density estimation has experienced a wide explosion of interest over the last
two decades. Among various methods of density estimation, this study
provides a practical description of density estimation based on kernel
methods.
Suppose that we have a set of observed data points, assumed to be a sample
from unknown probability density function, then density estimation is the
construction of an estimation of the density function from the observed data.
1.2 KERNEL DENSITY ESTIMATION
Let x1 , x2 , ⋯ xn denote a sample of size n from a random variable X with density
function f.
1
n
x−x i
The kernel density estimates of f at the point x is given by f ( x )= ∑
nh 1
k(
h
).
Where the kernel k satisfies ∫ k ( x ) dx=1 and the smoothing parameter h is
2
known as the bandwidth. Hence the performance of the kernel density
estimators depends on the value of the smoothing parameter. It is also
expedient to point out that this study considers popular choice of k which is
2
−x
1
the guassian kernel namely: k ( x )= e2
√2 π
There exist two approaches to density estimation namely parametric and non-
parametric approaches.
1.2.1 PARAMETRIC ESTIMATION
Assume that the data are drawn from one of a known family of parametric
distribution for example the normal distribution with mean μ and variance σ .
2
The density f underlying the data could then be estimated by finding the
estimates of μ and σ
2
from the data and substituting these estimates into the
formular for the normal density. In this work we shall not be considering
parametric estimates of this kind.
1.2.2 NON-PARAMETRIC DENSITY ESTIMATION
In non-parametric density estimation, the data are allowed to speak for
themselves in determining the estimate of f(x), this would be more plausible
3
than the case where f is constrained to fall in a given parametric family. There
is no rigid assumption made about the distribution of the observed data.
1.3 ASSUMPTION AND NOTATION
Given that we have a sample x1, x2 , … , xn of independent and identically
distributed random variables from a continuous univariate distribution with
probability density function f ( x), which is to be estimated. The following
notations are considered below:
∫ −integral
the
range of (−∞, ∞)
−∞
∑− ∑ ,…,n
i=1 i=1,2
n - sample size
f ( x )− probability density function
^f ( x )−estimate of f ( x)
h - bandwidth
k - Gaussian kernel
1.4 DEFINITION OF TERMS
4
Some basic terms which are relevant to this study are briefly defined in this
section.
Kernel density Estimation: this is simply defined as the construction of an
estimate of the density function from the observed data.
Bandwidth: Bandwidth in kernel density estimation denoted as h translates to
the smoothing parameter. This determines the width of the bumps and the
smoothness nature of the estimated density constricted.
Kernel Function: The kernel function k, determines the shape and the bumps
and the popular choice of k used as highlighted in this study is the Gaussian
Kernel.
1.5 SIGNIFICANCE OF DENSITY ESTIMATION
 Density estimation is used in the presentation and exploration of data.
 It is used in informal investigation of the properties of a given set of
data, such as the indication of the features of data like: skewers and the
modality (Unimodal, bi-modal or multimodal) in a given set of data.
5
 It can also be used for decisions making, conclusions, further analysis of
data and representation.
1.6 BASIC K.D.E PROPERTIES
In practice, the kernel k is generally chosen to be a unimodal probability
density symmetric about zero. In this case k satisfies the conditions.
1. ∫ k ( t ) dt=1 probability density

−∞
2. ∫ tk ( t ) dt=0 symmetric
−∞
3. ∫ t 2 k (t ) dt=k 2 ≠ 0
−∞
There are several approaches of density estimation. These are itemized below:
 Histograms ( the oldest and most widely used density estimator)
 Naïve estimator
 The nearest neighborhood estimator
 The variable kernel
 Orthogonal series estimators
6
 Maximum penalized likelihood estimators
 General weight function estimators
 The kernel density estimators.
In this study we shall focus majorly on the kernel method of density
estimators on a univariate data set. As earlier defined, the kernel density
estimator is represented as
n
x−x i
^f ( x )= 1 ∑ k ( )
nh i=1 h
This estimate the density f from the random sample x1, x2, ⋯ , xn at point x. the
selection of the smoothing parameter, h and the choice of the kernel function
is central in the estimation of the density f. however there is no universally
acceptable way of choosing the variables h and k. over the years several
authors have suggested different kernel, namely uniform, triangle,
epanechnikov, quartic, triweight and Gaussian.
Here the Gaussian kernel will be use throughout this study most importantly
we will bedrock and preoccupy ourself with the different methods of selecting
bandwidth also known as the window width.
7
1.7 SCOPE AND LIMITATION OF THE STUDY
This study covers the non-parametric approach and density estimation using a
univariate data set. The aim of this work is to calculate for the size of the
mean integrated square error (MISE) which is done by calculating for bias and
variance using five different method and choosing for h. A graphical
representation of the behavior of the different choices of h is illustrated.
1.8 SUMMARY
In this chapter we have described the kernel density estimation using the
Gaussian kernel and a selected choice of the smoothening parameter. We
talked about the two approaches to density estimation which are the
parametric and non-parametric approaches. We made use of 5 different
methods for bandwidth selection and they are; subjective choice of
bandwidth, Silverman rules of thumb, the Least square cross-validation
method, maximum likelihood cross validation method and the plug-in
method. We derived the approximated MISE comprises the sum of the bias
square and variance.
8
In conclusion, there is apparently no best technique of bandwidth selection
CHAPTER 2
LITERATURE REVIEW
2.0 Introduction
Density estimation is an established method in the field of statistics and
pattern recognition. Historically, the technique of histogram was used to
convey the general flavor of non-parametric theory and practice of density
estimation. Kernel density estimation has become a common tool for
empirical studies in any research area. This goes hand in hand with the fact
that this kind of estimator is now provided by many software packages. Since
about three decades the discussion on choice of bandwidth has been going on.
9
Although a good part of the discussion is about non-parametric regression.
New contributions typically provides simulation only to show the own select
or outperforms some of the existing methods. This section of the study
reviews the literature done by scholars in the following field
 Density estimation
 Kernel density estimation
 Bandwidth selection
2.1 REVIEWS ON DENSITY ESTIMATION
Density estimation has exercised a wide explosion of interest over the last two
decades. Silverman, (1986). Books on density estimation has been applied in
many fields including archaeology (e.g Baxter, beardah and west wood,
2000). It is well know that the performance of kernel density estimators
depends crucially on the value of the smoothing parameter.
2.1.1 Lofts Garden and Queensberry (1965), propounded the nearest
neighborhood estimation method of finding kernel density. This method of
density estimation can be generalized to provide an estimate related to the
10
kernel estimate e.g let k(x) be a kernel function integrating to one i.e
∫ k ( x ) dx=1. Then, when generated with nearest neighbor estimate the function
n
1 t−xi
is defined as f ( t )= ∑ k(
ndk (t ) i=1 dk ( t )
)
2.1.2 Breiman et,al (1997) modified the nearest neighbor estimate. In
his work, he assumed that let k be a kernel function and k also a positive
integer, and we define d j , k to be the distance from xj to the nearest point in the
set comprising of other n−1 data points. Then, the variable kernel estimate
n
1 1 t−x i
with smoothing parameter h is defined by ( t )= ∑ k( ) .
n i=1 hd , k hd , k
j j
The variable kernel density estimate is due to breiman et,al (1977). Though
Wertz (1978), refers to kernel density estimate as presumably independent.
The window width of the kernel placed on the point xj is proportional to d j, k
so that the data points in regions where the data are sparse will have flatter
kernels associated with them for any fixed k and the overall degree of
smoothing will depend on the parameter h. the choice of k determines how
responsible the window width choice will be. Thus the amount of smoothing
is queried by the choice of the integer k.
The naive density estimator was introduced by fix and hodqes in an
unpublished report. They expressed the naive density estimator as
11
^f ( x )= 1 ( no of x 1 , x 2 , … , x n ) following∈(x−h , x +h)
2 hn
Let w = weight function
1
W ( x )= if |x|<1 ,0 other wise
2
Then the naive estimator can be written as ^f ( x )= 1 ∑ 1 ( x ,… , x i )

n i=1 h h
Essentially, the first published paper to deal explicitly with probability density
estimation was Rosenblatt (1956), who discussed both the naive estimator and
the more general kernel estimator.
2.1.3 Good and gaskins (1971)
The maximum penalized likelihood approach was first applied to density
estimation by Good and Gaskins. Maximum penalized likelihood is a
standard statistical method. Meanwhile some other density estimator has
been denied from the definition of a density.
2.1.4 Whittle (1958)
It is possible to define a general class of density estimator which includes
several of the estimators. Suppose that w (x, y) is a function that satisfies the
conditions.
12
∞
∫ w ( x , y ) dx=1
−∞
w ( x , y ) ≥ 0 ∀ (x , y )
The estimates of the form ^f ( t )= 1 ∑ w ( x , t ). is referred to as genera: weight

n i=1
function estimates.
2.1.5 Silverman (1986)
He said that density estimation is the construction of an estimate of the
density function from the observed data. He proposed a parametric
approach to density estimation. When we consider any random quantity x
that has probability density function f. specifying the function f. gives a
natural description of the distribution of x, and allows probability associated
with x to be found from the relation.
∞
p ( a< x <b )=∫ f ( x ) dx ∀ a<b .
−∞
2.1.6 Fix and Hodges (1951b)
When we assumed that their distribution has a probability density f, the data
will be allowed to speak for themselves by determining their estimate of f
more than would be the case if f were constrained to fall in a given parametric
13
family density estimates of this kind were first proposed. By fix and Hodges
(1951).
2.2 Review on Kernel Density Estimation
In statistics, kernel density estimation (KDE) is a non-parametric way to
estimate the probability density function of a random variable. Kernel density
estimation is a fundamental smoothing problem where inference about the
population are made, based on a finite data sample. A new approach
introduced by chavdhuri and Marron (2000), whose objective is to analyze the
visible features representing important underlying structures for different
bandwidths.
2.2.1 Silverman (1986)
For kernel measure of discrepancy between f(x) and ^f ( x ) . This he showed
within the confines of mean square error (MSE) and mean integrated square
error (MISE). According to him when considering estimation at a single point,
a natural measure is the mean square error (abbreviated MSE) is defined by
MS E x f^ =E f^ =E ( f^ ( x )−f ( x ) ) by standard elementary properties of mean and

2
MS E x f =( E f^ ( x )−f ( x ) ) + var f^ ( x ),
2
variance the sum of the squared bias and
variance at x.
14
The first (Rosenblatt 1956) which is the most widely used way of placing a
measure on the global accuracy of ^f ( x ) as an estimator of f (x) is the mean
integrated square error (abbreviated MISE) defined by MISE ^f =E ( f^ ( x ) −f ( x ) )2 dx
Though there are other global measure of discrepancy which may be more
appropriate to one's intuitive ideas about what constitutes a globally good
estimate, the MISE is far the most tractable global measure of discrepancy. He
also discussed the methodology of the theoretical treatment of the closeness of
the estimator ^f , to the true density f in various senses. The estimate ^f of

n
course depends on the data as well as on the kernel and the window width. In
many branches of statistics, there is a trade-off between the bias and variance
terms in,
MS E x f =( E f^ ( x )−f ( x ) ) + var f^ ( x ).
2
The bias can be reduced at the expense of increasing the variance and vice
versa, by adjusting the amount of smoothing.
It is useful to note that, since the integrand is non-negative, the order of
integration and expectation can be reversed, which gives the MISE as the sum
of the integrated square bias and the integrated variance. The sample size
directly, but depends only on the weight function. This is important
conceptually because it shows that taking larger and larger samples will not
15
alone, reduce the bias, it will be necessary to adjust the weight function to
obtain unbiased estimates.
2.2.2 Tarn (2001)
He provided an insight on histogram (on how to construct them and their
features). He gave his own view on kernel density estimators how they are
generalized to serve as an improvement over histogram method of density
estimation. He recommended how to choose the most appropriate kernel
curve "nice" kernel so as to extract all the important properties of the data.
On histogram, he defined it as the simplest non-paramedic density estimator
and that is most frequently used. He stated that when constructing a
histogram, we consider the size of the bins (the binwidth) and the end points
of the bins.
Now on kernel density estimator he showed that the first two problem which
are not smooth and dependence on end points of bins can be solved by kernel
density estimator, He also asserted that the second problem of histogram and
dependency on width of bins is also a problem in kernel density estimation.
2.2.2 Scott (1981)
He obtained the optimal window for specific kernels, getting the optimal
16
value for the window width. The number of smoothers in density estimation is
determined by the window width bandwidth.
2.2.3 Ahmad and Ran (2003)
The contrast method for the kernel density estimation ^f ( x) proposed by
Ahmad and Ran (2003) hence we extend it to the case of the estimator. He
stated that if in an attempt to eliminate the bias, a very small value of h is
used, the integrated variance will become large on the other hand choosing a
large value of h will reduce the random variation as quantified by the variance
at the expense of introducing systematic error or bias into the estimation.
2.2.4 Wand and Schucany (1989)
The work on Gaussian based kernels density a class of higher-order kernels
for curve estimation and window width selection which can be viewed as an
estimation of second order Gaussian kernel. These kernels have some
attractive properties such as smoothness, and are widely accepted means of
estimating curves such as densities repression functions and failure rates
without parametric assumptions. Derivatives of these functions can be
estimated by straight forward extension of kernel estimators. In their note
they confined attention to estimation of densities and their derivatives.
17
There have been several proposals to automatic selection of h, a feature of
this selection rules is that they also require the use of kernels. They further
state that in the case of n = 1 a particular choice of kernel in the kernel
density estimator is the Gaussian kernel.
2.2.5 Muller and Mannitxch (1985)
The kernel density estimator formula studied by Muller (1984) and Gauser,
Muller and Manmtzch (1985). Their motivation for using higher order
Gaussian -based kernel is to reduce the order of magnitude of the curve
estimation leading to a faster rate of convergence of integrated mean
squared error.
2.2.6 Fryer (1976) and Dehevveels (1977)
They proposed that the mean integrated square error (MISE) could be
calculated exactly when both the underlying and the kernel function are
Gaussian, written in terms of convolutions which are simply evaluated in
Gaussian case. They observed that if f (density function) is normal and a
systematic function satisfying (k=l) is the Gaussian kernel (standard normal)
18
then the closed form expressions are available for these convolutions and
their integrals and exact MISE calculations are possible.
2.2.7 Rosenblatt (1956)
According to him the kernel density estimation is introduced by
n
^f h ( x )=n1 ∑ nh( x−x i)
i=1
u
( )
Where h is called the bandwidth k is a kernel, k =kn ( x )
h usually the
h
following assumption are imposed on the kernel.
Where
h is called the bandwidth k is a kernel, k b,('//''usually the following
assumptions are imposed on the kernel.
K is symmetric i.e, k(v) = k(-v)
∫ k (u ) du=1
∫ v ( u ) du=0 for j = 1,2,…,k
19
In this case k is called kernel of order k. note that because of the symmetry, k
is necessarily even and the second assumption guarantee that ^f ( x ) is a density

h
i.e ∫ f^ h ( x ) dx=1
2.2.8 Maron and wand (1992)
Their work was on the higher order guassian kernel which they gave as
(−1 )r ∅ 2r
x
−1
G 2 r= r −1
2 ( r−1 ) ! x
However, they gave the density estimate in form of

n
^f ( x )= 1 ∑ w( x−x i ) .
n i=1 h
2.2.9 Hall, Sheather, Jones and Marron (1991)
Expression for the mean integrated square error (MISE) was espoused by
evaluating the higher order terms of the kernel while asymptotic expression
for the window width was givens as
( )( )
3 /5
^p1 ^p1
h= + ^p2.
n n
20
2.3 Review on Bandwidth
Part of the community working on non-parametric statistics has accepted that
there may not be a perfect procedure to select optimal bandwidth nevertheless,
one should be able say which is a reasonable bandwidth selection, at least or a
particular problem. The development of bandwidth selector has been going
on, so that we believe that a review and comparison of existing selectors
would be quite helpful to get an idea of their objective and performance. More
over when covariate x is transformed such that similar smoothness can be
assumed over the whole (transformed) support using a global bandwidth is a
quite reasonable choice.
The idea of cross Validation methods goes back to Rudemo (1982) and
Bowman (1984), but we should also mention in this context the so-called
psevd-likelihood cross validation methods invented by Hobbema et al (1974)
and by Dvin (1976). Due to the lack of stability of this method, see wand and
Jones (1995). The biased cross validation of Scott and Terrell (1987) is
minimizing the asymptotic MISE, like plug- in method do, but uses a jack-
knife procedure (therefore called cross validation) to avoid the use of prior
information method that mingle different selectors or density estimator were
21
proposed by Ahmad and Pan (2004j. Calling it kernel contrast method and by
Mammen et al (2011), proposing the do-validation method.
Compared to cross validation, the plug-in method does minimize a different
objective function, namely the MISE instead of their ISE, they are less
volatile but not entirely data adaptive as they require some pilot information.
In contrast, cross validation allows to choose the bandwidth without making
assumptions about the smoothness class (or the like) to which the unknown
density belongs. Playing methods have a faster convergence rate compared to
cross validation, then the performance of plug-in method is pretty good.
Among these selectors. Silverman (1986) rule of thumb method is probably
the most popular one. Various refinements were introduced, like for example
by part and Marron (1990) hall et al. (1991), Taylor (1989) accounted no plug
- in methods as they aim to minimize the MISE.
There are already several papers dealing with a comparison of different
automatic data -driven bandwidth selection methods. But they are actually
older than years. In the 1970s and early 1980s, survey papers about density
estimation were published by wegman (1972), Tartar and Wertz and
Schneider various methods of smoothing parameter selection was released by
Marron (1988a) and by Park and Marron (1990). A brief survey was provided
22
by Jones et al (1996b) with a comprehensive simulation study published in
Jones et al (1996b). However, they concentrated on boot strap method and
only compared them with classical cross validation and there plug inversion
of sheathe and Jones (1991). Deuroye and Lugosi (1996) focus on an optimal
bandwidth choice and Deuroye (1997) for a comprehensive companion study.
In the context of asymptotic properties of bandwidth selector, there is a
tradeoff between the classical plug - in method and standard cross validation.
The plug in has always a smaller (asymptotic) variance compared to cross
validation see Hall and Marron (1987a) but often a larger bias in practices.
Sheather (2004) gave a practical description of kernel density estimation
revising some estimation and bandwidth selection method which he
considered to be the most popular at that time, Deuroye (1997) considered
different kernel density estimators. Marron (1986) made the point that " the
harder the estimation problem the better cross validation works based on this
idea, Martinez, Miranda et al (2009) proposed to first apply cross validation to
a harder estimation problem, and to afterward calculate the corresponding
bandwidth for the underlying real estimation problem. Hall et al (1992)
introduced smoothed cross validation, the general idea was a kind of pre
smoothing of the data before applying the cross-validation criterion. This
23
procedure of pre smoothing results in smaller sample variability but enlarges
bias therefore the resulting bandwidth is often over smoothing and cuts off
some important features of the underlying density. Kim et al (1994) obtained
asymptotically best bandwidth selectors was based on an exact MISE
expansion. Ahmad and Ran (2004) proposed a kernel contrast method for
choosing bandwidth either minimizing the ISE or alternatively the MISE.
Silverman (1986) proposed his ideal for h, from the point view of minimizing
the approximate mean integrated square error (AMISE), he stated that optimal
window width is somewhat disappointing since it self-depends on the
unknown width will converge to zero as the sample size increases, but at a
very slow rate. Oman et al (2008) he considers the least square cross
validation (LSCV), the biased cross validation (BCV) and the contrast method
for selecting h using different underlying normal mixture densities.
Ogbonmwan (1999) discussed a general method of obtaining an optimal
window width in the sense of minimizing the mean square error (MISE), he
was also able to obtained the optimal window width or some specific kernels
looking at the work of Terrell (1990) and Scowtt (1985) Terrell 1990
Proposed the maximal smoothing principle maximal smoothing principle for
histograms and frequency polygons. According to them, the maximal
24
smoothing principle (MSP) works fairly well for unimodal densities.
However, for multi-densities they tend to over smooth the data and hide the
features of the underlying density which can be viewed as a drawback But
Terrell (1992) advices the use of MSP "because they start with a sort of null
hypothesis that there is no structure of interest and let the data force as to
conclude otherwise. However, Silverman (1986) stated that it should never be
forgotten that the appropriate choice of smoothing parameter will always be
influenced by the purpose for which the density estimate is to be used.
He said also that when using density estimation for presenting conclusion,
there is a case for under smoothing somewhat, the reader can do further
smoothing by "eye" but cannot easily unsmooth.
2.4 Summary of Chapter
The author's reviews in this work concentrate on density estimation which is
of different types and its various views as considered by different authors.
Also, we looked at kernel density estimation with a concise but
comprehensive discussion as put forward by different scholars in that field of
study.
Finally, an important aspect of density estimation was reviewed. This is the
25
selection of bandwidth, which translate to the smoothing parameter in the
construction of density estimation.
26
CHAPTER THREE
METHODOLOGY
3.1 Introduction to Bandwidth Selection
In several survey many scholars and authors out that smoothing methods
provide a powerful methodology for gaining insights into data. Many
examples of this may be found in the monographs of Eubank (1988), Hardle
(1990), Muller (1988), Scott (1992). Silverman (1986) and wand and Jones
(1994). But effective use of these method requires choice of a smoothing
parameter. According to Scott et al (1992) he said h is crucial for the effective
performance of the kernel estimator. When insufficient smoothing is done, the
resulting is done, the resulting density or regression estimate is too rough and
contains serious feature that are artifacts of the sampling process. When
excessive smoothing is done, important features of the underlying structure
are smoothed away. A method that uses the data x 1 ,x 2 ,..., x n to produce a value
for the bandwidth h is called a bandwidth selector or data driven selector.
In the bands of an expert, interactive visual choice of the smoothing parameter
is a very powerful way to analyze data. But there are number of reasons why
it is important to be able to choose the amount of smoothing automatically
27
from the data. One is that software packages need a default. This is useful in
saving the time of experts through providing a sensible starting point, but it
becomes imperative when smoothing is use by non-experts. Another reason
this is important is that in a number of situations many estimates are required,
and it can be impractical to manually select smoothing parameters for all (.eg
see the income data in park and maron 1990) it should never be forgotten that
the accepted approach to this problem. Hence, the following methods are
discussed below.
3.2 Subjective Choice of Bandwidth
On a natural note, subjective choice of bandwidth requires examining several
plots of the data, all smoothed by different amount of h, may well give more
insight into the data than merely considering a single automatically produced
curve this a natural method for choosing the smoothing parameter as it entails
the plotting out of several curves and choose the estimate that is most in
accordance with one’s prior ideas about the density, for many application this
approach will be perfectly satisfactory. There are many situations where it is
satisfactory to select the most acceptable density (Ahmad and Muqdadi 2003).
One of the strategies to do that by starting with small ( or large) bandwidth
then going up ( or down) until you reach the suitable.
28
Moreover, when we chose the bandwidth, we have to consider the error in our
choice for many purposes, particularly for model and hypothesis generation, it
by no means unhelpful for the statistician to supply the scientist with a range
of possible presentation of the data. A choice among several alternative
models affords the users a very useful step forward from the enormous
number of positive explanations that could conceivably be consider red.
3.3 Rules of Thumb
The computationally simplest method for choosing a global bandwidth h is
based on replacing R(f”), the known part of hamise, by its value for a
parametric family expressed as a multiple of a scale parameter, which is then
estimated from the data. The method seems to date back to Dehevvels (1977)
and Scott (1979), who each proposed it for histogram. However the method
was popularized for kernel density estimates by Silvermon (1986, section
3.2), who used the normal distribution as the parametric family.
Let σ and IQR denote the standard deviation and interquartile range of x,
respectively. Take the kernel k to be the Gaussian kernel. Assuming that the
underlying distributions is normal, Silverman (1986, pages 45 and 47) should
that bandwidth selection for kernel density estimate reduce to
29
σ
HAmise normal = 1.06 n−1/5
And
−1
h Amise Normal =1. 06 σ
5
Jones, Marron and Sheather (1996) studied the Monte Carlo performance of
the normal reference bandwidth based on the standard deviation, that is, they
considered
−1
HSNR = 1. 065 n
5
Where s is the sample standard deviation an n is the sample size. According to
Jones et al (1996), this method is called the sample normal reference method
(SNR) they found out that H SNR and a mean that was usually unacceptably
large and thus often produced over smoothed density estimates. Furthermore,
Silverman (1986), page 48) recommended reducing the factor 1.06 in the
previous equation to 0.9 in an attempt not to miss bimodality and using the
smaller of two scale estimates. This rule is commonly used in practice and it
is often referred to as Silverman’s reference bandwidth or Silverman’s rule of
thumb it is given ty
hsROT = 0.9AN-1/5
30
where a = simple standard deviation this method is called Silverman rule of
thumb method (SROT).
Terrell and Scott (1985) and Terrell (1990) developed a bandwidth selection
method based on the maximal smoothing principle so as to produce over
smoothed density estimates. The method is based on choosing the “largest
degree of smoothing compatible with the estimated scale, the density with the
smallest value of f left (x right ) . } {¿ taking the variance σ as the scale parameter, Terrell
2
2
(1990), page 471) found that the family of distributions with variance σ
minimizes f left (x right ) . } {¿ for the standard Gaussian kernel this leads to the over
smoothed bandwidth
−1
hos =1 . 44 Sn 5
This method is caved over smoothed method according to Silver Man (1986)
comparing the over smoothed bandwidth with the normal reference bandwidth
h, we see that the over smoothed bandwidth is 1.08 times larger. This in
practice there is often very little visual difference between density estimates
produced using either there over smoothed bandwidth or the normal reference
bandwidth.
31
3.3.1 MEASURE OF DISCREPANCY
Naturally, there exists a discrepancy between the density estimate, f ( x ) and the
actual estimate, f ( x ) one method of measuring this discrepancy is called the
Mean Integrated Squared Error (M1SE). On the other hand, the Mean Squared
Error (MSE) can also be used. While the M1SE measures the error globally, is
considering all points of the density estimate, the MSE measures the local
error: that is, error at a point.
The MISE is the sum of bias squared and the variance. That is
MISE=∫ Bias 2h ( x ) dx+var { f^ ( x ) dx¿

(1)
biash ( x )=E ^f ( x )−f ( x )
( x−h y ) f ( y ) dy −f ( x )
1
=∫h −k
(2)
We do some transformation.
x−y
t= ⇒ y =x−ht
Let h and the assumption that k integrates to unity, we have
biash ( x )=∫ k ( t ) f ( x−ht ) dt−f ( x )
=∫ k ( t ) ( x−ht ) dt−f ( x ) dt
32
Using Taylor’s series expansion to expand f ( x−ht ) , we have
3 3
1 h t m
f ( x−ht )=f ( x ) −htf ' ( x ) + h2 t 2 f '' ( x ) − f (x)
2 6
+ higher order terms
Thus, by the assumption of the kernel function, we have
1
( x )=−hf ' ( x )∫ tk ( t ) dt+ h2 f '' ( x ) ∫ t 2 k ( t ) dt +⋯
Bishh 2 using the assumption.
∞ ∞ 2
∫−∞ tk ( t ) dt=0 and ∫ −∞
t k ( t ) dt=k 2 =¿0 ¿
1
( x )= h2 f '' ( x ) k 2 +
Biash h higher order term.
Hence, the integrated square bias, for the mean integrated square error is
1
∫ biash biash ( x )2 dx≈ 2 h 4 k 22 f '' ( x )2 dx
(4)
Similarly,
f^ ( x ) dx≈n−1 h−1∫ k ( t )2 dt
Var (5)
Therefore,
The asymptotic mean integrated squared error
33
1
= h 4 k 22∫ f '' ( x )2 dx + ( nh )−1∫ k ( t )2 dt
(AMISE) 4 (6)
Essentially, we want to choose h to make the mean integrated square error as
small as possible.
Equations (4) and (5) demonstraten one of the fundamental problems of
density estimation. This problem is revealed in the tradeoff between the bias
and the variance. It in an attempt to eliminate the bias, a very small value of h
is used, then the integrated variance will become large. On the other hand,
choosing a large value of h will reduce the random variation (integrated
variance) at the expense of introducing systematic error (bias) into the
estimation. It should be mentioned here that whatever method of density
estimation used, the choice of smoothing parameter implies a trade off
between random error (variance) and systematic error (bias).
The optimal bandwidth h can be obtained from the approximated mean
integrated squared error (MISE) given above by
1
MISE f^ ( x )= h4 k 22 ∫ f '' ( x )2 dx + ( nh )−1 ∫ k ( t )2 dt .
4
Minimizing the above with respect to h and equating to zero, we have
d
dh
MISE f^ ( x )=
dh 4 (
d 1 4 2
h k 2 ∫ f '' ( x )2 dx + ( nh )−1 ∫ k ( t )2 dt . )
34
Which gives
h3 k 22∫ f '' ( x )2 dx−n−1 h−2∫ k ( t )2 dt=0
Multiply through by h2 gives
h5 k 22∫ f '' ( x )2 dx−n−1 ∫ k ( t )2 dt=0
h5 k 22∫ f '' ( x )2 dx−n−1 ∫ k ( t )2 dt
h5 k 22∫ f '' ( x )2 n−1∫ k ( t )2 dt
∴ h= ( ∫ f '' ( x )
k 22 −2 −1
dx ) ∫ k ( t ) dt ) 5
2
−2 −1 −1
⇒h nnt =k
2
5
(∫ f '' ( x ) dx )
2 5
n 5
(∫ k ( t ) dt )
2
(7)
The formula (7) above for the optimal window width is somewhat
disappointing since it shows that h0, itself depends on the unknown density
f(x) being estimated. The following observations and deductions are made:
lim
n→∞ h opt =0 ( hopt converges but at slowate )
∫ f '' ( x )2 measures in a sense the rapidity of fluctuations in the density f ( x ) ; and
1
∫ k ( t )2=∫ 2 π ℓ−t 2 dt (−∞<t<∞ ) .
35
2
Let y=t ⇒ dy=2 t dt .
t=√ y
∞ ∞
∴∫∞ k ( t )2=∫−∞
1 − y dy
ℓ where t=√ y
2π 2t
∞ 1 −t dy
=∫−∞ ℓ .
2π 2 √ yt
−1
1 ∞
= 2∫0 y
2 −y
ℓ dy
4π
=
1 1
()
1
Γ = √n
2π 2 2π
1
= π−1 2
2
∞ 1
∴∫−∞ k ( t )2 =
2√ π
3.4 Cross- Validation Methods
This method is also known as the least square cross-validation, it was
proposed by RU demo (1982) and by Bowman (1984) is probably the most
popular and best studied one. It is completely automatic method for choosing
36
the smoothing parametric Silverman (1986). It has only been formulated in
recent years but is based on an extremely simple ideas, it depends on the
kernel and bandwidth, not on the sample size this method is based on the so
called leave one out density estimator.
Scott & Terrell (1987) called this method unbiased cross validation since for
fixed h i.e non-random bandwidth, it is easy to check that E(LSCV(h)-R(f) =
MISE (h) –R (f) (LSC(H)-RCF)=MISE(h)-R(f). the population of this method
is due to this intuitive motivation and fact that it is asymptotically optimal
( hall, 1983 and stone (1984). The idea is to consider the expansion of the
integrated square error (ISE) in the following way.
( ISECh )=∫ f^ h 2 ( x ) dx−∫ ^f h ( x ) f ( x ) dx+∫ f 2 ( x ) dx
^^
Note that the last tem does not depend on f h−, hence on h, so that we only
need to consider the first two terms the ideal choice of bandwidth is the one
which minimizes.
L ( h )=ISE ( h )−∫ f 2 ( x ) dx=∫ ^f h2 ( x ) dx −∫ f^ ( x )∫ ( x ) dx

the principle of the LSCV
method is to find an estimate f L(h) from the data and minimize it over h, and
also minimize the BCVCh over h.
37
^
A measure of the closeness of f ( x ) and a density f (x), is the mean integrated
square error (MISE) cause between y
MISE=E ∫ ( ^f ( x ) ) −f ( x )2 dx
(1)
( )
2
=E ∫ ^f 2 ( x )−2 f^ ( x )+∫ ( x ) dx
(2)
Notice that the last term on the right hand- side of the expression above dies
^
not involve the estimates density (mean that f ( x ) does not depend on h).
therefore minimization of MISE is equivalent to minimization of
MISE−∫ f 2 ( x ) dx=E [∫ ^f ( x ) dx−2∫ ^f ( x ) f ( x ) dx]

2
the LSCV bandwidth selection
method is based on abstaining an unbiased estimate of equation – (2) so, the
motivation of the least square cross validation (LSCV) come from expanding
the MISE of equation (1)
Thus, in practice it is prudent to plot LSCV (h) and just rely on the result of a
minimization routine. Jones, Marron and Sheather (1996) recommended that
the largest local minimizer of LSCV(h) be used as hLSCV, since this value
produces better empirical performance the global minimizer. Hall and Marron
38
(1987b) and Scott and Terrell (1987) showed that the LSCV bandwidth
HLISV achieves best possible convergence.
3.5 Plug-In Method
The slow rate of convergence of LSCV and BCV encouraged much research
on faster converging methods. The method is commonly thought to data back
to Woodruff (1970), who proposed it for estimating the density at a given
point. Estimating R ¿ ¿ by R ( f g ) requires the user to choose the bandwidth g

^ 11
for this so-called plot estimate. There are many ways this can be done we next
describe the “ solve the equation” plug in approach developed by sheather
and Jones (1991)m since this method is widely recommended (e.g Simonoff
1996, page 77, Bowman and Azzalini, 1997) and variables and Ripley, 2002,
(page 129).
Different versions of the plug – in – approach depends on the exact from of
the estimate of R(f”). The Sheather and Jones (1991) approach is based on
writing the pilot bandwidth for the estimate R(f”) as a function of h, namely
1
5
g ( h)= c ( k ) [ R ( f right )} over {R left (f ) ] 7 h ,
7
39
And estimating the resulting unknown functional of f using kernel density
estimates with bandwidths based on normal rules of thumbs in this situation,
the only unknown in the following equation is h:
[ ]
1 1
R (k ) 5
n−
5
h=
u 2 ( k ) R ( ^f 11g ch )
2
The Sheather- Jones plug- in bandwidth h is the solution to this equation.
Under smoothness assumptions on the underlying density

s
( hsj
n14 h AMISE
−1
)
N ( 0 , σ 25 j )
Has an asymptotic distribution thus the sheather- jones plug-in
5
n− ,
bandwidth has a relative convergence rate of order 14 which is much higher
than that of BCV most of the improvement is because BCV effectively user
thee same bandwidth to estimate R(f”) as it does to estimate f, while the
sheather Jones plug-in approach assumes more smoothness of the underlying
density than either LSCV or BCV.
Jone, Marron and sheather (1996) found that for easy to estimate densities
(i.e, those for which R(f”) is relatively small) the distribution of h sj tends to be
40
centered near hAMISE and has much lower variability than the distribution of
hLSCV.
A number of authors recommended that density estimates be drawn with more
than one value of the bandwidth. Scott (1992, page 161) advised looking at a
sequence of density estimates based on the square of smoothing parameters
hos
h=
1 .05−k for k = 0,1, 2…,
Starting with there sample over smoother displays some instability and very
local noise near the peaks. Maron and hung (2001) also recommend looking at
a family of density estimates for the given data set based on deferent values of
the smoothing parameter. Marron and chung (2001, page 198) advised that
this family be based around a “center point” which is an effective choice of
the global smoothing Jones plug in bandwidth for this purpose. Silverman
(1981) showed that an important advantage of using a Gaussian kernel is in
this case, that the number of modes in the density estimate decreases
monotonically as the bandwidth h increases. This means that the number of
features in the estimated density is a decreasing function of the amount of
smoothing.
41
Biased Validation
Biased cross- validation was proposed by Scott and Terrell (1987). Which is
based on choosing the bandwidth that maximizes an estimate of a asymptotic
mean integrated square error (AMISE) rather than an estimate of integrated
square error (ISE). Now consider again the asymptotic mean squared error.
( )
2
u2 (k )
R ( f 2)
−1 4
=( nh ) R ( x ) +h
AMISEch 2
It is gotten by subsiding R ( f ) by an estimate, instead of using a reference

2
R ( ^f 2h )
distribution they estimate the functional (essentially) by to drive a score
R ( ^f 2h )
function BCV(h) which is minimized with respect to h. where is the
second derivative of the kernel density estimate and the subscript h denote the
fact that the bandwidth used for the estimate is the same one used to estimate
R ( ^f 2h )
the density f ( x ) itself Scott and Terrell (1987) showed that is a biased
estimate for R ( f ) the BCV objective is given by

2
2
RK N 2 ck
BCV ( h )=
nh
+ ∑ ( x 1− x j )
¿
2 n2 h i =¿ j ∑ φ
h
42
^
Scott and Terrell (1987) proposed to use the minimizer h BCV of BCB(h) as
^
bandwidth. They shows that h BCV has the same relative rate of convergence to
h^ 0 as h^ LSCV but that the constant is often much smaller. The best performance is
obtained by choosing the smallest value of h for which a local minimum
occurs.
According to wand and Jones (1995 page 80) the ratio of the two asymptotic
variances for the Gaussian kernel is
2
σ LSCV
2
=15 .7
σ BCV
Thus, indicating that the bandwidths obtained from least square cross
validation are expected to be much more variable than those obtained from
biased cross validation.
43
Comparison Between Subjective Choice, Rules of Thumb and Least
Squares Cross-Validation
First, the least square cross- validation approach in bandwidth selection
results ultimately in data over smooth when used with heavy tailed
distributions of an outlier is left out of the data set, then smoothing the
remaining observations may produce no estimate at that point thereby, forcing
a larger bandwidth to be selected.
A reference for this point is Schuster and Girogry (1981) who point out the
LCV produces inconsistent fixed bandwidth kernel estimates when the
underlying distribution has heavy tails. However, considering the choice of
bandwidth subjectively of h which can moderate the defects of the LCV in
terms of over smoothing and showing clearly the feature estimates made
possible through the observations provided. In the same vein, rules of
reference thumb afford the production of optimum result through the use of its
1
n−
minimizes h. the slow 10
rate of convergence means that hLSCV is highly
variable in practice. In addition to high variability, least squares cross
validation often under smoothing practice, in that it leads to spurious
bumpiness in the underlying density (Simonoff 1996). A major advantage of
44
LSCV over other methods is that it is widely applicable.
SUMMARY
In this chapter we talked about some methods used in selecting or choosing
bandwidth and their procedures. We also looked at the measures of
discrepancy for these methods which we used the mean integrated square
error (MISE) in finding the discrepancies of the estimate of these methods
from their actual values.
45
CHAPTER FOUR
DATA ANALYSIS
PRESENTATION OF DATA, NUMERICAL VERIFICATION AND
GRAPHICAL ILLUSTRATIONS
This chapter focuses predominantly on numerical verification and graphical
illustration of data and data analysis, for the presentation of data see appendix
A,B,C in pages 63, 64 and 65 respectively. Taking into cognizance the obvious
importance of the choice of h, bandwidth in the construction of density estimate,
and also k, the gaussian kernel, we calculate the various optimal h of each of the
methods of bandwidths. Their corresponding MISE(s) are also derived. Both the
optimal h values and MISE(s) are estimated using R programming language,
version, 4.8 on HP G62 processor. AMD Athlon 11 Dual core i3 370m – 3gb
RAM – 373DX
46

Chapter One

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter One

Uploaded by

Copyright:

Available Formats

CHAPTER ONE

This project seeks to explore the concept of density estimation, especially

several methods of density estimation, namely:

Histogram, naïve estimator, the nearest neighborhood estimator, the viable

kernel, orthogonal series estimator, maximum penalized likelihood estimator,

study we shall make use of the kernel density estimator as an improvement

over all other methods mentioned above.

density estimation, we also critically examine the various methods of

choosing h, which translate into the smoothing parameter of the density

estimation and construction. Five methods of estimating bandwidth will be

Silverman’s rule of thumb. The optimum value of h of each of these methods

1.1 DENSITY ESTIMATION

two decades. Among various methods of density estimation, this study

provides a practical description of density estimation based on kernel

Suppose that we have a set of observed data points, assumed to be a sample

from unknown probability density function, then density estimation is the

construction of an estimation of the density function from the observed data.

1.2 KERNEL DENSITY ESTIMATION

Let x1 , x2 , ⋯ xn denote a sample of size n from a random variable X with density

Where the kernel k satisfies ∫ k ( x ) dx=1 and the smoothing parameter h is

estimators depends on the value of the smoothing parameter. It is also

1.2.1 PARAMETRIC ESTIMATION

parametric estimates of this kind.

1.2.2 NON-PARAMETRIC DENSITY ESTIMATION

In non-parametric density estimation, the data are allowed to speak for

themselves in determining the estimate of f(x), this would be more plausible

is no rigid assumption made about the distribution of the observed data.

1.3 ASSUMPTION AND NOTATION

Given that we have a sample x1, x2 , … , xn of independent and identically

distributed random variables from a continuous univariate distribution with

probability density function f ( x), which is to be estimated. The following

notations are considered below:

f ( x )− probability density function

1.4 DEFINITION OF TERMS

Kernel density Estimation: this is simply defined as the construction of an

estimate of the density function from the observed data.

Bandwidth: Bandwidth in kernel density estimation denoted as h translates to

smoothness nature of the estimated density constricted.

1.5 SIGNIFICANCE OF DENSITY ESTIMATION

 Density estimation is used in the presentation and exploration of data.

 It is used in informal investigation of the properties of a given set of

modality (Unimodal, bi-modal or multimodal) in a given set of data.

data and representation.

1.6 BASIC K.D.E PROPERTIES

In practice, the kernel k is generally chosen to be a unimodal probability

density symmetric about zero. In this case k satisfies the conditions.

1. ∫ k ( t ) dt=1 probability density

 Histograms ( the oldest and most widely used density estimator)

 The nearest neighborhood estimator

 The variable kernel

 Orthogonal series estimators

 General weight function estimators

 The kernel density estimators.

In this study we shall focus majorly on the kernel method of density

estimators on a univariate data set. As earlier defined, the kernel density

is central in the estimation of the density f. however there is no universally

authors have suggested different kernel, namely uniform, triangle,

epanechnikov, quartic, triweight and Gaussian.

bandwidth also known as the window width.

variance using five different method and choosing for h. A graphical

representation of the behavior of the different choices of h is illustrated.

Gaussian kernel and a selected choice of the smoothening parameter. We