Professional Documents
Culture Documents
Chapter One
Chapter One
1.0 INTRODUCTION
kernel density estimation with a close look on bandwidths selection. There are
general weight function estimators and kernel density estimators. Thus in this
Moreover, knowing fully well the influence of the choice of k and h in kernel
considered in this study. They are least square cross validation, maximum
likelihood cross validation, Sheather and Jones plug in, Subjective and
is calculated and the mean integrated square errors (MISEs are calculated by
1
estimating the bias square and the variances of each of the selected window
width.
Density estimation has experienced a wide explosion of interest over the last
methods.
function f.
1
n
x−x i
The kernel density estimates of f at the point x is given by f ( x )= ∑
nh 1
k(
h
).
2
known as the bandwidth. Hence the performance of the kernel density
expedient to point out that this study considers popular choice of k which is
2
−x
1
the guassian kernel namely: k ( x )= e2
√2 π
There exist two approaches to density estimation namely parametric and non-
parametric approaches.
Assume that the data are drawn from one of a known family of parametric
distribution for example the normal distribution with mean μ and variance σ .
2
The density f underlying the data could then be estimated by finding the
estimates of μ and σ
2
from the data and substituting these estimates into the
formular for the normal density. In this work we shall not be considering
3
than the case where f is constrained to fall in a given parametric family. There
∫ −integral
the
range of (−∞, ∞)
−∞
∑− ∑ ,…,n
i=1 i=1,2
n - sample size
^f ( x )−estimate of f ( x)
h - bandwidth
k - Gaussian kernel
4
Some basic terms which are relevant to this study are briefly defined in this
section.
the smoothing parameter. This determines the width of the bumps and the
Kernel Function: The kernel function k, determines the shape and the bumps
and the popular choice of k used as highlighted in this study is the Gaussian
Kernel.
data, such as the indication of the features of data like: skewers and the
5
It can also be used for decisions making, conclusions, further analysis of
2. ∫ tk ( t ) dt=0 symmetric
−∞
3. ∫ t 2 k (t ) dt=k 2 ≠ 0
−∞
There are several approaches of density estimation. These are itemized below:
Naïve estimator
6
Maximum penalized likelihood estimators
estimator is represented as
n
x−x i
^f ( x )= 1 ∑ k ( )
nh i=1 h
This estimate the density f from the random sample x1, x2, ⋯ , xn at point x. the
selection of the smoothing parameter, h and the choice of the kernel function
acceptable way of choosing the variables h and k. over the years several
Here the Gaussian kernel will be use throughout this study most importantly
we will bedrock and preoccupy ourself with the different methods of selecting
7
1.7 SCOPE AND LIMITATION OF THE STUDY
This study covers the non-parametric approach and density estimation using a
univariate data set. The aim of this work is to calculate for the size of the
mean integrated square error (MISE) which is done by calculating for bias and
1.8 SUMMARY
In this chapter we have described the kernel density estimation using the
talked about the two approaches to density estimation which are the
method. We derived the approximated MISE comprises the sum of the bias
8
In conclusion, there is apparently no best technique of bandwidth selection
CHAPTER 2
LITERATURE REVIEW
2.0 Introduction
empirical studies in any research area. This goes hand in hand with the fact
that this kind of estimator is now provided by many software packages. Since
about three decades the discussion on choice of bandwidth has been going on.
9
Although a good part of the discussion is about non-parametric regression.
New contributions typically provides simulation only to show the own select
Density estimation
Bandwidth selection
Density estimation has exercised a wide explosion of interest over the last two
many fields including archaeology (e.g Baxter, beardah and west wood,
10
kernel estimate e.g let k(x) be a kernel function integrating to one i.e
∫ k ( x ) dx=1. Then, when generated with nearest neighbor estimate the function
n
1 t−xi
is defined as f ( t )= ∑ k(
ndk (t ) i=1 dk ( t )
)
his work, he assumed that let k be a kernel function and k also a positive
integer, and we define d j , k to be the distance from xj to the nearest point in the
set comprising of other n−1 data points. Then, the variable kernel estimate
n
1 1 t−x i
with smoothing parameter h is defined by ( t )= ∑ k( ) .
n i=1 hd , k hd , k
j j
The variable kernel density estimate is due to breiman et,al (1977). Though
so that the data points in regions where the data are sparse will have flatter
kernels associated with them for any fixed k and the overall degree of
responsible the window width choice will be. Thus the amount of smoothing
11
^f ( x )= 1 ( no of x 1 , x 2 , … , x n ) following∈(x−h , x +h)
2 hn
1
W ( x )= if |x|<1 ,0 other wise
2
Essentially, the first published paper to deal explicitly with probability density
estimation was Rosenblatt (1956), who discussed both the naive estimator and
several of the estimators. Suppose that w (x, y) is a function that satisfies the
conditions.
12
∞
∫ w ( x , y ) dx=1
−∞
w ( x , y ) ≥ 0 ∀ (x , y )
function estimates.
∞
p ( a< x <b )=∫ f ( x ) dx ∀ a<b .
−∞
When we assumed that their distribution has a probability density f, the data
more than would be the case if f were constrained to fall in a given parametric
13
family density estimates of this kind were first proposed. By fix and Hodges
(1951).
bandwidths.
within the confines of mean square error (MSE) and mean integrated square
MS E x f =( E f^ ( x )−f ( x ) ) + var f^ ( x ),
2
variance the sum of the squared bias and
variance at x.
14
The first (Rosenblatt 1956) which is the most widely used way of placing a
Though there are other global measure of discrepancy which may be more
estimate, the MISE is far the most tractable global measure of discrepancy. He
course depends on the data as well as on the kernel and the window width. In
many branches of statistics, there is a trade-off between the bias and variance
terms in,
MS E x f =( E f^ ( x )−f ( x ) ) + var f^ ( x ).
2
The bias can be reduced at the expense of increasing the variance and vice
integration and expectation can be reversed, which gives the MISE as the sum
of the integrated square bias and the integrated variance. The sample size
conceptually because it shows that taking larger and larger samples will not
15
alone, reduce the bias, it will be necessary to adjust the weight function to
features). He gave his own view on kernel density estimators how they are
curve "nice" kernel so as to extract all the important properties of the data.
histogram, we consider the size of the bins (the binwidth) and the end points
of the bins.
Now on kernel density estimator he showed that the first two problem which
are not smooth and dependence on end points of bins can be solved by kernel
density estimator, He also asserted that the second problem of histogram and
He obtained the optimal window for specific kernels, getting the optimal
16
value for the window width. The number of smoothers in density estimation is
Ahmad and Ran (2003) hence we extend it to the case of the estimator. He
used, the integrated variance will become large on the other hand choosing a
large value of h will reduce the random variation as quantified by the variance
for curve estimation and window width selection which can be viewed as an
17
There have been several proposals to automatic selection of h, a feature of
this selection rules is that they also require the use of kernels. They further
The kernel density estimator formula studied by Muller (1984) and Gauser,
Muller and Manmtzch (1985). Their motivation for using higher order
squared error.
They proposed that the mean integrated square error (MISE) could be
calculated exactly when both the underlying and the kernel function are
18
then the closed form expressions are available for these convolutions and
n
^f h ( x )=n1 ∑ nh( x−x i)
i=1
u
( )
Where h is called the bandwidth k is a kernel, k =kn ( x )
h usually the
h
Where
∫ k (u ) du=1
19
In this case k is called kernel of order k. note that because of the symmetry, k
i.e ∫ f^ h ( x ) dx=1
Their work was on the higher order guassian kernel which they gave as
(−1 )r ∅ 2r
x
−1
G 2 r= r −1
2 ( r−1 ) ! x
Expression for the mean integrated square error (MISE) was espoused by
evaluating the higher order terms of the kernel while asymptotic expression
( )( )
3 /5
^p1 ^p1
h= + ^p2.
n n
20
2.3 Review on Bandwidth
would be quite helpful to get an idea of their objective and performance. More
The idea of cross Validation methods goes back to Rudemo (1982) and
Bowman (1984), but we should also mention in this context the so-called
and by Dvin (1976). Due to the lack of stability of this method, see wand and
Jones (1995). The biased cross validation of Scott and Terrell (1987) is
minimizing the asymptotic MISE, like plug- in method do, but uses a jack-
knife procedure (therefore called cross validation) to avoid the use of prior
21
proposed by Ahmad and Pan (2004j. Calling it kernel contrast method and by
objective function, namely the MISE instead of their ISE, they are less
volatile but not entirely data adaptive as they require some pilot information.
assumptions about the smoothness class (or the like) to which the unknown
the most popular one. Various refinements were introduced, like for example
by part and Marron (1990) hall et al. (1991), Taylor (1989) accounted no plug
automatic data -driven bandwidth selection methods. But they are actually
older than years. In the 1970s and early 1980s, survey papers about density
Marron (1988a) and by Park and Marron (1990). A brief survey was provided
22
by Jones et al (1996b) with a comprehensive simulation study published in
only compared them with classical cross validation and there plug inversion
of sheathe and Jones (1991). Deuroye and Lugosi (1996) focus on an optimal
tradeoff between the classical plug - in method and standard cross validation.
validation see Hall and Marron (1987a) but often a larger bias in practices.
different kernel density estimators. Marron (1986) made the point that " the
harder the estimation problem the better cross validation works based on this
introduced smoothed cross validation, the general idea was a kind of pre
23
procedure of pre smoothing results in smaller sample variability but enlarges
bias therefore the resulting bandwidth is often over smoothing and cuts off
expansion. Ahmad and Ran (2004) proposed a kernel contrast method for
Silverman (1986) proposed his ideal for h, from the point view of minimizing
the approximate mean integrated square error (AMISE), he stated that optimal
unknown width will converge to zero as the sample size increases, but at a
very slow rate. Oman et al (2008) he considers the least square cross
validation (LSCV), the biased cross validation (BCV) and the contrast method
window width in the sense of minimizing the mean square error (MISE), he
was also able to obtained the optimal window width or some specific kernels
looking at the work of Terrell (1990) and Scowtt (1985) Terrell 1990
24
smoothing principle (MSP) works fairly well for unimodal densities.
However, for multi-densities they tend to over smooth the data and hide the
Terrell (1992) advices the use of MSP "because they start with a sort of null
hypothesis that there is no structure of interest and let the data force as to
He said also that when using density estimation for presenting conclusion,
there is a case for under smoothing somewhat, the reader can do further
study.
25
selection of bandwidth, which translate to the smoothing parameter in the
26
CHAPTER THREE
METHODOLOGY
In several survey many scholars and authors out that smoothing methods
(1990), Muller (1988), Scott (1992). Silverman (1986) and wand and Jones
resulting is done, the resulting density or regression estimate is too rough and
contains serious feature that are artifacts of the sampling process. When
are smoothed away. A method that uses the data x 1 ,x 2 ,..., x n to produce a value
is a very powerful way to analyze data. But there are number of reasons why
27
from the data. One is that software packages need a default. This is useful in
saving the time of experts through providing a sensible starting point, but it
and it can be impractical to manually select smoothing parameters for all (.eg
see the income data in park and maron 1990) it should never be forgotten that
the accepted approach to this problem. Hence, the following methods are
discussed below.
plots of the data, all smoothed by different amount of h, may well give more
insight into the data than merely considering a single automatically produced
curve this a natural method for choosing the smoothing parameter as it entails
the plotting out of several curves and choose the estimate that is most in
accordance with one’s prior ideas about the density, for many application this
satisfactory to select the most acceptable density (Ahmad and Muqdadi 2003).
28
Moreover, when we chose the bandwidth, we have to consider the error in our
choice for many purposes, particularly for model and hypothesis generation, it
by no means unhelpful for the statistician to supply the scientist with a range
models affords the users a very useful step forward from the enormous
based on replacing R(f”), the known part of hamise, by its value for a
estimated from the data. The method seems to date back to Dehevvels (1977)
and Scott (1979), who each proposed it for histogram. However the method
Let σ and IQR denote the standard deviation and interquartile range of x,
respectively. Take the kernel k to be the Gaussian kernel. Assuming that the
29
σ
HAmise normal = 1.06 n−1/5
And
−1
h Amise Normal =1. 06 σ
5
Jones, Marron and Sheather (1996) studied the Monte Carlo performance of
the normal reference bandwidth based on the standard deviation, that is, they
considered
−1
HSNR = 1. 065 n
5
Jones et al (1996), this method is called the sample normal reference method
(SNR) they found out that H SNR and a mean that was usually unacceptably
large and thus often produced over smoothed density estimates. Furthermore,
Silverman (1986), page 48) recommended reducing the factor 1.06 in the
previous equation to 0.9 in an attempt not to miss bimodality and using the
smaller of two scale estimates. This rule is commonly used in practice and it
thumb it is given ty
hsROT = 0.9AN-1/5
30
where a = simple standard deviation this method is called Silverman rule of
Terrell and Scott (1985) and Terrell (1990) developed a bandwidth selection
degree of smoothing compatible with the estimated scale, the density with the
smallest value of f left (x right ) . } {¿ taking the variance σ as the scale parameter, Terrell
2
2
(1990), page 471) found that the family of distributions with variance σ
minimizes f left (x right ) . } {¿ for the standard Gaussian kernel this leads to the over
smoothed bandwidth
−1
hos =1 . 44 Sn 5
This method is caved over smoothed method according to Silver Man (1986)
comparing the over smoothed bandwidth with the normal reference bandwidth
h, we see that the over smoothed bandwidth is 1.08 times larger. This in
practice there is often very little visual difference between density estimates
produced using either there over smoothed bandwidth or the normal reference
bandwidth.
31
3.3.1 MEASURE OF DISCREPANCY
Naturally, there exists a discrepancy between the density estimate, f ( x ) and the
Mean Integrated Squared Error (M1SE). On the other hand, the Mean Squared
Error (MSE) can also be used. While the M1SE measures the error globally, is
considering all points of the density estimate, the MSE measures the local
The MISE is the sum of bias squared and the variance. That is
( x−h y ) f ( y ) dy −f ( x )
1
=∫h −k
(2)
We do some transformation.
x−y
t= ⇒ y =x−ht
Let h and the assumption that k integrates to unity, we have
=∫ k ( t ) ( x−ht ) dt−f ( x ) dt
32
Using Taylor’s series expansion to expand f ( x−ht ) , we have
3 3
1 h t m
f ( x−ht )=f ( x ) −htf ' ( x ) + h2 t 2 f '' ( x ) − f (x)
2 6
1
( x )=−hf ' ( x )∫ tk ( t ) dt+ h2 f '' ( x ) ∫ t 2 k ( t ) dt +⋯
Bishh 2 using the assumption.
∞ ∞ 2
∫−∞ tk ( t ) dt=0 and ∫ −∞
t k ( t ) dt=k 2 =¿0 ¿
1
( x )= h2 f '' ( x ) k 2 +
Biash h higher order term.
Hence, the integrated square bias, for the mean integrated square error is
1
∫ biash biash ( x )2 dx≈ 2 h 4 k 22 f '' ( x )2 dx
(4)
Similarly,
f^ ( x ) dx≈n−1 h−1∫ k ( t )2 dt
Var (5)
Therefore,
33
1
= h 4 k 22∫ f '' ( x )2 dx + ( nh )−1∫ k ( t )2 dt
(AMISE) 4 (6)
small as possible.
density estimation. This problem is revealed in the tradeoff between the bias
and the variance. It in an attempt to eliminate the bias, a very small value of h
is used, then the integrated variance will become large. On the other hand,
1
MISE f^ ( x )= h4 k 22 ∫ f '' ( x )2 dx + ( nh )−1 ∫ k ( t )2 dt .
4
d
dh
MISE f^ ( x )=
dh 4 (
d 1 4 2
h k 2 ∫ f '' ( x )2 dx + ( nh )−1 ∫ k ( t )2 dt . )
34
Which gives
∴ h= ( ∫ f '' ( x )
k 22 −2 −1
dx ) ∫ k ( t ) dt ) 5
2
−2 −1 −1
⇒h nnt =k
2
5
(∫ f '' ( x ) dx )
2 5
n 5
(∫ k ( t ) dt )
2
(7)
The formula (7) above for the optimal window width is somewhat
disappointing since it shows that h0, itself depends on the unknown density
f(x) being estimated. The following observations and deductions are made:
lim
n→∞ h opt =0 ( hopt converges but at slowate )
1
∫ k ( t )2=∫ 2 π ℓ−t 2 dt (−∞<t<∞ ) .
35
2
Let y=t ⇒ dy=2 t dt .
t=√ y
∞ ∞
∴∫∞ k ( t )2=∫−∞
1 − y dy
ℓ where t=√ y
2π 2t
∞ 1 −t dy
=∫−∞ ℓ .
2π 2 √ yt
−1
1 ∞
= 2∫0 y
2 −y
ℓ dy
4π
=
1 1
()
1
Γ = √n
2π 2 2π
1
= π−1 2
2
∞ 1
∴∫−∞ k ( t )2 =
2√ π
popular and best studied one. It is completely automatic method for choosing
36
the smoothing parametric Silverman (1986). It has only been formulated in
kernel and bandwidth, not on the sample size this method is based on the so
Scott & Terrell (1987) called this method unbiased cross validation since for
( hall, 1983 and stone (1984). The idea is to consider the expansion of the
^^
Note that the last tem does not depend on f h−, hence on h, so that we only
need to consider the first two terms the ideal choice of bandwidth is the one
which minimizes.
method is to find an estimate f L(h) from the data and minimize it over h, and
37
^
A measure of the closeness of f ( x ) and a density f (x), is the mean integrated
MISE=E ∫ ( ^f ( x ) ) −f ( x )2 dx
(1)
( )
2
=E ∫ ^f 2 ( x )−2 f^ ( x )+∫ ( x ) dx
(2)
Notice that the last term on the right hand- side of the expression above dies
^
not involve the estimates density (mean that f ( x ) does not depend on h).
motivation of the least square cross validation (LSCV) come from expanding
Thus, in practice it is prudent to plot LSCV (h) and just rely on the result of a
the largest local minimizer of LSCV(h) be used as hLSCV, since this value
produces better empirical performance the global minimizer. Hall and Marron
38
(1987b) and Scott and Terrell (1987) showed that the LSCV bandwidth
The slow rate of convergence of LSCV and BCV encouraged much research
for this so-called plot estimate. There are many ways this can be done we next
and Jones (1991)m since this method is widely recommended (e.g Simonoff
1996, page 77, Bowman and Azzalini, 1997) and variables and Ripley, 2002,
(page 129).
the estimate of R(f”). The Sheather and Jones (1991) approach is based on
writing the pilot bandwidth for the estimate R(f”) as a function of h, namely
1
5
g ( h)= c ( k ) [ R ( f right )} over {R left (f ) ] 7 h ,
7
39
And estimating the resulting unknown functional of f using kernel density
[ ]
1 1
R (k ) 5
n−
5
h=
u 2 ( k ) R ( ^f 11g ch )
2
than that of BCV most of the improvement is because BCV effectively user
Jone, Marron and sheather (1996) found that for easy to estimate densities
(i.e, those for which R(f”) is relatively small) the distribution of h sj tends to be
40
centered near hAMISE and has much lower variability than the distribution of
hLSCV.
than one value of the bandwidth. Scott (1992, page 161) advised looking at a
hos
h=
1 .05−k for k = 0,1, 2…,
Starting with there sample over smoother displays some instability and very
local noise near the peaks. Maron and hung (2001) also recommend looking at
a family of density estimates for the given data set based on deferent values of
the smoothing parameter. Marron and chung (2001, page 198) advised that
the global smoothing Jones plug in bandwidth for this purpose. Silverman
this case, that the number of modes in the density estimate decreases
smoothing.
41
Biased Validation
Biased cross- validation was proposed by Scott and Terrell (1987). Which is
square error (ISE). Now consider again the asymptotic mean squared error.
( )
2
u2 (k )
R ( f 2)
−1 4
=( nh ) R ( x ) +h
AMISEch 2
R ( ^f 2h )
distribution they estimate the functional (essentially) by to drive a score
R ( ^f 2h )
function BCV(h) which is minimized with respect to h. where is the
second derivative of the kernel density estimate and the subscript h denote the
fact that the bandwidth used for the estimate is the same one used to estimate
R ( ^f 2h )
the density f ( x ) itself Scott and Terrell (1987) showed that is a biased
2
RK N 2 ck
BCV ( h )=
nh
+ ∑ ( x 1− x j )
¿
2 n2 h i =¿ j ∑ φ
h
42
^
Scott and Terrell (1987) proposed to use the minimizer h BCV of BCB(h) as
^
bandwidth. They shows that h BCV has the same relative rate of convergence to
h^ 0 as h^ LSCV but that the constant is often much smaller. The best performance is
occurs.
According to wand and Jones (1995 page 80) the ratio of the two asymptotic
2
σ LSCV
2
=15 .7
σ BCV
Thus, indicating that the bandwidths obtained from least square cross
validation are expected to be much more variable than those obtained from
43
Comparison Between Subjective Choice, Rules of Thumb and Least
Squares Cross-Validation
results ultimately in data over smooth when used with heavy tailed
distributions of an outlier is left out of the data set, then smoothing the
A reference for this point is Schuster and Girogry (1981) who point out the
terms of over smoothing and showing clearly the feature estimates made
reference thumb afford the production of optimum result through the use of its
1
n−
minimizes h. the slow 10
rate of convergence means that hLSCV is highly
44
LSCV over other methods is that it is widely applicable.
SUMMARY
discrepancy for these methods which we used the mean integrated square
45
CHAPTER FOUR
DATA ANALYSIS
GRAPHICAL ILLUSTRATIONS
illustration of data and data analysis, for the presentation of data see appendix
A,B,C in pages 63, 64 and 65 respectively. Taking into cognizance the obvious
and also k, the gaussian kernel, we calculate the various optimal h of each of the
methods of bandwidths. Their corresponding MISE(s) are also derived. Both the
version, 4.8 on HP G62 processor. AMD Athlon 11 Dual core i3 370m – 3gb
RAM – 373DX
46