Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 46

CHAPTER ONE

1.0 INTRODUCTION

This project seeks to explore the concept of density estimation, especially

kernel density estimation with a close look on bandwidths selection. There are

several methods of density estimation, namely:

Histogram, naïve estimator, the nearest neighborhood estimator, the viable

kernel, orthogonal series estimator, maximum penalized likelihood estimator,

general weight function estimators and kernel density estimators. Thus in this

study we shall make use of the kernel density estimator as an improvement

over all other methods mentioned above.

Moreover, knowing fully well the influence of the choice of k and h in kernel

density estimation, we also critically examine the various methods of

choosing h, which translate into the smoothing parameter of the density

estimation and construction. Five methods of estimating bandwidth will be

considered in this study. They are least square cross validation, maximum

likelihood cross validation, Sheather and Jones plug in, Subjective and

Silverman’s rule of thumb. The optimum value of h of each of these methods

is calculated and the mean integrated square errors (MISEs are calculated by

1
estimating the bias square and the variances of each of the selected window

width.

1.1 DENSITY ESTIMATION

Density estimation has experienced a wide explosion of interest over the last

two decades. Among various methods of density estimation, this study

provides a practical description of density estimation based on kernel

methods.

Suppose that we have a set of observed data points, assumed to be a sample

from unknown probability density function, then density estimation is the

construction of an estimation of the density function from the observed data.

1.2 KERNEL DENSITY ESTIMATION

Let x1 , x2 , ⋯ xn denote a sample of size n from a random variable X with density

function f.

1
n
x−x i
The kernel density estimates of f at the point x is given by f ( x )= ∑
nh 1
k(
h
).

Where the kernel k satisfies ∫ k ( x ) dx=1 and the smoothing parameter h is

2
known as the bandwidth. Hence the performance of the kernel density

estimators depends on the value of the smoothing parameter. It is also

expedient to point out that this study considers popular choice of k which is
2
−x
1
the guassian kernel namely: k ( x )= e2

√2 π

There exist two approaches to density estimation namely parametric and non-

parametric approaches.

1.2.1 PARAMETRIC ESTIMATION

Assume that the data are drawn from one of a known family of parametric

distribution for example the normal distribution with mean μ and variance σ .
2

The density f underlying the data could then be estimated by finding the

estimates of μ and σ
2
from the data and substituting these estimates into the

formular for the normal density. In this work we shall not be considering

parametric estimates of this kind.

1.2.2 NON-PARAMETRIC DENSITY ESTIMATION

In non-parametric density estimation, the data are allowed to speak for

themselves in determining the estimate of f(x), this would be more plausible

3
than the case where f is constrained to fall in a given parametric family. There

is no rigid assumption made about the distribution of the observed data.

1.3 ASSUMPTION AND NOTATION

Given that we have a sample x1, x2 , … , xn of independent and identically

distributed random variables from a continuous univariate distribution with

probability density function f ( x), which is to be estimated. The following

notations are considered below:

∫ −integral
the
range of (−∞, ∞)
−∞

∑− ∑ ,…,n
i=1 i=1,2

n - sample size

f ( x )− probability density function

^f ( x )−estimate of f ( x)

h - bandwidth

k - Gaussian kernel

1.4 DEFINITION OF TERMS

4
Some basic terms which are relevant to this study are briefly defined in this

section.

Kernel density Estimation: this is simply defined as the construction of an

estimate of the density function from the observed data.

Bandwidth: Bandwidth in kernel density estimation denoted as h translates to

the smoothing parameter. This determines the width of the bumps and the

smoothness nature of the estimated density constricted.

Kernel Function: The kernel function k, determines the shape and the bumps

and the popular choice of k used as highlighted in this study is the Gaussian

Kernel.

1.5 SIGNIFICANCE OF DENSITY ESTIMATION

 Density estimation is used in the presentation and exploration of data.

 It is used in informal investigation of the properties of a given set of

data, such as the indication of the features of data like: skewers and the

modality (Unimodal, bi-modal or multimodal) in a given set of data.

5
 It can also be used for decisions making, conclusions, further analysis of

data and representation.

1.6 BASIC K.D.E PROPERTIES

In practice, the kernel k is generally chosen to be a unimodal probability

density symmetric about zero. In this case k satisfies the conditions.

1. ∫ k ( t ) dt=1 probability density


−∞

2. ∫ tk ( t ) dt=0 symmetric
−∞

3. ∫ t 2 k (t ) dt=k 2 ≠ 0
−∞

There are several approaches of density estimation. These are itemized below:

 Histograms ( the oldest and most widely used density estimator)

 Naïve estimator

 The nearest neighborhood estimator

 The variable kernel

 Orthogonal series estimators

6
 Maximum penalized likelihood estimators

 General weight function estimators

 The kernel density estimators.

In this study we shall focus majorly on the kernel method of density

estimators on a univariate data set. As earlier defined, the kernel density

estimator is represented as

n
x−x i
^f ( x )= 1 ∑ k ( )
nh i=1 h

This estimate the density f from the random sample x1, x2, ⋯ , xn at point x. the

selection of the smoothing parameter, h and the choice of the kernel function

is central in the estimation of the density f. however there is no universally

acceptable way of choosing the variables h and k. over the years several

authors have suggested different kernel, namely uniform, triangle,

epanechnikov, quartic, triweight and Gaussian.

Here the Gaussian kernel will be use throughout this study most importantly

we will bedrock and preoccupy ourself with the different methods of selecting

bandwidth also known as the window width.

7
1.7 SCOPE AND LIMITATION OF THE STUDY

This study covers the non-parametric approach and density estimation using a

univariate data set. The aim of this work is to calculate for the size of the

mean integrated square error (MISE) which is done by calculating for bias and

variance using five different method and choosing for h. A graphical

representation of the behavior of the different choices of h is illustrated.

1.8 SUMMARY

In this chapter we have described the kernel density estimation using the

Gaussian kernel and a selected choice of the smoothening parameter. We

talked about the two approaches to density estimation which are the

parametric and non-parametric approaches. We made use of 5 different

methods for bandwidth selection and they are; subjective choice of

bandwidth, Silverman rules of thumb, the Least square cross-validation

method, maximum likelihood cross validation method and the plug-in

method. We derived the approximated MISE comprises the sum of the bias

square and variance.

8
In conclusion, there is apparently no best technique of bandwidth selection

CHAPTER 2

LITERATURE REVIEW

2.0 Introduction

Density estimation is an established method in the field of statistics and

pattern recognition. Historically, the technique of histogram was used to

convey the general flavor of non-parametric theory and practice of density

estimation. Kernel density estimation has become a common tool for

empirical studies in any research area. This goes hand in hand with the fact

that this kind of estimator is now provided by many software packages. Since

about three decades the discussion on choice of bandwidth has been going on.

9
Although a good part of the discussion is about non-parametric regression.

New contributions typically provides simulation only to show the own select

or outperforms some of the existing methods. This section of the study

reviews the literature done by scholars in the following field

 Density estimation

 Kernel density estimation

 Bandwidth selection

2.1 REVIEWS ON DENSITY ESTIMATION

Density estimation has exercised a wide explosion of interest over the last two

decades. Silverman, (1986). Books on density estimation has been applied in

many fields including archaeology (e.g Baxter, beardah and west wood,

2000). It is well know that the performance of kernel density estimators

depends crucially on the value of the smoothing parameter.

2.1.1 Lofts Garden and Queensberry (1965), propounded the nearest

neighborhood estimation method of finding kernel density. This method of

density estimation can be generalized to provide an estimate related to the

10
kernel estimate e.g let k(x) be a kernel function integrating to one i.e

∫ k ( x ) dx=1. Then, when generated with nearest neighbor estimate the function
n
1 t−xi
is defined as f ( t )= ∑ k(
ndk (t ) i=1 dk ( t )
)

2.1.2 Breiman et,al (1997) modified the nearest neighbor estimate. In

his work, he assumed that let k be a kernel function and k also a positive

integer, and we define d j , k to be the distance from xj to the nearest point in the

set comprising of other n−1 data points. Then, the variable kernel estimate
n
1 1 t−x i
with smoothing parameter h is defined by ( t )= ∑ k( ) .
n i=1 hd , k hd , k
j j

The variable kernel density estimate is due to breiman et,al (1977). Though

Wertz (1978), refers to kernel density estimate as presumably independent.

The window width of the kernel placed on the point xj is proportional to d j, k

so that the data points in regions where the data are sparse will have flatter

kernels associated with them for any fixed k and the overall degree of

smoothing will depend on the parameter h. the choice of k determines how

responsible the window width choice will be. Thus the amount of smoothing

is queried by the choice of the integer k.

The naive density estimator was introduced by fix and hodqes in an

unpublished report. They expressed the naive density estimator as

11
^f ( x )= 1 ( no of x 1 , x 2 , … , x n ) following∈(x−h , x +h)
2 hn

Let w = weight function

1
W ( x )= if |x|<1 ,0 other wise
2

Then the naive estimator can be written as ^f ( x )= 1 ∑ 1 ( x ,… , x i )


n i=1 h h

Essentially, the first published paper to deal explicitly with probability density

estimation was Rosenblatt (1956), who discussed both the naive estimator and

the more general kernel estimator.

2.1.3 Good and gaskins (1971)

The maximum penalized likelihood approach was first applied to density

estimation by Good and Gaskins. Maximum penalized likelihood is a

standard statistical method. Meanwhile some other density estimator has

been denied from the definition of a density.

2.1.4 Whittle (1958)

It is possible to define a general class of density estimator which includes

several of the estimators. Suppose that w (x, y) is a function that satisfies the

conditions.

12

∫ w ( x , y ) dx=1
−∞

w ( x , y ) ≥ 0 ∀ (x , y )

The estimates of the form ^f ( t )= 1 ∑ w ( x , t ). is referred to as genera: weight


n i=1

function estimates.

2.1.5 Silverman (1986)

He said that density estimation is the construction of an estimate of the

density function from the observed data. He proposed a parametric

approach to density estimation. When we consider any random quantity x

that has probability density function f. specifying the function f. gives a

natural description of the distribution of x, and allows probability associated

with x to be found from the relation.


p ( a< x <b )=∫ f ( x ) dx ∀ a<b .
−∞

2.1.6 Fix and Hodges (1951b)

When we assumed that their distribution has a probability density f, the data

will be allowed to speak for themselves by determining their estimate of f

more than would be the case if f were constrained to fall in a given parametric

13
family density estimates of this kind were first proposed. By fix and Hodges

(1951).

2.2 Review on Kernel Density Estimation

In statistics, kernel density estimation (KDE) is a non-parametric way to

estimate the probability density function of a random variable. Kernel density

estimation is a fundamental smoothing problem where inference about the

population are made, based on a finite data sample. A new approach

introduced by chavdhuri and Marron (2000), whose objective is to analyze the

visible features representing important underlying structures for different

bandwidths.

2.2.1 Silverman (1986)

For kernel measure of discrepancy between f(x) and ^f ( x ) . This he showed

within the confines of mean square error (MSE) and mean integrated square

error (MISE). According to him when considering estimation at a single point,

a natural measure is the mean square error (abbreviated MSE) is defined by

MS E x f^ =E f^ =E ( f^ ( x )−f ( x ) ) by standard elementary properties of mean and


2

MS E x f =( E f^ ( x )−f ( x ) ) + var f^ ( x ),
2
variance the sum of the squared bias and

variance at x.

14
The first (Rosenblatt 1956) which is the most widely used way of placing a

measure on the global accuracy of ^f ( x ) as an estimator of f (x) is the mean

integrated square error (abbreviated MISE) defined by MISE ^f =E ( f^ ( x ) −f ( x ) )2 dx

Though there are other global measure of discrepancy which may be more

appropriate to one's intuitive ideas about what constitutes a globally good

estimate, the MISE is far the most tractable global measure of discrepancy. He

also discussed the methodology of the theoretical treatment of the closeness of

the estimator ^f , to the true density f in various senses. The estimate ^f of


n

course depends on the data as well as on the kernel and the window width. In

many branches of statistics, there is a trade-off between the bias and variance

terms in,

MS E x f =( E f^ ( x )−f ( x ) ) + var f^ ( x ).
2

The bias can be reduced at the expense of increasing the variance and vice

versa, by adjusting the amount of smoothing.

It is useful to note that, since the integrand is non-negative, the order of

integration and expectation can be reversed, which gives the MISE as the sum

of the integrated square bias and the integrated variance. The sample size

directly, but depends only on the weight function. This is important

conceptually because it shows that taking larger and larger samples will not

15
alone, reduce the bias, it will be necessary to adjust the weight function to

obtain unbiased estimates.

2.2.2 Tarn (2001)

He provided an insight on histogram (on how to construct them and their

features). He gave his own view on kernel density estimators how they are

generalized to serve as an improvement over histogram method of density

estimation. He recommended how to choose the most appropriate kernel

curve "nice" kernel so as to extract all the important properties of the data.

On histogram, he defined it as the simplest non-paramedic density estimator

and that is most frequently used. He stated that when constructing a

histogram, we consider the size of the bins (the binwidth) and the end points

of the bins.

Now on kernel density estimator he showed that the first two problem which

are not smooth and dependence on end points of bins can be solved by kernel

density estimator, He also asserted that the second problem of histogram and

dependency on width of bins is also a problem in kernel density estimation.

2.2.2 Scott (1981)

He obtained the optimal window for specific kernels, getting the optimal

16
value for the window width. The number of smoothers in density estimation is

determined by the window width bandwidth.

2.2.3 Ahmad and Ran (2003)

The contrast method for the kernel density estimation ^f ( x) proposed by

Ahmad and Ran (2003) hence we extend it to the case of the estimator. He

stated that if in an attempt to eliminate the bias, a very small value of h is

used, the integrated variance will become large on the other hand choosing a

large value of h will reduce the random variation as quantified by the variance

at the expense of introducing systematic error or bias into the estimation.

2.2.4 Wand and Schucany (1989)

The work on Gaussian based kernels density a class of higher-order kernels

for curve estimation and window width selection which can be viewed as an

estimation of second order Gaussian kernel. These kernels have some

attractive properties such as smoothness, and are widely accepted means of

estimating curves such as densities repression functions and failure rates

without parametric assumptions. Derivatives of these functions can be

estimated by straight forward extension of kernel estimators. In their note

they confined attention to estimation of densities and their derivatives.

17
There have been several proposals to automatic selection of h, a feature of

this selection rules is that they also require the use of kernels. They further

state that in the case of n = 1 a particular choice of kernel in the kernel

density estimator is the Gaussian kernel.

2.2.5 Muller and Mannitxch (1985)

The kernel density estimator formula studied by Muller (1984) and Gauser,

Muller and Manmtzch (1985). Their motivation for using higher order

Gaussian -based kernel is to reduce the order of magnitude of the curve

estimation leading to a faster rate of convergence of integrated mean

squared error.

2.2.6 Fryer (1976) and Dehevveels (1977)

They proposed that the mean integrated square error (MISE) could be

calculated exactly when both the underlying and the kernel function are

Gaussian, written in terms of convolutions which are simply evaluated in

Gaussian case. They observed that if f (density function) is normal and a

systematic function satisfying (k=l) is the Gaussian kernel (standard normal)

18
then the closed form expressions are available for these convolutions and

their integrals and exact MISE calculations are possible.

2.2.7 Rosenblatt (1956)

According to him the kernel density estimation is introduced by

n
^f h ( x )=n1 ∑ nh( x−x i)
i=1

u
( )
Where h is called the bandwidth k is a kernel, k =kn ( x )
h usually the
h

following assumption are imposed on the kernel.

Where

h is called the bandwidth k is a kernel, k b,('//''usually the following

assumptions are imposed on the kernel.

K is symmetric i.e, k(v) = k(-v)

∫ k (u ) du=1

∫ v ( u ) du=0 for j = 1,2,…,k

19
In this case k is called kernel of order k. note that because of the symmetry, k

is necessarily even and the second assumption guarantee that ^f ( x ) is a density


h

i.e ∫ f^ h ( x ) dx=1

2.2.8 Maron and wand (1992)

Their work was on the higher order guassian kernel which they gave as

(−1 )r ∅ 2r
x
−1

G 2 r= r −1
2 ( r−1 ) ! x

However, they gave the density estimate in form of


n
^f ( x )= 1 ∑ w( x−x i ) .
n i=1 h

2.2.9 Hall, Sheather, Jones and Marron (1991)

Expression for the mean integrated square error (MISE) was espoused by

evaluating the higher order terms of the kernel while asymptotic expression

for the window width was givens as

( )( )
3 /5
^p1 ^p1
h= + ^p2.
n n

20
2.3 Review on Bandwidth

Part of the community working on non-parametric statistics has accepted that

there may not be a perfect procedure to select optimal bandwidth nevertheless,

one should be able say which is a reasonable bandwidth selection, at least or a

particular problem. The development of bandwidth selector has been going

on, so that we believe that a review and comparison of existing selectors

would be quite helpful to get an idea of their objective and performance. More

over when covariate x is transformed such that similar smoothness can be

assumed over the whole (transformed) support using a global bandwidth is a

quite reasonable choice.

The idea of cross Validation methods goes back to Rudemo (1982) and

Bowman (1984), but we should also mention in this context the so-called

psevd-likelihood cross validation methods invented by Hobbema et al (1974)

and by Dvin (1976). Due to the lack of stability of this method, see wand and

Jones (1995). The biased cross validation of Scott and Terrell (1987) is

minimizing the asymptotic MISE, like plug- in method do, but uses a jack-

knife procedure (therefore called cross validation) to avoid the use of prior

information method that mingle different selectors or density estimator were

21
proposed by Ahmad and Pan (2004j. Calling it kernel contrast method and by

Mammen et al (2011), proposing the do-validation method.

Compared to cross validation, the plug-in method does minimize a different

objective function, namely the MISE instead of their ISE, they are less

volatile but not entirely data adaptive as they require some pilot information.

In contrast, cross validation allows to choose the bandwidth without making

assumptions about the smoothness class (or the like) to which the unknown

density belongs. Playing methods have a faster convergence rate compared to

cross validation, then the performance of plug-in method is pretty good.

Among these selectors. Silverman (1986) rule of thumb method is probably

the most popular one. Various refinements were introduced, like for example

by part and Marron (1990) hall et al. (1991), Taylor (1989) accounted no plug

- in methods as they aim to minimize the MISE.

There are already several papers dealing with a comparison of different

automatic data -driven bandwidth selection methods. But they are actually

older than years. In the 1970s and early 1980s, survey papers about density

estimation were published by wegman (1972), Tartar and Wertz and

Schneider various methods of smoothing parameter selection was released by

Marron (1988a) and by Park and Marron (1990). A brief survey was provided

22
by Jones et al (1996b) with a comprehensive simulation study published in

Jones et al (1996b). However, they concentrated on boot strap method and

only compared them with classical cross validation and there plug inversion

of sheathe and Jones (1991). Deuroye and Lugosi (1996) focus on an optimal

bandwidth choice and Deuroye (1997) for a comprehensive companion study.

In the context of asymptotic properties of bandwidth selector, there is a

tradeoff between the classical plug - in method and standard cross validation.

The plug in has always a smaller (asymptotic) variance compared to cross

validation see Hall and Marron (1987a) but often a larger bias in practices.

Sheather (2004) gave a practical description of kernel density estimation

revising some estimation and bandwidth selection method which he

considered to be the most popular at that time, Deuroye (1997) considered

different kernel density estimators. Marron (1986) made the point that " the

harder the estimation problem the better cross validation works based on this

idea, Martinez, Miranda et al (2009) proposed to first apply cross validation to

a harder estimation problem, and to afterward calculate the corresponding

bandwidth for the underlying real estimation problem. Hall et al (1992)

introduced smoothed cross validation, the general idea was a kind of pre

smoothing of the data before applying the cross-validation criterion. This

23
procedure of pre smoothing results in smaller sample variability but enlarges

bias therefore the resulting bandwidth is often over smoothing and cuts off

some important features of the underlying density. Kim et al (1994) obtained

asymptotically best bandwidth selectors was based on an exact MISE

expansion. Ahmad and Ran (2004) proposed a kernel contrast method for

choosing bandwidth either minimizing the ISE or alternatively the MISE.

Silverman (1986) proposed his ideal for h, from the point view of minimizing

the approximate mean integrated square error (AMISE), he stated that optimal

window width is somewhat disappointing since it self-depends on the

unknown width will converge to zero as the sample size increases, but at a

very slow rate. Oman et al (2008) he considers the least square cross

validation (LSCV), the biased cross validation (BCV) and the contrast method

for selecting h using different underlying normal mixture densities.

Ogbonmwan (1999) discussed a general method of obtaining an optimal

window width in the sense of minimizing the mean square error (MISE), he

was also able to obtained the optimal window width or some specific kernels

looking at the work of Terrell (1990) and Scowtt (1985) Terrell 1990

Proposed the maximal smoothing principle maximal smoothing principle for

histograms and frequency polygons. According to them, the maximal

24
smoothing principle (MSP) works fairly well for unimodal densities.

However, for multi-densities they tend to over smooth the data and hide the

features of the underlying density which can be viewed as a drawback But

Terrell (1992) advices the use of MSP "because they start with a sort of null

hypothesis that there is no structure of interest and let the data force as to

conclude otherwise. However, Silverman (1986) stated that it should never be

forgotten that the appropriate choice of smoothing parameter will always be

influenced by the purpose for which the density estimate is to be used.

He said also that when using density estimation for presenting conclusion,

there is a case for under smoothing somewhat, the reader can do further

smoothing by "eye" but cannot easily unsmooth.

2.4 Summary of Chapter

The author's reviews in this work concentrate on density estimation which is

of different types and its various views as considered by different authors.

Also, we looked at kernel density estimation with a concise but

comprehensive discussion as put forward by different scholars in that field of

study.

Finally, an important aspect of density estimation was reviewed. This is the

25
selection of bandwidth, which translate to the smoothing parameter in the

construction of density estimation.

26
CHAPTER THREE

METHODOLOGY

3.1 Introduction to Bandwidth Selection

In several survey many scholars and authors out that smoothing methods

provide a powerful methodology for gaining insights into data. Many

examples of this may be found in the monographs of Eubank (1988), Hardle

(1990), Muller (1988), Scott (1992). Silverman (1986) and wand and Jones

(1994). But effective use of these method requires choice of a smoothing

parameter. According to Scott et al (1992) he said h is crucial for the effective

performance of the kernel estimator. When insufficient smoothing is done, the

resulting is done, the resulting density or regression estimate is too rough and

contains serious feature that are artifacts of the sampling process. When

excessive smoothing is done, important features of the underlying structure

are smoothed away. A method that uses the data x 1 ,x 2 ,..., x n to produce a value

for the bandwidth h is called a bandwidth selector or data driven selector.

In the bands of an expert, interactive visual choice of the smoothing parameter

is a very powerful way to analyze data. But there are number of reasons why

it is important to be able to choose the amount of smoothing automatically

27
from the data. One is that software packages need a default. This is useful in

saving the time of experts through providing a sensible starting point, but it

becomes imperative when smoothing is use by non-experts. Another reason

this is important is that in a number of situations many estimates are required,

and it can be impractical to manually select smoothing parameters for all (.eg

see the income data in park and maron 1990) it should never be forgotten that

the accepted approach to this problem. Hence, the following methods are

discussed below.

3.2 Subjective Choice of Bandwidth

On a natural note, subjective choice of bandwidth requires examining several

plots of the data, all smoothed by different amount of h, may well give more

insight into the data than merely considering a single automatically produced

curve this a natural method for choosing the smoothing parameter as it entails

the plotting out of several curves and choose the estimate that is most in

accordance with one’s prior ideas about the density, for many application this

approach will be perfectly satisfactory. There are many situations where it is

satisfactory to select the most acceptable density (Ahmad and Muqdadi 2003).

One of the strategies to do that by starting with small ( or large) bandwidth

then going up ( or down) until you reach the suitable.

28
Moreover, when we chose the bandwidth, we have to consider the error in our

choice for many purposes, particularly for model and hypothesis generation, it

by no means unhelpful for the statistician to supply the scientist with a range

of possible presentation of the data. A choice among several alternative

models affords the users a very useful step forward from the enormous

number of positive explanations that could conceivably be consider red.

3.3 Rules of Thumb

The computationally simplest method for choosing a global bandwidth h is

based on replacing R(f”), the known part of hamise, by its value for a

parametric family expressed as a multiple of a scale parameter, which is then

estimated from the data. The method seems to date back to Dehevvels (1977)

and Scott (1979), who each proposed it for histogram. However the method

was popularized for kernel density estimates by Silvermon (1986, section

3.2), who used the normal distribution as the parametric family.

Let σ and IQR denote the standard deviation and interquartile range of x,

respectively. Take the kernel k to be the Gaussian kernel. Assuming that the

underlying distributions is normal, Silverman (1986, pages 45 and 47) should

that bandwidth selection for kernel density estimate reduce to

29
σ
HAmise normal = 1.06 n−1/5

And

−1
h Amise Normal =1. 06 σ
5

Jones, Marron and Sheather (1996) studied the Monte Carlo performance of

the normal reference bandwidth based on the standard deviation, that is, they

considered

−1

HSNR = 1. 065 n
5

Where s is the sample standard deviation an n is the sample size. According to

Jones et al (1996), this method is called the sample normal reference method

(SNR) they found out that H SNR and a mean that was usually unacceptably

large and thus often produced over smoothed density estimates. Furthermore,

Silverman (1986), page 48) recommended reducing the factor 1.06 in the

previous equation to 0.9 in an attempt not to miss bimodality and using the

smaller of two scale estimates. This rule is commonly used in practice and it

is often referred to as Silverman’s reference bandwidth or Silverman’s rule of

thumb it is given ty

hsROT = 0.9AN-1/5

30
where a = simple standard deviation this method is called Silverman rule of

thumb method (SROT).

Terrell and Scott (1985) and Terrell (1990) developed a bandwidth selection

method based on the maximal smoothing principle so as to produce over

smoothed density estimates. The method is based on choosing the “largest

degree of smoothing compatible with the estimated scale, the density with the

smallest value of f left (x right ) . } {¿ taking the variance σ as the scale parameter, Terrell
2

2
(1990), page 471) found that the family of distributions with variance σ

minimizes f left (x right ) . } {¿ for the standard Gaussian kernel this leads to the over

smoothed bandwidth

−1
hos =1 . 44 Sn 5

This method is caved over smoothed method according to Silver Man (1986)

comparing the over smoothed bandwidth with the normal reference bandwidth

h, we see that the over smoothed bandwidth is 1.08 times larger. This in

practice there is often very little visual difference between density estimates

produced using either there over smoothed bandwidth or the normal reference

bandwidth.

31
3.3.1 MEASURE OF DISCREPANCY

Naturally, there exists a discrepancy between the density estimate, f ( x ) and the

actual estimate, f ( x ) one method of measuring this discrepancy is called the

Mean Integrated Squared Error (M1SE). On the other hand, the Mean Squared

Error (MSE) can also be used. While the M1SE measures the error globally, is

considering all points of the density estimate, the MSE measures the local

error: that is, error at a point.

The MISE is the sum of bias squared and the variance. That is

MISE=∫ Bias 2h ( x ) dx+var { f^ ( x ) dx¿


(1)

biash ( x )=E ^f ( x )−f ( x )

( x−h y ) f ( y ) dy −f ( x )
1
=∫h −k
(2)

We do some transformation.

x−y
t= ⇒ y =x−ht
Let h and the assumption that k integrates to unity, we have

biash ( x )=∫ k ( t ) f ( x−ht ) dt−f ( x )

=∫ k ( t ) ( x−ht ) dt−f ( x ) dt

32
Using Taylor’s series expansion to expand f ( x−ht ) , we have

3 3
1 h t m
f ( x−ht )=f ( x ) −htf ' ( x ) + h2 t 2 f '' ( x ) − f (x)
2 6

+ higher order terms

Thus, by the assumption of the kernel function, we have

1
( x )=−hf ' ( x )∫ tk ( t ) dt+ h2 f '' ( x ) ∫ t 2 k ( t ) dt +⋯
Bishh 2 using the assumption.

∞ ∞ 2
∫−∞ tk ( t ) dt=0 and ∫ −∞
t k ( t ) dt=k 2 =¿0 ¿

1
( x )= h2 f '' ( x ) k 2 +
Biash h higher order term.

Hence, the integrated square bias, for the mean integrated square error is

1
∫ biash biash ( x )2 dx≈ 2 h 4 k 22 f '' ( x )2 dx
(4)

Similarly,

f^ ( x ) dx≈n−1 h−1∫ k ( t )2 dt
Var (5)

Therefore,

The asymptotic mean integrated squared error

33
1
= h 4 k 22∫ f '' ( x )2 dx + ( nh )−1∫ k ( t )2 dt
(AMISE) 4 (6)

Essentially, we want to choose h to make the mean integrated square error as

small as possible.

Equations (4) and (5) demonstraten one of the fundamental problems of

density estimation. This problem is revealed in the tradeoff between the bias

and the variance. It in an attempt to eliminate the bias, a very small value of h

is used, then the integrated variance will become large. On the other hand,

choosing a large value of h will reduce the random variation (integrated

variance) at the expense of introducing systematic error (bias) into the

estimation. It should be mentioned here that whatever method of density

estimation used, the choice of smoothing parameter implies a trade off

between random error (variance) and systematic error (bias).

The optimal bandwidth h can be obtained from the approximated mean

integrated squared error (MISE) given above by

1
MISE f^ ( x )= h4 k 22 ∫ f '' ( x )2 dx + ( nh )−1 ∫ k ( t )2 dt .
4

Minimizing the above with respect to h and equating to zero, we have

d
dh
MISE f^ ( x )=
dh 4 (
d 1 4 2
h k 2 ∫ f '' ( x )2 dx + ( nh )−1 ∫ k ( t )2 dt . )
34
Which gives

h3 k 22∫ f '' ( x )2 dx−n−1 h−2∫ k ( t )2 dt=0

Multiply through by h2 gives

h5 k 22∫ f '' ( x )2 dx−n−1 ∫ k ( t )2 dt=0

h5 k 22∫ f '' ( x )2 dx−n−1 ∫ k ( t )2 dt

h5 k 22∫ f '' ( x )2 n−1∫ k ( t )2 dt

∴ h= ( ∫ f '' ( x )
k 22 −2 −1
dx ) ∫ k ( t ) dt ) 5
2

−2 −1 −1
⇒h nnt =k
2
5
(∫ f '' ( x ) dx )
2 5
n 5
(∫ k ( t ) dt )
2

(7)

The formula (7) above for the optimal window width is somewhat

disappointing since it shows that h0, itself depends on the unknown density

f(x) being estimated. The following observations and deductions are made:

lim
n→∞ h opt =0 ( hopt converges but at slowate )

∫ f '' ( x )2 measures in a sense the rapidity of fluctuations in the density f ( x ) ; and

1
∫ k ( t )2=∫ 2 π ℓ−t 2 dt (−∞<t<∞ ) .

35
2
Let y=t ⇒ dy=2 t dt .

t=√ y

∞ ∞
∴∫∞ k ( t )2=∫−∞
1 − y dy
ℓ where t=√ y
2π 2t

∞ 1 −t dy
=∫−∞ ℓ .
2π 2 √ yt

−1
1 ∞
= 2∫0 y
2 −y
ℓ dy

=
1 1
()
1
Γ = √n
2π 2 2π

1
= π−1 2
2

∞ 1
∴∫−∞ k ( t )2 =
2√ π

3.4 Cross- Validation Methods

This method is also known as the least square cross-validation, it was

proposed by RU demo (1982) and by Bowman (1984) is probably the most

popular and best studied one. It is completely automatic method for choosing

36
the smoothing parametric Silverman (1986). It has only been formulated in

recent years but is based on an extremely simple ideas, it depends on the

kernel and bandwidth, not on the sample size this method is based on the so

called leave one out density estimator.

Scott & Terrell (1987) called this method unbiased cross validation since for

fixed h i.e non-random bandwidth, it is easy to check that E(LSCV(h)-R(f) =

MISE (h) –R (f) (LSC(H)-RCF)=MISE(h)-R(f). the population of this method

is due to this intuitive motivation and fact that it is asymptotically optimal

( hall, 1983 and stone (1984). The idea is to consider the expansion of the

integrated square error (ISE) in the following way.

( ISECh )=∫ f^ h 2 ( x ) dx−∫ ^f h ( x ) f ( x ) dx+∫ f 2 ( x ) dx

^^
Note that the last tem does not depend on f h−, hence on h, so that we only

need to consider the first two terms the ideal choice of bandwidth is the one

which minimizes.

L ( h )=ISE ( h )−∫ f 2 ( x ) dx=∫ ^f h2 ( x ) dx −∫ f^ ( x )∫ ( x ) dx


the principle of the LSCV

method is to find an estimate f L(h) from the data and minimize it over h, and

also minimize the BCVCh over h.

37
^
A measure of the closeness of f ( x ) and a density f (x), is the mean integrated

square error (MISE) cause between y

MISE=E ∫ ( ^f ( x ) ) −f ( x )2 dx
(1)

( )
2
=E ∫ ^f 2 ( x )−2 f^ ( x )+∫ ( x ) dx
(2)

Notice that the last term on the right hand- side of the expression above dies

^
not involve the estimates density (mean that f ( x ) does not depend on h).

therefore minimization of MISE is equivalent to minimization of

MISE−∫ f 2 ( x ) dx=E [∫ ^f ( x ) dx−2∫ ^f ( x ) f ( x ) dx]


2
the LSCV bandwidth selection

method is based on abstaining an unbiased estimate of equation – (2) so, the

motivation of the least square cross validation (LSCV) come from expanding

the MISE of equation (1)

Thus, in practice it is prudent to plot LSCV (h) and just rely on the result of a

minimization routine. Jones, Marron and Sheather (1996) recommended that

the largest local minimizer of LSCV(h) be used as hLSCV, since this value

produces better empirical performance the global minimizer. Hall and Marron

38
(1987b) and Scott and Terrell (1987) showed that the LSCV bandwidth

HLISV achieves best possible convergence.

3.5 Plug-In Method

The slow rate of convergence of LSCV and BCV encouraged much research

on faster converging methods. The method is commonly thought to data back

to Woodruff (1970), who proposed it for estimating the density at a given

point. Estimating R ¿ ¿ by R ( f g ) requires the user to choose the bandwidth g


^ 11

for this so-called plot estimate. There are many ways this can be done we next

describe the “ solve the equation” plug in approach developed by sheather

and Jones (1991)m since this method is widely recommended (e.g Simonoff

1996, page 77, Bowman and Azzalini, 1997) and variables and Ripley, 2002,

(page 129).

Different versions of the plug – in – approach depends on the exact from of

the estimate of R(f”). The Sheather and Jones (1991) approach is based on

writing the pilot bandwidth for the estimate R(f”) as a function of h, namely

1
5
g ( h)= c ( k ) [ R ( f right )} over {R left (f ) ] 7 h ,
7

39
And estimating the resulting unknown functional of f using kernel density

estimates with bandwidths based on normal rules of thumbs in this situation,

the only unknown in the following equation is h:

[ ]
1 1
R (k ) 5
n−
5
h=
u 2 ( k ) R ( ^f 11g ch )
2

The Sheather- Jones plug- in bandwidth h is the solution to this equation.

Under smoothness assumptions on the underlying density


s
( hsj
n14 h AMISE
−1
)
N ( 0 , σ 25 j )
Has an asymptotic distribution thus the sheather- jones plug-in
5
n− ,
bandwidth has a relative convergence rate of order 14 which is much higher

than that of BCV most of the improvement is because BCV effectively user

thee same bandwidth to estimate R(f”) as it does to estimate f, while the

sheather Jones plug-in approach assumes more smoothness of the underlying

density than either LSCV or BCV.

Jone, Marron and sheather (1996) found that for easy to estimate densities

(i.e, those for which R(f”) is relatively small) the distribution of h sj tends to be

40
centered near hAMISE and has much lower variability than the distribution of

hLSCV.

A number of authors recommended that density estimates be drawn with more

than one value of the bandwidth. Scott (1992, page 161) advised looking at a

sequence of density estimates based on the square of smoothing parameters

hos
h=
1 .05−k for k = 0,1, 2…,

Starting with there sample over smoother displays some instability and very

local noise near the peaks. Maron and hung (2001) also recommend looking at

a family of density estimates for the given data set based on deferent values of

the smoothing parameter. Marron and chung (2001, page 198) advised that

this family be based around a “center point” which is an effective choice of

the global smoothing Jones plug in bandwidth for this purpose. Silverman

(1981) showed that an important advantage of using a Gaussian kernel is in

this case, that the number of modes in the density estimate decreases

monotonically as the bandwidth h increases. This means that the number of

features in the estimated density is a decreasing function of the amount of

smoothing.
41
Biased Validation

Biased cross- validation was proposed by Scott and Terrell (1987). Which is

based on choosing the bandwidth that maximizes an estimate of a asymptotic

mean integrated square error (AMISE) rather than an estimate of integrated

square error (ISE). Now consider again the asymptotic mean squared error.

( )
2
u2 (k )
R ( f 2)
−1 4
=( nh ) R ( x ) +h
AMISEch 2

It is gotten by subsiding R ( f ) by an estimate, instead of using a reference


2

R ( ^f 2h )
distribution they estimate the functional (essentially) by to drive a score

R ( ^f 2h )
function BCV(h) which is minimized with respect to h. where is the

second derivative of the kernel density estimate and the subscript h denote the

fact that the bandwidth used for the estimate is the same one used to estimate

R ( ^f 2h )
the density f ( x ) itself Scott and Terrell (1987) showed that is a biased

estimate for R ( f ) the BCV objective is given by


2

2
RK N 2 ck
BCV ( h )=
nh
+ ∑ ( x 1− x j )
¿
2 n2 h i =¿ j ∑ φ
h

42
^
Scott and Terrell (1987) proposed to use the minimizer h BCV of BCB(h) as

^
bandwidth. They shows that h BCV has the same relative rate of convergence to

h^ 0 as h^ LSCV but that the constant is often much smaller. The best performance is

obtained by choosing the smallest value of h for which a local minimum

occurs.

According to wand and Jones (1995 page 80) the ratio of the two asymptotic

variances for the Gaussian kernel is

2
σ LSCV
2
=15 .7
σ BCV

Thus, indicating that the bandwidths obtained from least square cross

validation are expected to be much more variable than those obtained from

biased cross validation.

43
Comparison Between Subjective Choice, Rules of Thumb and Least

Squares Cross-Validation

First, the least square cross- validation approach in bandwidth selection

results ultimately in data over smooth when used with heavy tailed

distributions of an outlier is left out of the data set, then smoothing the

remaining observations may produce no estimate at that point thereby, forcing

a larger bandwidth to be selected.

A reference for this point is Schuster and Girogry (1981) who point out the

LCV produces inconsistent fixed bandwidth kernel estimates when the

underlying distribution has heavy tails. However, considering the choice of

bandwidth subjectively of h which can moderate the defects of the LCV in

terms of over smoothing and showing clearly the feature estimates made

possible through the observations provided. In the same vein, rules of

reference thumb afford the production of optimum result through the use of its
1
n−
minimizes h. the slow 10
rate of convergence means that hLSCV is highly

variable in practice. In addition to high variability, least squares cross

validation often under smoothing practice, in that it leads to spurious

bumpiness in the underlying density (Simonoff 1996). A major advantage of

44
LSCV over other methods is that it is widely applicable.

SUMMARY

In this chapter we talked about some methods used in selecting or choosing

bandwidth and their procedures. We also looked at the measures of

discrepancy for these methods which we used the mean integrated square

error (MISE) in finding the discrepancies of the estimate of these methods

from their actual values.

45
CHAPTER FOUR

DATA ANALYSIS

PRESENTATION OF DATA, NUMERICAL VERIFICATION AND

GRAPHICAL ILLUSTRATIONS

This chapter focuses predominantly on numerical verification and graphical

illustration of data and data analysis, for the presentation of data see appendix

A,B,C in pages 63, 64 and 65 respectively. Taking into cognizance the obvious

importance of the choice of h, bandwidth in the construction of density estimate,

and also k, the gaussian kernel, we calculate the various optimal h of each of the

methods of bandwidths. Their corresponding MISE(s) are also derived. Both the

optimal h values and MISE(s) are estimated using R programming language,

version, 4.8 on HP G62 processor. AMD Athlon 11 Dual core i3 370m – 3gb

RAM – 373DX

46

You might also like