Asymptotically Optimal Histogram Density Approximations

Asymptotically Optimal Cells for a Historgram
Author(s): Atsuyuki Kogure

Source: The Annals of Statistics, Vol. 15, No. 3 (Sep., 1987), pp. 1023-1030
Published by: Institute of Mathematical Statistics
Stable URL: https://www.jstor.org/stable/2241813
Accessed: 02-06-2023 20:46 +00:00
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
https://about.jstor.org/terms
Institute of Mathematical Statistics is collaborating with JSTOR to digitize, preserve and

extend access to The Annals of Statistics
This content downloaded from 128.197.229.194 on Fri, 02 Jun 2023 20:46:02 +00:00
All use subject to https://about.jstor.org/terms
The Annals of Statistics
1987, Vol. 15, No. 3, 1023-1030
ASYMPTOTICALLY OPTIMAL CELLS FOR A HISTOGRAM
BY ATSUYUKI KOGURE
Fukushima University
The purpose of this paper is to examine the properties of the histogram

when the cells are allowed to be arbitrary. Given a random sample from an
unknown probability density f on I, we wish to construct a histogram. Any
partition of I can be used as cells. The optimal partition minimizes the mean
integrated squared error (MISE) of the histogram from f. An expression is
found for the infimum of MISE over all partitions. It is proved that the
infimum is attained asymptotically by minimizing MISE over a class of
partitions of locally equisized cells.
1. Introduction. Let f be a continuous probability density on an interval I

in the real line. I includes its endpoints if it is finite. Given a random sample
X1, X2,..., X. of size n from f, we wish to construct a histogram. Prior to the
actual drawing of a histogram we need to choose cells into which the Xi's are
grouped. Any partition of I can be used as cells. (We call a collection {C>) of
intervals of I a partition of I if Ci n Cj = 0 whenever i 1 j and if I = UjCj.)
When a partition Q = {Cj> is chosen, the histogram on Cj is
Hn(xIQ) := Fn(Cj)1ICI9
where ICjl denotes the width of cell Cj and

n
Fn(C)= n- lE {XiE C ).
i=l
Considering the histogram an estimate of f, we adopt the

squared error,
(1.1) MISE(n,Q) := E I(Hn(xIQ) -f(x))dXj^
as the risk is using Q. Here E denotes the expectation with

The risk is decomposed into two components as
(1.2) MISE(n, Q) = F(C- )(i F(C)) +-
where fQ(x) = E[Hn(xIQ)]. The first component

the sampling variability and the second the bias.
the bias at the cost of increasing the sampling v
is to minimize MISE by compromising the confli
Received April 1986; revised January 1987.

AMS 1980 subject classifications. Primary 62G05; secondary 62E20.
Key words and phrases. Histogram, cells, partition, mean integrated squared error.
1023
1024 A. KOGURE
In this paper we examine the properties of the histogram when the cells are
allowed to be arbitrary. We ask how close, in terms of the MISE, the histogram
comes to f. Several authors-including Tapia and Thompson (1978), Scott (1979)
and Freedman and Diaconis (1981)-have investigated this problem. However,
they deal with the subject under the restrictive setting that all cells in a
partition have a common cell width, i.e., for all j, I (Cl = h for some h > 0.
pursue this problem with no restrictions on cells. We derive an asymptotic
expression for the infimum of MISE over all partitions and show that the
infimum can be achieved by a "partition of locally equisized cells."
We analyze the problem under the following conditions on f:
(C.1) f is twice continuously differentiable;
(C.2) f, f ' and f " all belong to L2;
(C.3) f I(x)f(x)I2/3 dX> 0.
An important class of densities excluded by the conditions is that of step
functions. For any such density, the optimal histogram is the density itself.
2. Asymptotic behaviour of MISE. How small can MISE be made as the

sample size tends to infinity? Let 9 denote the class of all partitions of L We
have
THEOREM 1. Let (C.1) and (C.2) be satisfied. Then
(2.1) inf MISE(n, Q) = [(a)( )1/3 lf f(x)f(x)12/3 dx] n -2/3 + (n - 2/3
Under the restrictive setting that all cells in a partition have a common width
h, Freedman and Diaconis (1981) obtained a relation parallel to (2.1):
(2-2) inf MISE(n,

(.)h>O h) = )(_I) / f t(x))2 dx n-2/3 + o(n-2/3),
[(2i)/((613
where MISE(n, h) is the MISE of the histogram with common cell width h. See
also Scott (1979). The ratio of the leading term of (2.1) to that of (2.2) is
J f '(X)f(X)12/3 dx/{Jt( f '(X))2 dX}l/3, which is seen to be no more than 1 by
H6lder's inequality. It is approximately 0.89 for the normal density and 0.24 for
the Cauchy density.
Theorem 1 will be derived by proving the following two propositions. The first
of these gives an asymptotic lower bound for the MISE.
PROPOSITION 1. Under (C.1) and (C.2) we have
(2.3) liminfn2/3 inf {MISE(n, Q)} 2 (1)6'/ - f1 t '(x)f(x)12/3dx.

n - oo Q 9
All we need then is to find a partition whose MISE behaves like the lower
bound as n increases. Our scheme for getting such a partition is based on three
OPTIMAL CELLS FOR A HISTOGRAM 1025
observations:
(i) on a small interval f is well approximated by a linear function;

(ii) the bias term f( fQ(x) - f(x))2 cdx of MISE(n, Q) is minimized for a
partition of equisized cells if f is linear; and
(iii) if f does not change much over the region, then the magnitude of the
sampling variation (1/n)EjF(Cj)(1 - F(Cj))/lCjl is mostly determined by
number of cells, regardless of the cell sizes being equal or unequal.
With these observations we might expect that the desired partition can be
found in a class of partitions of locally equisized cells (POLEC). A POLEC is
constructed by "dividing I twice," a process separated into two steps:
Step 1. Choose a reference point x0 E I and a mesh width w > 0 to obtain an
equally spaced net {A i E Z}, where Z is a set of integers and
Ai:= [x0 + (i - 1)W, xO + iw),
(2.4) u i = I.
ieZ
Step 2. For each i E Z choose a positive integer ki and

cells, allowing different ki's for different Ai's. Put k:=
Q( w, k) denote the resultant POLEC.
Let {w,j and {U,} be sequences of positive numbers such that
(2.5) n -l << wn << n-/ Un << n /, (WnlUn) << n-/
where an << bn means limsupn :(a,,/bn) = 0. Define a class of POLEC
(2.6) Y(Wn) Un) {Q(wn,k)Iki < Un for all i E Z}.

Then we have
PROPOSITION 2. Under conditions (C.1) and (C.2) we have
inf{MISE(n, Q): Q E 9 (wn, Un))
(2.7) = ()6/3 f '(x) f (x) 12/3 n -2/3 + o( n2/3).
Clearly, Propositions 1 and 2 together imply Theorem 1. The proofs of the

propositions occupy the next section.
Another implication of the two propositions is
inf{MISE(n, Q): Q e g(wn, Un)} = inf {MISE(n, Q)} + o(n-2/3).
This means that there is a POLEC whose associated MISE is nearly as small as
that of the optimal partition as the sample size grows large. This may call forth
some interest in developing a data-driven rule for choosing a POLEC. Some
discussion on this can be found in Kogure (1986). For the related subject of the
data-based choice of a cell width, see Scott (1979), Rudemo (1982), Chow, Geman
and Wu (1983) and Stone (1985).
1026 A. KOGURE
3. Proofs.
PROOF OF PROPOSITION 1. For each interval C let
y(C) = (3)6-1/3JcIf /(x)f(x)I2/3 dx

and for each Q = {C.} E 9 let a?,(Q) - n2/3MISE(n, Q) - y(I).
Put hn (n2/9logn)-' and define a subclass gn of .2 as
:n = (Q = {C)jl0 < ICIl < hn forall I).

For each interval C, 0 < ICI < so, let
Sn(C) = (.i)(nl/31C1)2 + JC(f'(x))2dx + (n l/31CI)-1F(C)

and for each Q = {Cj} e Yn let f3n(Q) = Ej(Sn(Cj) - y(Cj)). Observe that
inf an(Q) ? inf f3,(Q) + inf (an(Q) -in(Q))
(3.1) ~Qe Q r eQ =g
+ (inf an(Q) - inf an(Q))
Then it suffices to show that each term on the right of (3.1) has a nonnegative
lower limit. First, we will show
(3.2) liminf inf f,1(Q) 2 0.

n --#oo Q g?n
Fix Q= {Cj) E- 'n For each C. E Q

1/3
n( Cj)2 ( 23)6-/3{(f(x))2 dx) (F(Cj)) /
> (3)6-/3 f '(x)f(x) 12/3 dx (by H6lder's inequaity)
Thus, fln(Q) ? 0, which implies (3.2).

Second, we will show
(3.3) liminf inf (an(Q) - f3n(Q)) 2 0.
Fix Q = (Cj) E 96,. Then

an(Q) - n(Q) = n2/3MISE(n, Q) - n(Cj)
I
(3.4) = fl2/3( (fQ(X) - f(X)) dx - (d )2 l9 ( f J (X))2 dx)

-1n/3 E (F(Cj))21,Cj.
It is easy to see that for each j there is a point xj inside cj such that
j( fQ(X) - f(X))2 dC 2 (?)( f '(Xj))21C l.
Then, by Lemma 2.22 of Freedman and Diaconis (1981), we
f( '(xj)) 2ICj - | f(X))2 dx < 21c11f I t '(x)f "
Thus,
f( Q(x) - f(x))2 dx - ( EI) lCjq2 ( f t(X))2 dx

(3.5) 1 1
> -2h3 f '(x)f "(x) I dx.

n
Note that
(3.6) =f(I(F(Cj))21Cj, fQ(x))2 dx < f(X))2 dx.

Combining (3.4), (3.5) and (3.6) and recalling that h, = (n2/9log n)1 we have
(3.3).
Lastly, we will show
(3.7) lim inf ( inf an(Q) - inf an(Q)} ? 0.

n oQeg =-g
It would suffice to show that for each Q e 9, there

sufficiently large n
(3.8) an(Q) - an(Qo) 2 rn,

where {rn} is a sequence of real numbers such
then for all sufficiently large n: infQ ef gAf
construct such Q? as follows. Fix C E Q. If I
Q?. If ICI > h , then divide C equally into s
equal to hn if ICI = oo. If ICI < so, then ther
that hn(m - 1) < ICI < hnm. Set h* equal to
subintervals thus made be included in Q?. Repeat this for each C E Q. Then we
have
an(Q) - an(Q?)
-2/3(J1( IQ(x) - f(x))2 _X fQO(X) _ f X))2 d
(3*9) + (1/n)( E F(C)(1 - F(C))IICI

CEQ, CEQO
- E F(C)(1 - F(C))IICl))
C4Q, CEQ0
Note that Q? is finer than Q in the sense that for each C E Q there is C0 E Q?
such that CO c C. Thus
(3.10) j( fQ(x) f(x))2 dx > ( fQo(x) - f(x))2 dx.
1028 A. KOGURE
Observe that
(3.11) E F(c)/C < (inf ICI) < Qhn.

CeQ, CE CeQ0
Combining (3.6), (3.9), (3.10) and (3.11) we have
an(Q) - a(Q n) ? 1-1/3(2h-1 + f(f(X))2 dx)
This in turn implies the existence of {rn) in (3.8). The proof of the proposition is
now complete. O
PROOF OF PROPOSITION 2. The mean value theorem implies that for each
i E Z there is xi E Ai such that
(3.12) l I '(x)f(x) 2 dX = W, f (Xi) f (Xi) 12/3,
where A, is defined as at (2.4). For each i E Z define a real-valued function Hi(.)

on (0, oo) as
(3.13) Hi(t) 1= (12)Wn3( f '(x))2/t2 + f (xi)t/n.

Put Z? := {i E Zlf(xi) # 0 and f '(xi) 0). For each i E Z? let
t*= 6- 1/3f{( I 1(Xi))2/f(xi)} /n1/3wl .
t x minimizes Hi(t) and Hi(t*) = (3)6-1/3JAf'(x)f(x)22/3 n-2/3 By (312)

JAIf '(x)f(x)I2/3 dx = 0 for i E Z?. Thus,
(3.14) ? Hi(t") = (1)6-1/3 f '(x) f(x) 12/3 x n-2/3

ieZI
Let Q(wn, k) be a partition in 9(wn, Un), where 9(wn, Un) is defi

It is easy to see that
(3.15) MISE(n, Q(wnjk)) = ? Hi(ki) + o(n 2/3),
where the o(n -2/3) term is bounded by a quantity which converges to ze

than n-2/3, uniformly in both {xi} and k. In the light of (3.14) and (3.1
conclusion follows if we verify the existence of a sequence of positive
{k,?i E Z, k* < U)n such that
(3.16) Hi(k#*) - Hi(t?*) = o(n-2/3).

ieZ ieZ0
If both f '(xi) and f(xi) ar

of f '(xi) and f(x ) is not ze
integer such that H.(k9) is closest to Hj(t"). F
if i E Z and kp < U,
Un= -1 if i E Z and k9 > Un,i
g Un if f '(xi) # 0 and f(xi) 0,
1 if f '(xi) = 0 and f (xi) #0.
Now consider the four possible cases and the respective inequalities below. For
the first two cases we utilize the following relation: If i E Z?, then
(3.17) 2( 1 )( f ,(Xi))2W3t-2 > (f(xi)/n)t, iff t' >< t.

Case 1. i E ZO and k9 < Un. By (3.17)
Hi(t?* + 1) < (3)( f (xj)1n)(t?* + 1)
and
Hi(t*) = (3)(f(xi)/n)t*.
Thus,
(3.18) Hi(k?H ) - Hi(t*) < Hi(tVI + 1) - Hj(0)
= (3)( f(xi)/n).
Case 2. ieZOand k9 > Un.Then tV, + 1> Un Thus,
(3.19) Hi(U -1) < (j)( f (X))2Wn3(U -_) 1K.
Case 3. f (xi) # 0 and f(xt) = 0. Then
(3.20) Hi(Un) = ( 11)( f '(x ))2wn,U,-.

Case 4. f '(xi) = 0 and f(xi) # 0. Then
(3.21) Hi(l) = f (xi)/n.
Combining (3.18)-(3.21) we have
Hi (k* )- Hi(ti)
ieZ ieZ0
(3.22) < (E)( Ef(xi))/n + (4)( ( ((Xi))2)W(U- 1)
By Corollary 2.24 of Freedman and Diaconis (1981) we have
(3.23) E f (xi)wn- (x) dx < wnj I '(xf ) I dx

and
(3.24) | (I f(xi))2w,-j( f(x))2 dx < 2wnjI f '(x) f (x) I dx.

iEr=Z
1030 A. KOGURE
Equation (3.22) together with (3.23) and (3.24) implies
H E (k*)- E Hi(t*) = o((nwn) 1 + (wJUn)2)

ieZ iEZ0
= O(n- 2/3)
The last relation holds because of (2.5). This completes the proof of Proposition
2. ri
Acknowledgments. This work is part of the author's Ph.D. thesis at Yale

University, written under the supervision of Professor J. A. Hartigan, whose
guidance and suggestions are gratefully acknowledged and appreciated. Thanks
are also due to Professors K. Takeuchi and Y. Hosoya for helpful comments.
REFERENCES
CHOW, Y.-S., GEMAN, S. and Wu, L. D. (1983). Consistent crossvalidated density estimation. Ann.
Statist. 11 25-38.
FREEDMAN, D. and DIACONIS, P. (1981). On the histogram as a density estimator: L2 theory. Z.
Wahrsch. verw. Gebiete 57 453-476.
KOGURE, A. (1986). Optimal cells for a histogram. Ph.D. thesis, Yale Univ.
RUDEMO, M. (1982). Empirical choice of histograms and kernel density estimators. Scand. J.
Statist. 9 65-78.
ScoTr, D. W. (1979). On optimal and data-based histograms. Biometrika 66 605-610.
STONE, C. (1985). An asymptotically optimal histogram selection rule. In Proc. of the Berkeley
Conference in Honor of Jerzy Neymnan and Jack Kiefer (L. M. Le Cam and R. A. Olshen,
eds.) 2 513-520. Wadsworth, Belmont, Calif.
TAPIA, R. A. and THOMPSON, J. R. (1978). Nonparametric Probability Density Estimation. Johns
Hopkins Univ. Press, Baltimore, Md.
FACULTY OF ECONOMICS
FUKUSHIMA UNIVERSITY
MATSUKAWA-MACHI, FUKUSHIMA 960-12
JAPAN

Asymptotically Optimal Histogram Density Approximations

Uploaded by

Copyright:

Available Formats

You might also like

Asymptotically Optimal Histogram Density Approximations

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Asymptotically Optimal Histogram Density Approximations

Uploaded by

Copyright:

Available Formats

Asymptotically Optimal Cells for a Historgram

Author(s): Atsuyuki Kogure

Institute of Mathematical Statistics is collaborating with JSTOR to digitize, preserve and

ASYMPTOTICALLY OPTIMAL CELLS FOR A HISTOGRAM

The purpose of this paper is to examine the properties of the histogram

1. Introduction. Let f be a continuous probability density on an interval I

where ICjl denotes the width of cell Cj and

Considering the histogram an estimate of f, we adopt the

(1.1) MISE(n,Q) := E I(Hn(xIQ) -f(x))dXj^

as the risk is using Q. Here E denotes the expectation with

(1.2) MISE(n, Q) = F(C- )(i F(C)) +-

where fQ(x) = E[Hn(xIQ)]. The first component

Received April 1986; revised January 1987.

2. Asymptotic behaviour of MISE. How small can MISE be made as the

THEOREM 1. Let (C.1) and (C.2) be satisfied. Then

(2.1) inf MISE(n, Q) = [(a)( )1/3 lf f(x)f(x)12/3 dx] n -2/3 + (n - 2/3

(2-2) inf MISE(n,

PROPOSITION 1. Under (C.1) and (C.2) we have

(2.3) liminfn2/3 inf {MISE(n, Q)} 2 (1)6'/ - f1 t '(x)f(x)12/3dx.

(i) on a small interval f is well approximated by a linear function;

Ai:= [x0 + (i - 1)W, xO + iw),

Step 2. For each i E Z choose a positive integer ki and

(2.6) Y(Wn) Un) {Q(wn,k)Iki < Un for all i E Z}.

PROPOSITION 2. Under conditions (C.1) and (C.2) we have

inf{MISE(n, Q): Q E 9 (wn, Un))

(2.7) = ()6/3 f '(x) f (x) 12/3 n -2/3 + o( n2/3).

Clearly, Propositions 1 and 2 together imply Theorem 1. The proofs of the

inf{MISE(n, Q): Q e g(wn, Un)} = inf {MISE(n, Q)} + o(n-2/3).

PROOF OF PROPOSITION 1. For each interval C let

y(C) = (3)6-1/3JcIf /(x)f(x)I2/3 dx

:n = (Q = {C)jl0 < ICIl < hn forall I).

Sn(C) = (.i)(nl/31C1)2 + JC(f'(x))2dx + (n l/31CI)-1F(C)

(3.2) liminf inf f,1(Q) 2 0.

Fix Q= {Cj) E- 'n For each C. E Q

n( Cj)2 ( 23)6-/3{(f(x))2 dx) (F(Cj)) /

> (3)6-/3 f '(x)f(x) 12/3 dx (by H6lder's inequaity)

Thus, fln(Q) ? 0, which implies (3.2).

(3.3) liminf inf (an(Q) - f3n(Q)) 2 0.

Fix Q = (Cj) E 96,. Then

(3.4) = fl2/3( (fQ(X) - f(X)) dx - (d )2 l9 ( f J (X))2 dx)

j( fQ(X) - f(X))2 dC 2 (?)( f '(Xj))21C l.

Then, by Lemma 2.22 of Freedman and Diaconis (1981), we

f( '(xj)) 2ICj - | f(X))2 dx < 21c11f I t '(x)f "

f( Q(x) - f(x))2 dx - ( EI) lCjq2 ( f t(X))2 dx

> -2h3 f '(x)f "(x) I dx.

(3.6) =f(I(F(Cj))21Cj, fQ(x))2 dx < f(X))2 dx.

(3.7) lim inf ( inf an(Q) - inf an(Q)} ? 0.

It would suffice to show that for each Q e 9, there

(3.8) an(Q) - an(Qo) 2 rn,

-2/3(J1( IQ(x) - f(x))2 _X fQO(X) _ f X))2 d

(3*9) + (1/n)( E F(C)(1 - F(C))IICI

(3.10) j( fQ(x) f(x))2 dx > ( fQo(x) - f(x))2 dx.

(3.11) E F(c)/C < (inf ICI) < Qhn.

Combining (3.6), (3.9), (3.10) and (3.11) we have

an(Q) - a(Q n) ? 1-1/3(2h-1 + f(f(X))2 dx)

(3.12) l I '(x)f(x) 2 dX = W, f (Xi) f (Xi) 12/3,

where A, is defined as at (2.4). For each i E Z define a real-valued function Hi(.)

(3.16) Hi(k#) - Hi(t?) = o(n-2/3).

H E (k)- E Hi(t) = o((nwn) 1 + (wJUn)2)