Professional Documents
Culture Documents
Averaged Shifted Histogram: Wiley Interdisciplinary Reviews: Computational Statistics December 2009
Averaged Shifted Histogram: Wiley Interdisciplinary Reviews: Computational Statistics December 2009
net/publication/229760716
CITATIONS READS
23 6,930
1 author:
David W. Scott
Rice University
177 PUBLICATIONS 16,299 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by David W. Scott on 09 November 2018.
David W. Scott1
Department of Statistics MS-138
Rice University
P. O. Box 1892
Houston, TX 77251-1892
scottdw@rice.edu
Keywords
Frequency chart, Kernel Density Estimation, Bin Width, Bin Origin
Abstract
The averaged shifted histogram or ASH is a nonparametric probability den-
sity estimator derived from a collection of histograms. The ASH enjoys several
advantages compared to a single histogram: visual smoothness, better approx-
imation, with nearly the same computational efficiency. The ASH provides not
only a bridge between the histogram and advanced kernel methods and but
also a method of choice for implementation.
TX 77251-1892 (email: scottdw@rice.edu). This work was partially supported by NSF award DMS-05-
05584, and ONR contract N00014-06-1-0060.
1
It is easy to verify that fˆ(x) ≥ 0 and fˆ(x) dx = 1.
R
Scott (1979) and Freedman and Diaconis (1981) analyzed the statistical error of a his-
togram and derived the optimal bin width formula
1/3
∗ 6
h = R , (1)
n f (x)2 dx
′
where f (x) is the (unknown) sampling density. While most parametric estimates con-
verge at the rate O(n−1 ), nonparametric estimates converge at slower rates because
of unavoidable estimation bias. Histograms achieve the rate of O(n−2/3 ). This is the
slowest rate available; however, histograms are perfectly useful density estimators even
for modest sample sizes.
A histogram has two parameters, the bin width h and the bin origin t0 . As an example,
consider Sammy Sosa’s 66 home runs during the 1998 baseball season. Thirty-six were
hit at home in Chicago’s Wrigley Field. The data values are the estimated distance in
feet of those home runs. In Figure 1, we display 3 histograms (as frequency curves)
with bin widths of 12.5, 25, and 50 feet, respectively. With only a sample size of
n = 36, the data do not support the histogram with 13 bins. Even the middle histogram
is a bit rough, but the right histogram seems a little oversmoothed.
12
9
6
3
0
350 400 450 500 350 400 450 500 350 400 450 500
Figure 1: Three histogram estimates of Sosa’s home run distances with bin widths of
12.5, 25, and 50 feet (left to right).
While the bin width is the critical parameter, the histogram is quite affected visually by
different choices of t0 for this particular sample. In Figure 2, we display 3 histograms
all with h = 25 feet but three different choices for t0 . For such a small sample size,
the variability due to the bin origin is of about the same magnitude as the bin width.
2
We view the choice of bin origin as a nuisance parameter, as the interactions between
h and t0 are quite nonlinear and difficult to predict.
18
15
12
9
6
3
0
350 400 450 500 350 400 450 500 350 400 450 500
Figure 2: Three histograms of the Sosa home run data using the same bin width of 25
feet, but different bin origins.
3
Let us focus on the interval [0, δ). The first (leftmost) shifted histogram that includes
this interval spans the bin interval [ 1−m 1
m h, m h) ≡ [(1 − m)δ, δ) and is given by
and the mth and final shifted histogram covers [0, mδ) = [0, h) and is given by
(1) (2) (m−2) (m−1) (m)
ν + ν0 + · · · ν0 + ν0 + ν0 v0
fˆm (x) = 0 ≡ .
nh nh
The averaged shifted histogram (ASH) is defined to be the equally-weighted average
of these m shifted histograms:
m
1 Xˆ
fˆASH (x) = fj (x) x ∈ [0, δ) ; (2)
m
j=1
1 2 m−1 m−1 2 1
··· 1 ··· , (3)
m m m m m m
respectively. In Figure 3, we display the ASH of the Sosa home data in the range
1 ≤ m ≤ 6. The trimodal nature of the data shows more clearly as m increases.
The equally-weighted average in Equation (2) and weights in Equation (3) are equiva-
lent to sampling from an isosceles triangle probability density
at the 2m − 1 points
1−m 2−m 1 1 m−2 m−1
t= , , . . . , − , 0, , . . . , , (4)
m m m m m m
4
0.015
300 350 400 450 500 550 300 350 400 450 500 550
in the interval (−1, 1). This density or kernel is not particularly smooth. Smoother
weights may be obtained by sampling at the points in Equation (4) using the triweight
kernel
35
K(t) = (1 − t2 )3 I(−1,1) (t) .
32
The corresponding ASH is also much smoother; see Figure 4. The kernel weights
should be renormalized to sum to m so that the ASH is a bona fide density estimate.
Visually, the smoothness of the ASH allows for easier comparisons and understanding.
For example, we may ask if the trimodal density for Sosa’s home runs at home is
mirrored by the 30 home runs Sosa hit when away; see Figure 5. Sosa hit more home
runs in the middle cluster when he was away from Wrigley Field.
Sosa’s magnificent achievement of 66 home runs in 1998 was overshadowed by Mark
McGwire’s record-setting 70 home runs that same year. In Figure 6 we compare their
home run distances. McGwire’s density is also multimodal. Clearly McGwire hit more
long home runs than Sosa did; see Keating and Scott (1999).
Theoretical Considerations
Does the visual smoothness afforded by the ASH translate into any theoretical advan-
tage compared to the histogram? The answer is affirmative. The bias of the histogram is
of order O(h), but the ASH transitions to the higher order O(h2 ) as m → ∞; see Scott
(1985b) for more details. The resulting rate of convergence improves to O(n−4/5 ). In
fact, if the midpoints of the ASH are connected by piecewise linear segments, the re-
sulting figure always has the convergence rate of O(n−4/5 ). In fact, when m = 1, this
result was proved for the frequency polygon by Scott (1985a), who first demonstrated
5
0.015
m=1 m=2 m=4
0.010
0.005
0.000
m=8 m = 16 m = 32
300 350 400 450 500 550 300 350 400 450 500 550
300 350 400 450 500 550 300 350 400 450 500 550
Figure 5: Comparison of Sosa’s home (gray) and away (red hatched) home runs with a
histogram and an ASH (m = 32).
6
0.012
0.008
0.004
0.000
Figure 6: Comparison of Sosa’s (gray) and McGwire’s (red hatched) home runs; ASH
(m = 32).
it had the faster rate of convergence. However, the frequency polygon still requires
specification of a bin origin.
As m → ∞, the ASH converges pointwise to the kernel density estimator with cor-
responding kernel. When m ≥ 8 and the frequency polygon is drawn, the ASH and
kernel estimates are indistinguishable. The computation savings can be substantial with
the ASH. As with the histogram, the primary savings comes from the transformation of
the raw data into bin counts. One then convolves these bin counts with kernel weights
such as those in Equation (3). An alternative to the ASH is to use the FFT with a normal
kernel; see Silverman (1982).
References
7
J.M.F. Chamayou. Averaging shifted histograms. Computer Physics Commu-
nications, 21:145–161, 1980.
D. Freedman and P. Diaconis. On the histogram as a density estimator: l2
theory. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 57:
453–476, 1981.
J.P. Keating and D.W. Scott. A primer on density estimation for the great home
run race of ‘98. STATS, 25:16–22, 1999.
D. W. Scott. On optimal and data-based histograms. Biometrika, 66:605–610,
1979.
D. W. Scott. Frequency polygons. Journal of the American Statistical Associa-
tion, 80:348–354, 1985a.
D. W. Scott. Multivariate density estimation: theory, practice, and visualization.
John Wiley & Sons, New York, 1992.
D.W. Scott. Averaged shifted histograms: Effective nonparametric density esti-
mators in several dimensions. Annals of Statistics, 13:1024–1040, 1985b.
B. W. Silverman. [Algorithm AS 176] Kernel density estimation using the fast
Fourier transform. Applied Statistics, 31:93–99, 1982.
Cross-References
Nonparametric curve estimation, Bandwidth in smoothing, Exploratory data
analysis