Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/229760716

Averaged Shifted Histogram

Article  in  Wiley Interdisciplinary Reviews: Computational Statistics · December 2009


DOI: 10.1002/wics.54

CITATIONS READS
23 6,930

1 author:

David W. Scott
Rice University
177 PUBLICATIONS   16,299 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Tissue Engineering of Bone and Cartilage View project

3D Printing of Biomaterials View project

All content following this page was uploaded by David W. Scott on 09 November 2018.

The user has requested enhancement of the downloaded file.


Article type: Focus Article

Averaged Shifted Histogram 099

Department of Statistics, Rice University

David W. Scott1
Department of Statistics MS-138
Rice University
P. O. Box 1892
Houston, TX 77251-1892
scottdw@rice.edu

Keywords
Frequency chart, Kernel Density Estimation, Bin Width, Bin Origin

Abstract
The averaged shifted histogram or ASH is a nonparametric probability den-
sity estimator derived from a collection of histograms. The ASH enjoys several
advantages compared to a single histogram: visual smoothness, better approx-
imation, with nearly the same computational efficiency. The ASH provides not
only a bridge between the histogram and advanced kernel methods and but
also a method of choice for implementation.

The Ordinary Histogram


The histogram is especially noteworthy because of its intuitive appeal and universal
availability. Whether with massive data sets or dealing with real-time streaming data,
the histogram is the default because of its computational simplicity.
Given a set of intervals or bins Bk = [tk , tk+1 ) with fixed bin width h = tk+1 − tk and
a random sample {x1 , x2 , . . . , xn }, the bin counts are given by
n
X
νk = I(xi ∈ Bk ) .
i=1
P
Clearly k νk = n. The histogram in density form is defined as
νk
fˆ(x) = x ∈ Bk .
nh
1 David W. Scott is Noah Harding Professor, Department of Statistics MS-138, Rice University, Houston,

TX 77251-1892 (email: scottdw@rice.edu). This work was partially supported by NSF award DMS-05-
05584, and ONR contract N00014-06-1-0060.

1
It is easy to verify that fˆ(x) ≥ 0 and fˆ(x) dx = 1.
R

Scott (1979) and Freedman and Diaconis (1981) analyzed the statistical error of a his-
togram and derived the optimal bin width formula
 1/3
∗ 6
h = R , (1)
n f (x)2 dx

where f (x) is the (unknown) sampling density. While most parametric estimates con-
verge at the rate O(n−1 ), nonparametric estimates converge at slower rates because
of unavoidable estimation bias. Histograms achieve the rate of O(n−2/3 ). This is the
slowest rate available; however, histograms are perfectly useful density estimators even
for modest sample sizes.
A histogram has two parameters, the bin width h and the bin origin t0 . As an example,
consider Sammy Sosa’s 66 home runs during the 1998 baseball season. Thirty-six were
hit at home in Chicago’s Wrigley Field. The data values are the estimated distance in
feet of those home runs. In Figure 1, we display 3 histograms (as frequency curves)
with bin widths of 12.5, 25, and 50 feet, respectively. With only a sample size of
n = 36, the data do not support the histogram with 13 bins. Even the middle histogram
is a bit rough, but the right histogram seems a little oversmoothed.
12
9
6
3
0

350 400 450 500 350 400 450 500 350 400 450 500

Figure 1: Three histogram estimates of Sosa’s home run distances with bin widths of
12.5, 25, and 50 feet (left to right).

While the bin width is the critical parameter, the histogram is quite affected visually by
different choices of t0 for this particular sample. In Figure 2, we display 3 histograms
all with h = 25 feet but three different choices for t0 . For such a small sample size,
the variability due to the bin origin is of about the same magnitude as the bin width.

2
We view the choice of bin origin as a nuisance parameter, as the interactions between
h and t0 are quite nonlinear and difficult to predict.
18
15
12
9
6
3
0

350 400 450 500 350 400 450 500 350 400 450 500

Figure 2: Three histograms of the Sosa home run data using the same bin width of 25
feet, but different bin origins.

Eliminating the Bin Origin Nuisance Parameter


Consider the problem of selecting both the bin width and bin origin either theoretically
or from data. A small change in the bin width leads to large changes in the mesh
away from the bin origin. Small changes in the bin origin may leave the bin counts
unchanged, or see one data point move over a bin. Nevertheless, we have seen how the
histogram can change dramatically with either parameter. Therefore, it is unlikely that
there could be enough evidence to reliably select one pair for (h, t0 ) over another.
A careful examination of the optimal bin width in Equation (1) shows that the specific
choice of a bin origin does not appear relevant. Now this formula is only asymptotically
optimal, but the fact remains that the bin origin is not the relevant parameter. As a
result, rather than try to select among alternative values of t0 for a fixed bin width h,
we view these alternative choices for t0 as equally likely.
Therefore, we propose to average across these shifted alternatives. Specifically, we
propose to select m shifted histograms to average. We shift the bin origin by multiples
of the quantity h/m, which we denote by δ. Thus the bin B0 = [0, h) is subdivided into
the m intervals [0, δ), [δ, 2δ), . . ., [ m−1
m δ, mδ). The bin count ν0 is similarly subdivided
(1) (2) (m)
into {ν0 , ν0 , . . . , ν0 }. Other bins are likewise subdivided and denoted.

3
Let us focus on the interval [0, δ). The first (leftmost) shifted histogram that includes
this interval spans the bin interval [ 1−m 1
m h, m h) ≡ [(1 − m)δ, δ) and is given by

(2) (3) (m−1) (m) (1)


ν + ν−1 + · · · ν−1 + ν−1 + ν0
fˆ1 (x) = −1 ;
nh
the second shifted histogram covers [(2 − m)δ, 2δ) and is given by
(3) (4) (m) (1) (2)
ν + ν−1 + · · · ν−1 + ν0 + ν0
fˆ2 (x) = −1 ;
nh

the (m − 1)th -shifted histogram covers [−δ, (m − 1)δ) and is given by


(m) (1) (m−3) (m−2) (m−1)
ν + ν0 + · · · ν0 + ν0 + ν0
fˆm−1 (x) = −1 ;
nh

and the mth and final shifted histogram covers [0, mδ) = [0, h) and is given by
(1) (2) (m−2) (m−1) (m)
ν + ν0 + · · · ν0 + ν0 + ν0 v0
fˆm (x) = 0 ≡ .
nh nh
The averaged shifted histogram (ASH) is defined to be the equally-weighted average
of these m shifted histograms:
m
1 Xˆ
fˆASH (x) = fj (x) x ∈ [0, δ) ; (2)
m
j=1

the definition in other shifted bins of width δ is immediate.


(1)
Observe that the bin count ν0 appears in all m of the shifted histograms; the bin count
(2) (m)
ν0 appears in m − 1 of the shifted histograms, as does ν−1 ; finally, the bin count
(m) (2)
ν0 appears in only 1 of the shifted histograms, as does ν−1 . Thus the weights on the
(2) (3) (m) (1) (2) (m)
2m − 1 bin counts ν−1 , ν−1 , . . ., ν−1 , ν0 , ν0 , . . ., ν0 are

1 2 m−1 m−1 2 1
··· 1 ··· , (3)
m m m m m m
respectively. In Figure 3, we display the ASH of the Sosa home data in the range
1 ≤ m ≤ 6. The trimodal nature of the data shows more clearly as m increases.
The equally-weighted average in Equation (2) and weights in Equation (3) are equiva-
lent to sampling from an isosceles triangle probability density

K(t) = (1 − |t|) I(−1,1) (t)

at the 2m − 1 points
1−m 2−m 1 1 m−2 m−1
t= , , . . . , − , 0, , . . . , , (4)
m m m m m m

4
0.015

m=1 m=2 m=3


0.010
0.005
0.000

m=4 m=5 m=6

300 350 400 450 500 550 300 350 400 450 500 550

Figure 3: Triangle ASH estimates of the Sosa data with 1 ≤ m ≤ 6.

in the interval (−1, 1). This density or kernel is not particularly smooth. Smoother
weights may be obtained by sampling at the points in Equation (4) using the triweight
kernel
35
K(t) = (1 − t2 )3 I(−1,1) (t) .
32
The corresponding ASH is also much smoother; see Figure 4. The kernel weights
should be renormalized to sum to m so that the ASH is a bona fide density estimate.
Visually, the smoothness of the ASH allows for easier comparisons and understanding.
For example, we may ask if the trimodal density for Sosa’s home runs at home is
mirrored by the 30 home runs Sosa hit when away; see Figure 5. Sosa hit more home
runs in the middle cluster when he was away from Wrigley Field.
Sosa’s magnificent achievement of 66 home runs in 1998 was overshadowed by Mark
McGwire’s record-setting 70 home runs that same year. In Figure 6 we compare their
home run distances. McGwire’s density is also multimodal. Clearly McGwire hit more
long home runs than Sosa did; see Keating and Scott (1999).

Theoretical Considerations
Does the visual smoothness afforded by the ASH translate into any theoretical advan-
tage compared to the histogram? The answer is affirmative. The bias of the histogram is
of order O(h), but the ASH transitions to the higher order O(h2 ) as m → ∞; see Scott
(1985b) for more details. The resulting rate of convergence improves to O(n−4/5 ). In
fact, if the midpoints of the ASH are connected by piecewise linear segments, the re-
sulting figure always has the convergence rate of O(n−4/5 ). In fact, when m = 1, this
result was proved for the frequency polygon by Scott (1985a), who first demonstrated

5
0.015
m=1 m=2 m=4
0.010
0.005
0.000

m=8 m = 16 m = 32

300 350 400 450 500 550 300 350 400 450 500 550

Figure 4: Triweight ASH estimates with m = 1, 2, 4, . . . , 32.


0.015
0.010
0.005
0.000

300 350 400 450 500 550 300 350 400 450 500 550

Figure 5: Comparison of Sosa’s home (gray) and away (red hatched) home runs with a
histogram and an ASH (m = 32).

6
0.012
0.008
0.004
0.000

350 400 450 500 550

Figure 6: Comparison of Sosa’s (gray) and McGwire’s (red hatched) home runs; ASH
(m = 32).

it had the faster rate of convergence. However, the frequency polygon still requires
specification of a bin origin.
As m → ∞, the ASH converges pointwise to the kernel density estimator with cor-
responding kernel. When m ≥ 8 and the frequency polygon is drawn, the ASH and
kernel estimates are indistinguishable. The computation savings can be substantial with
the ASH. As with the histogram, the primary savings comes from the transformation of
the raw data into bin counts. One then convolves these bin counts with kernel weights
such as those in Equation (3). An alternative to the ASH is to use the FFT with a normal
kernel; see Silverman (1982).

Multivariate ASH and Discussion


The construction of the one-dimensional ASH is easily extended to several dimensions.
One simply shifts the bin origin several times along each coordinate axis and then
averages the resulting histograms. One may also generalize to arbitrary kernel weights.
The multivariate ASH is equivalent to a so-called product kernel density estimator as
m → ∞; see Scott (1992).
An ASH with random mesh points was described by Chamayou (1980) for analysis of
high-energy physics data. Software for the ASH is generally available on statlib and
in the R library. Software for the advanced surface representation of the ASH density
estimate of high-dimensional data is updated from time to time.

References

7
J.M.F. Chamayou. Averaging shifted histograms. Computer Physics Commu-
nications, 21:145–161, 1980.
D. Freedman and P. Diaconis. On the histogram as a density estimator: l2
theory. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 57:
453–476, 1981.
J.P. Keating and D.W. Scott. A primer on density estimation for the great home
run race of ‘98. STATS, 25:16–22, 1999.
D. W. Scott. On optimal and data-based histograms. Biometrika, 66:605–610,
1979.
D. W. Scott. Frequency polygons. Journal of the American Statistical Associa-
tion, 80:348–354, 1985a.
D. W. Scott. Multivariate density estimation: theory, practice, and visualization.
John Wiley & Sons, New York, 1992.
D.W. Scott. Averaged shifted histograms: Effective nonparametric density esti-
mators in several dimensions. Annals of Statistics, 13:1024–1040, 1985b.
B. W. Silverman. [Algorithm AS 176] Kernel density estimation using the fast
Fourier transform. Applied Statistics, 31:93–99, 1982.

Cross-References
Nonparametric curve estimation, Bandwidth in smoothing, Exploratory data
analysis

View publication stats

You might also like