Histograms Encrypted

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

PM R 11 (2019) 309–312

www.pmrjournal.org

Statistically Speaking
Histograms: A Useful Data Analysis Visualization
Regina L. Nuzzo, PhD

Introduction The hospital stays in this study ranged from 3 days to


241 days. The histogram shows 25 bins extending from
Visualizing data with histograms is an excellent first 0 to 250 days, so that each bin covers a period of 10 days.
step in any analysis of quantitative data, but many The data in Figure 1 can be seen to exhibit a strong
researchers fail to take advantage of this exploratory right or positive skew. This happens when the majority
data analysis tool. This article gives an overview of histo- of the data have low values, but a few values are
grams and uses examples to illustrate important data fea- extremely high relative to the rest (ie, the data have a
tures that they can help reveal. long right tail). Here, about 75% of the 1000 study
patients had short hospital stays with a duration of less
The Nature of Histograms than 20 days (almost 475 patients stayed up to 10 days,
and nearly 300 stayed between 10 and 20 days). After
Histograms were one of the earliest types of data visu- 20 days, the probability of a patient staying even longer
alizations, with references to their use dating back to the drops dramatically over time, which can be seen in the
19th century.1 The goal of these graphs is to visualize the long, tapering right tail. In fact, only 5 patients stayed
shape (distribution) of data for a single quantitative var- longer than 140 days. Additionally, because hospital stays
iable such as systolic blood pressure, age, or birthweight. must be nonnegative, there is a floor effect (or basement
(Notice that histograms are not bar charts. Bar charts are effect) at zero. This happens when data values tend to
properly used only for displaying counts of categorical “pile up” near the lowest possible limit of the variable.
variables. Histograms and boxplots display quantitative The combination of skewness and floor effect suggests
data.)2 to researchers that they should avoid most traditional sig-
Derived from the Latin root words for “drawn fences,” nificance tests for this data (such as t-tests, analyses of
histograms typically consist of a number of adjacent, variance, and tests for correlation coefficients). This is
equal-width vertical columns, drawn so that there is no because these tests often rely on the underlying popula-
space between the columns. The columns correspond to tion being relatively symmetric. Using these tests when
“bins” that together span the range of the data. The data the data are skewed can lead to making incorrect infer-
are divided among these bins, with the height of each bar ences such as rejecting the null hypothesis when the null
corresponding to the number of data points falling into hypothesis is actually true (type I error). Researchers
each bin. The taller the bar, the more data points fall into should also report the median and IQR (interquartile
the range of that bin. range) of the sample instead of (or at least in addition to)
the mean and SD. This is because the mean and SD can be
Skewness highly sensitive to skewness and outliers, and the median
and IQR are more robust measures of center and spread.
Figure 1 shows the length of hospital stays for a random
sample of 1000 patients enrolled in Phase I of SUPPORT Comparing Subpopulations
(Study to Understand Prognoses Preferences Outcomes
and Risks of Treatment).3 (The dataset is available at Histograms are also useful to compare the distributions
http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets) of two subpopulations on the same axes. For example, in

© 2019 American Academy of Physical Medicine and Rehabilitation


https://dx.doi.org/10.1002/pmrj.12145
310 Histograms: A Useful Data Analysis Visualization

Histogram of Hospital Stay Durations Overlaid histograms


(each binwidth = 10 days)
group
metabolic disorder
normal
450

400
20

350
Number of patients

300

250
15

Number of participants
200

150

100 10

50

0 20 40 60 80 100 120 140 160 180 200 220 240 260


5
Days until Hospital Discharge

Figure 1. Histogram of length of hospital stays for a random sample of


1000 enrolled in Phase I of SUPPORT (Study to Understand Prognoses
Preferences Outcomes and Risks of Treatment).3 The majority (approxi-
mately 75%) of the patients had stays of less than 20 days, whereas the
remaining quarter had stays ranging from 20 to 241 days. There is a floor
effect and a strong positive skew. 0

1964 cancer researcher Frits de Waard and colleagues pub- 30 35 40 45 50 55 60 65 70 75 80 85


Age of breast cancer incidence
lished a paper that investigated the bimodality in age-
specific breast cancer incidence that had been observed Figure 2. Overlaid histograms of age-specific breast cancer incidence,
since before World War II.4 Figure 2 shows a histogram with reconstructed from published histograms in de Waard et al.4 The group
data simulated to recreate their published histograms. of patients with indications of metabolic disorder ranged from 30 to
85 years of age, whereas the group without indications ranged from
The researchers stratified their sample of 240 patients 30 to only 74 years. The dark shaded area shows the overlap between
with mammary carcinoma into two groups: those who the two groups. Using histograms to compare these subgroups help lead
exhibited obesity, hypertension, and/or decreased glu- the researchers to propose the existence of two distinct mechanisms for
cose tolerance and those who exhibited none of these breast cancer.
symptoms. The former group had its peak around
65-69 years of age, whereas the latter peaked at around (Alternatively, the number of bins can be chosen: the
45-49 years. The dark shaded area shows the overlap in greater the number of bins, the smaller the bin width.)
the distributions of the two groups. After the age of Bin widths control the “resolution” of the histogram. If
75 breast cancer was seen only in women with one of the the bins are too wide, the histogram becomes very “soft
indications. These histograms combined with other data focus,” without a clear shape and with many interesting
led the researchers to propose that breast cancers develop data features obscured. On the other hand, using too-
along two different pathways, ones that researchers now narrow bins will result in a histogram with an overly
realize correspond to estrogen-receptor-positive cancers choppy result; this tends to accentuate random artifacts
and estrogen-receptor-negative cancers. in the data sample and makes it difficult to discern the
true distribution of the underlying population data.
Bin Width: The Tuning Parameter for Histograms See Figure 3, for example, which shows the effect of
different bin widths on the interpretation of the data.
The most important decision a researcher faces when The variable plotted here is average systolic blood pres-
developing a histogram is the width of the bins. sure (SBP) of each patient early in the SUPPORT study.
R. L. Nuzzo / PM R 11 (2019) 309–312 311

(A) Undersmoothed Histogram (B) Oversmoothed Histogram


(70 bins) (7 bins)
400

40

300
Number of participants

Number of participants
30

200

20

100
10

0 0

0 20 40 60 80 100 120 140 160 180 0 20 40 60 80 100 120 140 160 180
Mean Systolic BP Mean Systolic BP

(C) Balanced Histogram


(20 bins)

150
Number of participants

100

50

0 20 40 60 80 100 120 140 160 180


Mean Systolic BP
Figure 3. Histograms (of mean systolic blood pressure from participants enrolled in Phase I of the SUPPORT study) showing the effect of bin width on
interpretability.2 The bins in panel A are too narrow, resulting in a plot that accentuates random artifacts in the data. The bins in panel B are too wide,
obscuring the bimodality in the data. Panel C follows the Rice rule and displays a balanced plot that reveals an almost-smooth bimodal shape with
peaks at about 70 mm Hg and 110 mm Hg.
312 Histograms: A Useful Data Analysis Visualization

Figure 3C shows a histogram using 20 bins, each cover-


Box ing 9 mm Hg (this choice gave the most interpretable bin
width in whole numbers). Here the bimodality is clear,
Method Number of bins Bin width with one peak at around 70 mm Hg and another at
pffiffiffi 110 mm Hg. This shape should be a clear sign to
Square root n ffiffi
Maxp−Min
n
pffiffiffiffiffiffiffiffiffiffiffi researchers that further investigation is warranted to
1000 ≈32 ffiffiffiffiffiffiffi
180
p −0ffi
≈6
1000 investigate if there are distinct subgroups in this sample.
Sturges5 ceil(log2n) + 1 Max −Min
ceilðlog2 nÞ + 1
ceil(log21000) + 1 = 11 180 −0
ceilðlog2 1000Þ + 1 = 16:4
Rice6 2 * n1/3 Max −Min Conclusion
2 * n1=3
2 * 10001/3 = 20 180 −0
=9
2 * 10001=3
Scott7 ðMax −MinÞ * n1=3 3:5 * SD
n1=3
Histograms are useful exploratory data visualizations
3:5 * SD
ð180− 0Þ * 10001=3 3:5 * SD
= 9:6 for spotting outliers, skew, bimodality, and other shape
3:5 * 27:3 ≈19 n1=3
Freedman-Diaconis8 ðMax −MinÞ * n1=3 2 * IQR features in the distribution as well as for comparing sub-
2 * IQR n1=3
2 * 42:25
groups in the data. The presence of strong skewness or
ð180− 0Þ * 10001=3 = 8:45
2 * 42:25 ≈21 10001=3 outliers should lead researchers to investigate the use of
median and IQR as summary statistics and nonparametric
Notes hypothesis tests instead of traditional parametric tests.
ceil(x) refers to the ceiling function, which returns the Bin width/bin number is a tuning parameter that should
smallest integer that is greater than or equal to x. be experimented with to find the right balance to allow
SD refers to the standard deviation. interesting features to emerge from the data. Even if no
IQR refers to the interquartile range. histograms are included in a final publication, they are a
quick and indispensable tool to help researchers catch
potential problems in the data and reveal interesting
Figure 3A shows an undersmoothed histogram of the data, features.
with 90 bins each covering 2 mm Hg. We can see what
appears to be bimodality (ie, 2 modes, or clusters of data)
in the sample, with one peak around 70 mm Hg and References
another around 110 mm Hg. The high resolution makes it
difficult to determine the overall shape and height of each 1. Beniger JR, Robyn DL. Quantitative graphics in statistics: a brief his-
peak, however. In Figure 3B the same data is plotted in an tory. Am Stat. 1978;32(1):1-11.
2. Nuzzo RL. The box plots alternative for visualizing quantitative data.
overly smoothed histogram with only 7 bins, each one cov-
PM R. 2016;8(3):268-272.
ering 30 mm Hg. In this version we have lost interesting 3. Knaus WA, Harrell FE, Lynn J, et al. The SUPPORT prognostic model:
and important data features, such as the bimodality. objective estimates of survival for seriously ill hospitalized adults.
There is no magic formula for determining the perfect Ann Intern Med. 1995;122:191-203.
bin width, but over the years statisticians have developed 4. De Waard F, Baanders-Van Halewijn EA, Huizinga J. The bimodal age
rules of thumb to help guide researchers. See Box for distribution of patients with mammary carcinoma. Evidence for the
existence of 2 types of human breast cancer. Cancer. 1964;17(2):
common formulas for bin width and the corresponding 141-151.
number of bins.5–8 The resulting estimates for the mean 5. Sturges HA. The choice of a class interval. J Am Stat Assoc. 1926;21
SBP example in Figure 3 are also given. In this case, there (153):65-66.
are three approaches that are close in their recommenda- 6. Terrell GR, Scott DW. Oversmoothed nonparametric density esti-
mates. J Am Stat Assoc. 1985;80(389):209-214.
tions: Rice at 20, Scott at 19, and Freedman/Diaconis at
7. Scott DW. On optimal and data-based histograms. Biometrika. 1979;
21 for the number of bins. There was not much visual dif- 66(3):605-610.
ference between histograms with each of these bin num- 8. Freedman D, Diaconis P. On the histogram as a density estimator: L
bers (not shown). 2 theory. Probability theory and related fields. 1981;57(4):453-476.

Disclosure

R.L.N. Department of Science, Technology, and Mathematics, Department of Submitted for publication February 8, 2019; accepted February 11, 2019.
Psychology, HMB S340F Gallaudet University, Washington, DC, NE.
Address correspondence to: R.L.N.; e-mail: regina.nuzzo@gallaudet.edu
Disclosure: nothing to disclose

You might also like