Professional Documents
Culture Documents
02 Good Charts
02 Good Charts
(under construction)
Gary Klass
Illinois State University
© 2002
1 de 30
display as it draws on the talents of both the scientist and the artist. You
have to know and understand your data, but you also need a good sense of
how the reader will visualize the chart’s graphical elements.
Two problems arise in charting that are less common when data are
displayed in tables. Poor choices, or deliberately deceptive, choices in
graphic design can provide a distorted picture of numbers and relationships
they represent. A more common problem is that charts are often designed
in ways that hide what the data might tell us, or that distract the reader
from quickly discerning the meaning of the evidence presented in the
chart. Each of these problems is illustrated in the two classic texts on data
presentation: Darrell Huff’s How to Lie with Statistics (1994) and Edward
Tufte’s The Visual Display of Quantitative Information (1983).
Huff’s little paperback, first published in 1954 and reissued many times
thereafter, condemned graphical representations of data that “lied”. Here,
the two numbers, one 3 times the magnitude of the other, are represented
by two cows, one 27 times larger than the other, resulting in a Lie Factor
of 9.
Here the figure depicts the increase in the number of milk cows in the
United States, from 8 million in 1860 to twenty five million in 1936. The
larger cow is thus represented as three times the height the 1860 cow. But
she is also three times as wide, thus taking up nine times the area of the
page. Moreover the graphic is a depiction of a three dimensional figure:
when we take the depth of the cow into account, she is twenty seven times
larger in 1936. Later, Tufte developed the “Lie Factor”: a numerical
measure of the data distortion. Here, representing a numbers that 3 times
2 de 30
different in magnitude with images that a 27 times different in size
produces a “Lie Factor” of 9.
Such visual distortions are not as common as they once were, but modern
computer technology has made possible all sorts of new ways of lying with
charts.
the labeling that defines the data: the title, axis titles and labels,
legends defining separate data series, and notes (often, to
indicate the data source),
scales defining the range of the Y (and sometimes the X) axis, and
the graphical elements that represent the data: the bars in bar
charts, the lines in times series plot, the points in scatterplots, or
the slices of a pie chart.
3 de 30
Figure 2: Components of a chart
Axis titles. Axis titles should be brief and should not be used at all if the
information merely repeats what is clear from the title and axis labels. It
would be redundant to repeat the phrase “Tuition and fees” in the Y axis of
figure 2, and the X-axis title, “University”, is completely unnecessary. If the
title of the chart has the subtitle “% of GDP”, it is not necessary to repeat
either the phrase or the word “percent” in the axis title.
Axis scale and data labels. The value or magnitude of the main graphical
elements of the chart are defined by either or both the axis scale and
individual data labels. Avoid using too many numbers to define the data
points. A chart that labels the value of each individual data point does not
4 de 30
need labeling on the y axis. If it seems necessary to label every value in a
chart, consider that a table is probably a more efficient way of presenting
the data.
Legends. Legends are used in charts with more than one data series.
They should not be placed on the outside of the chart in a way than reduces
the plot area, the amount of space given to represent the data. In figure 2,
the legend is placed inside the chart (although some think that detracts
from the main graphical elements), it could also be placed at the bottom of
the chart (where the unnecessary “university” now stands.
The source. Specifying the source of the data is important for proper
academic citation, but it also can also give knowledgeable readers who are
often familiar with common data sources important insights into the
reliability and validity of the data. For example, knowing that crime
statistics come from the FBI rather than The National Criminal
Victimization Survey can be a crucial bit of information.
Other chart elements. The amount of ink given over to the non-data
elements of a chart that are not necessary for defining the meaning and
values of the data should be kept to an absolute minimum. Plot area
borders and plot area shading are unnecessary. Keep the shading of the
graphical elements simple and always avoid using unnecessary 3-D effects.
In most of the charts that follow, even the vertical line defining the the
Y-axis has been removed, following the commendable charting standards of
The Economist magazine.
The most general standards of charting data are thus the following:
To see what happens when these rules are violated, consider figure 3, taken
5 de 30
from Robert Putnam’s Bowling Alone (where it is labeled figure 47), a work
that contains many good and bad examples of graphical data display (and
unfortunately, no tables at all). In just one chart, Putnam violates the three
fundamental rules of data presentation: the chart does not depict
meaningful data; the data it does depict are ambiguous, and the chart
design is seriously inefficient. One can’t accuse Putnam of distorting the
data only because his main conclusions are not derived from the data
presented in the chart.
Of these, let’s consider the inefficiency first: the first thing you notice about
the chart is that the graphical elements are represented in three
dimensions. On both efficiency and truthfulness this is unfortunate; the
3-D effect is entirely unnecessary and in this case serves to distort the
visual representation of the data. Had not the data labels been shown on
the top of each bar, it would not be readily apparent that column A is in fact
bigger than column F, or that C is the same size as B. In addition the chart
suffers from what might be called “numbering inefficiency”: Putnam uses 13
numbers to represent 6 data points. Eliminating the 3-D, as shown in figure
4, offers a more exact representation of the data with a lot less ink.
6 de 30
Figure 4: Revised chart, without 3-D effects.
There are two problems of ambiguous data in the chart. Partly this is
resolved in Putnam’s text where it is explained that bar E is the percentage
of women who are homemakers out of concern for their kids while bar A is
the percentage of women who are working full-time because they need the
money. It’s not quite clear what the numbers for those who work part-time
mean. In the case of bar C, for example, are the women working only
part-time because of the kids, or are they not full-time homemakers because
of the money?
The other ambiguity, however, is not for the lack of proper labeling. If one
looks at the chart quickly, the first impression one would get would be that
is that only 11% of women who work full-time do so for reasons of personal
satisfaction. But that is not the case. Look at the Y-axis title. Or notice that
all of the percentages add up to 100. Of all the women in the survey, 11%
were in the single category of “employed full-time for reasons of personal
satisfaction.” This is not what one expects in a bar chart, but given the
data Putnam has decided to display, there isn’t a whole lot that can be done
with the chart to fix it.
Still, we have to ask, “what does this chart mean?” In particular, what data
do the arrows on the bars represent?
7 de 30
explanatory. That there are problems with this chart become apparent to
the reader as soon as one encounters Putnam’s page and a half of
accompanying text devoted, not to explaining the significance of the data,
but to explaining what the elements of the chart represent. A careful
reading of the text tells us that there are basically three conclusions Putnam
would have us draw from this chart:
· Over time, (the 1980s and 90s) more women are working.
Based on the textual discussion that Putnam offers it becomes clear that the
most meaningful data is represented in the chart, not by the height of the
bars, but by the direction of the arrows on the bars. Recall that as a general
rule data presentations that include more than one time point provide for
much more meaningful analyses than cross sectional or single time point
presentations. Although most of the data analysis in Bowling Alone is time
series data, in this case Putnam averages 21 years of data down to single
data points represented by the chart’s bars, with the times series change
represented by directional arrows. Thus, the most meaningful comparison
in the chart – the comparison that support the conclusion that Putnam seeks
to draw from the data -- is not that bar A is higher than bar B or F, but that
the arrow for Bar A is going up while the arrow for bar F is going down.
8 de 30
Figure 5: Revised chart, with data from text.
Note also that figure 5 simplifies the data presentation by eliminating the
ambiguous part-time category: for part-timers, is “personal satisfaction” the
reason for not staying at home or the reason for not working full-time? And
it clarifies that the “necessity” refers to “kids” in the case of homemakers
and to “money” in the case of full time workers.
Types of Charts
Most charts are a variation on one of four basic types: pie charts, bar
charts, time series charts and scatterplots. Choosing the right type of chart
depends on the characteristics of the data and the relationships you want
displayed.
Pie Charts
9 de 30
· Use pie charts only for data that add up to some
meaningful total.
Pie charts should rarely be used. Pie charts usually contain more ink than is
necessary to display the data and the slices provide for a poor
representation of the magnitude of the data points. Do you remember as a
kid trying to decide which slice of your birthday cake was the largest? It is
more difficult for the eye to discern the relative size of pie slices than it is to
assess relative bar length. Forcing the reader to draw comparisons across
the two pie charts shown in figure 6 is also a bad idea: without looking at
the data label percentages in the above figures one cannot easily determine
whether the FY 2000 slices are larger or smaller than the corresponding FY
2007 slices
3-D pie charts are even worse, as they also add a visual distortion of the
data (in figure 7, the thick 3-D band exaggerates the size of the corporate
income tax slice).
10 de 30
Figure 7: Exploding 3-D pie charts.
All the information in the pie charts above can be conveyed more precisely
and with far less ink in the simple bar chart shown in figure 8.
Nevertheless, people like pie charts. Readers expect to see one or two pie
charts similar to those in figure 6 at the very beginning of an annual agency
budget report. But it would be a big mistake to rely on several pie charts
for the primary data analysis in a report.
For those who would ignore all the advice given here and insist that good
charts must look pretty, the most recent version of the Microsoft Excel
charting software (in Office 2007, beta) will satisfy all your foolish desires:
3-D pie charts that gleam and glisten like Christmas tree ornaments, to say
nothing about what you can do with the 3-D pie chart’s cousins, the donut,
cylinder, cone, radar and pyramid charts.
11 de 30
As a general rule 3-D charts are not a good idea even when the data are
three dimensional. In theory they provide for a precise representation of
data, but it is rare that provide a basis for drawing a simple conclusion.
Bar Charts:
Bar charts often contain little data, a lot of ink, and rarely reveal ideas that
cannot be presented much more simply in a table. Minimizing the
ink-to-data ratio is especially important in the case of bar charts. Never use
a 3-D bar chart. Keep the gridlines faint. Display no more than seven
numbers on the Y-axis scale. If there are fewer than five bars, consider
using data labels rather than a Y-axis scale; it doesn't make sense to use a
five-numbered scale when the exact values can be shown with four
numbers.
12 de 30
Figure 9: Rotated bar chart, two data series
Look at figure 9 and you can quickly grasp the main points – the United
States has the highest child poverty rate among developed nations --, but
then spend some time with it and you'll discover other interesting things.
Note, for example, the differences in child and elderly poverty across
nations or that the three countries at the top, with the lowest child poverty
rates are Scandinavian countries; five of the seven countries with the
highest child poverty are English-language countries.
As with tables, sorting the data on the most significant variable greatly
eases the interpretation of the data. The data in figure 9 are sorted on the
child rather than the elderly poverty rates only because most of the
research on the topic has focused on child poverty. Note also that if the
sorted variable represents time, time should always go from left to right and
13 de 30
on the X-axis.
One variation of the bar chart, the stacked bar chart, should be used with
caution, especially when there is no implicit order to the categories (i.e.,
when the categorical variable is nominal rather than ordinal) that make up
the bar, as is the case in figure 10. Note how difficult it is to discern the
differences in the size of the components on the upper parts of the bar. The
same difficulty occurs with stacked line and area charts.
The stacked bar chart works best when the primary comparisons are to be
made across the data series represented at the bottom of the bar. Thus,
placing the “teachers” data series at the bottom of the bars in figure 11
(and sorting the data on that series) forces the reader’s attention on the
crucial comparison and the obvious conclusion: American teachers are
fortunate to have such a large supervisory and support staff.
14 de 30
Figure 11: Stacked (100%) bar chart
One common bar charting mistake is including the legend on the right-hand
side of the plot area (shown in figure 12), placing the legend inside the plot
area, as in figure 9, or horizontally under the table title (as in figure 11)
maximizes the size of the area given over to displaying the data.
Figure 12: Scaling effects in a bar chart with two data series
Scaling effects occur when a bar chart (or a line chart, as we will see) two
data series with numbers of a substantially different magnitude, the
variation in the data series containing the smaller numbers. Figure 12, for
15 de 30
example, depicts the increase in the labor force participation rate (the
percent of the adult population in the labor force) from 60% in 1970 to 67%
in 2000, and the increase in the unemployment rate from 5.3% to 7.1%. The
immediate visual impression the chart gives is that the labor force
participation rate is larger than the unemployment rate (a relatively
meaningless comparison), while the important variation in the
unemployment rate (a 30% increase) is hardly noticeable. Including an
additional bar representing the sum of the other bars in a chart (as shown
in figure 13) has the same effect of reducing the variation in the main
graphical elements.
To see what happens when most of the bar charting rules are violated,
consider the example in figure 14, produced by the Illinois Board of Higher
Education (IBHE), (conflict of interest disclosure) my employer.
16 de 30
Figure 14: A really bad bar chart.
source: IBHE 2002.
It’s not just the 3-D. Look carefully at the X-axis. Using comparable data
(the only available data: Fall headcounts rather than 12 month headcounts),
eliminating the 3-D effects, sorting time from left to right, and removing the
community college data series, and adjusting the bottom of the scale, we
see something in figure 15 that the IBHE chart obscured: private institution
enrollments are responding to public demand for higher education, public
universities are not.
17 de 30
Figure 15: Revised enrollment chart
Note, however, that some would object to not using a zero base for the
Y-axis scale in figure 15, but I don’t think that the depiction is all that
unfair. It is fair to say, I think, that private institutions have accounted for
most of the growth in university and college enrollments in the state, a
disparity that would appear even more dramatic if annual change measures
were depicted as in figure 16, with a zero base.
18 de 30
Times Series Line Charts:
The time series chart is one of the most efficient means of displaying large
amounts of data in ways that provide for meaningful analysis. The typical
time series line chart is a scatterplot chart with time represented on the
X-axis and lines connecting the data points.
Make sure the reader can clearly distinguish the lines for
separate data series.
19 de 30
Figure 17: Presidential approval: times series trend with annotations
20 de 30
Figure 19: Time series chart with second Y-axis
Many who have written about graphical distortion condemn the use of
two-scale charts because the relative sizes of the two scales are completely
arbitrary. This is true; had job approval and unemployment been plotted on
the same 0 to 90 Y-axis scale, the unemployment rate would be an almost
flat line at the bottom of the chart.
When several times series lines are printed in black and white, it is
sometimes difficult to separate out the different tend lines. Mixing solid,
dotted, and dashed lines for each variable may solve this problem, although
it is sometimes difficult to distinguish between dotted and dashed lines.
Scatterplots
21 de 30
Rules for Scatterplots
Use the chart title should identify the two variables and
the cases (e.g., cities or states)
With good labeling of the variables and cases and common-sense scaling of
the X and Y-axes, there's not a lot that can go wrong with a scatterplot,
although extreme outliers on one or more of the variables can obscure
patterns in the data.
22 de 30
Figure 20: Scatterplot with data labels and trendline
In figure 20, TV viewing is the independent variable. (If you were trying to
predict which types of students watch the most TV, the axes would be
reversed.) The scatterplot contains two optional plotting features: a
regression trendline denoting the linear relationship between the two
variables and the use of State postal ID data labels to indicate each state's
position on the chart (these labels require a special add-in to the Excel
program). Although the chart suffers from overlapping data labels, the
interpretation is straightforward; the higher the percentage of students in a
state watching more than 6 hours of TV each day, the lower the state's math
scores.
Boxplots
23 de 30
· Boxplots are best used for comparing the distribution
of the same variable for two or more groups or two or
more time points.
The simple boxplot, as shown in figure 21, displays the four quartiles of the
data, with the "box" comprising the two middle quartiles, separated by the
median. The upper and lower quartiles are represented by the single lines
extending from the box. More detailed versions of the boxplot restrict the
“whiskers” on the plot to 1.5 times the size of the boxes and plot the higher
or lower values (outliers) as individual points. Some versions also plot the
mean in addition to the median.
A single boxplot box (as in figure 21) rarely reveals much about the data,
and graphs of single variable data distributions (using stem-and-leaf or
histogram charts) rarely offer a more detailed graphic representation of the
data distribution. The real advantages of the boxplot graphic comes
through, however, in single charts using several boxplots to compare the
distribution of a variable across groups or over time and an especially useful
elaboration of the boxplot graph involves plotting an individual case over
the boxplot to compare single cases to the overall distribution (see figure
22).
24 de 30
Figure 22: Comparing boxplots, with labels for individual cases (Nevada)
Thus, figure 22 displays the percentage Democratic vote for the 50 states
over the past seven presidential elections. Labeling a single case, we can
see that the Democratic vote in Nevada has moved steadily higher relative
to the other states. One can easily imagine applying the same plotting
strategy in a variety of other settings, for example, comparing one school
district's test scores to the distribution of test scores across other school
districts.
25 de 30
House, the Office of Management and Budget, and the Government Printing
Office websites.
Poverty. The poverty data shown in figure 9 was obtained from the
Luxembourg Income Survey website and is described in the Poverty chapter
that follows.
26 de 30
Census Bureau’s Statistical Abstract.
All of the charts shown in this chapter were prepared using the Microsoft
Excel charting software, but some of the charts required modifications to
the normal charting options offered by Excel.
Figure 2c contains two levels of category labels on the X-axis that are easily
done with Excel but not explained in the documentation. The trick is to
incorporate an additional label series in the data range, as shown below:
Nor does Excel offer a boxplot (also called “box and whiskers”) chart type as
shown in figure 17, although Microsoft does have on-line instructions that
explain how to modify existing charts types to derive a boxplot. More
simply, Jon Peltier provides a “box chart maker”, an Excel add-in that
greatly simplifies the process. A number of other add-ins and work-arounds
to extend the Excel charting capabilities are freely available on the
Internet.
Excel provides two different approaches for formatting the X-axis for times
series line graphs. Excel line graphs treat the X-axis as if “X” is not a
numerical variable, much the same way that the bar chart graphs define the
X-axis. The X-axis labels and the data points are positioned between the
tick marks and the sequencing of the data points depends on the order of
27 de 30
the cases in the Excel spreadsheet (see figure 19).
28 de 30
Figure 21: Excel XY scatter: X-axis is a labeled data series
To get around this, a) eliminate the X-axis labels, b) plot an extra data series
with all the values set to the axis’ minimum value, c) use a “+” sign as the
data markers, and d) use the J-Walk Chart tools to label the data points.
Also available on the Internet are a variety of macros and instructions for
modifying the scaling of the Y-axis (Excel does a really poor job when it
comes to log scaled axes).
One would have hoped that Microsoft would incorporate such functions into
the newer versions of the Excel software. Unfortunately, the Microsoft
developers seem to have chosen to go in a different direction with the latest
version of the software (Office 2007), incorporating all sorts of new chart
styles that involve all sorts advanced 3-D effects, shading, glow, soft edges
and shadows.
References:
Few, Stephen. 2004. Show Me the Numbers: Designing Tables and Graphs to
Enlighten (Analytics Press).
Miller, Jane E. 2004. “Creating Effective Charts,” The Chicago Guide to Writing
about Numbers, (University of Chicago Press). Chapter 7.
Wallgren, Anders, et. al. Graphing Statistics & Data (Sage Publications, 1996).
Wainer, Howard. 1997. Visual Revelations: Graphical Tales of Fate and Deception
from Napoleon Bonaparte to Ross Perot. (Mahwah, NJ: Lawrence Erlbaum
Associates).
29 de 30
30 de 30