Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

Presenting Data: Tabular and graphic display of social indicators

(under construction)
Gary Klass
Illinois State University
© 2002

1 de 30
display as it draws on the talents of both the scientist and the artist. You
have to know and understand your data, but you also need a good sense of
how the reader will visualize the chart’s graphical elements.

Two problems arise in charting that are less common when data are
displayed in tables. Poor choices, or deliberately deceptive, choices in
graphic design can provide a distorted picture of numbers and relationships
they represent. A more common problem is that charts are often designed
in ways that hide what the data might tell us, or that distract the reader
from quickly discerning the meaning of the evidence presented in the
chart. Each of these problems is illustrated in the two classic texts on data
presentation: Darrell Huff’s How to Lie with Statistics (1994) and Edward
Tufte’s The Visual Display of Quantitative Information (1983).

Huff’s little paperback, first published in 1954 and reissued many times
thereafter, condemned graphical representations of data that “lied”. Here,
the two numbers, one 3 times the magnitude of the other, are represented
by two cows, one 27 times larger than the other, resulting in a Lie Factor
of 9.

Figure 1: Graphical distortion


of data
SOURCE: Darrell Huff. 1993. How
to Lie with Statistics WW Norton
& Co, 72.

Here the figure depicts the increase in the number of milk cows in the
United States, from 8 million in 1860 to twenty five million in 1936. The
larger cow is thus represented as three times the height the 1860 cow. But
she is also three times as wide, thus taking up nine times the area of the
page. Moreover the graphic is a depiction of a three dimensional figure:
when we take the depth of the cow into account, she is twenty seven times
larger in 1936. Later, Tufte developed the “Lie Factor”: a numerical
measure of the data distortion. Here, representing a numbers that 3 times

2 de 30
different in magnitude with images that a 27 times different in size
produces a “Lie Factor” of 9.

Such visual distortions are not as common as they once were, but modern
computer technology has made possible all sorts of new ways of lying with
charts.

Edward Tufte would second Emperor Joseph II’s famous complaint to a


young composer: “too many notes, Mozart.” Tufte’s unique contribution to
art of chart design was to stress the virtue of efficient data presentation.
His fundamental rule of efficient graphical design is to minimize the ratio
of ink-to-data by minimizing or eliminating any elements from the chart
that do not aid in conveying what the numbers mean. Tufte’s advice to those
who would chart is essentially the same advice offered by Strunk and White
to would-be writers:

"A sentence should contain no unnecessary words, a paragraph no


unnecessary sentences for the same reason that a drawing should
contain no unnecessary lines and a machine no unnecessary parts."
(23)

Just as the purpose of any statistic is to simplify, to represent in one number


a larger set of numbers, the purpose of a chart is to simplify numerical
comparisons: to represent in several numerical comparisons in a single
graphic. The most common errors in chart design are to include elements
in the graphical display that have nothing to do with the presentation of the
numerical comparisons. Below we will see how the standard applies to the
components of charts in general.

The Components of a Chart

There are three basic components to most charts:

the labeling that defines the data: the title, axis titles and labels,
legends defining separate data series, and notes (often, to
indicate the data source),

scales defining the range of the Y (and sometimes the X) axis, and

the graphical elements that represent the data: the bars in bar
charts, the lines in times series plot, the points in scatterplots, or
the slices of a pie chart.

3 de 30
Figure 2: Components of a chart

Titles. In journalistic writing a chart title will sometimes state the


conclusion the writer would have the reader draw from the chart. If figure
2 were used in a Governors State University press release, the title, “Tuition
and Fees Lowest at GSU” might be appropriate. In academic writing, the
title should be used to define the data series, as is shown in figure 2,
without imposing a data interpretation on the reader. Often, the units of
measurement are specified at the end of the title after a colon or in
parentheses in a subtitle (e.g. “constant dollars”, “% of GDP”, or “billions of
US dollars”).

Axis titles. Axis titles should be brief and should not be used at all if the
information merely repeats what is clear from the title and axis labels. It
would be redundant to repeat the phrase “Tuition and fees” in the Y axis of
figure 2, and the X-axis title, “University”, is completely unnecessary. If the
title of the chart has the subtitle “% of GDP”, it is not necessary to repeat
either the phrase or the word “percent” in the axis title.

Axis scale and data labels. The value or magnitude of the main graphical
elements of the chart are defined by either or both the axis scale and
individual data labels. Avoid using too many numbers to define the data
points. A chart that labels the value of each individual data point does not

4 de 30
need labeling on the y axis. If it seems necessary to label every value in a
chart, consider that a table is probably a more efficient way of presenting
the data.

Legends. Legends are used in charts with more than one data series.
They should not be placed on the outside of the chart in a way than reduces
the plot area, the amount of space given to represent the data. In figure 2,
the legend is placed inside the chart (although some think that detracts
from the main graphical elements), it could also be placed at the bottom of
the chart (where the unnecessary “university” now stands.

Gridlines. If used at all, gridlines should use as little ink as possible so as


to not overwhelm the main graphical elements of the chart.

The source. Specifying the source of the data is important for proper
academic citation, but it also can also give knowledgeable readers who are
often familiar with common data sources important insights into the
reliability and validity of the data. For example, knowing that crime
statistics come from the FBI rather than The National Criminal
Victimization Survey can be a crucial bit of information.

Other chart elements. The amount of ink given over to the non-data
elements of a chart that are not necessary for defining the meaning and
values of the data should be kept to an absolute minimum. Plot area
borders and plot area shading are unnecessary. Keep the shading of the
graphical elements simple and always avoid using unnecessary 3-D effects.
In most of the charts that follow, even the vertical line defining the the
Y-axis has been removed, following the commendable charting standards of
The Economist magazine.

When Graphic Design Goes Badly.

The most general standards of charting data are thus the following:

Present meaningful data.

Define the data unambiguously.

Do not distort the data.

Present the data efficiently.

To see what happens when these rules are violated, consider figure 3, taken

5 de 30
from Robert Putnam’s Bowling Alone (where it is labeled figure 47), a work
that contains many good and bad examples of graphical data display (and
unfortunately, no tables at all). In just one chart, Putnam violates the three
fundamental rules of data presentation: the chart does not depict
meaningful data; the data it does depict are ambiguous, and the chart
design is seriously inefficient. One can’t accuse Putnam of distorting the
data only because his main conclusions are not derived from the data
presented in the chart.

Figure 3: Very bad graphical display

Of these, let’s consider the inefficiency first: the first thing you notice about
the chart is that the graphical elements are represented in three
dimensions. On both efficiency and truthfulness this is unfortunate; the
3-D effect is entirely unnecessary and in this case serves to distort the
visual representation of the data. Had not the data labels been shown on
the top of each bar, it would not be readily apparent that column A is in fact
bigger than column F, or that C is the same size as B. In addition the chart
suffers from what might be called “numbering inefficiency”: Putnam uses 13
numbers to represent 6 data points. Eliminating the 3-D, as shown in figure
4, offers a more exact representation of the data with a lot less ink.

6 de 30
Figure 4: Revised chart, without 3-D effects.

There are two problems of ambiguous data in the chart. Partly this is
resolved in Putnam’s text where it is explained that bar E is the percentage
of women who are homemakers out of concern for their kids while bar A is
the percentage of women who are working full-time because they need the
money. It’s not quite clear what the numbers for those who work part-time
mean. In the case of bar C, for example, are the women working only
part-time because of the kids, or are they not full-time homemakers because
of the money?

The other ambiguity, however, is not for the lack of proper labeling. If one
looks at the chart quickly, the first impression one would get would be that
is that only 11% of women who work full-time do so for reasons of personal
satisfaction. But that is not the case. Look at the Y-axis title. Or notice that
all of the percentages add up to 100. Of all the women in the survey, 11%
were in the single category of “employed full-time for reasons of personal
satisfaction.” This is not what one expects in a bar chart, but given the
data Putnam has decided to display, there isn’t a whole lot that can be done
with the chart to fix it.

Still, we have to ask, “what does this chart mean?” In particular, what data
do the arrows on the bars represent?

A critical standard of good charting is that the chart should be self

7 de 30
explanatory. That there are problems with this chart become apparent to
the reader as soon as one encounters Putnam’s page and a half of
accompanying text devoted, not to explaining the significance of the data,
but to explaining what the elements of the chart represent. A careful
reading of the text tells us that there are basically three conclusions Putnam
would have us draw from this chart:

· Over time, (the 1980s and 90s) more women are working.

· They are doing so less for reasons of personal satisfaction and


more out of necessity (i.e., to earn money).

· Correspondingly, there has been a significant decline in the


number of women who choose to be homemakers for reasons of
personal satisfaction.

These three conclusions are directly relevant to Putnam’s general thesis:


that over time there has been a decline in social capital (adults are spending
less time raising children and developing the social capital of future
generations) driven in part by the demands of the expanding work force.

Based on the textual discussion that Putnam offers it becomes clear that the
most meaningful data is represented in the chart, not by the height of the
bars, but by the direction of the arrows on the bars. Recall that as a general
rule data presentations that include more than one time point provide for
much more meaningful analyses than cross sectional or single time point
presentations. Although most of the data analysis in Bowling Alone is time
series data, in this case Putnam averages 21 years of data down to single
data points represented by the chart’s bars, with the times series change
represented by directional arrows. Thus, the most meaningful comparison
in the chart – the comparison that support the conclusion that Putnam seeks
to draw from the data -- is not that bar A is higher than bar B or F, but that
the arrow for Bar A is going up while the arrow for bar F is going down.

The crucial comparison is made directly in figure 5, based on the data


presented in the textual discussion. Moreover, it directly illustrates several
points that neither the text nor the original chart make clear: In 1978, a
plurality of women were homemakers who did so out of personal
satisfaction; in 1999 women who worked full time for financial reasons were
the plurality.

8 de 30
Figure 5: Revised chart, with data from text.

Note also that figure 5 simplifies the data presentation by eliminating the
ambiguous part-time category: for part-timers, is “personal satisfaction” the
reason for not staying at home or the reason for not working full-time? And
it clarifies that the “necessity” refers to “kids” in the case of homemakers
and to “money” in the case of full time workers.

Types of Charts

Most charts are a variation on one of four basic types: pie charts, bar
charts, time series charts and scatterplots. Choosing the right type of chart
depends on the characteristics of the data and the relationships you want
displayed.

Pie Charts

Pie charts are used to represent the distribution of the categorical


components of a single variable. Note that as a general rule, multivariate
comparisons provide for more meaningful analysis than do single variable
distributions and for this and other reasons pie charts should be rarely
used, if at all.

Rules for pie charts:

· Avoid using pie charts.

9 de 30
· Use pie charts only for data that add up to some
meaningful total.

· Never ever use three dimensional pie charts; they are


even worse than two dimensional pies.

· Avoid forcing comparisons across more than one pie


chart.

Figure 6: Comparing two pie charts

Pie charts should rarely be used. Pie charts usually contain more ink than is
necessary to display the data and the slices provide for a poor
representation of the magnitude of the data points. Do you remember as a
kid trying to decide which slice of your birthday cake was the largest? It is
more difficult for the eye to discern the relative size of pie slices than it is to
assess relative bar length. Forcing the reader to draw comparisons across
the two pie charts shown in figure 6 is also a bad idea: without looking at
the data label percentages in the above figures one cannot easily determine
whether the FY 2000 slices are larger or smaller than the corresponding FY
2007 slices

3-D pie charts are even worse, as they also add a visual distortion of the
data (in figure 7, the thick 3-D band exaggerates the size of the corporate
income tax slice).

10 de 30
Figure 7: Exploding 3-D pie charts.

All the information in the pie charts above can be conveyed more precisely
and with far less ink in the simple bar chart shown in figure 8.

Figure 8: Bar charts are better than pie charts.

Nevertheless, people like pie charts. Readers expect to see one or two pie
charts similar to those in figure 6 at the very beginning of an annual agency
budget report. But it would be a big mistake to rely on several pie charts
for the primary data analysis in a report.

For those who would ignore all the advice given here and insist that good
charts must look pretty, the most recent version of the Microsoft Excel
charting software (in Office 2007, beta) will satisfy all your foolish desires:
3-D pie charts that gleam and glisten like Christmas tree ornaments, to say
nothing about what you can do with the 3-D pie chart’s cousins, the donut,
cylinder, cone, radar and pyramid charts.

11 de 30
As a general rule 3-D charts are not a good idea even when the data are
three dimensional. In theory they provide for a precise representation of
data, but it is rare that provide a basis for drawing a simple conclusion.

Bar Charts:

Bar charts typically display the relationship between one or more


categorical variables with one or more quantitative variables represented by
the length of the bars. The categorical variables are usually defined by the
categories displayed on the X-axis and, if there is more than one data series,
by the legend.

Rules for bar charts:

· Minimize the ink, do not use 3-D effects.

· Sort the data on the most significant variable.

· Use rotated bar charts if there are more than 8 to 10 categories.

· Place legends inside or below the plot area.

· With more than one data series, beware of scaling distortions.

Bar charts often contain little data, a lot of ink, and rarely reveal ideas that
cannot be presented much more simply in a table. Minimizing the
ink-to-data ratio is especially important in the case of bar charts. Never use
a 3-D bar chart. Keep the gridlines faint. Display no more than seven
numbers on the Y-axis scale. If there are fewer than five bars, consider
using data labels rather than a Y-axis scale; it doesn't make sense to use a
five-numbered scale when the exact values can be shown with four
numbers.

12 de 30
Figure 9: Rotated bar chart, two data series

Look at figure 9 and you can quickly grasp the main points – the United
States has the highest child poverty rate among developed nations --, but
then spend some time with it and you'll discover other interesting things.
Note, for example, the differences in child and elderly poverty across
nations or that the three countries at the top, with the lowest child poverty
rates are Scandinavian countries; five of the seven countries with the
highest child poverty are English-language countries.

As with tables, sorting the data on the most significant variable greatly
eases the interpretation of the data. The data in figure 9 are sorted on the
child rather than the elderly poverty rates only because most of the
research on the topic has focused on child poverty. Note also that if the
sorted variable represents time, time should always go from left to right and

13 de 30
on the X-axis.

One variation of the bar chart, the stacked bar chart, should be used with
caution, especially when there is no implicit order to the categories (i.e.,
when the categorical variable is nominal rather than ordinal) that make up
the bar, as is the case in figure 10. Note how difficult it is to discern the
differences in the size of the components on the upper parts of the bar. The
same difficulty occurs with stacked line and area charts.

Figure 10: Stacked bar chart with nominal categories

The stacked bar chart works best when the primary comparisons are to be
made across the data series represented at the bottom of the bar. Thus,
placing the “teachers” data series at the bottom of the bars in figure 11
(and sorting the data on that series) forces the reader’s attention on the
crucial comparison and the obvious conclusion: American teachers are
fortunate to have such a large supervisory and support staff.

14 de 30
Figure 11: Stacked (100%) bar chart

One common bar charting mistake is including the legend on the right-hand
side of the plot area (shown in figure 12), placing the legend inside the plot
area, as in figure 9, or horizontally under the table title (as in figure 11)
maximizes the size of the area given over to displaying the data.

Figure 12: Scaling effects in a bar chart with two data series

Scaling effects occur when a bar chart (or a line chart, as we will see) two
data series with numbers of a substantially different magnitude, the
variation in the data series containing the smaller numbers. Figure 12, for

15 de 30
example, depicts the increase in the labor force participation rate (the
percent of the adult population in the labor force) from 60% in 1970 to 67%
in 2000, and the increase in the unemployment rate from 5.3% to 7.1%. The
immediate visual impression the chart gives is that the labor force
participation rate is larger than the unemployment rate (a relatively
meaningless comparison), while the important variation in the
unemployment rate (a 30% increase) is hardly noticeable. Including an
additional bar representing the sum of the other bars in a chart (as shown
in figure 13) has the same effect of reducing the variation in the main
graphical elements.

Figure 13: Scaling effect in a bar chart.

To see what happens when most of the bar charting rules are violated,
consider the example in figure 14, produced by the Illinois Board of Higher
Education (IBHE), (conflict of interest disclosure) my employer.

16 de 30
Figure 14: A really bad bar chart.
source: IBHE 2002.

It’s not just the 3-D. Look carefully at the X-axis. Using comparable data
(the only available data: Fall headcounts rather than 12 month headcounts),
eliminating the 3-D effects, sorting time from left to right, and removing the
community college data series, and adjusting the bottom of the scale, we
see something in figure 15 that the IBHE chart obscured: private institution
enrollments are responding to public demand for higher education, public
universities are not.

17 de 30
Figure 15: Revised enrollment chart

Note, however, that some would object to not using a zero base for the
Y-axis scale in figure 15, but I don’t think that the depiction is all that
unfair. It is fair to say, I think, that private institutions have accounted for
most of the growth in university and college enrollments in the state, a
disparity that would appear even more dramatic if annual change measures
were depicted as in figure 16, with a zero base.

Figure 16: Bar chart with annual change data

18 de 30
Times Series Line Charts:

The time series chart is one of the most efficient means of displaying large
amounts of data in ways that provide for meaningful analysis. The typical
time series line chart is a scatterplot chart with time represented on the
X-axis and lines connecting the data points.

Rules for Time Series (Line) Charts

Time is almost always displayed on the X-axis from left to


right.

Display as much data with as little ink as possible.

Make sure the reader can clearly distinguish the lines for
separate data series.

Beware of scaling effects.

When displaying fiscal or monetary data over-time, it is


often best to use deflated data (e.g., inflation-adjusted or
% of GDP)

19 de 30
Figure 17: Presidential approval: times series trend with annotations

Scaling effects. When two variables with numbers of different magnitudes


are graphed on the same chart, the variable with the large scale will
generally appear to have a greater degree of variation; the smaller-scale
variable will appear relatively "flat" even though the percentage change is
the same. In figure 18, ABCorp’s stock seems to be growing much faster
than XYZCOM's, yet the rate of increase is identical.

Figure 18: Illustration of scaling distortion

When the differences in scale are so great as to eliminate most of the


perceived variation in the smaller-scale variable, using a second scale,
displayed on the right-hand side as in figure 19, is sometimes preferable,
although this may make the interpretation of the graph more complicated.

20 de 30
Figure 19: Time series chart with second Y-axis

Many who have written about graphical distortion condemn the use of
two-scale charts because the relative sizes of the two scales are completely
arbitrary. This is true; had job approval and unemployment been plotted on
the same 0 to 90 Y-axis scale, the unemployment rate would be an almost
flat line at the bottom of the chart.

One solution to trendlines of different magnitudes is to rescale the


variables, calculating the percentage change from a base year —but note
that the selection of the base year can produce dramatically different
results.

When several times series lines are printed in black and white, it is
sometimes difficult to separate out the different tend lines. Mixing solid,
dotted, and dashed lines for each variable may solve this problem, although
it is sometimes difficult to distinguish between dotted and dashed lines.

Scatterplots

The two-dimensional scatterplot is the most efficient medium for the


graphical display of data. A simple scatterplot will tell you more about the
relationship between two interval-level variables than any other method of
presenting or summarizing such data.

21 de 30
Rules for Scatterplots

Use two interval-level variables.

Fully define the variables with the axis titles.

Use the chart title should identify the two variables and
the cases (e.g., cities or states)

If there is an implied causal relationship between the


variables, place the independent variable (the one that
causes the other) on the X-axis and the dependent
variable (the one that may be caused by the other) on the
Y-axis.

· Scale the axes to maximize the use of the plot area


for displaying the data points.

· It’s a good idea to add data labels to identify the


cases.

With good labeling of the variables and cases and common-sense scaling of
the X and Y-axes, there's not a lot that can go wrong with a scatterplot,
although extreme outliers on one or more of the variables can obscure
patterns in the data.

22 de 30
Figure 20: Scatterplot with data labels and trendline

In figure 20, TV viewing is the independent variable. (If you were trying to
predict which types of students watch the most TV, the axes would be
reversed.) The scatterplot contains two optional plotting features: a
regression trendline denoting the linear relationship between the two
variables and the use of State postal ID data labels to indicate each state's
position on the chart (these labels require a special add-in to the Excel
program). Although the chart suffers from overlapping data labels, the
interpretation is straightforward; the higher the percentage of students in a
state watching more than 6 hours of TV each day, the lower the state's math
scores.

Boxplots

John W. Tukey invented the boxplot as a convenient method of displaying


the distribution of interval-level variables.

Rules for Boxplots:

· A simple boxplot plots the median and four quartiles of


data for an interval level variable.

23 de 30
· Boxplots are best used for comparing the distribution
of the same variable for two or more groups or two or
more time points.

· Boxplots are an excellent means of displaying how a


single case compares to a large number of other cases.

Figure 21: Components of a boxplot

The simple boxplot, as shown in figure 21, displays the four quartiles of the
data, with the "box" comprising the two middle quartiles, separated by the
median. The upper and lower quartiles are represented by the single lines
extending from the box. More detailed versions of the boxplot restrict the
“whiskers” on the plot to 1.5 times the size of the boxes and plot the higher
or lower values (outliers) as individual points. Some versions also plot the
mean in addition to the median.

A single boxplot box (as in figure 21) rarely reveals much about the data,
and graphs of single variable data distributions (using stem-and-leaf or
histogram charts) rarely offer a more detailed graphic representation of the
data distribution. The real advantages of the boxplot graphic comes
through, however, in single charts using several boxplots to compare the
distribution of a variable across groups or over time and an especially useful
elaboration of the boxplot graph involves plotting an individual case over
the boxplot to compare single cases to the overall distribution (see figure
22).

24 de 30
Figure 22: Comparing boxplots, with labels for individual cases (Nevada)

Thus, figure 22 displays the percentage Democratic vote for the 50 states
over the past seven presidential elections. Labeling a single case, we can
see that the Democratic vote in Nevada has moved steadily higher relative
to the other states. One can easily imagine applying the same plotting
strategy in a variety of other settings, for example, comparing one school
district's test scores to the distribution of test scores across other school
districts.

Notes on data sources:

Higher education. The higher education data in figures 2, 14, 15 and 16


are compiled by the Illinois Board of Higher Education and are readily
available on the Board’s website. Most states have similar governing board
for higher education, but the governance structure varies from state to
state. Most colleges and universities have an institutional research
department responsible for compiling data and preparing reports on
enrollments, tuition and fees, staffing, expenditures and student academic
performance (which is then forwarded to the governing boards). Often the
data are presented in an annual data profile.

Federal government revenue. The president’s Office of Management and


Budget submits the proposed federal budget (for the budget year beginning
October 1) to Congress in January of each year. (figures 6, 7, 8, and 10) The
actual budget documents and spreadsheet files are available in the White

25 de 30
House, the Office of Management and Budget, and the Government Printing
Office websites.

Poverty. The poverty data shown in figure 9 was obtained from the
Luxembourg Income Survey website and is described in the Poverty chapter
that follows.

Presidential approval. (FIGURES 17, 19) Many polling agencies


regularly conduct surveys asking the more or less standard presidential
approval question: “How would you rate President Bush's performance on
the job: excellent – good – fair or poor?” The Gallup poll website has the
most complete historical data on presidential approval and would be the
best source for comparing several administrations’ approval data, but
access to their data requires a subscription fee. The “Professor PollKatz
Pool of Polls” website http://www.pollkatz.homestead.com/ contains time
series charts (but not the actual data) on presidential approval surveys
conducted by 15 polling organizations. The Pollingreport.com website is an
excellent source for political polling data, reporting data from several
polling organization.

Unemployment. The unemployment measure used in figure 19 is the


standard monthly Bureau of Labor Statistics measure available on the
Bureau’s website. It measures the percent of the labor force (those
employed and looking for work) who are unemployed and actively seeking
employment.

Education. The Organization for Economic Cooperation and Development


(OECD) is an excellent source of governmental and social data for the
World’s developed nations. The staffing data in figure 11 was reported in
one of their annual reports on education. The TV viewing and math score
data in figure 20 was obtained from the National Center for Educational
Statistics website. As discussed in the Education chapter, this is prime
source for US educational statistics.

Elections. There is no US government agency responsible for compiling


even federal election data, although the Clerk of the House (of
Representatives) does report official tabulations for presidential and
congressional elections and the Census Bureau does do a post-congressional
election survey on voter turnout. Congressional Quarterly’s biennial
America Votes is the most comprehensive source of US election data.
Figure 22 is based on the Congressional Quarterly data obtained from the

26 de 30
Census Bureau’s Statistical Abstract.

Notes to Excel users:

All of the charts shown in this chapter were prepared using the Microsoft
Excel charting software, but some of the charts required modifications to
the normal charting options offered by Excel.

Figure 2c contains two levels of category labels on the X-axis that are easily
done with Excel but not explained in the documentation. The trick is to
incorporate an additional label series in the data range, as shown below:

Figure 22: Specifying two levels of labels in


the data range

To produce scatterplots with case labeling, as in figure 15, use of one of


several add-ins or macros freely available on the Internet. The “J-Walk Chart
Tools” add-in allows users to specify a data range to label any chart’s data
points and includes other options for controlling chart and text size.

Nor does Excel offer a boxplot (also called “box and whiskers”) chart type as
shown in figure 17, although Microsoft does have on-line instructions that
explain how to modify existing charts types to derive a boxplot. More
simply, Jon Peltier provides a “box chart maker”, an Excel add-in that
greatly simplifies the process. A number of other add-ins and work-arounds
to extend the Excel charting capabilities are freely available on the
Internet.

Excel provides two different approaches for formatting the X-axis for times
series line graphs. Excel line graphs treat the X-axis as if “X” is not a
numerical variable, much the same way that the bar chart graphs define the
X-axis. The X-axis labels and the data points are positioned between the
tick marks and the sequencing of the data points depends on the order of

27 de 30
the cases in the Excel spreadsheet (see figure 19).

Figure 23: Excel line chart

In most situations, Excel’s “XY scatter-chart-with-lines”, works has several


advantages over the line chart. As shown in figure 20, the axis is treated as
a numerical variable, the axis labels are placed under the axis tick marks,
and any gaps in the time series are correctly spaced.

Figure 24: Excel XY Scatter chart with lines

The disadvantage of the scatter-with-lines chart is that it offers little control


over the labeling of the axis: the tick marks and labels must start with the
minimum value of the series and are evenly spaced. If the data series starts
in 1959 (e.g., most US poverty data) and you wish to have tick marks every
five years, the axis labels will be the series beginning 1959-1964-1969….
rather than 1960-1965-1970.

28 de 30
Figure 21: Excel XY scatter: X-axis is a labeled data series

To get around this, a) eliminate the X-axis labels, b) plot an extra data series
with all the values set to the axis’ minimum value, c) use a “+” sign as the
data markers, and d) use the J-Walk Chart tools to label the data points.

Also available on the Internet are a variety of macros and instructions for
modifying the scaling of the Y-axis (Excel does a really poor job when it
comes to log scaled axes).

One would have hoped that Microsoft would incorporate such functions into
the newer versions of the Excel software. Unfortunately, the Microsoft
developers seem to have chosen to go in a different direction with the latest
version of the software (Office 2007), incorporating all sorts of new chart
styles that involve all sorts advanced 3-D effects, shading, glow, soft edges
and shadows.

References:

Huff, Darrell. 1993. How to Lie with Statistics WW Norton & Co

Putnam, Robert D. 2000. Bowling Alone (Simon and Schuster).

Tufte, Edward. The Visual Display of Quantitative Information (Cheshire:


Connecticut: Graphics Press, 1983).

Other useful books on graphing data:

Few, Stephen. 2004. Show Me the Numbers: Designing Tables and Graphs to
Enlighten (Analytics Press).

Jones, Gerald E. How to Lie With Charts (iUniverse.com, 2000)

Kosslyn, Stephen M. Elements of Graph Design (NY: W. H. Freeman, 1994).

Miller, Jane E. 2004. “Creating Effective Charts,” The Chicago Guide to Writing
about Numbers, (University of Chicago Press). Chapter 7.

Wallgren, Anders, et. al. Graphing Statistics & Data (Sage Publications, 1996).

Wainer, Howard. 1997. Visual Revelations: Graphical Tales of Fate and Deception
from Napoleon Bonaparte to Ross Perot. (Mahwah, NJ: Lawrence Erlbaum
Associates).

29 de 30
30 de 30

You might also like