Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Making pretty graphs

In this document, I will be analyzing two different data sets. Both data sets were given to Statistics
3011 students in Fall of 2011. The first data set involved two quantitative variables, and the other used
two categorical variables. In both cases, the student was asked to explore the relationship between the
two variables.
Quantitative variables
Background
The first set of data explored the relationship between weight and percent body fat for 18 randomly
selected adult males. The description was as follows:
Hydrostatic body fat testing he best way to measure percent body fat. Hydrostatic body fat testing
weighs a person on land, and then in water. Since body fat is less dense than water (and therefore
floats) and all other components of the human body (bone, muscle, and connective tissue) is denser
than water, the difference between one's weight on land and one's weight in water is used to
determine the percent body fat.
Researchers wanted to see if there was an easier way to measure percent body fat in men.
Specifically, They wanted to see if body weight alone was a significant predictor of percent body fat,
to the extent that it could be used to calculate percent body fat. They randomly selected 18 men,
weighed them, and then found their percent body fat. The results are listed as "Weight" (in pounds)
and "BodyFat" (in percentage of total weight). Does a man's weight significantly predict the man's
percent body fat? If so, would you recommend using weight instead of percent body fat? Why or why
not?

Graphing one quantitative variable


Graphing each variable in R, without any additional modifications, resulted in the following graphs:

R does a nice job with the graph itself, in my opinion. The graph is simple and uncluttered, and
universally understood. However, R doesn't have the ability to understand the research and ideas
behind the data, so you will need to make some cosmetic modifications.
1

The first and easiest modifications to make are to label the axes and graph so that they are clearer and
more descriptive. The following commands can do that:
xlab="name for the horizontal axis"
ylab="name for the vertical axis"
main="title for the graph"
Keeping in mind the rules for good graphs, I edited the label for percent body fat. Note that I labeled it
"Figure 2". I did this because we want to see if weight is predictive of percent body fat, so percent
body fat is our dependent variable.
> hist(bodyfat, xlab="Percent body fat", ylab="number of men",
main="Figure 2: The percent body fat for the 18 men")

You can add colors to the bars, if you'd like, with the command col = "name of color":
> hist(bodyfat, xlab="Percent body fat", ylab="number of men",
main="Figure 2: The percent body fat for the 18 men", col="blue")

Remember to pick a color that provides a nice contrast. You can also create a vector of colors to apply,
like so:
> bfcolor<-c("purple", "blue", "green", "yellow", "orange", "red")
> hist(bodyfat, xlab="Percent body fat", ylab="number of men",
main="Figure 2: The percent body fat for the 18 men", col=bfcolor)
But my personal opinion is this vector will violate the rule of keeping it simple and uncluttered.
2

One thing that is bothering me about these two graphs is the scale that R chose: We have 8 ranges for
the weight, but only 6 for the percent body fat. I understand what R is doing- it's using ranges that
make sense numerically, since the bars for body fat are for every 5% and the bars for weight are for
every 10 pounds. But 8 bars for 18 data points makes it hard to see an overall shape.
If you want to change the number of bars in your histogram, there are several ways to do so:
The easiest way is to specify the number of bars with the command breaks = n, where n is
the number of bars that you want for your histogram. This doesn't always work:
> hist(weight, xlab="Weight (pounds)", ylab="number of men",
main="Figure 1: Weight of the 18 randomly selected men", breaks=6)

The safer way to specify the number of bars is to define a new vector that specifies exactly
where the breaks will be. Since I still want the bars to be evenly spaced, I first sorted the data
to get a sense of the overall range, then used the info to find the best increment:

> sort(weight)
[1] 146 159 159 160 168 173 175 175 181 187 188 188 192 196 200 205
[17] 215 219
> (max(weight) - min(weight))/6
[1] 12.16667
> wranges<-c(145, 158, 171, 184, 197, 210, 223)
> hist(weight, xlab="Weight (pounds)", ylab="number of men",
main="Figure 1: Weight of the 18 randomly selected men",
breaks=wranges)

If you'd rather do boxplots, go for it. I personally like them better because they highlight the outliers.
But I tend to stick with histograms because they don't require a lot of a priori knowledge on the part of
the reader,. With a boxplot, your reader needs to know (1) what a five number summary is and (2) how
it is translated into the graph. You're welcome to explain it to them in your document, but that distracts
from the message at hand.
Similarly, a Q-Q plot is super handy, especially for regression, but it requires the reader to know what a
distribution is, and what your distribution of interest (most likely the normal distribution) looks like.
If you wanted to combine the two histograms into one figure, you can do so in R with the
mfrow=c(#ofRows, #ofColumns) command. For example, if we wanted to have one figure
with the graph of the weight on top and the graph of the percent body fat on bottom:
> par(mfrow=c(2,1))
> hist(weight, xlab="Weight (pounds)", ylab="number of men",
main="Part A: Weight of the 18 randomly selected men",
breaks=wranges, col="blue")
> hist(bodyfat, xlab="Percent body fat", ylab="number of men",
main="Part B: The percent body fat for the 18 men", col="blue")
> par(mfrow=c(1,1))

Variables measured on the men in our study

Graphing two variables


The first choice is to make a scatterplot to explore the relationship between the two variables. Using
the command > plot(weight, bodyfat) I got the following graph:

Again, I'm pretty happy with the overall look of the graph, but the labels really do need to be updated:
> plot(weight, bodyfat, main="Figure 3: Weight Versus Percent Body
Fat for 18 Adult Males", xlab="weight(pounds)", ylab = "percent body
fat")

Don't like circles? You can use the command pch = 2 to get triangles.
Don't like black? You can use the col command we used above to choose another color.
Want a filled in circle? Use the command pch = 19. pch = 20 gives you a smaller filled circle.
Want to have the circles filled with a different color? Use pch = 21 to specify a filled circle, col
= "colorname" to change the outline of the circle, and bg="colorname" to fill the circle.
There are other shapes that you can add, and different shapes you can fill (1)

Personally, I like the filled circles (pch = 19), especially for small data sets. When the sets get really
large, I like using pch = 3 (plus signs) or pch 4 (x marks) so that I can see if there's more than
one subject at a given point.
Just to be super fancy, I did pch = 21.
> plot(weight, bodyfat, main="Figure 3: Weight Versus Percent Body
Fat for 18 Adult Males", xlab="weight(pounds)", ylab = "percent body
fat", pch=21, bg = "blue")

Of course, no graph of the relationship between two quantitative variables would be complete without a
regression line. The basic line R gives you is adequate:
> wvbfreg<-lm(bodyfat~weight)
> abline(wvbfreg)

But you can certainly modify it . You can adjust the color with the col= command and the line type
with the command lty = some number in the abline command. Remember that the color must
be in quotation marks:
> abline(wvbfreg, col = "red", lty = 4)

You can also make the line thicker, which I think will help with clarity a lot more than a dotted line
will. The code is lwd = some number, where the number corresponds to the thickness of the line. This
is how I would put a line on the graph for clarity:
> abline(wvbfreg, col = "red", lwd = 3)

I think it's critical to put a legend on to identify the line. We know it's the regression line, but not
everyone else does, so it's a good idea to explain the line. The legend is a separate command, and
before you add it, you'll need to decide on the following information:

The value of the X variable where you want the legend to start. R will put the left edge of the
legend box at this value. I'll call this value "xbox" in the command below.

The value of the Y variable where you want the legend to start. R will put the top edge of the
legend box at this value. I'll call this value "ybox" in the command below.

In addition, any modifications to the basic black R line will need to be reiterated. The format for the
command is as follows:
> legend(xbox, ybox, "WhateverYouWantToCallTheLine", any
modifications to lty or lwd or col)
For my graph, I'm going to add the following legend starting at weight = 145 lbs and body fat = 32%
> legend(145,32,"regression line", lwd = 3, col = "red")

You might want to actually give the regression formula, especially if you're presenting to other
statisticians. If you're presenting the information to people who don't even know what a regression line
is (and there are a lot of people out there who don't), it's probably better to provide a brief description in
your writeup or as a legend below the figure. Keep the legend as simple as possible.

You can also use the abline command to put non-regression lines on the graph. Let's say that
someone has decided "healthy" is a percent body fat at 20% or less, so you want to put a line on the
graph at 20% Since this is a horizontal line, we enter it into the command as h=20 in the same
location as we put the regression formula before. If we wanted a line at some value of the X variable,
we'd want a vertical line, so we'd use v=. You also need to specify any changes you want to make to
the line, such as type (lty), width (lwd) or color (col). The general formula is:
> abline( specifications for location of the line, any modifications
to the shape or color of the line)
I'm adding the line as a dotted line, since it's just a cutoff. I've decided to not make it any thicker, or
change the color. I will also update the legend so that people know what the line is for. When doing,
since I've got two lines, I have to specify the parameters for each, in order. I am also re-graphing the
graph because otherwise we end up with legends on top of each other.
> plot(weight, bodyfat, main="Figure 3: Weight Versus Percent Body
Fat for 18 Adult Males", xlab="weight(pounds)", ylab = "percent body
fat", pch=21, bg = "blue")
> abline(wvbfreg, col = "red", lwd = 3)
> abline(h=20, lty = 2)
> legend(145,32,c("regression line", "cutoff for healthy"), lty =
c(1,2), lwd = c(3,1), col = c("red","black"))

Please note that each condition was listed in order.


9

Categorical variables
Background
This data was comparing the different seat locations on a bus in terms of nausea. The description is as
follows:
Researchers wanted to see if the location of one's seat on a bus had any affect on his or her likelihood
of being nauseous at some point during the ride. They randomly sampled 3256 bus riders, noting their
chosen seat: the front of the bus (the seats against the walls of the bus near the driver, or the first
row), the middle of the bus (the seats behind the first row, but in front of the back exit doors), and the
back/rear of the bus (seats at or behind the back exit doors). When the randomly chosen person
stood up to depart, he or she was asked if, at any point during the ride, the rider had felt nauseous.
The answers were recorded as "yes" or "no". Is there evidence that the location of the seat on the
bus has a significant effect on the passenger's nausea? If so, where is the passenger most likely to
become nauseous? Least likely to become nauseous?

Tabulating one variable


Categorical data is a lot harder to work with in R. To do anything with it, you first need to tabulate it.
People often want to jump in and table the two variables together. I think it's worthwhile in your
preliminary data analysis to table each variable separately as well to get a sense of the distribution of
the data. I tend to lay out the table first by hand, then enter it into R. In addition, whenever I
present a table in a document, I don't even mess around with R's tables-- I make my own.
In order to make a pretty table, I use borders and shading to make it as easy to read as possible. For me,
that means using borders only to demarcate variables from counts, and using different borders and/or
shading for the title so that people can easily distinguish it from the variables or categories.
Also, just as with figures, each table should have a reference number/letter. I like to use letters for
tables if I've used numbers for figures. That way, each graphic has its own reference symbol.
Finally, I think it's really good to center the titles. Whether or not you choose to do that, however, any
numbers in a column should be lined up by their ones place. I do this by splitting the cell vertically,
typing the values into the first column of the split, choosing right alignment for the column, and then
adjusting the border until it looks centered.
Table A: Bus Riders By Seat Location
Front of Bus

Middle of Bus

Back of Bus

928

1329

999

Table A: Bus Riders By Seat Location


Front of Bus

928

Middle of Bus

1329

Back of Bus

999
10

I use the tables as a guide when I load the data into R. Although I like the look of the second table I
made, and would probably use that one in a paper, I'm going to use the first table as my template for my
table matrix in R.
> seattable<-matrix(c(928, 1329,999),nrow=1, byrow=T)
> seattable
[,1] [,2] [,3]
[1,] 928 1329 999
> colnames(seattable)<-c("front", "middle", "rear")
> seattable
front middle rear
[1,]
928
1329 999
Graphing the variable
Once the matrix is into R, you can graph it. This is what a bar graph will look like if you just type
> barplot(seattable)

The contrast between the bars and the background is nice, and R does start the frequency axis at 0,
which is ideal. But the labeling stinks. Easy enough to fix with the commands main, xlab, and
ylab. Recall from our discussion of quantitative variables (on page 2) that:
main = "WhateverYouWantToNametheGraph" puts a title on the graph
xlab = "NameOfYourCategoricalVariable" labels the x axis
ylab = "FrequencyOfSubjects" labels the y axis

11

You can also change the color of the bars with the col command.
For the bus data, I think I want these bars to be blue.
> barplot(seattable, main="Figure 4: Bus passengers by seat
location", xlab="Location of seat on bus", ylab="Number of passengers
by seat location", col="blue")

If you want to present relative frequencies, you can, but the graph will look the same:
> propseattable<-prop.table(seattable)
> propseattable
front
middle
rear
[1,] 0.2850123 0.4081695 0.3068182
> barplot(propseattable, main="Figure 5: Bus passengers by seat
location", xlab="Location of seat on bus", ylab="Proportion of
passengers by seat location", col="red")

12

You can also do pie graphs. I'm not a big fan, mainly because of the extra work involved. The pie
chart R gives you is not well labeled and hard to read. Also, I personally find it much easier to see
differences between groups with a bar graph than with a pie chart. Nonetheless, people seem to like
them, so here's how to make them better.
> pie(seattable)

R always has a lot of white space around pie charts. I don't know how to get rid of that. But I can label
the pieces of the pie and make the different pieces more distinct.
When labeling the pieces of the pie, keep in mind that the (1) wedge is the first number listed, the (2)
wedge is the second, and so on. So, since the first column in our table is "front", that is represented in
the pie by wedge 1. We need to add the category names in the order that they are for the pie. We do
this with the command label=VectorOfNames:
> pie(seattable, labels=c("front", "middle", "back"))

13

Give the graph a title with the main command so it's clear what "front", "middle", and "back" mean.
I'd also recommend making the colors a bit more vivid with the col command. Keep in mind that col
will need to have a vector after it, and will be applied in the same order that the labels were:
> pie(seattable, labels=c("front", "middle", "back"), main="Figure 6:
Seat location for bus passengers in the study", col=c("blue", "red",
"yellow"))

Finally, it's really important to not just label the pie pieces, but also list the number or proportion
that are in each pie. I think the viewer won't be able to see, from this graph, that the front and back of
the bus did NOT have equal amounts.
The issue is how to do it without violating the rule of simplicity. You can add it to the labels, but I
think it's a bit fussy. If you do, it's more helpful, in my mind, to add the relative frequencies over the
frequencies, and round as much as possible:
> pie(seattable, labels=c("front (29%)", "middle (41%)", "back
(31%)"), main="Figure 6: Seat location for bus passengers in the
study", col=c("blue", "red", "yellow"))

When we rounded, we ended up with 101%, which bugs non-mathy people.


14

Alternatively, you can use the legend command to put a legend on the graph, like we did on page 8.
Be sure to put the labels in the correct order, as well as the colors. Use the fill command to make
boxes for the colors and add the colors, instead of lty, lwd, and col, like we did on the scatterplot.
I'm not sure how the location of the legend works, since we don't have any axes, but I think using 0,0
would put the box in the middle of the graph itself, negative numbers move the legend to the left and
down, and positive numbers move it to the right and up. I'm going to graph the other variable, which
measured whether or not the passenger was nauseous after the bus ride.
> pie(tablenausea, labels=c("no nausea", "nausea"), main="Figure 7:
Nausea for passengers in the study", col=c("blue", "red"))
> legend(1.5, 0.5, c("87.2%", "12.8%"), fill=c("blue", "red"))

I personally prefer seeing the relative frequencies on the graph, and having the categories in the legend.
The cex command adjusts the size of the legend. Here, by typing cex = 0.75, I'm saying that I
want the legend to be 75% of the normal size. I've also moved the legend to the upper left corner:
> pie(tablenausea, labels=c("87%", "13%"), main="Figure 7: Nausea for
passengers in the study", col=c("blue", "red"))
> legend(-2.5, 1, c("not nauseous", "nauseous"), cex=0.75,
fill=c("blue", "red"))

15

Tabulating two categorical variables


Again, R needs a table to do anything, from graphs to hypothesis tests. But R's table is notoriously
hard to read. So any table I'll make for publication I'll make in a word processing program.
Since I want to use the bus seat location to predict nausea, seat location is the explanatory variable and
nausea is the response variable. Since we read left to right in this country, I have the explanatory
variable categories form the rows of the table and the response variable's categories form the columns.
Again, I'll use borders sparingly and different widths of borders to delineate groups. Also, I personally
like all information to be in the middle of the cel, both horizontally and vertically. But I also want the
ones digits to line up, so I've actually made two columns under the "yes" heading and under the "no"
heading. I then right aligned the text in the leftmost column, and adjusted the division between the two
columns so that it was to the right of the "yes" title. That way, I can type the data with right alignment,
which means that the numbers line up as I want them to, but I can still enter the numbers.

Table B: Seat location versus nausea


Nauseous when exiting bus?
front of bus
Seat location

middle of bus
back of bus

No

Yes

870

58

1163

166

806

193

This is OK, but the point of any figure is to illustrate the concept, and I can't tell if there's a difference
in nausea levels. To see if there are differences in the nausea for the different seat locations, it'd be nice
to know the conditional proportions. It really is easiest to do this in R.
I'm going to use data.frame to construct the table in R this time. Please note: you can do
hypothesis testing on a dataframe, but you have to have it be a matrix for graphing purposes.
>
>
>
>

no<-c(870, 1163, 806)


yes<-c(58, 166, 193)
busbarftable<-data.frame(no, yes)
busbarftable
no yes
1 870 58
2 1163 166
3 806 193
> rownames(busbarftable)<-c("front", "middle", "back")
> busbarftable
no yes
front
870 58
middle 1163 166
back
806 193
> busbarftable<-as.matrix(busbarftable)

16

As I stated above, I want the conditional proportions; that is, for each seat location, I want to know
the proportion that are/aren't nauseous. So I can use the following command to get that information,
where the "1" indicates you find the proportion across rows. If you didn't include "1", you'll get
proportion of table, which is the marginal proportions:
> busbarfprop<-prop.table(busbarftable, 1)
> busbarfprop
no
yes
front 0.9375000 0.0625000
middle 0.8750941 0.1249059
back
0.8068068 0.1931932
I'm going to make a new table with the conditional proportions. In order for you to see the formatting
of the table, I've left all the borders of the cells in place.

Table C: Percent with or without nausea in each location of the bus


Nauseous?

Seat location

No

Yes

Front

93.75%

6.25%

Middle

87.51%

12.49%

Back

80.68%

19.32%

And with the borders out:

Table C: Percent with or without nausea in each location of the bus


Nauseous?

Seat location

No

Yes

Front

93.75%

6.25%

Middle

87.51%

12.49%

Back

80.68%

19.32%

This is extreme border removal. You don't have to go this far.


To graph a table of two categorical values, you've pretty much got to use a bar graph. Here's what it
looks like for the basic counts:
> barplot(busbarftable)

17

I have three issues with the graph. The first should be obvious-- it's not labelled. We don't know what
"no" and "yes" stand for, and we don't know what the different shades of grey mean.
> barplot(busbarftable, xlab="Nauseous when exiting the bus?", ylab =
"number of passengers", main = "Nausea in different seat locations on a
bus", col=c("blue", "red", "yellow"))
> legend(1.5, 2500, c("front", "middle", "back"), cex=0.75, fill=c("blue",
"red", "yellow"))

But the second and third issues pertain to the way the data is presented.
If the goal is to see if any differences in nausea between the three groups, how are we to compare the
medium grey and light grey categories? They're not even across from each other! To have the bars
next to each other instead of on top of each other, use the command beside = T. When the bars
are beside each other, the y axis will get smaller, so I needed to change the location of the legend. Why
I had to type "5.5" instead of "1.5" for the x axis location, I have no clue. I basically just put in
numbers until the legend showed up where I wanted it.
> barplot(busbarftable, xlab="Nauseous when exiting the bus?", ylab =
"number of passengers", main = "Nausea in different seat locations on a
bus", col=c("blue", "red", "yellow"), beside=T)
> legend(5.5, 1000, c("front", "middle", "back"), cex=0.75, fill=c("blue",
"red", "yellow"))

18

This graph is OK, but just as the table of conditional proportions made the difference between the
groups more obvious, I think a graph of the conditional proportions table will make the differences
in nausea for the different seat locations more apparent. It's not critical here, since the number in each
location is similar, but it does help.
> barplot(busbarfprop, xlab="Nauseous when exiting the bus?", ylab =
"proportion in each location", main = "Nausea in different seat
locations on a bus", col=c("blue", "red", "yellow"), beside=T)
> legend(5.5, 0.8, c("front", "middle", "back"), cex=0.75,
fill=c("blue", "red", "yellow"))

Conclusion
I realize that your data may not be exactly the same as the data presented here, especially if I gave you
an anova to do. But hopefully, you'll at least have some insight into how to make graphs and tables that
are easy to read.
Please feel free to come see me for advice.

19

Some resources
I've looked at a lot of websites and books for help, and to be honest, they only really help when you
know what you're doing. Here are a couple of resources that I find helpful even if you're not ensconced
in the world of R programming:
http://www.harding.edu/fmccown/r/
This website is nice because it not only gives you the commands, but it also gives you the
graphs that result, which I find missing from a lot of R help documents or question boards.
Dalgaard, Peter. Introductory Statistics with R. Springer, 2002.
This was the single most important and useful book I bought in grad school. If you are a
registered student, and log into www.lib.umn.edu, you can access this book (and all other
Springer books) for free online. Just type "Peter Dalgaard" into the MNCAT search window,
and you should see this book in your search results. Choose the online version.
Note
1. The shapes for a given number are listed here:
http://www.win-vector.com/blog/2012/04/how-to-remember-point-shape-codes-in-r/
And the shapes you can fill:
http://stat.ethz.ch/R-manual/R-patched/library/graphics/html/points.html

20

You might also like