Professional Documents
Culture Documents
Full Download Ebook PDF Intro Stats Pearson New International Edition PDF
Full Download Ebook PDF Intro Stats Pearson New International Edition PDF
International Edition
Visit to download the full and correct content document:
https://ebooksecure.com/download/ebook-pdf-intro-stats-pearson-new-international-e
dition/
Intro Stats
De Veaux Velleman Bock
Intro Stats
Richard D. De Veaux
Paul F. Velleman David E. Bock
Fourth Edition
Fourth Edition
ISBN 978-1-29202-250-5
9 781292 022505
Stats Starts Here
Statistics in a Word
It can be fun, and sometimes useful, to summarize a discipline in only a few words. So,
Statistics is about
Economics is about . . . Money (and why it is good).
variation Data vary
because we don’t see every- Psychology: Why we think what we think (we think).
thing and because even what Paleontology: Previous Life.
we do see and measure, we Biology: Life.
measure imperfectly.
Religion: After Life
So, in a very basic way,
Statistics is about the real, im- Anthropology: Who?
perfect world in which we live. History: What, where, and when?
Philosophy: Why?
Engineering: How?
Accounting: How much?
In such a caricature, Statistics is about . . . Variation.
2 Data
Amazon.com opened for business in July 1995, billing itself as “Earth’s Biggest Bookstore.”
By 1997, Amazon had a catalog of more than 2.5 million book titles and had sold books to
more than 1.5 million customers in 150 countries. In 2010, the company’s sales reached
$34.2 billion (a nearly 40% increase from the previous year). Amazon has sold a wide vari-
ety of merchandise, including a $400,000 necklace, yak cheese from Tibet, and the largest
book in the world. How did Amazon become so successful and how can it keep track of so
many customers and such a wide variety of products? The answer to both questions is data.
But what are data? Think about it for a minute. What exactly do we mean by “data”?
Do data have to be numbers? The amount of your last purchase in dollars is numerical
data. But your name and address in Amazon’s database are also data even though they
are not numerical. What about your ZIP code? That’s a number, but would Amazon care
about, say, the average ZIP code of its customers?
Let’s look at some hypothetical values that Amazon might collect:
Try to guess what they represent. Why is that hard? Because there is no context. If we
don’t know what values are measured and what is measured about them, the values are
meaningless. We can make the meaning clear if we organize the values into a data table
such as this one:
Now we can see that these are purchase records for album download orders from Amazon.
The column titles tell what has been recorded. Each row is about a particular purchase.
Stats Starts Here
What information would provide a context? Newspaper journalists know that the lead
paragraph of a good story should establish the “Five W’s”: who, what, when, where, and
(if possible) why. Often, we add how to the list as well. The answers to the first two ques-
tions are essential. If we don’t know what values are measured and who those values are
measured on, the values are meaningless.
Stats Starts Here
3 Variables
The characteristics recorded about each individual are called variables. They are usu-
ally found as the columns of a data table with a name in the header that identifies what
has been recorded. In the Amazon data table we find the variables Order Number, Name,
State/Country, Price, and so on.
Categorical Variables
Some variables just tell us what group or category each individual belongs to. Are you
male or female? Pierced or not? We call variables like these categorical, or qualitative,
variables. (You may also see them called nominal variables because they name catego-
ries.) Some variables are clearly categorical, like the variable State/Country. Its values
are text and those values tell us what category the particular case falls into. But numerals
are often used to label categories, so categorical variable values can also be numerals. For
example, Amazon collects telephone area codes that categorize each phone number into a
geographical region. So area code is considered a categorical variable even though it has
numeric values. (But see the story in the following box.)
Area codes—numbers or categories? The What and Why of area codes are
not as simple as they may first seem. When area codes were first introduced, AT&T was still
the source of all telephone equipment, and phones had dials.
To reduce wear and tear on the dials, the area codes with the lowest digits (for which the
dial would have to spin least) were assigned to the most populous regions—those with the most
phone numbers and thus the area codes most likely to be dialed. New York City was assigned
212, Chicago 312, and Los Angeles 213, but rural upstate New York was given 607, Joliet was
815, and San Diego 619. For that reason, at one time the numerical value of an area code could
be used to guess something about the population of its region. Since the advent of push-button
phones, area codes have finally become just categories.
“Far too many scientists
have only a shaky grasp Descriptive responses to questions are often categories. For example, the responses to
of the statistical techniques the questions “Who is your cell phone provider?” or “What is your marital status?” yield
they are using. They employ categorical values. When Amazon considers a special offer of free shipping to customers,
them as an amateur chef em- it might first analyze how purchases have been shipped in the recent past. Amazon might
start by counting the number of purchases shipped in each category: ground transporta-
ploys a cookbook, believing tion, second-day air, and overnight air. Counting is a natural way to summarize a categori-
the recipes will work with- cal variable like Shipping Method.
out understanding why. A
more cordon bleu attitude . . . Quantitative Variables
might lead to fewer statisti-
When a variable contains measured numerical values with measurement units, we call it
cal soufflés failing to rise.” a quantitative variable. Quantitative variables typically record an amount or degree of
—The Economist, something. For quantitative variables, its measurement units provide a meaning for the
June 3, 2004, “Sloppy numbers. Even more important, units such as yen, cubits, carats, angstroms, nanoseconds,
stats shame science”
Stats Starts Here
miles per hour, or degrees Celsius tell us the scale of measurement, so we know how far
apart two values are. Without units, the values of a measured variable have no meaning. It
does little good to be promised a raise of 5000 a year if you don’t know whether it will be
paid in Euros, dollars, pennies, yen, or Estonian krooni.
Sometimes a variable with numeric values can be treated as either categorical or
quantitative depending on what we want to know from it. Amazon could record your Age
in years. That seems quantitative, and it would be if the company wanted to know the
average age of those customers who visit their site after 3 am. But suppose Amazon wants
to decide which album to feature on its site when you visit. Then thinking of your age in
one of the categories Child, Teen, Adult, or Senior might be more useful. So, sometimes
whether a variable is treated as categorical or quantitative is more about the question we
want to ask rather than an intrinsic property of the variable itself.
Identifiers
For a categorical variable like Sex, each individual is assigned one of two possible values,
Privacy and the say M or F. But for a variable with ID numbers, such as a student ID, each individual
Internet You have many receives a unique value. We call a variable like this, which has exactly as many values as
Identifiers: a social security cases, an identifier variable. Identifiers are useful, but not typically for analysis.
number, a student ID number, Amazon wants to know who you are when you sign in again and doesn’t want to
possibly a passport number, a confuse you with some other customer. So it assigns you a unique identifier. Amazon also
health insurance number, and wants to send you the right product, so it assigns a unique Amazon Standard Identification
probably a Facebook account Number (ASIN) to each item it carries. You’ll want to recognize when a variable is play-
name. Privacy experts are wor- ing the role of an identifier so you aren’t tempted to analyze it.
ried that Internet thieves may
Identifier variables themselves don’t tell us anything useful about their categories
match your identity in these dif-
ferent areas of your life, allow-
because we know there is exactly one individual in each. However, they are crucial in
ing, for example, your health, this era of large data sets because by uniquely identifying the cases, they make it pos-
education, and financial records sible to combine data from different sources, protect (or violate) privacy, and provide
to be merged. Even online com- unique labels. Many large databases are relational databases. In a relational database,
panies such as Facebook and different data tables link to one another by matching identifiers. In the Amazon example,
Google are able to link your the Customer Number, ASIN, and Transaction Number are all identifiers. The IP (Internet
online behavior to some of these protocol) address of your computer is another identifier, needed so that the electronic mes-
identifiers, which carries with sages sent to you can find you.
it both advantages and dan-
gers. The National Strategy for
Trusted Identities in Cyberspace Ordinal Variables
(www.wired.com/images_blogs/
threatlevel/2011/04/NSTIC
A typical course evaluation survey asks, “How valuable do you think this course will
strategy_041511.pdf) proposes be to you?” 1 = Worthless; 2 = Slightly; 3 = Middling; 4 = Reasonably; 5 = Invaluable.
ways that we may address this Is Educational Value categorical or quantitative? Often the best way to tell is to look to
challenge in the near future. the why of the study. A teacher might just count the number of students who gave each
response for her course, treating Educational Value as a categorical variable. When she
wants to see whether the course is improving, she might treat the responses as the amount
of perceived value—in effect, treating the variable as quantitative.
But what are the units? There is certainly an order of perceived worth: Higher num-
bers indicate higher perceived worth. A course that averages 4.5 seems more valuable
than one that averages 2, but we should be careful about treating Educational Value as
purely quantitative. To treat it as quantitative, she’ll have to imagine that it has “educa-
tional value units” or some similar arbitrary construct. Because there are no natural units,
she should be cautious. Variables that report order without natural units are often called
ordinal variables. But saying “that’s an ordinal variable” doesn’t get you off the hook.
You must still look to the why of your study and understand what you want to learn from
the variable to decide whether to treat it as categorical or quantitative.
Stats Starts Here
✓ Just Checking
In the 2004 Tour de France, Lance Armstrong made history by winning the race for an
unprecedented sixth time. In 2005, he became the only 7-time winner and set a new
record for the fastest average speed—41.65 kilometers per hour—that stands to this
day. You can find data on all the Tour de France races on the DVD. Here are the first
three and last ten lines of the data set. Keep in mind that the entire data set has nearly
100 entries.
1. List as many of the W’s as you can for this data set.
Total Distance
Year Winner Country of Origin Total Time (h/min/s) Avg. Speed (km/h) Stages Ridden (km) Starting Riders Finishing Riders
1903 Maurice Garin France 94.33.00 25.3 6 2428 60 21
1904 Henri Cornet France 96.05.00 24.3 6 2388 88 23
1905 Louis Trousseller France 112.18.09 27.3 11 2975 60 24
f
2002 Lance Armstrong USA 82.05.12 39.93 20 3278 189 153
2003 Lance Armstrong USA 83.41.12 40.94 20 3427 189 147
2004 Lance Armstrong USA 83.36.02 40.53 20 3391 188 147
2005 Lance Armstrong USA 86.15.02 41.65 21 3608 189 155
2006 Óscar Periero Spain 89.40.27 40.78 20 3657 176 139
2007 Alberto Contador Spain 91.00.26 38.97 20 3547 189 141
2008 Carlos Sastre Spain 87.52.52 40.50 21 3559 199 145
2009 Alberto Contador Spain 85.48.35 40.32 21 3460 180 156
2010 Andy Schleck Luxembourg 91.59.27 39.590 20 3642 180 170
2011 Cadel Evans Australia 86.12.22 39.788 21 3430 198 167
Stats Starts Here
There’s a world of data on the Internet These days, one of the richest
sources of data is the Internet. With a bit of practice, you can learn to find data on almost any
subject. The Internet has both advantages and disadvantages as a source of data. Among the
advantages are the fact that often you’ll be able to find even more current data than those we
present. The disadvantage is that references to Internet addresses can “break” as sites evolve,
move, and die.
Our solution to these challenges is to offer the best advice we can to help you search for
the data, wherever they may be residing. We usually point you to a website. We’ll sometimes
suggest search terms and offer other guidance.
Some words of caution, though: Data found on Internet sites may not be formatted in the
best way for use in statistics software. Although you may see a data table in standard form, an
attempt to copy the data may leave you with a single column of values. You may have to work
in your favorite statistics or spreadsheet program to reformat the data into variables. You will
also probably want to remove commas from large numbers and extra symbols such as money
indicators ($, ¥, £); few statistics packages can handle these.
the data.
■ We must know who, what, and why to be able to say anything useful based on the
data. The who are the cases. The what are the variables. A variable gives informa-
tion about each of the cases. The why helps us decide which way to treat the
variables.
■ Stop and identify the W’s whenever you have data, and be sure you can identify the
Stats Starts Here
Consider the source of your data and the reasons the data were collected. That can help
you understand what you might be able to learn from the data.
Identify whether a variable is being used as categorical or quantitative.
■ Categorical variables identify a category for each case. Usually we think about the
counts of cases that fall in each category. (An exception is an identifier variable that
just names each case.)
■ Quantitative variables record measurements or amounts of something; they must
have units.
■ Sometimes we may treat the same variable as categorical or quantitative depending
on what we want to learn from it, which means some variables can’t be pigeonholed
as one type or the other.
Review of Terms The key terms are in chapter order so you can use this list to review the material in the chapter.
Data Recorded values whether numbers or labels, together with their context.
Data table An arrangement of data in which each row represents a case and each column
represents a variable.
Context The context ideally tells who was measured, what was measured, how the data
were collected, where the data were collected, and when and why the study was
performed.
Experimental unit An individual in a study for which or for whom data values are recorded. Human
experimental units are usually called subjects or participants.
Population The entire group of individuals or instances about whom we hope to learn.
Variable A variable holds information about the same characteristic for many cases.
Categorical (or qualitative) A variable that names categories with words or numerals.
variable
Nominal variable The term “nominal” can be applied to a variable whose values are used only to name
categories.
Quantitative variable A variable in which the numbers are values of measured quantities with units.
Identifier variable A categorical variable that records a unique value for each case, used to name or
identify it.
Ordinal variable The term “ordinal” can be applied to a variable whose categorical values possess some
kind of order.
Stats Starts Here
A S Most often we find statistics on a computer using a program, or package, designed for
Activity: Examine the Data.
that purpose. There are many different statistics packages, but they all do essentially
Take a look at your own data from your
experiment earlier in the chapter and get the same things. If you understand what the computer needs to know to do what you
comfortable with your statistics package want and what it needs to show you in return, you can figure out the specific details of
as you find out about the experiment test most packages pretty easily.
results. For example, to get your data into a computer statistics package, you need to tell
the computer:
“Computers are useless. ■ Where to find the data. This usually means directing the computer to a file stored
They can only give you on your computer’s disk or to data on a database. Or it might just mean that you
answers.” have copied the data from a spreadsheet program or Internet site and it is currently
on your computer’s clipboard. Usually, the data should be in the form of a data
—Pablo Picasso table. Most computer statistics packages prefer the delimiter that marks the divi-
sion between elements of a data table to be a tab character and the delimiter that
marks the end of a case to be a return character.
■ Where to put the data. (Usually this is handled automatically.)
■ What to call the variables. Some data tables have variable names as the first row of
the data, and often statistics packages can take the variable names from the first
row automatically.
Exercises
Section 1 4. Nobel laureates The website www.nobelprize.org allows
you to look up all the Nobel prizes awarded in any year.
1. Grocery shopping Many grocery store chains offer The data are not listed in a table. Rather you drag a slider
customers a card they can scan when they check out and to the year and see a list of the awardees for that year.
offer discounts to people who do so. To get the card, Describe the who in this scenario.
customers must give information, including a mailing
address and e-mail address. The actual purpose is not to Section 3
reward loyal customers but to gather data. What data do
these cards allow stores to gather, and why would they 5. Grade levels A person’s grade in school is generally
want that data? identified by a number.
a) Give an example of a why in which grade level is
2. Online shopping Online retailers such as Amazon.com treated as categorical.
keep data on products that customers buy, and even prod- b) Give an example of a why in which grade level is
ucts they look at. What does Amazon hope to gain from treated as quantitative.
such information?
6. ZIP codes The Postal Service uses five-digit ZIP codes
Section 2 to identify locations to assist in delivering mail.
a) In what sense are ZIP codes categorical?
3. Super Bowl Sports announcers love to quote statistics.
b) Is there any ordinal sense to ZIP codes? In other
During the Super Bowl, they particularly love to an-
words, does a larger ZIP code tell you anything about
nounce when a record has been broken. They might have
a location compared to a smaller ZIP code?
a list of all Super Bowl games, along with the scores of
each team, total scores for the two teams, margin of vic- 7. Voters A February 2010 Gallup Poll question asked,
tory, passing yards for the quarterbacks, and many more “In politics, as of today, do you consider yourself
bits of information. Identify the who in this list. a Republican, a Democrat, or an Independent?”
Stats Starts Here
The possible responses were “Democrat,” “Republican,” 15. Bicycle safety Ian Walker, a psychologist at the Uni-
“Independent,” “Other,” and “No Response.” What kind versity of Bath, wondered whether drivers treat bicycle
of variable is the response? riders differently when they wear helmets. He rigged
his bicycle with an ultrasonic sensor that could measure
8. Job hunting A June 2011 Gallup Poll asked Americans,
how close each car was that passed him. He then rode on
“Thinking about the job situation in America today,
alternating days with and without a helmet. Out of 2500
would you say that it is now a good time or a bad time
cars passing him, he found that when he wore his helmet,
to find a quality job?” The choices were “Good time” or
motorists passed 3.35 inches closer to him, on average,
“Bad time.” What kind of variable is the response?
than when his head was bare. (Source: NY Times,
9. Medicine A pharmaceutical company conducts an Dec. 10, 2006)
experiment in which a subject takes 100 mg of a sub-
16. Investments Some companies offer 401(k) retirement
stance orally. The researchers measure how many
plans to employees, permitting them to shift part of
minutes it takes for half of the substance to exit the
their before-tax salaries into investments such as mutual
bloodstream. What kind of variable is the company
funds. Employers typically match 50% of the employees’
studying?
contribution up to about 6% of salary. One company,
10. Stress A medical researcher measures the increase in concerned with what it believed was a low employee
heart rate of patients under a stress test. What kind of participation rate in its 401(k) plan, sampled 30 other
variable is the researcher studying? companies with similar plans and asked for their 401(k)
participation rates.
Chapter Exercises
17. Honesty Coffee stations in offices often just ask users
11. The news Find a newspaper or magazine article in which to leave money in a tray to pay for their coffee, but
some data are reported. For the data discussed in the ar- many people cheat. Researchers at Newcastle Uni-
ticle, answer the same questions as for Exercises 7–10. versity alternately taped two posters over the coffee
Include a copy of the article with your report. station. During one week, it was a picture of flowers;
12. The Internet Find an Internet source that reports on a during the other, it was a pair of staring eyes. They
study and describes the data. Print out the description and found that the average contribution was significantly
answer the same questions as for Exercises 7–10. higher when the eyes poster was up than when the flow-
ers were there. Apparently, the mere feeling of being
(Exercises 13–20) For each description of data, identify
watched—even by eyes that were not real—was enough
Who and What were investigated and the Population of
to encourage people to behave more honestly.
interest.
(Source: NY Times, Dec. 10, 2006)
13. Gaydar A study conducted by a team of American
and Canadian researchers found that during ovulation, 18. Blindness A study begun in 2011 examines the use
a woman can tell whether a man is gay or straight by of stem cells in treating two forms of blindness, Star-
looking at his face. To explore the subject, the authors con- gardt’s disease and dry age-related macular degenera-
ducted three investigations, the first of which involved 40 tion. Each of the 24 patients entered one of two separate
undergraduate women who were asked to guess the sexual trials in which embryonic stem cells were to be used to
orientation of 80 men based on photos of their face. Half treat the condition. (beta.news.yahoo.com/first-patients-
of the men were gay, and the other half were straight. All enroll-stem-cell-trials-blindness-134757558.html)
held similar expressions in the photos or were deemed to 19. Not-so-diet soda A look at 474 participants in the San
be equally attractive. None of the women were using any Antonio Longitudinal Study of Aging found that partici-
contraceptive drugs at the time of the test. The result: the pants who drank two or more diet sodas a day “experi-
closer a woman was to her peak ovulation the more accurate enced waist size increases six times greater than those of
her guess. (news.yahoo.com/does-ovulation-boost-womans- people who didn’t drink diet soda.” (news.yahoo.com/
gaydar-210405621.html) diet-sodas-dont-help-dieting-175007737.html)
14. Hula-hoops The hula-hoop, a popular children’s toy in 20. Molten iron The Cleveland Casting Plant is a large,
the 1950s, has gained popularity as an exercise in recent highly automated producer of gray and nodular iron
years. But does it work? To answer this question, the automotive castings for Ford Motor Company. The com-
American Council on Exercise conducted a study to eval- pany is interested in keeping the pouring temperature
uate the cardio and calorie-burning benefits of “hooping.” of the molten iron (in degrees Fahrenheit) close to the
Researchers recorded heart rate and oxygen consumption specified value of 2550 degrees. Cleveland Casting mea-
of participants, as well as their individual ratings of per- sured the pouring temperature for 10 randomly selected
ceived exertion, at regular intervals during a 30-minute crankshafts.
workout. (www.acefitness.org/certifiednewsarticle/1094/)
Stats Starts Here
(Exercises 21–34) For each description of data, identify the ranging from 0 to 5. They found no evidence of benefits
W’s, name the variables, specify for each variable whether of the compound.
its use indicates that it should be treated as categorical or
28. Vineyards Business analysts hoping to provide informa-
quantitative, and, for any quantitative variable, identify the
tion helpful to American grape growers compiled these
units in which it was measured (or note that they were not
data about vineyards: size (acres), number of years in
provided).
existence, state, varieties of grapes grown, average case
21. Weighing bears Because of the difficulty of weighing a price, gross sales, and percent profit.
bear in the woods, researchers caught and measured
54 bears, recording their weight, neck size, length, and 29. Streams In performing research for an ecology class,
sex. They hoped to find a way to estimate weight from students at a college in upstate New York collect data on
the other, more easily determined quantities. streams each year. They record a number of biological,
chemical, and physical variables, including the stream
22. Schools The State Education Department requires local name, the substrate of the stream (limestone, shale, or
school districts to keep these records on all students: age, mixed), the acidity of the water (pH), the temperature
race or ethnicity, days absent, current grade level, stan- (°C), and the BCI (a numerical measure of biological
dardized test scores in reading and mathematics, and any diversity).
disabilities or special educational needs.
30. Fuel economy The Environmental Protection Agency
23. Arby’s menu A listing posted by the Arby’s restaurant (EPA) tracks fuel economy of automobiles based on in-
chain gives, for each of the sandwiches it sells, the type formation from the manufacturers (Ford, Toyota, etc.).
of meat in the sandwich, the number of calories, and the Among the data the agency collects are the manufacturer,
serving size in ounces. The data might be used to assess vehicle type (car, SUV, etc.), weight, horsepower, and gas
the nutritional value of the different sandwiches. mileage (mpg) for city and highway driving.
24. Age and party The Gallup Poll conducted a representa- 31. Refrigerators In 2006, Consumer Reports published an
tive telephone survey of 1180 American voters during the article evaluating refrigerators. It listed 41 models, giving
first quarter of 2007. Among the reported results were the the brand, cost, size (cu ft), type (such as top freezer),
voter’s region (Northeast, South, etc.), age, party affilia- estimated annual energy cost, an overall rating (good,
tion, and whether or not the person had voted in the 2006 excellent, etc.), and the repair history for that brand
midterm congressional election. (percentage requiring repairs over the past 5 years).
25. Babies Medical researchers at a large city hospital inves- 32. Walking in circles People who get lost in the desert,
tigating the impact of prenatal care on newborn health mountains, or woods often seem to wander in circles
collected data from 882 births during 1998–2000. They rather than walk in straight lines. To see whether people
kept track of the mother’s age, the number of weeks the naturally walk in circles in the absence of visual clues,
pregnancy lasted, the type of birth (cesarean, induced, researcher Andrea Axtell tested 32 people on a football
natural), the level of prenatal care the mother had (none, field. One at a time, they stood at the center of one goal
minimal, adequate), the birth weight and sex of the baby, line, were blindfolded, and then tried to walk to the other
and whether the baby exhibited health problems (none, goal line. She recorded each individual’s sex, height,
minor, major). handedness, the number of yards each was able to walk
26. Flowers In a study appearing in the journal Science, a before going out of bounds, and whether each wandered
research team reports that plants in southern England are off course to the left or the right. No one made it all the
flowering earlier in the spring. Records of the first flower- way to the far end of the field without crossing one of the
ing dates for 385 species over a period of 47 years show sidelines. (Source: STATS No. 39, Winter 2004)
that flowering has advanced an average of 15 days per 33. Kentucky Derby 2012 The Kentucky Derby is a horse
decade, an indication of climate warming, according to race that has been run every year since 1875 at Churchill
the authors. Downs, Louisville, Kentucky. The race started as a
27. Herbal medicine Scientists at a major pharmaceutical 1.5-mile race, but in 1896, it was shortened to 1.25 miles
firm conducted an experiment to study the effectiveness because experts felt that 3-year-old horses shouldn’t run
of an herbal compound to treat the common cold. They such a long race that early in the season. (It has been run
exposed each patient to a cold virus, then gave them in May every year but one—1901—when it took place on
either the herbal compound or a sugar solution known to April 29). Here are the data for the first four and several
have no effect on colds. Several days later they assessed recent races.
each patient’s condition, using a cold severity scale
Stats Starts Here
✓
34. Indy 2012 The 2.5-mile Indianapolis Motor Speedway
has been the home to a race on Memorial Day nearly
every year since 1911. Even during the first race, there Just Checking ANSWERS
were controversies. Ralph Mulford was given the check-
ered flag first but took three extra laps just to make sure 1. Who—Tour de France races; What—year, winner,
he’d completed 500 miles. When he finished, another country of origin, total time, average speed, stages,
driver, Ray Harroun, was being presented with the total distance ridden, starting riders, finishing riders;
winner’s trophy, and Mulford’s protests were ignored. How—official statistics at race; Where—France (for
Harroun averaged 74.6 mph for the 500 miles. In 2011, the most part); When—1903 to 2011; Why—not spec-
the winner, Dan Wheldon, averaged 170.265 mph. ified (To see progress in speeds of cycling racing?)
Here are the data for the first five races and five 2. Variable Type Units
recent Indianapolis 500 races. Year Quantitative or Years
Identifier
Year Driver Time Speed (mph) Winner Categorical
1911 Ray Harroun 6:42:08 74.602 Country of Origin Categorical
1912 Joe Dawson 6:21:06 78.719 Total Time Quantitative Hours/minutes/
1913 Jules Goux 6:35:05 75.933 seconds
1914 René Thomas 6:03:45 82.474 Average Speed Quantitative Kilometers
1915 Ralph DePalma 5:33:55.51 89.840 per hour
… Stages Quantitative Counts (stages)
2008 Scott Dixon 3:28:57.6792 143.567
Total Distance Quantitative Kilometers
2009 Hélio Castroneves 3:19:34.6427 150.318
Starting Riders Quantitative Counts (riders)
2010 Dario Franchitti 3:05:37.0131 161.623
2011 Dan Wheldon 2:56:11.7267 170.265 Finishing Riders Quantitative Counts (riders)
2012 Dario Franchitti 2:58:51.2532 167.734
Stats Starts Here
Answers
Here are the “answers” to the exercises for this chapter. The answers are outlines of the complete solution. Your solution should follow the
model of the Step-by-Step examples, where appropriate. You should explain the context, show your reasoning and calculations, and draw
conclusions. For some problems, what you decide to include in an argument may differ somewhat from the answers here. But, of course, the
numerical part of your answer should match the numbers in the answers shown.
1. Retailers, and suppliers to whom they sell the information, will hospital; Why—Researchers were investigating the impact of
use the information about what products consumers buy to tar- prenatal care on newborn health; How—Not specified exactly,
get their advertisements to customers more likely to buy their but probably from hospital records; Variable—Mother’s age;
products. Type—Quantitative; Units—Not specified, probably years;
3. The individual games. Variable—Length of pregnancy; Type—Quantitative; Units—
Weeks; Variable—Birth weight of baby; Type—Quantitative;
5. a) Sample—A principal wants to know how many from Units—Not specified, probably pounds and ounces; Variable—
each grade will be attending a performance of a school Type of birth; Type—Categorical; Variable—Level of prenatal
play. care; Type—Categorical; Variable—Sex; Type—Categorical;
b) Sample—A principal wants to see the trend in the amount of Variable—Baby’s health problems; Type—Categorical.
mathematics learned as students leave each grade.
27. Who—Experiment subjects; Cases—Each subject is a case;
7. Categorical. What—Treatment (herbal cold remedy or sugar solution) and
9. Quantitative. cold severity; When—Not specified; Where—Not specified;
Why—To test efficacy of herbal remedy on common cold;
11. Answers will vary.
How—The scientists set up an experiment; Variable—Treat-
13. Who—40 undergraduate women; What—Ability to differentiate ment; Type—Categorical; Variable—Cold severity rating;
gay men from straight men; Population—All women. Type—Quantitative (perhaps ordinal categorical); Units—Scale
15. Who—2500 cars; What—Distance from car to bicycle; from 0 to 5; Concerns—The severity of a cold seems subjective
Population—All cars passing bicyclists. and difficult to quantify. Scientists may feel pressure to report
negative findings of herbal product.
17. Who—Coffee drinkers at a Newcastle University coffee station;
What—Amount of money contributed; Population—All people 29. Who—Streams; Cases—Each stream is a case; What—Name
in honor system payment situations. of stream, substrate of the stream, acidity of the water,
temperature, BCI; When—Not specified; Where—Upstate New
19. Who—474 participants in the San Antonio Longitudinal Study
York; Why—To study ecology of streams; How—Not specified;
of Aging; What—Diet soda consumption and waist size change;
Variable—Stream name; Type—Identifier; Variable—Substrate;
Population—All diet soda drinkers.
Type—Categorical; Variable—Acidity of water; Type—
21. Who—54 bears; Cases—Each bear is a case; What—Weight, Quantitative; Units—pH; Variable—Temperature; Type—
neck size, length, and sex; When—Not specified; Where—Not Quantitative; Units—Degrees Celsius; Variable—BCI; Type—
specified; Why—To estimate weight from easier-to-measure Quantitative; Units—Not specified
variables; How—Researchers collected data on 54 bears they
31. Who—41 refrigerator models; Cases—Each of the 41 refrigera-
were able to catch.
tor models is a case; What—Brand, cost, size, type, estimated
Variable—Weight; Type—Quantitative; Units—Not specified;
annual energy cost, overall rating, and repair history; When—
Variable—Neck size; Type—Quantitative; Units—Not
2006; Where—United States; Why—To provide information
specified; Variable—Length; Type—Quantitative; Units—Not
to the readers of Consumer Reports; How—Not specified;
specified; Variable—Sex; Type—Categorical.
Variable—Brand; Type—Categorical; Variable—Cost; Type—
23. Who—Arby’s sandwiches; Cases—Each sandwich is a case; Quantitative; Units—Not specified (dollars); Variable—Size;
What—Type of meat, number of calories, and serving size; Type—Quantitative; Units—Cubic feet; Variable—Type;
When—Not specified; Where—Arby’s restaurants; Why—To Type—Categorical; Variable—Estimated annual energy cost;
assess nutritional value of sandwiches; How—Report by Arby’s Type—Quantitative; Units—Not specified (dollars);
restaurants; Variable—Type of meat; Type—Categorical; Variable—Overall rating; Type—Categorical (ordinal);
Variable—Number of calories; Type—Quantitative; Units— Variable—Percent requiring repair in the past 5 years; Type—
Calories; Variable—Serving size; Type—Quantitative; Quantitative; Units—Percent.
Units—Ounces.
33. Who—Kentucky Derby races; What—Date, winner, jockey,
25. Who—882 births; Cases—Each of the 882 births is a case; trainer, owner, and time; When—1875 to 2012; Where—
What—Mother’s age, length of pregnancy, type of birth, level Churchill Downs, Louisville, Kentucky; Why—Not specified
of prenatal care, birth weight of baby, sex of baby, and baby’s (To see trends in horse racing?); How—Official
health problems; When—1998–2000; Where—Large city statistics collected at race; Variable—Year; Type—
Stats Starts Here
Photo Acknowledgments
Photo credits appear in order of appearance within the chapter.
Displaying and Describing
Categorical Data
From Chapter 2 of Intro Stats, Fourth Edition. Richard D. De Veaux, Paul F. Velleman, David E. Bock. Copyright © 2014 by Pearson
Education, Inc. All rights reserved.
Displaying and Describing
Categorical Data
1 Summarizing and Displaying a
Single Categorical Variable
2 Exploring the Relationship
Between Two Categorical
Variables
w
egorical variables work together.
Are men or women more likely
hat happened on the Titanic at 11:40 on the night of April 14, 1912, is
to be Democrats? Are people
well known. Frederick Fleet’s cry of “Iceberg, right ahead” and the three
with blue eyes more likely to be
accompanying pulls of the crow’s nest bell signaled the beginning of a
left-handed?
nightmare that has become legend. By 2:15 a.m., the Titanic, thought by
You often see categorical
many to be unsinkable, had sunk, leaving more than 1500 passengers and crew members
data displayed in pie charts or
on board to meet their icy fate.
bar charts, or summarized in
Below are some data about the passengers and crew aboard the Titanic. Each case
tables, sometimes in confusing
(row) of the data table represents a person on board the ship. The variables are the per-
ways. But, with a little skill, it’s
son’s Survival status (Dead or Alive), Age (Adult or Child), Sex (Male or Female), and
not hard to do it right.
ticket Class (First, Second, Third, or Crew).
Table 1
Part of a data table showing four Survival Age Sex Class
variables for nine people aboard the Dead Adult Male Third
Titanic.
Dead Adult Male Crew
Dead Adult Male Third
Dead Adult Male Crew
Dead Adult Male Crew
Dead Adult Male Crew
Alive Adult Female First
Dead Adult Male Third
Dead Adult Male Crew
A S Video: The Incident tells the The problem with a data table like this—and in fact with all data tables—is that you
story of the Titanic, and includes rare film can’t see what’s going on. And seeing is just what we want to do. We need ways to show
footage. the data so that we can see patterns, relationships, trends, and exceptions.
Displaying and Describing Categorical Data
How
patterns in your data. It could also show you things you did not expect to see:
A variety of sources and extraordinary (possibly wrong) data values or unexpected patterns.
Internet sites 3. Make a picture. The best way to Tell others about your data is with a well-chosen
Why Historical interest picture.
These are the three rules of data analysis. There are pictures of data throughout the
text, and new kinds keep showing up. These days, technology makes drawing pictures of
data easy, so there is no reason not to follow the three rules.
Figure 1
A picture to tell a story. In November
2008, Barack Obama was elected the
44th president of the United States.
News reports commonly showed the
election results with maps like the
one on top, coloring states won by
Obama blue and those won by his
opponent John McCain red. Even
though McCain lost, doesn’t it look
like there’s more red than blue?
That’s because some of the larger
states like Montana and Wyoming
have far fewer voters than some of
the smaller states like Maryland and
Connecticut. The strange-looking
map on the bottom cleverly distorts
the states to resize them proportional
to their populations. By sacrificing an
accurate display of the land areas,
we get a better impression of the
votes cast, giving us a clear picture
of Obama’s historic victory.
(Source: www-personal.umich.
edu/~mejn/election/2008/)
Displaying and Describing Categorical Data
Figure 2
How many people were in each class
on the Titanic? From this display,
it looks as though the service must
have been great, since most aboard
were crew members. Although the Crew
length of each ship here corresponds
to the correct number, the impression
is all wrong. In fact, only about 40%
were crew.
Class Count one of the totals, it’s the associated area in the image that we notice. There were about
First 325 3 times as many crew as second-class passengers, and the ship depicting the number of
Second 285 crew is about 3 times longer than the ship depicting second-class passengers. The problem
Third 706 is that it occupies about 9 times the area. That just isn’t a correct impression.
Crew 885 The best data displays observe a fundamental principle of graphing data called the
area principle. The area principle says that the area occupied by a part of the graph should
correspond to the magnitude of the value it represents. Violations of the area principle are a
Table 2 common way to lie (or, since most mistakes are unintentional, we should say err) with Statistics.
A frequency table of the
Titanic passengers.
Frequency Tables
A display that distorts the number of cases can be misleading. But before we can follow
Class % the area principle to accurately display a categorical variable, we need to organize the
First 14.77 number of cases associated with each category. Even when we have thousands of cases, a
Second 12.95 variable like ticket Class, with only a few categories, is easy to organize. We just count the
Third 32.08 number of cases corresponding to each category and put them in a table. This frequency
Crew 40.21 table also records the totals and uses the category names to label each row. We use the
names of the categories to label each row in the frequency table. For ticket Class, these are
Table 3 “First,” “Second,” “Third,” and “Crew.”
A relative frequency table for the For a variable with dozens or hundreds of categories, a frequency table will be much
same data. harder to read. You might want to combine categories into larger headings. For example,
instead of counting the number of students from each state, you might group the states
into regions like “Northeast,” “South,” “Midwest,” “Mountain States,” and “West.” If the
1000
number of cases in several categories is relatively small, you can put them together into
800 one category labeled “Other.”
Counts are useful, but sometimes we want to know the fraction or proportion of the
Frequency
600
data in each category, so we divide the counts by the total number of cases. Usually we
400 multiply by 100 to express these proportions as percentages. A relative frequency table
displays the percentages, rather than the counts, of the values in each category. Both types
200
of tables show how the cases are distributed across the categories. In this way, they de-
0 scribe the distribution of a categorical variable because they name the possible categories
First Second Third Crew
Class
and tell how frequently each occurs.
Another random document with
no related content on Scribd:
ihmeet ja enteet, joita huomataan valtioiden, ruhtinaiden ja
yksityisten henkilöitten häviön edellä, ovat hyvien enkelien armeliaita
muistutuksia, joita huolimattomat ihmiset nimittävät vain
sattumuksen ja luonnon ilmiöiksi.
Mitä siis henkiin tulee, niin kaukana siitä, että kieltäisin niiden
olemassaolon, pikemmin uskon, ettei ainoastaan kokonaisilla
valtioilla, vaan yksityisillä henkilöilläkin on ohjaava ja suojeleva
enkelinsä. Tämä ei ole Rooman kirkon uusi mielipide, vaan jo
Pythagoraan ja Platon esittämä. Siinä ei ole mitään vääräuskoista, ja
vaikka sitä ei nimenomaan mainitakaan raamatussa, on se kuitenkin
ihmiselämälle hyvä ja hyödyllinen ajatus. Se saattaisi olla hyvä
olettamus poistamaan monia epäilyksiä, joille tavallinen filosofia ei
voi antaa mitään ratkaisua.