Professional Documents
Culture Documents
Ips10e Lectureslides ch01
Ips10e Lectureslides ch01
Statistics
Tenth Edition
David S. Moore George P. McCabe
Bruce Craig
Chapter 1
Looking at Data—Distributions
Lecture Slides
© 2021 W.H. Freeman and Company
Chapter 1
Looking at Data — Distributions
Introduction
1.1 Data
2
1.1 Data
¡ Cases, variables, values, and labels
¡ Categorical and quantitative variables
¡ Key characteristics of a data set
3
Cases, Variables, Values, and Labels 4
Variables
§ What? How many variables does the data set have? What
are the exact definitions of these variables? What are the
units of measurement for each quantitative variable?
§ Pie charts
§ Histograms
§ Examining distributions
§ Dealing with outliers
§ Time plots 7
Exploratory Data Analysis 8
Exploring Data
Variable
Characteristic of
the individual
Quantitative variable
Takes numerical values for
which arithmetic operations
make sense
Distribution of a Variable 10
Number of Students
Library
14% 300
250
200
Wikipedia 150
9%
Google
74% 100
50
Other
3% 0
Google Library Wikipedia Other
Quantitative Variables 13
To construct a stemplot:
§ Separate each observation into a stem (all but the rightmost digit)
and a leaf (the remaining digit).
§ Write the stems in a vertical column; draw a vertical line to the right
of the stems.
§ Write each leaf in the row to the right of its stem; order the leaves,
if desired.
Stemplots 2 15
If there are very few stems (when the data cover only a very small range
of values), then you might want to create more stems by splitting the
original stems.
Example: If all the data values are between 150 and 179, then we may
choose to use the following stems:
15
15 Leaves 0–4 would go on each upper stem (first
16 “15”) and leaves 5–9 would go on each lower
16 stem (second “15”).
17
17
Histograms 1 17
For large datasets and/or quantitative variables that take many values:
§ Divide the possible values into classes, or intervals of equal width.
§ Count how many observations fall into each interval. Instead of
counts, one may also use percents.
§ Draw a picture representing the distribution ― each bar height is
equal to the number (or percent) of observations in its interval.
Histograms 2 18
Distribution of IQ Scores
18
16 Class Count
14 75 ≤ IQ Score < 85 2
85 ≤ IQ Score < 95 3
Number of Students
12
10
95 ≤ IQ Score < 105 10
8
105 ≤ IQ Score < 115 16
6
4
115 ≤ IQ Score < 125 13
2 125 ≤ IQ Score < 135 10
0 135 ≤ IQ Score < 145 5
145 ≤ IQ Score < 155 1
5
5
75
95
11
13
IQ Score
Examining Distributions 1 19
§ In any graph of data, look for the overall pattern and for striking
deviations from that pattern.
§ You can describe the overall pattern by its shape, center, and
spread.
§ An important kind of deviation is an outlier, an individual that falls
outside the overall pattern.
Examining Distributions 2 20
§ A distribution is symmetric if the right and left sides of the graph are
approximately mirror images of each other.
§ A distribution is skewed to the right (right-skewed) if the right side of
the graph (containing the half of the observations with larger values) is
much longer than the left side.
§ It is skewed to the left (left-skewed) if the left side of the graph is
much longer than the right side.
60%
30%
20%
10%
0%
1985 1986 1987 1988 1989 1990 1991 1992 1993
1
𝑥 = 4 𝑥%
𝑛
Measuring Center: The Median 26
Use the data below to calculate the mean and median of the time to
start a business (in days) of 24 randomly selected countries.
19 17 43 7 12 27 67 49 6 6 29 12
12 9 17 23 1 12 14 18 6 7 9 31
19 + 17 + ⋯ + 31
𝑥̅ = = 18.875 days
24
0 16667799
1 222247789
2 379 12 + 14
3 1 Key: 4|9 represents 𝑀= = 13 days
4 39
a country that had a 2
49-day time to start
5 a business.
6 7
Comparing Mean and Median 28
The minimum and maximum values alone tell us little about the
distribution as a whole. Likewise, the median and quartiles tell us little
about the tails of a distribution.
To get a quick summary of both center and spread, combine all five
numbers.
The median and quartiles divide the distribution roughly into quarters.
This leads to a new way to display quantitative data, the boxplot.
§ Draw and label a number line that includes the range of the
distribution.
§ Extend lines (whiskers) from the box out to the minimum and
maximum values.
Note. For a modified boxplot, the whiskers extend from the box to the minimum
and maximum values that are not outlines, with the outliers themselves marked
as, e.g., little stars.
Suspected Outliers: 1.5 ´ IQR Rule 32
Start Time
Measuring Spread: 34
1 "
Standard deviation = 𝑠! = 1 𝑥% − 𝑥̅
𝑛−1
Calculating the Standard Deviation 1 35
Deviation: 8 – 5 = 3
0 2 4 6 8 10
Number of Pets
NumberOfPets
x =5
Calculating the Standard Deviation 2 36
xi (xi-mean) (xi-mean)2
3. Square each deviation. 1 1 – 5 = –4 (–4)2 = 16
3 3 – 5 = –2 (–2)2 = 4
4. Find the “average” squared
deviation. Calculate the sum of 4 4 – 5 = –1 (–1)2 = 1
the squared deviations divided by 4 4 – 5 = –1 (–1)2 = 1
(n – 1). This is called the 4 4 – 5 = –1 (–1)2 = 1
variance.
5 5–5=0 (0)2 = 0
3. Calculate the square root of the 7 7–5=2 (2)2 = 4
variance. This is the standard
8 8–5=3 (3)2 = 9
deviation.
9 9–5=4 (4)2 = 16
Sum = ? Sum = ?
and Spread
We now have a choice between two descriptions for center and spread:
ü Mean and standard deviation
ü Median and interquartile range
The median and IQR are usually better than the mean and standard
deviation for describing a skewed distribution or a distribution with
outliers.
¡ Normal distributions
¡ Standardizing observations
The distribution of Iowa Test of Basic Skills (ITBS) vocabulary scores for
7th-grade students in Gary, Indiana, is close to Normal. Suppose the
distribution is N(6.84, 1.55).
ü Sketch the Normal density curve for this distribution.
ü What percent of ITBS vocabulary scores are less than 3.74?
ü What percent of the scores are between 5.29 and 9.94?
Standardizing Observations 2 51
𝑥−𝜇
𝑧=
𝜎
1 – (0.1056+0.2090) = 1 – 0.3146
= 0.6854
How to Solve Problems
Normal Calculations 2 55
Involving Normal
Distributions
3. Perform calculations.
§ Standardize x to restate the problem in terms of a standard
Normal variable z.
§ Use Table A and the fact that the total area under the curve is 1
to find the required area under the standard Normal curve.
If exactly 10% of men aged 18–24 are shorter than a particular man, how
tall is he?
N(70, 2.8)
.10
? 70
Normal Calculations 4 57
Z = –1.28
Normal Calculations 5 58
Z = –1.28 .10
? 70
We need to “unstandardize” the z-score to find the observed value (x):
𝑥−𝜇
𝑧= ⟹ 𝑥 = 𝜇 + 𝑧𝜎
𝜎
x = 70 + z(2.8)
= 70 + [ (–1.28 ) ´ (2.8) ]
= 70 + (–3.58) = 66.42
A man would have to be approximately less than or equal to 66.42 inches
tall to be in the lower 10% of all men in the population.
Normal Quantile Plots 1 59
Note. Normal quantile plots are complex to do by hand, but they are standard
features in most statistical software.
Good fit to a straight line: The Curved pattern: The data are not
distribution of rainwater pH Normally distributed. Instead, the data
values is close to Normal. are right-skewed: A few individuals
have particularly long survival times.
Normal quantile plots are complex to do by hand, but they are standard
features in most statistical software.
Density Estimation 1 61
A density estimator does not start with any specific shape, such as the
Normal shape. It looks at the data and draws a density curve that
describes the overall shape of the data.
Recall the example concerning the start time for 24 businesses for which
we created a box plot.
From the box plot, it is evident that the data are right-skewed.
We conclude that the normal curve does not approximate these data
well.
Chapter 1
Looking at Data—
Distributions (Review)
Introduction
1.1 Data
63