Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 73

Introduction to Probability

and Statistics
Fourteenth Edition

Chapter 2
Describing Data
with Numerical Measures
Some graphic screen captures from Seeing Statistics ® Copyright ©2006 Brooks/Cole
Some images © 2001-(current year) www.arttoday.com  A division of Thomson Learning, Inc.
Describing Data with Numerical
Measures
• Graphs can help you describe the basic shape of a data
distribution; “a picture is worth a thousand words.”
• Graphical methods may not always be sufficient for
describing data.
• Numerical measures can be created for both
populations and samples.
– A parameter is a numerical descriptive measure
calculated for a population.
population
– A statistic is a numerical descriptive measure
calculated for a sample.
sample
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Measures of Center
• Dotplots, stem and leaf plots, and histograms are
used to describe the distribution of a set of
measurements on a quantitative variable.
• Measure of center - a measure along the
horizontal axis of the data distribution that locates
the center of the distribution.

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
Arithmetic Mean or Average
• Definition: The arithmetic mean or
average of a set of n measurements is equal
to the sum of the measurements divided by n.

 xi 1 n  x
x x   xi
n i 1

N
n
where n = number of measurements
 xi  sum of all the measuremen ts
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Trimmed Mean
• Weighted arithmetic mean: n

w x i i
x i 1
n

• Trimmed mean:
w
i 1
i

– Chopping extreme values (e.g., Olympics


gymnastics score computation)

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
Winsorized Mean
• Winsorized mean is a method of averaging that
initially replaces the smallest and largest values
with the observations closest to them.
• This is done to limit the effect of outliers or
abnormal extreme values, or outliers, on the
calculation.

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
Winsorized Mean
• Initial data set: 1, 5, 7, 8, 9, 10, 34. 
• The data set now appears as follows: 5, 5, 7, 8, 9,
10, 10.
• Taking an arithmetic average of the new set
produces a winsorized mean of 7.7, or (5 + 5 + 7 +
8 + 9 + 10 + 10) divided by 7.
• Note that the arithmetic mean would have been
higher—10.6.
• The winsorized mean effectively reduces the
influence of the 34 value as an outlier.

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
Winsorized Mean
• Consider a 20% winsorized mean that takes the top 10%
and bottom 10% and replaces them with their next
closest value.
• We will winsorize the following data set: 2, 4, 7, 8, 11,
14, 18, 23, 23, 27, 35, 40, 49, 50, 55, 60, 61, 61, 62, 75.
• The two smallest and two largest data points—20% of
the 20 data points—will be replaced with their next
closest value.
• Thus, the new data set is as follows: 7, 7, 7, 8, 11, 14,
18, 23, 23, 27, 35, 40, 49, 50, 55, 60, 61, 61, 61, 61.
• The winsorized mean is 33.9, or the total of the data
(678) divided by the total number of data points (20).
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Winsorized Mean
• It mitigates the effects of outliers by replacing them
with less extreme values.
• The winsorized mean is not the same as the trimmed
mean, which involves removing data points as
opposed to replacing them—although the results of
the two tend to be close.
• The winsorized mean is also not the same as the
arithmetic mean which does not adjust for outliers.

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
Example
•The set: 2, 9, 11, 5, 6

 xi 2  9  11  5  6 33
x    6 .6
n 5 5

If we were able to enumerate the whole


population, the population mean would be
called (the Greek letter “mu”).

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
Median
Definition: The median m of a set of n
measurements is the value of x that falls in
the middle position when the measurements are
ordered from smallest to largest.
•The position of the median is .5(n + 1)
once the measurements have been ordered .
• Although both the mean and the median are
good measures of the center of a distribution,
the median is less sensitive to extreme values
or outliers. Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Example
• The set: 2, 4, 9, 8, 6, 5, 3 n=7
• Sort:2, 3, 4, 5, 6, 8, 9
• Position: .5(n + 1) = .5(7 + 1) = 4th
Median = 4th largest measurement
• The set: 2, 4, 9, 8, 6, 5 n=6
• Sort:2, 4, 5, 6, 8, 9
• Position: .5(n + 1) = .5(6 + 1) = 3.5th
Median = (5 + 6)/2 = 5.5 - average of the 3rd and 4th measurements
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Mode
• Definition: The mode is the category that
occurs most frequently, or the most frequently
occurring value of x.
• The set: 2, 4, 9, 8, 8, 5, 3
– The mode is 8, which occurs twice
• The set: 2, 2, 9, 8, 8, 5, 3
– There are two modes—8 and 2 (bimodal)
bimodal
• The set: 2, 4, 9, 8, 5, 3
– There is no mode (each value is unique).
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Mode
• When measurements on a continuous variable have
been grouped as a frequency or relative frequency
histogram, the class with the highest peak or
frequency is called the modal class, and the midpoint
of that class is taken to be the mode.
• The mode is generally used to describe large data sets,
whereas the mean and median are used for both large
and small data sets.
• It is possible for a distribution of measurements to
have more than one mode. These modes would appear
as “local peaks” in the relative frequency distribution.
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Mode
• For example, if we were to tabulate the length of fish
taken from a lake during one season, we might get a
bimodal distribution, possibly reflecting a mixture of
young and old fish in the population.
• Bimodal distributions often represent a mixture of two
different populations in the data set.
• Sometimes bimodal distributions of sizes or weights
reflect a mixture of measurements taken on males and
females.
• In any case, a set or distribution of measurements may
have more than one mode.
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Mode
• Mode: Value that occurs most frequently in the data
• Unimodal
– Empirical formula: mean  mode  3  (mean  median)
• Multi-modal
– Bimodal
– Trimodal

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
Example
The number of quarts of milk purchased by
25 households:
0 0 1 1 1 1 1 2 2 2 2 2 2 2 2
2 3 3 3 3 3 4 4 4 5
• Mean?
 xi 55
x   2 .2 10/25

n 25 8/25

Relative frequency
• Median? 6/25

m2 4/25

• Mode? (Highest peak) 2/25

mode  2
0
0 1 2 3 4 5
Quarts

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
Extreme Values
• The mean is more easily affected by extremely large
or small values than the median.

• The median is often used as a measure of center


when the distribution is strongly skewed by one or
more extreme values.

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
Extreme Values
• If a distribution is skewed to the right, the mean shifts
to the right; if a distribution is skewed to the left, the
mean shifts to the left.
• When a distribution is symmetric, the mean and the
median are equal.
• The median is not affected by these extreme values
because the numerical values of the measurements are
not used in its calculation.

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
Extreme Values

Symmetric: Mean = Median

Skewed right: Mean > Median

Skewed left: Mean < Median

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
Symmetric vs. Skewed Data
• Median, mean and mode of symmetric,
symmetric
positively and negatively skewed data

positively
negatively
skewed
skewed

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
Measures of Variability
• Data sets may have the same center but look different
because of the way the numbers spread out from the center.
• A measure along the horizontal axis of the data distribution
that describes the spread of the distribution from the center.

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
Measures of Variability
• Measures of variability (dispersion) can help you
create a mental picture of the spread of the data.

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
The Range
• The range, R, of a set of n measurements is
the difference between the largest and
smallest measurements.
• Example: A botanist records the number of
petals on 5 flowers:
5, 12, 6, 8, 14
• The range is R = 14 – 5 = 9.
•Quick and easy, but only uses 2
of the 5 measurements.Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
The Range
• The range is easy to calculate, easy to interpret,
and is an adequate measure of variation for small
sets of data.
• But, for large data sets, the range is not an
adequate measure of variability. For example, the
two relative frequency distributions have the same
range but very different shapes and variability.

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
The Variance
• The variance is measure of variability that uses all
the measurements. It measures the average squired
deviation of the measurements about their mean.
• The variance will be relatively large for highly vari-
able data and relatively small for less variable data.
• Flower petals: 5, 12, 6, 8, 14

45
x 9
5
4 6 8 10 12 14
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
The Variance
• Definition The variance of a population of N
measurements is the average of the squared deviations
of the measurements about their mean 
2
2  ( x   )
  i
N
• Definition The variance of a sample of n
measurements is the sum of the squared deviations
of the measurements about their mean , divided
by (n – 1) 2
2  ( xi  x )
s 
n 1
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
The Variance
• Flower petals: 5, 7, 1, 2, 4

• The sample variance is:

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
The Standard Deviation
• In calculating the variance, we squared all of the
deviations, and in doing so changed the scale of
the measurements.
• To return this measure of variability to the original
units of measure, we use the standard deviation.
• Definition The standard deviation of a set of
measurements is equal to the positive square root
of the variance.
Population standard deviation :    2
Sample standard deviation : s  s 2
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
The Standard Deviation

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
Two Ways to Calculate
the Sample Variance
2
Use the Definition Formula:
xi xi  x ( x i  x )
2
5 -4 16 2  ( xi  x )
s 
12 3 9 n 1
6 -3 9
60
8 -1 1   15
14 5 25
4
Sum 45 0 60
s  s 2  15  3 . 87
45
x 9
5 Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Two Ways to Calculate
the Sample Variance
Use the Computing Formula:
xi xi2 2
2 (  x )
5 25  xi  i
2
s  n
12 144
n 1
6 36
8 64
45 2
465 
14 196  5  15
Sum 45 465 4
2
s  s  15  3 . 87
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Some Notes
• The value of s is always greater than or equal to zero.
• The larger the value of s2 or s, the greater the variability of the
data set.
• If s2 or s is equal to zero, all the measurements must have the
same value.
• In order to measure the variability in the same units as the
original observations, we compute the standard deviation
• Why divide by n –1?
– The sample standard deviation s is often used to estimate
the population standard deviation Dividing by n –1 gives
us a better estimate of 

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
Using Measures of Center and
Spread: Tchebysheff’s Theorem
Given a number k greater than or equal to 1 and a
set of n measurements, at least 1-(1/k2) of the
measurement will lie within k standard deviations of
the mean.
 Can be used for either samples ( x and s) or for a population (
and ).
Important results:
If k = 1, at least 1 – 1/12 = 0 (none) of the measurements are
within 1 standard deviations of the mean.
If k = 2, at least 1 – 1/22 = 3/4 (75%) of the measurements
are within 2 standard deviations of the mean.
If k = 3, at least 1 – 1/32 = 8/9 (88.88%) of the measurements
Copyright ©2006 Brooks/Cole
are within 3 standard deviations of the Amean.
division of Thomson Learning, Inc.
Illustrating Tchebysheff’s Theorem
• Tchebysheff’s Theorem applies to any set of measurements and
can be used to describe either a sample or a population.
• Since Tchebysheff’s Theorem applies to any distribution, it is
very conservative. “at least 1 - (1/k2)” in this theorem

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
Example: Tchebysheff’s Theorem
• Although the first statement is not at all helpful, the other two
values of k provide valuable information about the proportion of
measurements that fall in certain intervals.
• The values k=2 and k=3 are not the only values, may be k = 2.5

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
Empirical Rule
• Another rule for describing the variability of a data set
does not work for all data sets, but it does work very
well for data that “pile up” in the familiar mound shape.
• The closer your data distribution is to the mound-
shaped (normal distribution) curve, the more accurate
the rule will be.
• Since mound-shaped data
distributions occur quite
frequently in nature, the rule
can often be used in practical
applications (Empirical
Rule).
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Using Measures of
Center and Spread:
The Empirical Rule
Given a distribution of measurements
that is approximately mound-shaped:
The interval   contains approximately 68% of
the measurements.
The interval  2 contains approximately 95%
of the measurements.
The interval  3 contains approximately 99.7%
of the measurements.
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Example 2.7

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
Example
The ages of 50 tenured faculty at a
state university.
• 34 48 70 63 52 52 35 50 37 43 53 43 52 44
• 42 31 36 48 43 26 58 62 49 34 48 53 39 45
• 34 59 34 66 40 59 36 41 35 36 62 34 38 28
• 43 50 30 43 32 44 58 53
14/50

x  44.9 12/50

Relative frequency
10/50

s  10.73
8/50

6/50

4/50

Shape? Skewed right


2/50

0
25 33 Copyright
41 ©2006
49 Brooks/Cole
57 65 73
Ages
A division of Thomson Learning, Inc.
Age Tally Frequency Relative Percent
Frequency
25 to < 33 1111 5 5/50 = .10 10%
33 to < 41 1111 1111 1111 14 14/50 = .28 28%
41 to < 49 1111 1111 111 13 13/50 = .26 26%
49 to < 57 1111 1111 9 9/50 = .18 18%
57 to < 65 1111 11 7 7/50 = .14 14%
65 to < 73 11 2 2/50 = .04 4%

• We choose to use 6 intervals.


• Minimum class width = (70 – 26)/6 = 7.33
• Convenient class width = 8
• Use 6 classes of length 8, starting at 25.
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
k x ks Interval Proportion Tchebysheff Empirical
in Interval Rule
1 44.9 10.73 34.17 to 55.63 31/50 (.62) At least 0  .68
2 44.9 21.46 23.44 to 66.36 49/50 (.98) At least .75  .95
3 44.9 32.19 12.71 to 77.09 50/50 (1.00) At least .89  .997

•Yes. Tchebysheff’s
•Do the actual proportions in the three
Theorem must be
intervals agree with those given by true for any data
Tchebysheff’s Theorem? set.
•Do they agree with the Empirical •No. Not very well.
Rule?
•The data distribution is not very
•Why or why not? mound-shaped, but skewed right.

Example 2.8 from Book


Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Example
•The length of time for a worker to
complete a specified operation averages
12.8 minutes with a standard deviation of 1.7 minutes.

• If the distribution of times is approximately mound-


shaped, what proportion of workers will take longer
than 16.2 minutes to complete the task?

95% between 9.4 and 16.2


47.5% between 12.8 and 16.2
.475 .475 .025 (50-47.5)% = 2.5% above 16.2
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Using Tchebysheff’s Theorem and
Empirical Rule
• Tchebysheff’s Theorem applies to any set of
measurements—sample or population, large or small,
mound-shaped or skewed.
• Tchebysheff’s Theorem gives a lower bound to the
fraction of measurements to be found in an interval
constructed as x  ks. At least 1 - (1/k2) of the
measurements will fall into this interval, and probably
more!
• The Empirical Rule is a “rule of thumb” that can be
used as a descriptive tool only when the data tend to be
roughly mound-shaped (the data tend to pile up near
the center of the distribution). Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Using Tchebysheff’s Theorem and
Empirical Rule
• When you use these two tools for describing a set of
measurements, Tchebysheff’s Theorem will always be
satisfied, but it is a very conservative estimate of the
fraction of measurements that fall into a particular
interval.
• If it is appropriate to use the Empirical Rule (mound-
shaped data), this rule will give you a more accurate
estimate of the fraction of measurements that fall into
the interval.

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
Approximating s
• Tchebysheff’s Theorem and the Empirical Rule
can be used to detect gross errors in the
calculation of s.
• Roughly most of the time, measurements lie
within two standard deviations of their mean.
• This interval is marked off in below figure, and it
implies that the total range of the measurements,
from smallest to largest, should be somewhere
around four standard deviations.

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
Approximating s
• From Tchebysheff’s Theorem and the Empirical
Rule, we know that
R  4-6 s
• To approximate the standard deviation of a set of
measurements, we can use:
s  R/4
or s  R / 6 for a large data set.
• This is, of course, a very rough approximation, but
it can be very useful in checking for large errors in
your calculation of s. Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Approximating s
The ages of 50 tenured faculty at a
state university.
• 34 48 70 63 52 52 35 50 37 43 53 43 52 44
• 42 31 36 48 43 26 58 62 49 34 48 53 39
45
• 34 59 34 66 40 59 36 41 35 36 62 34 38 28
• 43 50 30 43 32 44 58 53
R = 70 – 26 = 44
s  R / 4  44 / 4  11

Actual s = 10.73
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Approximating s
• The range approximation is not intended to provide
an accurate value for s.
• Rather, its purpose is to detect gross errors in
calculating, such as the failure to divide the sum of
squares of deviations by (n - 1) or the failure to take
the square root of s2.
• For one of these mistakes, the answer will be many
times larger than the range approximation of s.
• The range approximation for s can be improved if it
is known that the sample is drawn from a mound-
shaped distribution of data. Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Approximating s
• The range for a sample of n measurements will
depend on the sample size, n. For larger values of n,
a larger range of the x values is expected.
• Divisor for the Range Approximation of s
Number of Measurements Expected Ratio of Range to s

5 2.5
10 3
25 4
>=50 6

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
Measures of Relative Standing
• Where does one particular measurement stand in
relation to the other measurements in the data set?
• How your score of 30 is compared to the scores of
the other students in the class?
• How many standard deviations away from the
mean does the measurement lie? This is measured
by the z-score.
Suppose s = 2. s
• Definition The sample z-
8
score is a measure of relative
s s
standing defined by
x 5 x9
xx
z - score  x = 9 lies z =2 std dev from the mean.
s Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
z-Scores

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
z-Scores
• From Tchebysheff’s Theorem and the Empirical Rule
– At least 3/4 (75%) and more likely 95% of measurements lie
within 2 standard deviations of the mean.
– At least 8/9 (88.88%) and more likely 99.7% of measurements lie
within 3 standard deviations of the mean.
• For mound-shaped data, z-scores between –2 and 2 are not unusual. z-
scores exceeding 2 in absolute value are considered somewhat
unlikely. z-scores larger than 3 in absolute value would indicate a
possible outlier. (Example 2.11 from Book)

Outlier Not unusual Outlier


z
-3 -2 -1 0 1 2 3
Somewhat unusual Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Measures of Relative Standing
• A percentile is another measure of relative standing
and is most often used for large data sets.
(Percentiles are not very useful for small data sets.)
• Definition A set of n measurements on the variable
x has been arranged in order of magnitude. The pth
percentile is the value of x that is greater than p% of
the measurements and is less than the remaining
(100 - p)%.

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
Examples
• 90% of all men (16 and older) earn
more than $319 per week.
BUREAU OF LABOR STATISTICS

10% 90% $319 is the 10th


$319 percentile.

50th Percentile  Median


25th Percentile  Lower Quartile (Q1)
75th Percentile  Upper Quartile (Q3)
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Quartiles and the IQR
• Definition A set of n measurements on the variable x
has been arranged in order of magnitude.
• The lower quartile (Q1) is the value of x which is
larger than 25% and less than 75% of the ordered
measurements.
• The upper quartile (Q3) is the value of x which is
larger than 75% and less than 25% of the ordered
measurements.
• Definition The interquartile range (IQR) for a set
of measurements is the difference between the upper
and lower quartiles; that is,
IQR = Q3 – Q1 Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Calculating Sample Quartiles
• The lower and upper quartiles (Q1 and Q3), can be
calculated as follows:
• The position of Q1 is .25(n + 1)
• The position of Q3 is .75(n + 1)
once the measurements have been ordered. If the
positions are not integers, find the quartiles by
interpolation, using the values of two adjacent
positions.

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
Calculating Sample Quartiles

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
Example
The prices ($) of 18 brands of walking shoes:
40 60 65 65 65 68 68 70 70
70 70 70 70 74 75 75 90 95
Position of Q1 = .25(18 + 1) = 4.75
Position of Q3 = .75(18 + 1) = 14.25

Q1is 3/4 of the way between the 4th and 5th


ordered measurements, or
Q1 = 65 + .75(65 - 65) = 65.

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
Example
The prices ($) of 18 brands of walking shoes:
40 60 65 65 65 68 68 70 70
70 70 70 70 74 75 75 90 95
Position of Q1 = .25(18 + 1) = 4.75
Position of Q3 = .75(18 + 1) = 14.25

Q3 is 1/4 of the way between the 14th and 15th


ordered measurements, or
Q3 = 74 + .25(75 - 74) = 74.25
and
IQR = Q3 – Q1 = 74.25 - 65 = 9.25
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Using Measures of Center and
Spread: The Box Plot
The Five-Number Summary:
Min Q1 Median Q3 Max
• Divides the data into 4 sets containing an equal
number of measurements.
• A quick summary of the data distribution.
• Use to form a box plot to describe the shape of the
distribution and to detect outliers.
outliers
• From the box plot, you can quickly detect any
skewness in the shape of the distribution and see
whether there are any outliers in theCopyright
data set.
©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Outlier: Not unwanted always
• An outlier may result from transposing digits when
recording a measurement, from incorrectly reading an
instrument dial, from a malfunctioning piece of equipment,
or from other problems.
• Even a data set may contain one or more valid
measurements that, for one reason or another, differ
markedly from the others in the set.
• These outliers can cause a marked distortion in
commonly used numerical measures such as x and s.
• In fact, outliers may themselves contain important
information not shared with the other measurements in the
set. Therefore, isolating outliers, if they are present, is an
important step in any preliminary analysis of a data set.
• The box plot is designed expressly for this©2006
Copyright purpose.
Brooks/Cole
A division of Thomson Learning, Inc.
Constructing a Box Plot
Calculate Q1, the median Q2, Q3 and IQR for the data set
Draw a horizontal line to represent the scale of
measurement.
Form a box just above the horizontal line with the right
and left ends at Q3 and Q1.
Draw a vertical line through the box at the location of
the Q1, the median, Q3.

Q1 m Q3
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Constructing a Box Plot
Isolate outliers by calculating
Lower fence: Q1-1.5 IQR
Upper fence: Q3+1.5 IQR
Measurements beyond the upper or lower fence are
outliers and are marked (*).

*
Q1 m Q3

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
Constructing a Box Plot
Extend horizontal lines called “whiskers” from the
ends of the box to the smallest and largest
observations that are not outliers.

*
Q1 m Q3

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
Example 2.14
Amount of sodium in 8 brands of cheese:
260 290 300 320 330 340 340 520

Q1 = 292.5 m = 325 Q3 = 340

m
Q1 Q3
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Example
IQR = 340-292.5 = 47.5
Lower fence = 292.5-1.5(47.5) = 221.25
Upper fence = 340 + 1.5(47.5) = 411.25
Outlier: x = 520

*
m
Q1 Q3
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Interpreting Box Plots
Box plot can be used to describe the shape of a data
distribution by looking at the position of the median line
compared to Q1 and Q3.
Median line in center of box and whiskers of equal
length—symmetric distribution
Median line left of center and long right whisker—
skewed right
Median line right of center and long left whisker—
skewed left
For most skewed distributions, the whisker on the
skewed side of the box tends to be longer than the
whisker on the other side.

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
Key Concepts
I. Measures of Center of a Data Distribution
1. Arithmetic mean (mean) or average
a. Population: 
 xi
x
b. Sample of size n:
n
2. Median: position of the median  .5(n 1)
3. Mode
4. The median may preferred to the mean if the data are
highly skewed.
II. Measures of Variability
1. Range: R  largest  smallest

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
Key Concepts
2. Variance 2
 ( x   )
a. Population of N measurements: 2  i
N
b. Sample of n measurements:

2 (  xi ) 2
2  xi 
 ( x  x ) n
s2  i

n 1 n 1
3. Standard deviation
Population standard deviation :    2
Sample standard deviation : s  s 2
4. A rough approximation for s can be calculated as s  R / 4.
The divisor can be adjusted depending on the sample size.
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Key Concepts
III. Tchebysheff’s Theorem and the Empirical Rule
1. Use Tchebysheff’s Theorem for any data set, regardless of
its shape or size.
a. At least 1-(1/k 2 ) of the measurements lie within k
standard deviation of the mean.
b. This is only a lower bound; there may be more
measurements in the interval.
2. The Empirical Rule can be used only for relatively mound-
shaped data sets.
– Approximately 68%, 95%, and 99.7% of the measurements
are within one, two, and three standard deviations of the
mean, respectively.
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Key Concepts
IV. Measures of Relative Standing
1. Sample z-score:
2. pth percentile; p% of the measurements are smaller, and
(100 p)% are larger.
3. Lower quartile, Q 1; position of Q 1 .25(n 1)
4. Upper quartile, Q 3 ; position of Q 3 .75(n 1)
5. Interquartile range: IQR Q 3 Q 1
V. Box Plots
1. The five-number summary:
Min Q1 Median Q3 Max
One-fourth of the measurements in the data set lie between each
of the four adjacent pairs of numbers.

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
Key Concepts
2. Box plots are used for detecting outliers and shapes of
distributions.
3. Q 1 and Q 3 form the ends of the box. The median line is in
the interior of the box.
4. Upper and lower fences are used to find outliers.
a. Lower fence: Q 1 1.5(IQR)
b. Outer fences: Q 3 1.5(IQR)
5. Outliers are marked on the box plot with an asterisk (*).
6. Whiskers are connected to the smallest and largest
measurements that are not outliers.
7. Skewed distributions usually have a long whisker in the
direction of the skewness, and the median line is drawn away
from the direction of the skewness.
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.

You might also like