Lecture Notes Chapter 5

You might also like

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 12

Stats Supplemental Lecture Notes for Chapter 5

What’s the main idea?

Statisticians commonly find a theoretical model that (approximately) fits their real data. Normal models
are very commonly used and can estimate the proportion of real values that will fall in a certain interval
(to get the actual proportion, you only need to count the number of the data values amongst the set of data
which fall within the specified interval, and then divide this by the total number of data values in the data
set). To use a Normal model, we must know its mean () and standard deviation (). Normal
distributions arise naturally in certain physical phenomena (like heights and manufacturing errors).
Standardized tests are often scored so that the results follow a Normal model.

Why are there two symbols for mean and standard deviation?

x and y are typically used to represent variables in algebra and calculus courses. The same is true when
naming generic statistics symbols. But, typically, statisticians use entire words as variable names; for
example, PRICE, height, Salary. When working with a variable named x, there are two symbols for mean
(x and ) and two symbols for standard deviation (s and ), but they are not interchangeable. The
symbols x and s are used for the mean and standard deviation (s.d.) of a data set, an actual list of data
values gathered from a real study or experiment. The values of x and s can be calculated by crunching
numbers (with a formula or a statistical function). The symbols  and  are used for the mean and s.d. in
two separate instances: (1) To represent the mean and standard deviation of a population’s data set (and
not just a subset of a population, called a sample), and (2) To represent the mean and standard deviation
of a theoretical mathematical model. In this latter case, values are  and  may come from experience
and judgment, but also (as in the case of some standardized tests) may be predetermined so that the data
can be scaled to have the desired mean and standard deviation.

How does rescaling or standardizing affect summary statistics of a data set?

Let’s examine this by looking at classified ads for used Toyota Corollas. In recent classified ads, there are
twelve Corollas for sale.

Model Year Model Transmission Mileage Price, $


2006 S Automatic 5000 14,000
2005 XRS lo
2003 LE Auto very lo 12,980
2003 LE Auto 39000 12,980
2000 CE Auto 70K 6900
1999 5spd 3650
1999 CE Auto 60K 6300
1997 Auto 5850
1991 Auto 1050
1989 Auto 200K 1200
1988 1750
1986 5spd 1500

1
Ignoring the 2005 Corolla, the mean price for used Corollas is about $6196 and the standard deviation of
the prices is about $5,033. The five-number summary is $1050, $1500, $5850, $12980, $14000. The
range is $12,950.

Questions:
(A) How would the above results change if all the prices increased by $1000?
(B) How would the above results change if all the prices decreased by 40%?
(C) How would the above results change if we standardized all the prices?

The table below shows how Corolla prices would change under each condition. The row for the 2005
Corolla with no listed price has been omitted.

Model Year Original Increased Decreased Standardized


Price, $ Price, $ (A) Price, $ (B) Price (C)
2006 14,000 15,000 8400 1.5505
2003 12,980 13,980 7788 1.3478
2003 12,980 13,980 7788 1.3478
2000 6900 7900 4140 0.1398
1999 3650 4650 2190 -0.5059
1999 6300 7300 3780 0.02059
1997 5850 6850 3510 -0.0688
1991 1050 2050 630 -1.023
1989 1200 2200 720 -0.9927
1988 1750 2750 1050 -0.8834
1986 1500 2500 900 -0.9331

For the increased prices (A), the mean and all the prices in the five-number summary increased by $1000.
However, the standard deviation didn’t change! Neither did the range, since we added $1000 to both the
min and the max. We translated the prices up by $1000 each, but didn’t change the spread of the data set.
Adding or subtracting a constant to every value in a data set affects measures of position (mean, median,
quartiles) but not measures of spread (range, IQR, standard deviation). The boxplot for the increased
prices is the same length as the boxplot for the original prices.

The decreased prices (B) are 60% of the original prices, i.e. decreased price = 0.60*(original price).
Multiplying every value by 0.60 makes decreased prices less spread out than the original prices. The
mean, median, and quartiles of the decreased prices are 60% of the statistics for the original prices, but so
are the range, IQR, and standard deviation. Multiplying or dividing every value in a data set affects both
measures of position (mean, median, quartiles) and measures of spread (range, IQR, standard deviation).
The boxplot for the decreased prices is 60% as long as the boxplot for the original prices.

The standardized results (C), or z-scores, look totally different from the original data. The units are no
longer dollars! Z-scores are actually unitless, but they measure standard deviations from the mean. The
car with a price of $12,980 has a cost that is about 1.35 standard deviations above the mean. The car with
a price of $5,850 has a cost that is 0.07 standard deviations below the mean. The mean of the
standardized prices is 0 and the standard deviation of the standardized prices in 1. Standardizing a data
set always produces a new data set with a mean of 0 and a standard deviation of 1.

2
According to my statistical software, the mean of the standardized prices is about 7.2*10-7 or 0.00000072.
The real mean is 0; since we rounded the mean and s.d. of the original prices we have slight round-off
error in our calculation. Similarly, the standard deviation of the standardized prices is 1 (not
0.999999457).

How does standardizing allow useful comparisons of data from different distributions?

In today’s want ads, there are twenty-four Toyota Camrys for sale (see chapter 3 lecture notes). The
prices have a mean of about $11905 and a SD of about $6523. One 2003 Camry cost $14350; is this
Camry relatively more or less pricey than the 2003 Corolla that cost $12,980? On an absolute scale the
Camry is clearly more expensive, but z-score allow us to make relative judgments. The z-score for the
Camry price is about 0.37. The z-score for the Corolla price is about 1.35. The Corolla is much more
expensive relative to the other listed Corollas than the Camry is to other listed Camrys. (To be truly fair,
we would have to account for the fact that the ages of listed Corollas and Camrys have different
distributions.)

What intuition should I have about z-scores (also called standard scores)?

The 68-95-99.7 Rule gives you an idea of how likely a score from a Normal model is to be one, two, or
three standard deviations from the mean. A different way of phrasing the rule is in a Normal model,
about 32% of values are more than one standard deviation from the mean, about 5% of values are more
than two standard deviations from the mean, and 0.3% of values are more than three standard deviations
from the mean.

This includes an equal proportion of values that are lower than the mean and higher than the mean. For
example, in a Normal model, about 16% of values are higher than one standard deviation above the mean
(z-score > 1) and about 16% of values are lower than one standard deviation below the mean (z-score <
-1). About 2.5% of values in a Normal model are higher than two standard deviations above the mean (z-
score > 2) and about 0.15% (only 15 in 10,000) are lower than three standard deviations below the mean
(z-score < -3). Think very carefully about the graph of the Normal model on page 130.

The 68-95-99.7 Rule only applies to Normal models. The more skewed the distribution, the less the 68-
95-99.7 Rule fits. However, even though the numbers from the Rule don’t apply to non-Normal
distributions, the concept generally does. A value one standard deviation from the mean is common. A
value two standard deviations from the mean is unusual. A value three standard deviations from the mean
is exceptionally rare. In practice, for lots of real data sets (not models) we will analyze, all the data will
fall within two standard deviations of the mean.

Even if a Normal model is a good fit for a data set, do not expect the actual data to exactly follow the 68-
95-99.7 Rule. If you want to know the percent of actual data that fall within one standard deviation of the
mean, calculate the mean and s.d., use them to determine the interval from one s.d. below the mean to one
s.d. above the mean, then count the number of data values in that interval. Counting is a great technique
for analyzing data!

How can we calculate using a Normal model?


The equation for the graph of a Normal model with mean  and standard deviation  is , where e
(2.71828) is the base of the natural logarithm. Finding areas under a Normal curve requires a definite

3
integral, i.e. the proportion of scores between a and b can be expressed as . Unfortunately, such an
integral is usually impossible to evaluate exactly so we must use a numerical integration technique to
approximate the value of the integral. Luckily, such a technique is built into statistical technology such as
StatCrunch, the TI-84, Excel, etc.

IQ scores follow a Normal model with mean  = 100 and standard deviation  = 15. Below is a picture of
the Normal model for the distribution of IQ scores. The area under this curve (between  and ) is
scaled to be equal to 1 (think of 100%). The key idea is that areas under the curve correspond to
percentages or proportions of the population.

What proportion of people has IQs between 85 and 115?

Since the mean is 100 and the s.d. is 15, this question asks for the proportion of people whose IQs are
within 1 s.d. of the mean. Since IQs follow a Normal model we can apply the 68-95-99.7 Rule. Hence,
about 68% of people have IQs between 85 and 115.

NOTE: Many books and/or videos will first convert the values in a problem involving a Normal
Distribution to z-scores even when they intend to use technology such as
https://homepage.stat.uiowa.edu/~mbognar/applets/normal.html or https://s3-us-west-
2.amazonaws.com/oervm/stats/probs.html or StatCrunch or Excel or a TI calculator, etc. to perform the
computations. This approach does work, provided that no values are rounded until all calculations are
complete, it is not necessary for any problem where both the mean µ and standard deviation σ are known
and technology is used. If a Z-Table is to be used, then first converting to z scores is required. Until the
invention of calculators and computers, using a Z-Table was the only alternative to evaluating an integral,
(as described earlier in these notes) that existed to find an area under the Normal curve.

4
x
So, in this particular problem, you could (but it is not necessary) proceed as follows. Using z  ,

85  100 115  100
we see that x  85 becomes z   1 and x  115 becomes z  1.
15 15

Recall that the mean of the Standardized Normal model is  z  0 and the standard deviation is  z  1 .

Using a table of Z values, such as the one in Appendix F of our textbook, we see that the area under the
Normal curve with mean 0 and standard deviation 1 between z  1 and z  1 is approximately
0.8413  0.1587  0.6826 . For worked examples, see https://www.youtube.com/watch?v=awcD3xkV0XI.

Here’s a short video showing how to compute quantities involving Normal distributions using
StatCrunch: https://www.youtube.com/watch?v=jnVZpAAk6H0. Here’s another video illustrating the
use of StatCrunch in working with a Normal Distribution: https://www.youtube.com/watch?
v=mfF3LCajgjw&t=0s&list=PLBE055F65E43B4973&index=73. And another:
https://www.youtube.com/watch?v=rM8m95Mvrfs&t=0s&list=PLBE055F65E43B4973&index=74.

Using the Normal Distribution Calculator at https://s3-us-west-2.amazonaws.com/oervm/stats/probs.html


with 0 in the box next to “Mean”, 1 in the box next to “Standard deviation,” and -1 & 1 in the boxes
associated with the third selection within the “Calculate” options, we have:

Thus, about 68.2689% of people have IQs between 85 and 115.

We can compute this same value without having to first convert to z-scores if we use technology instead
of a Z Table.

Using the Normal Distribution Calculator at https://s3-us-west-2.amazonaws.com/oervm/stats/probs.html

5
with 100 in the box next to “Mean”, 15 in the box next to “Standard deviation,” and 85 & 115 in the
boxes associated with the third selection within the “Calculate” options, we can compute the requested
percentage by having technology determine the value of P  85  x  115 :

x, IQ Scores

What proportion of people has IQs less than 90?

The 68-95-99.7 Rule isn’t as helpful here, but technology such as an StatCrunch, a TI calculator, an app,
etc. will give us a great approximation.

Using the Normal Distribution Calculator at https://s3-us-west-2.amazonaws.com/oervm/stats/probs.html


with 100 in the box next to “Mean”, 15 in the box next to “Standard deviation,” and 90 in the boxe
associated with the first selection within the “Calculate” options, we have:

6
x, IQ Scores

If the technology you are using requires that a lower bound be entered, then use . Typically,
calculators and computers don’t have negative (or positive) infinity on them so we use a very large
number, like -10^99. Be sure to use the negative sign and not the subtraction sign. Note that another way
to enter -10^99.

What is the 70th percentile for IQ scores?

Now we are given a proportion, and want to work backward to find a data value (IQ score). We are
looking for the IQ that separates the bottom 70% of IQ scores from the top 30%. This requires a new tool
for the Normal distribution.

7
x
If you opt to use a Z-Table, you will need to work with the formula z  , but with the quantities

x
rearranged so that you are finding a value of x. Algebraically, we see that z   z  x  

 z  x    z    x . After looking up the value of z that’s associated with the 70th percentile
(~0.52), substitute this approximation of z’s value and the values of  and  into x  z   to find the
requested 70th percentile for IQ scores: x   0.52   15   100  107.8 .

If, instead, you choose to use technology, such as the online calculator at https://s3-us-west-
2.amazonaws.com/oervm/stats/probs.html or StatCrunch or a TI-84 calculator, etc., you may not have to
work with a z-score. For example, in this problem, assuming that the values of 100 and 15 are still in the
fields for the mean and standard deviation, select the appropriate function within the “Inverse probability”
list and enter 0.70

x, IQ Scores

Thus, the 70th percentile for IQ scores is approximately 107.8660.

How can I tell whether a Normal model is a good fit for a data set?

To tell whether a Normal model is a good fit for a data set, construct a Normal probability plot. This is
difficult to do by hand, but easy with technology such as StatCrunch, a TI calculator, Excel, an app, etc.

8
Here’s an instructional video illustrating how to perform this task in StatCrunch:
https://www.youtube.com/watch?v=AmB4iK0ia1w.

Here’s an instructional video illustrating how to perform this task in Excel:


https://www.youtube.com/watch?v=FA1u2LtNmiI.

Here’s an instructional video illustrating how to perform this task on a TI-84 calculator:
https://www.youtube.com/watch?v=cZjKdm4TzDo.

We will construct a Normal probability plot of the Camry prices using the online app at
https://mathcracker.com/normal-probability-plot-maker.php (enter the data into the box, separating each
value with a comma, and then click-on the yellow button).

27500 18950 19500 14500 15850 16950 15990 15900 13995 19500 14350 14400
11950 11999 8700 8300 8499 7250 4988 2500 3700 3790 3700 2950

Camry Prices (in $)

The more linear a Normal Probability Plot is the closer the data follows a Normal Distribution model (i.e.
the data’s distribution is unimodal, symmetric, and bell-shaped). Check-out
https://www.youtube.com/watch?v=smJBsZ4YQZw for a more details about interpreting Normal
Probability Plots.

Thus, for the Camry prices data set, the Normal probability plot do not look very linear, and so a Normal
Distribution would not be a good model for this small data set (n = 24).

If you’re curious about the nitty-gritty details about the construction of a Normal probability plot, check-
out https://www.youtube.com/watch?v=g5ef9QyDf30.
9
What if both µ and σ are not given in a problem involving the use of a Normal Distribution model?

Well, you won’t be able to use either the normalcdf or invNorm functions with the values of the variable
that’s known to be Normally Distributed since both of these calculator functions require the variable’s
mean µ and standard deviation σ.

When you aren’t given both µ and σ for a Normally Distributed variable (let’s call the variable x), you
x
will have to use the following facts: z  , the mean and standard deviation of the Standardized

Normal Distribution are  z  0 and  z  1 , respectively.

Example: A machine is set to dispense 16.9 fluid ounces, on average, of water into a bottle. The actual
amount of varies from bottle to bottle; the Normal model applies. What should the standard
deviation be to ensure that only 1% of the bottles receive more than 17.0 ounces of water. Report your
answer to the nearest hundredth of an ounce.

Let x represent the volume of water in a bottle (in ounces).

1% = 0.01
%

Although
x x  x
 x  16.9 is known,  x is not. Consequently, we’ll have to use z   to find the value
 x
x  x
of  x . Via algebra, we see that  x  .
z

We are told that the 99th percentile is x  17.0 ounces. Using a TI-84 calculator, we see that the z score
for the 99th percentile is invNorm  0.99, 0, 1   2.326347877 since  z  0 and  z  1 are always known
(by definition).

10
If instead, you choose to use the app at https://s3-us-west-2.amazonaws.com/oervm/stats/probs.html, then
with Mean = 0, Standard deviation = 1, and 0.99 for the first entry under the “Inverse probability” portion
of the “Calculate” choices, we have:

And so we see that the z-score corresponding to the 99th percentile is  2.326347877 (from the TI) and it
is  2.326348 (from the online app at https://s3-us-west-2.amazonaws.com/oervm/stats/probs.html) …
same value, just approximated to a different number of digits after the decimal point.

x  x 17.0  16.9 17.0  16.9


Hence,  x     0.04 .
z 2.326347877 2.326348

Remember: Don’t round any intermediate calculations until all calculations are completed (the third step
shown in the above string of calculations is shown only as a reference for you to check your work).

Answer:   0.04 ounces (reported to the nearest hundredth of an ounce)

11
Another Example: Find the mean weight of watermelons if the standard deviation of all watermelons’
weights is 3 ounces and it’s known that a watermelon weighing 8 pounds (128 ounces) is in the 85th
percentile. Assume watermelon weights are Normally distributed. Report your answer to the nearest
hundredth of an ounce.

Let x represent the weight of a watermelon (in ounces).

15% = 0.15
%
5, = 0.15x
x   x  x
Although  x  3 is known,  x is not. Consequently, we’ll have to use z  
 x

x x  x
Although  x  3 is known,  x is not. Consequently, we’ll have to use z   to find
 x
the value of  x . Via algebra, we see that  x  x  z x .

We are told that the 85th percentile is x  128 ounces. The z score for the 85th percentile, as determined
by a TI-84 calculator, is invNorm  0.85, 0, 1   1.03643338 since  z  0 and  z  1 are always known.

According to the app at https://s3-us-west-2.amazonaws.com/oervm/stats/probs.html, the z-score


corresponding to the 85th percentile is approximately 1.036433: .

Hence,  x  x  z x  128   invNorm  0.85,0,1    3   128   1.03643338   3   124.89 .

Remember: Don’t round any intermediate calculations until all calculations are completed (the third step
shown in the above string of calculations is shown only as a reference for you to check your work).

Answer:   124.89 ounces (reported to the nearest hundredth of an ounce)

12

You might also like