Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 735

Descriptive Statistics

One important use of descriptive statistics is to summarize a collection


of data in a clear and
understandable way. For example, assume a psychologist gave a
personality test measuring shyness to all 2500 students attending a small
college. How might these measurements be summarized? There are two
basic methods: numerical and graphical. Using the numerical approach
one might compute statistics such as the mean and standard deviation.
These statistics convey information about the average degree of shyness
and the degree to which people differ in shyness. Using the graphical
approach one might create a stem and leaf display and a box plot.
These plots contain detailed information about the distribution of
shyness scores.

Graphical methods are better suited than numerical methods for


identifying patterns in the data. Numerical approaches are more precise
and objective.

Since the numerical and graphical approaches compliment each other, it


is wise to use both.
Inferential Statistics

Inferential statistics are used to draw inferences about a population


from a sample. Consider an experiment in which 10 subjects who
performed a task after 24 hours of sleep deprivation scored
12 points lower than 10 subjects who performed after a normal night's
sleep. Is the difference real or could it be due to chance? How much
larger could the real difference be than the 12 points found in the
sample? These are the types of questions answered by inferential
statistics.

There are two main methods used in inferential statistics: estimation


and hypothesis testing. In estimation, the sample is used to estimate a
parameter and a confidence interval about the estimate is constructed.

In the most common use of hypothesis testing, a "straw man" null


hypothesis is put forward and it is determined whether the data are
strong enough to reject it. For the sleep depriva tion study, the null
hypothesis would be that sleep deprivation has no effect on
performance.
Next Section: Variables
Variable (1 of 2)

A variable is any measured characteristic or attribute that differs for


different subjects. For example, if the weight of 30 subjects were
measured, then weight would be a variable.

Quantitative and Qualitative


Variables can be quantitative or qualitative. (Qualitative variables are
sometimes called "categorical variables.") Quantitative variables are
measured on an ordinal, interval, or ratio scale; qualitative variables are
measured on a nominal scale. If five-year old subjects were asked to
name their favorite color, then the variable would be qualitative. If the
time it took them to
respond were measured, then the variable would be quantitative.

Independent and Dependent


When an experiment is conducted, some variables are manipulated by
the experimenter and others are measured from the subjects. The
former variables are called "independent variables" or "factors"
whereas the latter are called "dependent variables" or "dependent
measures."

Variable (2 of 2)

For example, consider a hypothetical experiment on the effect of


drinking alcohol on reaction time: Subjects drank either water, one
beer, three beers, or six beers and then had their reaction times to the
onset of a stimulus measured. The independent variable would be the
number of beers drunk (0, 1, 3, or
6) and the dependent variable would be reaction time.

Continuous and Discrete


Some variables (such as reaction time) are measured on a continuous
scale. There is an infinite number of possible values these variables can
take on. Other variables can only take on a limited number of values.
For example, if a dependent variable were a subject's rating on a five-
point scale where only the values 1, 2, 3, 4, and 5 were allowed, then
only five possible values could occur. Such variables are called
"discrete" variables.

Parameter

A parameter is a numerical quantity measuring some aspect of a


population of scores. For example, the mean is a measure of
central tendency.

Greek letters are used to designate parameters. At the bottom of this


page are shown several parameters of great importance in statistical
analyses and the Greek symbol that represents each one. Parameters are
rarely known and are usually estimated by statistics computed in
samples. To the right of each Greek symbol is the symbol for the
associated statistic used to estimate it from a sample.
Quantity Parameter
Statistic

Mean μ M
Standard deviation σ s
Proportion π p
Correlation ρ r
Statistics

The word "statistics" is used in several different senses. In the broadest


sense, "statistics" refers
to a range of techniques and procedures for analyzing data, interpreting
data, displaying data, and making decisions based on data. This is what
courses in "statistics" generally cover.

In a second usage, a "statistic" is defined as a numerical quantity (such


as the mean) calculated in a sample. Such statistics are used to estimate
parameters.

The term "statistics" sometimes refers to calculated quantities


regardless of whether or not they are from a sample. For example, one
might ask about a baseball player's statistics and be
referring to his or her batting average, runs batted in, number of home
runs, etc. Or, "governmen t statistics" can refer to any numerical indexes
calculated by a governmental agency.
although the different meanings of "statistics" has the potential for
confusion, a careful consideration of the context in which the word is
used should make its intende d meaning clear.

Summation Notation (1 of 4)

The Greek letter Σ (a capital sigma) is used to designate summation. For


example, suppose an experimenter measured the performance of four
subjects on a memory task. Subject 1's score will be referred to as X 1 ,
Subject 2's as X 2 , and so on. The scores are shown below:

Subje Scor
1 X 7
2 X 6
3 X 5
4 X 8
The way to use the summation sign to indicate the sum of all four X's is:

This notation is read as follows:


Sum the values of X from X1 through X4 .

The index i (shown just under the Σ sign) indicates which values of X
are to be summed. The index i takes on values beginning with the value
to the right of the "=" sign (1 in this cas e) and continues sequentially
until it reaches the value above the Σ sign (4 in this case). Therefore i
takes on the values 1, 2, 3, and 4 and the values of X 1, X2, X3, and X4
are summed (7 + 6 + 5 + 8
= 26).

Summation Notation (2 of 4)

In order to make formulas more general, variables can be


used with the summation notation. For example,
means to sum up values of X from 1 to N where N can be any
number but usually indicates the sample size.

Often an abbreviated form of the summation notation is used. For


example, ΣX means to sum all the values of X. When only a subset
of the values of X are to be summed then the full version is
required. Thus, the sum of all elements of X except the first and the
last (the N'th) would be indicated as:

which would be read as the sum of X with i going from 2 to N-1.


Some formulas require that each number be squared before the
numbers are summed. This is indicated by:

and is equal to 7² + 6² + 5² + 8²
= 174. Summation Notation (3
of 4)
The abbreviated version is simply: ΣX². It is very important to note that
it makes a big difference whether the numbers are squared first and then
summed or summed first and then squared. The symbol (ΣX)² indicates
that the numbers should be summed first and then squared. For the
present example, this equals:

(7 + 6 + 5 + 8)² = 26² = 676. This, of course, is quite different from 174.

Sometimes a formula requires that the sum of cross products be


computed. For instance, if 3 subjects were each tested twice, they
might each have a score on X and on Y.

Subje XY
1 2 3
2 1 6
3 4 5

The sum of cross products (2 x 3) + (1 x 6) + (4 x 5) = 32 can be


represented in summation notation simply as: ΣXY.

Summation Notation (4 of 4)
Basic Theorems:

The following data will be used to illustrate the theorems:

X Y
3 8
2 3
4 1

Σ(X + Y) = ΣX + ΣY
Σ(X + Y) = 11 + 5 + 5 = 21
ΣX = 3 + 2 + 4 = 9
ΣY = 8 + 3 + 1 = 12
ΣX + ΣY = 9 + 12 = 21

ΣaX = aΣX (a is a
constant) For an
example, let a = 2.
ΣaX = (2)(3) + (2)(2) + (2)(4) = 18
a ΣX = (2)(9) = 18
Σ(X-M)² = ΣX² - (ΣX)²/N
(N is the number of numbers, 3 in this case, and M is the mean which is
also equal to 3 in this
case.
Σ(X-M)² = (3-3)² + (2-3)² + (4-3)² = 2
ΣX² = 3²+ 2²+
4²= 29 (ΣX)²/N
= 9²/3 = 27
ΣX² - (ΣX)²/N = 29 - 27 = 2

Measurement Scales (1 of 6)

Measurement is the assignment of numbers to objects or events in a


systematic fashion. Four levels of measurement scales are commonly
distinguished: nominal, ordinal, interval, and ratio.
There is a relationship between the level of measurement and the
appropriateness of various statistical procedures. For example, it would
be silly to compute the mean of nominal measurements. However, the
appropriateness of statistical analyses involving means for ordinal level
data has been controversial. One position is that data must be
measured on an interval or a ratio scale for the computation of means
and other statistics to be valid. Therefore, if data are measured on an
ordinal scale, the median but not the mean can serve as a measure of
central tendency.

The arguments on both sides of this issue will be examined in the


context of an hypothetical experiment designed to determine whether
people prefer to work with color or with black and white computer
displays. Twenty subjects viewed black and white displays and 20
subjects viewed color displays.

Measurement Scales (2 of 6)

Displays were rated on a 7 point scale where a 1 was the lowest rating
and a 7 was the highest rating. This rating scale is only an ordinal scale
since there is no assurance that the difference between a rating of 1
and a rating of 2 represents the same degree of difference in
preference as the difference between a rating of 5 and a rating of 6.
The mean rating of the color display was 5.5 and the mean rating of the
black and white display was 3.9. The first question the experimenter
would ask is how likely is it that this big a difference between means
could have occurred just because of chance factors such as which
subjects saw the black and white display and which subjects saw the
color display. Standard methods of statistical inference can answer this
question. Assume these methods led to the conclusion that the
difference was not due to chance
but represented a "real" difference in means. Does the fact that the
rating scale was ordinal instead of
interval have any implications for the validity of the statistical
conclusion that the difference between means was not due to
chance?

Measurement Scales (3 of 6)

The answer is an unequivocal "NO." There is really no room for


argument here. What can be questioned, however, is whether it is
worth knowing that the mean rating of color displays is higher than the
mean rating for B & W displays.

The argument that it is not worth knowing assumes that means of


ordinal data are meaningless. Supporting the notion that means of
ordinal data are meaningless is the fact that examples can be made up
showing that a difference between means on an ordinal scale can be in
the opposite direction of
what they would have been if the "true" measurement scale had been
used.

If means of ordinal data are meaningless, why should anyone care


whether the difference between two meaningless quantities (the two
means) is due to chance or not. Naturally enough, the answer lies in
challenging the proposition that means of ordinal data are
meaningless. There are two counter arguments to the example
showing that using an ordinal scale can reverse the direction of the
difference between means.

Measurement Scales (4 of 6)

The first is philosophical and challenges the validity of the notion that
there is some unseen "true" measurement scale that is only being
approximated by the rating scale. The second counter argument
accepts the notion of an underlying scale but considers the examples
to be very contrived and unlikely to occur in real data. Measurement
scales used in behavioral research are invariably somewhere between
ordinal and interval scales. In the preference experiment, it may not
be the case that the difference between the ratings one and two is
exactly the same as the difference between five and six,
but it is unlikely to be many times larger either. The scale is roughly
interval and it is exceedingly unlikely that the means on this scale
would favor color displays while the means on the "true" scale would
favor the B & W displays.

There are some cases where one can validly argue that the use of an
ordinal instead of a ratio scale seriously distorts the conclusions.
Consider an experiment designed to determine whether 5-year old
children are more distractible than 10-year old children.

Measurement Scales (5 of 6)

No Distraction Distraction
5- 6 3
yr 1 8
It looks as though the 10-year olds are more distractible since
distraction cost them 4 points but only cost the 5-year olds 3 points.
However, it might be that a change from 3 to 6 represents a larger
difference than a change from 8 to 12. Consider that the performance
of 5-year olds dropped 50% from distraction but the performance of
10-year olds dropped only 33%.

Which age group is "really" more distractible? Unfortunately, there is


no clearly right or wrong answer. If proportional change is considered,
then 5-year olds are more distractible; if the amount of change is
considered then 10-year olds are more distractible. Keep in mind that
statistical conclusions are not affected by the choice of measurement
scale even though the all - important interpretation of these conclusions
can be.

In this example, a statistical test could validly rule out chance as an


explanation of the finding that 10- year olds lost more points from
distraction than did 5-year olds. However, the statistical test will not
reveal whether a greater drop necessarily means 10-year olds are
more distractible. So, the conclusion that distraction costs 10-year
olds more points than it costs 5-year olds is valid. The interpretation
depends on measurement issues.

In summary, statistical analyses provide conclusions about the


numbers entered into them. Relating these conclusions to the
substantive research issues depends on the measurement
operations.

Chapter 1

1. A teacher wishes to know whether the males in his/her class


have more favorable attitudes toward gun control than do the
females. All students in the class are given a questionnaire about
gun control and the mean responses of the males and the
females are compared. Is this an example of descriptive or
inferential statistics?

2. A medical researcher is testing the effectiveness of a new drug


for treating Parkinson's disease.
Ten subjects with the disease are given the new drug and 10 are
given a placebo. Improvement
in symptomology is measured. What would be the roles of
descriptive and inferential statistics in the analysis of these data?

3. What are the advantages and disadvantages of graphical as


opposed to numerical approaches to descriptive statistics?

4. Distinguish between random and stratified sampling?

5. A study is conducted to determine whether people learn better


with spaced or massed practice.
Subjects volunteer from an introductory psychology class. The
first 10 subjects who volunteer are assigned to the massed-
practice condition; the next 10 are assigned to the spaced-
practice condition. Discuss the consequences and seriousness of
each of the following two kinds of non-
random sampling: (1) Subjects are not randomly sampled from
some specified population and (2) subjects are not randomly
assigned to conditions. In general, which type of non-random
sampling is more serious?

6. Define independent and dependent variables.

7. Categorize the following variables as being


qualitative or quantitative: Response time
Rating of job
satisfaction
Favorite color
Occupation
aspired to
Number of words remembered

8. Specify the level of measurement used for the items in Question


7.

9. Categorize the variables in Question 7 as being continuous or


discrete.

10. Are Greek letters used for statistics or for parameters?


11. When would the mean score of a class on a final exam be
considered a statistic? When would it be considered a
parameter?

12. An experiment is conducted to examine the effect of punishment


on learning speed in rats.
What are the independent and dependent variables?

13. For the numbers 1, 2, 4, 8


Comp
ute: ΣX

ΣX2

(ΣX) 2

14. ΣX = 7 and ΣX² = 21. A new variable Y is created by multiplying


each X by 3. What are ΣY and ΣY² equal to ?

Selected
Answers

Answers:
Chapter 1
1. This is an example of descriptive statistics since there is no attempt
to generalize the results from the class to a population. The key is that
the teacher is interested only in his/her class and not in other students.
If the teacher's class were thought of as a random sample of students
from
some larger population, and the teacher wished to draw an inference
about the population and not simply describe the sample, then it would
be an example of inferential statistics.

4. In random sampling each element of the population has an equal


chance of being drawn each time an element is sampled. The
characteristics of a random sample may be very different from the
characteristics of a population. In stratified sampling, attempts are
made to make the sample
as close to the population as possible on the key attributes.

7.
Response
time:
Quantitativ
e
Rating of job
satisfaction:
Quantitative
Favorite
color:
Qualitative
Occupation aspired to:
Qualitative as it stands but could be considered quantitative if rated in
terms of expected income, prestige etc.
Number of words
remembered:
Quantitative

13.
ΣX = 15
ΣX² = 85
(ΣX) ²
=225

14.
ΣY = 21
ΣY² = 189
1. Central
Tendency
1. Mean
2. Median
3. Mode
4. Trimean
5. Trimmed Mean
6. Summary

Arithmetic Mean
The arithmetic mean is what is commonly called the average: When
the word "mean" is used without a modifier, it can be assumed that it
refers to the arithmetic mean. The mean is the sum of all the scores
divided by the number of scores. The formula in summation notation
is:

μ = ΣX/N

where μ is the population mean and N is the number of scores.


If the scores are from a sample, then the symbol M refers to the
mean and N refers to the sample size. The formula for M is the same
as the formula for
μ.

M = ΣX/N

The mean is a good measure of central tendency for roughly


symmetric distributions but can be misleading in skewed
distributions since it can be greatly influenced by scores in the tail.
Therefore, other statistics such as the median may be more
informative for distributions such as reaction time or family income
that are frequently very skewed

Click here for an interactive demonstration of properties of the mean


and median.

The sum of squared deviations of scores from their mean is lower


than their squared deviations from any other number.
For normal distributions, the mean is the most efficient and
therefore the least subject to sample fluctuations of all measures
of central tendency.

The formal definition of the arithmetic mean is µ = E[X] where μ is the

Mean (2 of 4)
The geometric mean is the nth root of the product of the scores. Thus,
the geometric mean of the scores: 1, 2, 3, and 10 is the fourth root of 1 x
2 x 3 x 10 which is the fourth root of 60 which equals 2.78. The formula
can be written as:

Geometric mean =

where ΠX means to take the product of all the values of X. The


geometric mean can also be computed by:

1. taking the logarithm of each number


2. computing the arithmetic mean of the logarithms
3. raising the base used to take the logarithms to the
arithmetic mean. The next page shows an example of
this method using natural logarithms.
Mean (3 of 4)

X Ln(
1 0
2 0.693147
3 1.098612
1 2.302585
Arithmetic
mean
Geometric = 1.024.
mean

The base of natural logarithms is 2.718. The expression: EXP[1.024]


means that 2.718 is raised to the 1.024th power. Ln(X) is the natural
log of X.

Naturally, you get the same result using logs base 10 as shown below.

X Log(X)
1 0.0000
2 0.30103
3 0.47712
1 1.00000
Arithmetic
mean =
Geometric 0.44454.

If any one of the scores is zero then the geometric mean is zero. The
geometric mean does not make sense if any scores are less than zero.

The geometric mean is less affected by extreme values than is the


arithmetic mean and is useful as a measure of central tendency for
some positively skewed distributions.

The geometric mean is an appropriate measure to use for averaging


rates. For example, consider a stock portfolio that began with a value of
$1,000 and had annual returns of 13%, 22%, 12%, -
5%, and -13%.The table below shows the value after each of the five
years.

Yea Retur Value


1 13% 1,130
2 22% 1,379
3 12% 1,544
4 -5% 1,467
5 -13% 1,276
The question is how to compute annual rate of return? The answer is to
compute the geometric mean of the returns. Instead of using the
percents, each return is represented as a multiplier indicating how much
higher the value is after the year. This multiplier is 1.13 for a 13%
return and 0.95 for a 5% loss. The multipliers for this example are 1.13,
1.22, 1.12, 0.95, and 0.87. The geometric mean of these multipliers is
1.05. Therefore, the average annual rate of return is 5%. The following
table shows how a portfolio gaining 5% a year would end up with the
same value ($1,276) as the one shown above.

Yea Retur Valu


1 5% 1,05
2 5% 1,10
3 5% 1,15
4 5% 1,21
5 5% 1,27
Mean (4 of 4) Harmonic Mean

Next section: Median

Harmonic Mean
The harmonic mean is used to take the mean of sample sizes. If there
are k samples each of size n, then the harmonic mean is defined as:

For the numbers 1, 2, 3, and 10, the harmonic mean is:


= 2.069. This is less than the geometric mean of 2.78 and the arithmetic
mean of 4.

Next section: Median

Median

Next section: Mode

The median is the middle of a distribution: half the scores are above the
median and half are below the median. The median is less sensitive to
extreme scores than the mean and this makes it a better measure than
the mean for highly skewed distributions. The median income is usually
more informative than the mean income, for example.

The sum of the absolute deviations of each number from the median is
lower than is the sum of absolute deviations from any other number.
Click here for an example.

The mean, median, and mode are equal in symmetric distributions. The
mean is typically higher than the median in positively skewed
distributions and lower than the median in negatively skewed
distributions, although this may not be the case in bimodal
distributions. Click here for examples.
Computation of Median
When there is an odd number of numbers, the median is simply the
middle number. For example, the median of 2, 4, and 7 is 4.

When there is an even number of numbers, the median is the mean of


the two middle numbers. Thus, the median of the numbers 2, 4, 7, 12
is (4+7)/2 = 5.5.

Mode

Next section: Trimean

The mode is the most frequently occurring score in a distribution


and is used as a measure of central tendency. The advantage of the
mode as a measure of central tendency is that its meaning is
obvious. Further, it is the only measure of central tendency that can
be used with nominal data.

The mode is greatly subject to sample fluctuations and is therefore


not recommended to be used as the only measure of central
tendency. A further disadvantage of the mode is that many
distributions have more than one mode. These distributions are
called "multi modal."
In a normal distribution, the mean, median, and mode are identical.

Trimean

Next section: Trimmed Mean

The trimean is computed by adding the 25th percentile plus twice the
50th percentile plus the
75th percentile and dividing by four. What follows is an example of how
to compute the trimean. The 25th, 50th, and 75th percentile of the
dataset "Example 1" are 51, 55, and 63 respectively. Therefore, the
trimean is computed as:

The trimean is almost as resistant to extreme scores as the median and


is less subject to sampling fluctuations than the arithmetic mean in
extremely skewed distributions. It is less efficient than the mean for
normal distributions. . The trimean is a good measure of central
tendency and is probably not used as much as it should be.
Trimmed Mean
Next section: Summary

A trimmed mean is calculated by discarding a certain percentage of the


lowest and the highest scores and then computing the mean of the
remaining scores. For example, a mean trimmed 50% is computed by
discarding the lower and higher 25% of the scores and taking the mean
of the remaining scores.
The median is the mean trimmed 100% and the arithmetic mean is
the mean trimmed 0%. A trimmed mean is obviously less
susceptible to the effects of extreme scores than is the
arithmetic mean. It is therefore less susceptible to sampling fluctuation
than the mean for
extremely skewed distributions. It is less efficient than the mean for
normal distributions.

Trimmed means are often used in Olympic scoring to minimize the


effects of extreme ratings possibly caused by biased judges.

Summary of Measures of Central Tendency (1 of 2)


Of the five measures of central tendency discussed, the mean is by far
the most widely used. It takes every score into account, is the most
efficient measure of central tendency for normal distributions and is
mathematically tractable making it possible for statisticians to develop
statistical procedures for drawing inferences about means. On the other
hand, the mean is not appropriate for highly skewed distributions and is
less efficient than other measures of central tendency when extreme
scores are possible. The geometric mean is a viable alternative if all
the scores are positive and the distribution has a positive skew.

The median is useful because its meaning is clear and it is more efficient
than the mean in
highly-skewed distributions. However, it ignores many scores and is
generally less efficient than the mean, the trimean, and trimmed means.

The mode can be informative but should almost never be used as the
only measure of central tendency since it is highly susceptible to
sampling fluctuations.

1. Spread
1. Range
2. Semi-Interquartile Range
3. Variance
4. Standard Deviation
5. Summary
Range

Next section: Semi-interquatile

The range is the simplest measure of spread or dispersion: It is


equal to the difference between the largest and the smallest values.
The range can be a useful measure of spread because it is so easily
understood. However, it is very sensitive to extreme scores since it is
based on only two values. The range should almost never be used as
the only measure of spread, but can be informative if used as a
supplement to other measures of spread such as the standard
deviation or semi-interquartile range.

Example:
The range of the numbers 1, 2, 4, 6, 12, 15, 19, 26 = 26 -1 = 25
Semi-Interquartile Range

Next section: Variance and Standard Deviation

The semi-interquartile range is a measure of spread or dispersion. It is


computed as one half the difference between the 75th percentile [often
called (Q3)] and the 25th percentile (Q1). The formula for semi-
interquartile range is therefore: (Q3-Q1)/2.
Since half the scores in a distribution lie between Q3 and Q1, the semi-
interquartile range is 1/2 the distance needed to cover 1/2 the scores. In
a symmetric distribution, an interval stretching from one semi-
interquartile range below the median to one semi-interquartile above
the median will contain 1/2 of the scores. This will not be true for a
skewed distribution, however.

The semi-interquartile range is little affected by extreme scores, so it is a


good measure of spread for skewed distributions. However, it is more
subject to sampling fluctuation in normal distributions than is the
standard deviation and therefore not often used for data that are
approximately normally distributed.

Standard Deviation and Variance (1 of 2)


The variance and the closely-related standard deviation are measures
of how spread out a distribution is. In other words, they are measures
of variability.

The variance is computed as the average squared deviation of each


number from its mean. For example, for the numbers 1, 2, and 3, the
mean is 2 and the variance is:
.

The formula (in summation notation) for the variance in a population is

where μ is the mean and N is the


number of scores. When the variance
is computed in a sample, the statistic

(where M is the mean of the sample) can be used. S² is a biased


estimate of σ², however. By far the most common formula for
computing variance in a sample is:
which gives an unbiased estimate of σ². Since samples are usually used
to estimate parameters, s² is the most commonly used measure of
variance. Calculating the variance is an important part of many
statistical applications and analyses. It is the first step in calculating the
standard deviation.

Standard Deviation
The standard deviation formula is very simple: it is the square
root of the variance. It is the most commonly used measure of
spread.

An important attribute of the standard deviation as a measure of


spread is that if the mean and standard deviation of a normal
distribution are known, it is possible to compute the percentile rank
associated with any given score. In a normal distribution, about 68%
of the scores are within one standard deviation of the mean and
about 95% of the scores are within two standard deviations of the
mean.
The standard deviation has proven to be an extremely useful
measure of spread in part because it is mathematically tractable.
Many formulas in inferential statistics use the standard deviation.
(See next page for applications to risk analysis and stock portfolio
volatility.)

Standard Deviation and Variance (2 of 2)

Next section: Summary

although less sensitive to extreme scores than the range, the


standard deviation is more sensitive than the semi-interquartile
range. Thus, the standard deviation should be supplemented by the
semi- interquartile range when the possibility of extreme scores is
present.

If variable Y is a linear
transformation of X such that: Y =
bX + A,
then the variance of Y is:
where is the variance of X.

The standard deviation of Y is bσx where σx is the standard deviation of


X.

Standard Deviation as a Measure of Risk


The standard deviation is often used by investors to measure the risk
of a stock or a stock portfolio. The basic idea is that the standard
deviation is a measure of volatility: the more a stock's returns vary
from the stock's average return, the more volatile the stock. Consider
the following two stock portfolios and their respective returns (in per
cent) over the last six months. Both portfolios end up increasing in
value from $1,000 to $1,058. However, they clearly differ in volatility.
Portfolio A's monthly returns range from -1.5% to 3% whereas
Portfolio B's range from -9% to 12%. The standard deviation of the
returns is a better measure of volatility than the range because it takes
all the values into account. The standard deviation of the six returns
for Portfolio A is 1.52; for Portfolio B it is 7.24.
How to compute the standard deviation in SPSS.

Summary of Measures of Spread (Variability)

Next section: Skew

The standard deviation is by far the most widely used measure of spread.
It takes every score into account, has extremely useful properties when
used with a normal distribution, and is tractable mathematically and,
therefore, it appears in many formulas in inferential statistics. The
standard deviation is not a good measure of spread in highly-skewed
distributions and should be
supplemented in those cases by the semi-interquartile range.
The range is a useful statistic, but it cannot stand alone as a measure of
spread since it takes into account only two scores.

The semi-interquartile range is rarely used as a measure of spread, in


part because it is not very mathematically tractable. However, it is
influenced less by extreme scores than the standard deviation, is less
subject to sampling fluctuations in highly-skewed distributions, and has
a good intuitive meaning. It should be used to supplement the standard
deviation in most cases.

Shape

Skew (1 of 3)

A distribution is skewed if one of its tails is longer than the other. The
first distribution shown has a positive skew. This means that it has a
long tail in the positive direction. The distribution below it has a
negative skew since it has a long tail in the negative direction. Finally,
the third distribution is symmetric and has no skew. Distributions with
positive skew are sometimes called "skewed to the right" whereas
distributions with negative skew are called "skewed to the left."
Skew (2 of 3)

Distributions with positive skew are more common than distributions


with negative skews. One example is the distribution of income. Most
people make under $40,000 a year, but some make quite a bit more with
a small number making many millions of dollars per year. The positive
tail therefore extends out quite a long way whereas the negative tail
stops at zero.

For a more psychological example, a distribution with a positive


skew typically results if the time it takes to make a response is
measured. The longest response times are usually much
longer than typical response times whereas the shortest response times
are seldom much less than the typical response time. A histogram of the
author's performance on a perceptual motor task in which the goal is to
move the mouse to and click on a small target as quickly as possible is
shown below. The X axis shows times in milliseconds.
Negatively skewed distributions do occur, however. Consider the
following frequency polygon of test grades on a statistics test where
most students did very well but a few did poorly. It has a large negative
skew.

Skew (3 of
3)
Next section: Kurtosis

Skew can be calculated as:

where μ is the mean and σ is the standard deviation.

The normal distribution has a skew of 0 since it is a symmetric


distribution.

As a general rule, the mean is larger than the median in positively


skewed distributions and less than the median in negatively skewed
distributions.(Click here for an illustration.) There are counter
examples. For example it is not uncommon for the median to be higher
than the mean in a postively skewed bimodal distribution or with
discrete distributions. See "Mean, Median, and Skew: Correcting a
Textbook Rule" by Paul Hippel, for more details.
Graphs

Frequency Polygon (1 of 2)

A frequency polygon is a graphical display of a frequency table. The


intervals are shown on the X-axis and the number of scores in each
interval is represented by the height of a point located above the middle
of the interval. The points are connected so that together with the X
-axis they form a polygon.

A frequency table and a relative frequency polygon for response times in


a study on weapons and aggression are shown below. The times are in
hundredths of a second.

Lowe Uppe Cumulati Per Cumulati


r 25 r 30 Coun
1 ve 1 Cen
3.1 ve3.1

30 35 4 5 2 2
35 40 8 13 12.4 15.6

Note: Values in each category are >8 the lower 2


limit and ≤ to the upper limit.
Frequency polygons are useful for comparing distributions.
This is achieved by overlaying the frequency polygons drawn for
different data sets. The figure below provides an
example. The data come from a task in which the goal
is to move a computer mouse to a
target on the screen as fast as possible. On 20 of the trials,
the target was a small rectangle; on the other 20, the target
was a large rectangle. Time to reach the target
was recorded on each trial. The two
distributions (one for each target) are plotted
together. The figure shows that although there
is
some overlap in times, it generally took longer to
move the mouse to the small target than to the
large one.

Frequency Polygon (2 of 2)

Next section: Histograms

Frequency polygons can be based on the actual frequencies or the


relative frequencies. When
based on relative frequencies, the percentage of scores instead of the
number of scores in each category is plotted.

In a cumulative frequency polygon, the number of scores (or the


percentage of scores) up to and including the category in question is
plotted. A cumulative frequency polygon is shown below.
A histogram is constructed from a frequency table. The intervals are
shown on the X-axis and the number of scores in each interval is
represented by the height of a rectangle located above the interval. A
histogram of the response times from the dataset Target RT is shown
below.

Histogram
The shapes of histograms will vary depending on the choice of the size
of the intervals. Click
here to see an interactive histogram demonstrating the effect of changing
the size of the intervals.

A bar graph is much like a histogram, differing in that the columns are
separated from each other by a small distance. Bar graphs are commonly
used for qualitative variables.

Stem and Leaf Plots (1 of 2)

A stem and leaf display (also called a stem and leaf plot) is a graphical
method of displaying data. It is particularly useful when the data are
not too numerous. A stem and leaf plot is similar to a histogram except
it portrays a little more precision. A stem and leaf plot of the
tournament players from the dataset "chess" as well as the data
themselves are shown below:

85.3
80.3
75.9
74.9
71.1
58.1
56.4
51.2
45.6
40.1

The largest value, 85.3, is approximated as:

10 x 8 + 5.

This is represented in the plot as a stem of 8 and a leaf of 5. It is shown


as the 5 in the first line of the plot. Similarly, 80.3 is approximated as 10
x 8 + 0; it has a stem of 8 and a leaf of 0. It is shown as the "0" in the
first line of the plot.
Depending on the data, each stem is displayed 1, 2, or 5 times. When
a stem is displayed only once (as on the plot shown above), the leaves
can take on the values from 0-9.

Stem and Leaf (2 of 2)

Next section: Box plots

When a stem is displayed twice, (as in the example shown below) one
stem is associated with the leaves 5-9 and the other stem is associated
with the leaves 0-4.

Finally, when a stem is displayed five times, the first has the leaves 8 -9,
the second 6-7, the third
4-5, and so on.
If positive and negative numbers are both present, +0 and -0 are used as
stems as they are in the plot . A stem of -0 and a leaf of 7 is a value of
(-0 x 1) + (-0.1 x 7) = -0.7.

There is a variation of stem and leaf displays that is useful for


comparing distributions. The two distributions are placed back to back
along a common column of stems. The figure below shows such a
graph. It compares two distributions. The stems are in the middle, the
leaves to the left are for one distribution, and the leaves to the right are
for the other. For example, the second -to-last row shows that the
distribution on the left contains the values 11, 12, and 13 whereas the
distribution on the right contains two 12's and three 14's.

11 4
3 7
33 3 233
886 2 889
443311 2 0011122
9877766 1 5688889
32 1 22444
7 0 69
Next section: Box plots
Statistical Consulting for Dissertations and Theses (401 -659-4025)

Box Plot (1 of 2)

A box plot provides an excellent visual summary


of many important aspects of a distribution. The
box stretches from the lower hinge (defined as
the 25th percentile) to the upper hinge (the 75th
percentile) and therefore contains the middle
half of the scores in the distribution.

The median is shown as a line across the box. Therefore 1/4 of


the distribution is between this line and the top
of the box and 1/4 of the distribution is between
this line and the bottom of the box.
The "H-spread" is defined as the difference
between the hinges and a "step" is defined as
1.5 times the H-spread.

Inner fences are 1 step beyond the hinges.


Outer fences are 2 steps beyond the hinges.

There are two adjacent values: the largest value below the upper inner
fence and the smallest value above the lower inner fence. For the data
plotted in the figure, the minimum value is above the lower inner fence
and is therefore the lower adjacent value. The maximum value is the
inner fences so it is not the upper adjacent value.

As shown in the figure, a line is drawn from the upper hinge to the
upper adjacent value and from the lower hinge to the lower adjacent
value

Box Plots (2 of 2)

Next Chapter: Describing Bivariate Data


Every score between the inner and outer fences is
indicated by an "o" whereas a score beyond the outer
fences is indicated by a "*".

It is often useful to compare data from two or more groups by viewing


box
plots from the groups side by side. Plotted are data from Example 2a
and Example 2b . The data from 2b are higher, more spread out, and
have a positive skew. That the skew is positive can be determined by
the fact that the mean is higher than the median and the upper whisker
is longer than the lower whisker.

Some computer programs present their own variations on box plots. For
example, SPSS does not include the mean. JMP distinguishes between
"outlier" box plots which are the same as those described here and
quantile box plots that show the 10th, 25th, 50th, 75th, and 90th
percentiles.

Next Chapter: Describing Bivariate Data

Exercises

Chapte
r2

1. Which of the the frequency polygons has a large positive skew?


Which has a large negative skew?
2. Which of the box plots on the has a large positive skew? Which has a
large negative skew?
3. Make up a dataset of 20 numbers with a positive skew. Use a
statistical program to compute the skew and to create a box plot. Is the
mean larger than the median as it should be for distributions with a
positive skew. What is the value for skew? Plot a frequency polygon
with these data.

4. Repeat Problem 3 only this time make the dataset have a negative
skew.

5. Make up two data sets that have:


(a) the same mean but differ in
standard deviations. (b) the same
mean but have different medians.
(c) the same median but different means.
(d) the same semi-interquartile range but differ in standard deviations
6. Assume the variable X has a mean of 10 and a standard deviation of
2. What would be the mean and standard deviation of a new variable
(Y) that was created by multiplying each element of X by 5 and then
adding 4.

7. The dataset below has data on memory for chess positions for
players of three levels of expertise. The numbers represent the
number of pieces correctly from three chess postions. Create side-
by-side box plots for these three groups. What can you say about the
differences between these groups from the box plots. Consider
spread as well as central tendency.

22.1 32.5 40.1


22.3 37.1 45.6
26.2 39.1 51.2
29.6 40.5 56.4
31.7 45.5 58.1
33.5 51.3 71.1

8. For the numbers 1, 3, 4, 5, and 12, find the value (v) for which Σ(X-
v)² is minimized. Find the value (v) for which Σ |x-v| is minimized.
9. Experiment with the sampling distribution simulation and do the
exercises associated with it.
10. What is more likely to have a skewed distribution: time to solve
an anagram problem or scores on a vocabulary test?

Selected Answers

Chapter 2

6.
Standard
deviation = 10.
Mean = 54.

8.
The value (v) for which (X-v)² is
minimized is 5. The value (v) for
which  |x-v| is minimized is 4.
Scatterplots

Next section: Introduction to Pearson's correlation

A scatterplot shows the scores on one variable plotted against scores on


a second variable. Below is a plot showing the relationship between grip
strength and arm strength for 147 people working at physically-
demanding jobs. The data are from a case study in the Rice Virtual Lab
in
Statistics. The plot shows a very strong but certainly not a perfect
relationship between these two variables.
Scatterplots should almost always be constructed when the relationship
between two variables is of interest. Statistical summaries are no
substitute for a full plot of the data.

Pearson's Correlation (1 of 3)

The correlation between two variables reflects the degree to which


the variables are related. The most common measure of correlation
is the Pearson Product Moment Correlation (called Pearson's
correlation for short). When measured in a population the Pearson
Product Moment correlation is designated by the Greek letter rho
(ρ). When computed in a sample, it is designated by the letter "r"
and is sometimes called "Pearson's r." Pearson's correlation reflects
the degree of linear relationship between two variables. It ranges
from +1 to -1. A correlation of +1 means that there is a perfect
positive linear relationship between variables. The scatterplot
shown on this page depicts such a relationship. It is a positive
relationship because high scores on the X-axis are associated
with high scores on the Y-axis.
Pearson's Correlation (2 of 3)

A correlation of -1 means that there is a perfect negative linear


relationship between variables. The scatterplot shown below depicts a
negative relationship. It is a negative relationship becaus e high scores
on the X-axis are associated with low scores on the Y-axis.

A correlation of 0 means there is no linear relationship between the


two variables. The second graph shows a Pearson correlation of 0.
Correlations are rarely if ever 0, 1, or -1. Some real data showing a
moderately high correlation are shown on the next page.

Pearson's Correlation (3 of 3)

Next section: Computational formula

The scatterplot below shows arm strength as a function of grip strength


for 147 people working in physically-demanding jobs (click here for
details about the study). The plot reveals a strong positive relationship.
The value of Pearson's correlation is 0.63.
Other information about Pearson's correlation can be obtained by
clicking one of the following links:

 Computational formula
 Sampling distribution
 Confidence interval

Computing Pearson's Correlation Coefficient

Next section: Example values


The formula for Pearson's correlation takes on many forms. A
commonly used formula is shown below. The formula looks a bit
complicated, but taken step by step as shown i n the numerical example,
it is really quite simple.

A simpler looking formula can be used if the numbers are converted


into z scores:
where zx is the variable X converted into z scores and z y is the variable
Y converted into z scores.

Computing Pearson's r in
SPSS Computing
Pearson's r in EXCEL

Example Correlations

Next section: Restricted range


Restricted Range

Next section: Linear transformations

Consider a study that investigated the correlation between arm strength


and grip strength for 147 people working in physically-demanding jobs.
Do you think the correlation was higher for all
147 workers tested, or for the workers who were above the median in
grip strength? The upper portion of the figure below shows that the
scatterplot for the entire sample of 147 workers. The lower portion of
the figure shows the scatterplot for the 73 workers who scored highest
on grip strength. The correlation is 0.63 for the sample of 147 but only
0.47 for the sample of 73. Notice that the scales of the X-axes are
different.
Whenever a sample has a restricted range of scores, the correlation will
be reduced. To take the most extreme example, consider what the
correlation between high-school GPA and college GPA would be in a
sample where every student had the same high-school GPA. The
correlation would necessarily be 0.0.

Click here for an interactive demonstration of the effect of


restriction of range and here for a related demonstration.

Effect of Linear Transformations on Pearson's r

Next section: Spearman's rho


Linear transformations have no effect on Pearson's correlation
coefficient. Thus, the correlation between height and weight is the same
regardless of whether height is measured in inches, feet, centimeters or
even miles. This is a very desirable property since, with the excepti on
of ratio scales, choices among measurement scales that are linear
transformations of each other are arbitrary. For instance, scores on the
Scholastic Aptitude Test (SAT) range from 200-800. It was an arbitrary
decision to set 200 to 800 as the range. The test would not be any
different if 100 points were subtracted from each score and then each
score were multiplied by 3. Scores on the SAT would then range from
300-2100. The Pearson's correlation between SAT and some other
variable (such as college grade point average) would not be affected by
this linear
transformation.

Next section: Spearman's rho

Spearman's rho
Next chapter: Probability

Spearman's rho is a measure of the linear relationship between two


variables. It differs from Pearson's correlation only in that the
X Y
computations
7 4 are done after the numbers are converted to ranks. When
5 7
converting
8 9 to ranks, the smallest value on X becom es a rank of 1, etc.
9 8
Consider the following X-Y pairs:

Converting these to ranks would result in the following:


X Y
2 1
1 2
3 4
4 3
The first value of X (which was a 7) is converted into a 2 because 7 is
the second lowest value of
X. The X value of 5 is converted into a 1 since it is the lowest.
Spearman's rho can be computed with the formula for Pearson's r using
the ranked data. For this example, Spearman's rho = 0.60
Spearman's rho is an example of a "rank-
randomization" test.
Chapte
r3

1. Make up 3 datasets each with 10 observations (rows) and 2


variables (columns). One should have a correlation greater than .8; one
should have a correlation between .4 and .7; one should have a
correlation between -.20 and +.20.

2. Compute Pearson's r and Spearman's rho for


the following data:
1,
2
5,
9
6,
8
12,2
0
Selected
Answers

2a. 0.987
2b. 0.800
Simple probability (1 of 2)

What is the probability that a card drawn at random from a deck of cards
will be an ace? Since of the 52 cards in the deck, 4 are aces, the
probability is 4/52. In general, the probability of an event is the number
of favorable outcomes divided by the total number of possible
outcomes. (This assumes the outcomes are all equally likely.) In this
case there are four favorable outcomes: (1) the ace of spades, (2) the ace
of hearts, (3) the ace of diamonds, and (4) the ace of clubs. Since each of
the 52 cards in the deck represents a possible outcome, there are 52
possible outcomes.

Simple Probability (2 of 2)

Next section: Conditional probability


The same principle can be applied to the problem of determining the
probability of obtaining different totals from a pair of dice. As shown
below, there are 36 possible outcomes when a pair of dice is thrown.

To calculate the probability that the sum of the two dice will equal 5,
calculate the number of outcomes that sum to 5 and divide by the total
number of outcomes (36). Since four of the outcomes have a total of 5
(1,4; 2,3; 3,2; 4,1), the probability of the two dice adding up to 5 is 4/36
= 1/9 . In like manner, the probability of obtaining a sum of 12 is
computed by dividing the number of favorable outcomes (there is only
one) by the total number of outcomes (36). The probability is therefore
1/36 .

Conditional Probability
A conditional probability is the probability of an event given that
another event has occurred. For example, what is the probability that
the total of two dice will be greater than 8 given that the first die is a 6?
This can be computed by considering only outcomes for which the first
die is a 6. Then, determine the proportion of these outcomes that total
more than 8. All the possible outcomes for two dice are shown below:

There are 6 outcomes for which the first die is a 6, and of these, there
are four that total more than 8 (6,3; 6,4; 6,5; 6,6). The probability of a
total greater than 8 given that the first die is 6 is therefore 4/6 =
2/3.
More formally, this probability can be written as:

p(total>8 | Die 1 = 6) = 2/3.

In this equation, the expression to the left of the vertical bar


represents the event and the expression to the right of the vertical bar
represents the condition. Thus it would be read as "The probability that
the total is greater than 8 given that Die 1 is 6 is 2/3." In more abstract
form, p(A|B) is the probability of event A given that event B occurred.

Probability of A and B (1 of 2)
If A and B are Independent
A and B are two events. If A and B are independent, then the
probability that events A and B both occur is:

p(A and B) = p(A) x p(B).

In other words, the probability of A and B both occurring is the


product of the probability of A and the probability of B.

What is the probability that a fair coin will come up with heads twice in
a row? Two events must occur: a head on the first toss and a head on
the second toss. Since the probability of each event is 1/2, the
probability of both events is: 1/2 x 1/2 = 1/4.

Now consider a similar problem: Someone draws a card at random


out of a deck, replaces it, and then draws another card at random.
What is the probability that the first card is the ace of clubs and the
second card is a club (any club). Since there is only one ace of clubs in
the deck, the probability of the first event is 1/52. Since 13/52 = 1/4
of the deck is composed of clubs, the probability of the second event
is 1/4. Therefore, the probability of both events is: 1/52 x 1/4 = 1/208
.

Probability of A and B (2 of 2)

Next section: Probability of A or B

If A and B are Not Independent


If A and B are not independent, then the probability of A and B is:

p(A and B) = p(A) x p(B|A)


where p(B|A) is the conditional probability of B given A.

If someone draws a card at random from a deck and then, without


replacing the first card, draws a second card, what is the probability
that both cards will be aces? Event A is that the first card is an ace.
Since 4 of the 52 cards are aces, p(A) = 4/52 = 1/13. Given that the first
card is an ace, what is the probability that the second card will be an
ace as well? Of the 51 remaining cards, 3 are aces. Therefore, p(B|A) =
3/51 = 1/17 and the probability of A and B is:
1/13 x 1/17 = 1/221.

Next section: Probability of A or B

Binomial distribution (1 of 3)

When a coin is flipped, the outcome is either a head or a tail; when a


magician guesses the card selected from a deck, the magician can
either be correct or incorrect; when a baby is born, the baby is either
born in the month of March or is not. In each of these examples, an
event has two mutually exclusive

possible outcomes. For convenience, one of the outcomes can be


labeled "success" and the other outcome "failure." If an event occurs N
times (for example, a coin is flipped N times), then the binomial
distribution can be used to determine the probability of obtaining
exactly r successes in the N outcomes. The binomial probability for
obtaining r successes in N trials is:
where P(r) is the probability of exactly r successes, N is the number of
events, and π is the probability of success on any one trial. This formula
for the binomial distribution assumes that the events:

1. are dichotomous (fall into only two categories)


2. are mutually exclusive
3. are independent and
4. are randomly selected

Consider this simple application of the binomial distribution: What is the


probability of obtaining exactly
3 heads if a fair coin is flipped 6 times?
Binomial Distribution (2 of 3)

For this problem, N = 6, r = 3, and π = 0.5. Therefore,


Two binomial distributions are shown below. Notice that for π =
0.5, the distribution is symmetric whereas for π = 0.3, the
distribution has a positive skew.

Binomial distribution (3 of 3)
Next section: Subjective probability

Often the cumulative form of the binomial distribution is used. To


determine the probability of obtaining 3 or more successes with n=6 and
π = 0.3, you compute P(3) + P(4) + P(5) + P(6). This can also be written
as:

and is equal to 0.1852 + 0.0595 + 0.0102 + 0.0007 = 0.2556. The


binomial distribution can be approximated by a normal distribution
(click here to see how). Click here for an interactive demonstration
of the normal approximation to the binomial.
Subjective probability (1 of 1)

Next chapter: Normal distribution

For some purposes, probability is best thought of as subjective.


Questions such as "What is the probability that Boston will defeat New
York in an upcoming baseball game?" cannot be calculated by dividing
the number of favorable outcomes by the number of possible
outcomes. Rather, assigning probability 0.6 (say) to this event seems to
reflect the speaker's personal opinion --- perhaps his or her willingness
to bet according to certain odds. Such an approach to probability,
however, seems to lose the objective content of the idea of chance;
probability becomes mere opinion. Two people might attach different
probabilities to the outcome, yet there would be no criterion for
calling one "right" and the other "wrong." We cannot call one of the
two people right simply because he or she assigned a higher
probability to the outcome that actually occurred. After all, you would
be right to attribute probability 1/6 to throwing a six with a fair die,
and your friend who attributes 2/3 to this event would be wrong. And
you are still right (and your friend is still wrong) even if the die ends up
showing a six!

The following example illustrates the present approach to probabilities.


Suppose you wish to know what the weather will be like next Saturday
because you are planning a picnic. You turn on your radio, and the
weather person says, “There is a 10% chance of rain.” You decide to
have the picnic outdoors and, lo
and behold, it rains. You are furious with the weather person. But was
he or she wrong? No, they did not say it would not rain, only that rain
was unlikely. The weather person would have been flatly wrong only if
they said that the probability is 0 and it subsequently rained. However,
if you kept track of the
weather predictions over a long periods of time and found that it
rained on 50% of the days that the weather person said the
probability was 0.10, you could say his or her probability
assessments are wrong.

So when is it sensible to say that the probability of rain is 0.10?


According to a frequency interpretation, it means that it will rain 10%
of the days on which rain is forecast with this probability.
Chapte
r4

1. What is the probability of rolling a pair of dice and obtaining a total


score of 10 or more?
2. A box contains three black pieces of cloth, two striped pieces,
and four dotted pieces. A piece is selected randomly and then
placed back in the box. A second piece is selected
randomly. What is the
probability that: (a) both
pieces are dotted?
(b) the first piece is
black and the second
piece is dotted?
(c) one piece is black and
one piece is striped?
3. A card is drawn at random from a deck. What is the probability that it
is an ace or a king?

4. A card is drawn at random from a deck. What is the probability it is


either a red card, an ace, or both?

5. Two cards are drawn from a deck (without replacement). What is the
probability they are both diamonds?

1. 1/6

2a. 16/81
2b. 4/27
2c. 4/27
3. 2/13

4. 7/13

5. 1/17
Sampling Distribution (1 of 3)

If you compute the mean of a sample of 10 numbers, the value you


obtain will not equal the population mean exactly; by chance it will be a
little bit higher or a little bit lower. If you sampled sets of 10 numbers
over and over again (computing the mean for each set), you would find
that some sample means come much closer to the population mean than
others. Some would
be higher than the population mean and some would be lower. Imagine
sampling 10 numbers and computing the mean over and over again, say
about 1,000 times, and then constructing a relative frequency
distribution of those 1,000 means. This distribution of means is a very
good approximation to the sampling distribution of the mean. The
sampling distribution of the mean is a theoretical distribution that is
approached as the number of samples in the relative frequency
distribution increases. With 1,000 samples, the relative frequency
distribution is quite close; with
10,000 it is even closer. As the number of samples approaches
infinity, the relative frequency distribution approaches the sampling
distribution.
Sampling Distribution (2 of 3)

The sampling distribution of the mean for a sample size of 10 was just
an example; there is a different sampling distribution for other sample
sizes. Also, keep in mind that the relative frequency distribution
approaches a sampling distribution as the number of samples inc
reases, not as the sample size increases since there is a different
sampling distribution for each sample size.

A sampling distribution can also be defined as the relative frequency


distribution that would be obtained if all possible samples of a
particular sample size were taken. For example, the
sampling distribution of the mean for a sample size of 10 would be
constructed by computing the mean for each of the possible ways in
which 10 scores could be sampled from the population and creating a
relative frequency distribution of these means. although these two
definitions may seem different, they are actually the same: Both
procedures produce exactly the same sampling distribution.

Sampling Distribution (3 of 3)
Next section: Sampling distribution of the mean

Statistics other than the mean have sampling distributions too. The
sampling distribution of the median is the distribution that would result
if the median instead of the mean were computed in each sample.
Students often define "sampling distribution" as the sampling
distribution of the mean. That is a serious mistake.

Sampling distributions are very important since almost all


inferential statistics are based on sampling distributions.

Click here for interactive simulation illustrating important concepts


about sampling distributions.

Sampling Distribution of the Mean

Next section: Standard error

The sampling distribution of the mean is a very important distribution. In


later chapters you will
see that it is used to construct confidence intervals for the mean and for
significance testing.
Given a population with a mean of μ and a standard deviation of σ, the
sampling distribution of the mean has a mean of μ and a standard
deviation of

where n is the sample size. The standard deviation of the sampling


distribution of the mean is called the standard error of the mean. It is
designated by the symbol: σM . Note that the spread of the sampling
distribution of the mean decreases as the sample size increases.

An example of the effect of sample size is shown above. Notice that the
mean of the distribution is not affected by sample size.

Standard Error

Next section: Central limit theorem


The standard error of a statistic is the standard deviation of the
sampling distribution of that statistic. Standard errors are
important because they reflect how much sampling fluctuation a
statistic will show. The inferential statistics involved in the
construction of confidence intervals and significance testing are
based on standard errors. The standard
error of a statistic depends on the sample size. In general, the
larger the sample size the smaller the standard error. The
standard error of a statistic is usually designated by the Greek
letter sigma (σ) with a subscript indicating the statistic. For
instance, the standard error of the mean is indicated by the
symbol: σM.

Click on a statistic to view its


standard error. mean
difference between
mean linear
combination of
means Pearson's
correlation
median
standard
deviation
proportion
difference between proportions

Central Limit Theorem (1 of 2)


The central limit theorem states that given a distribution with a mean μ
and variance σ², the sampling distribution of the mean approaches a
normal distribution with a mean (μ) and a variance σ²/N as N, the
sample size, increases. The amazing and counter-intuitive thing about
the central limit theorem is that no matter what the shape of the original
distribution, the sampling distribution of the mean approaches a normal
distribution. Furthermore, for most distributions, a normal distribution is
approached very quickly as N increases. Keep in mind that N is the
sample size for each mean and not the number of samples. Remember in
a sampling distribution the number of samples is assumed to be infinite.
The sample size is the number of scores in each sample; it is the number
of scores that goes into the computation of each mean.

On the next page are shown the results of a simulation exercise to


demonstrate the central limit theorem. The computer sampled N scores
from a uniform distribution and computed the mean. This procedure
was performed 500 times for each of the sample sizes 1, 4, 7, and 10.

Central Limit Theorem (2 of 2)


Next section: Area under sampling distribution of the
mean

Below are shown the resulting frequency distributions each based on


500 means. For N = 4, 4 scores were sampled from a uniform
distribution 500 times and the mean computed each time. The same
method was followed with means of 7 scores for N = 7 and 10 scores
for N = 10.
Two things should be noted about the effect of increasing N:

1. The distributions becomes more and more normal.

2. The spread of the distributions decreases. Click here for an


interactive demonstration of the central limit theorem.
3. Area Under the Sampling Distribution of the Mean (1 of 4)
4.
5.
6. Assume a test with a mean of 500 and a standard deviation of
100. Which is more likely:
(1) that the mean of a sample of 5 people is greater than 580 or (2)
that the mean of a sample of 10 people is greater than 580? Using
your intuition, you may have been able to figure out that a mean
over 580 is more likely to occur with the smaller sample.
One way to approach problems of this kind is to think of the
extremes. What is the probability that the mean of a sample of
1,000 people would be greater than 580. The probability is
practically zero since the mean of a sample that large will almost
certainly be very close to the population mean. The chance that it
is more than 80 points away is practically nil. On the other hand,
with a small sample, the sample mean could very well be as many
as 80 points from the population mean. Therefore, the larger the
sample size, the less likely it is that a sample mean will deviate
greatly from the population mean. It follows that it is more likely
that the sample of 5 people will have a mean gre ater than
580 then will the sample of 10 people.

Area Under the Sampling Distribution of the Mean (2 of 4)

1. To figure out the probabilities exactly, it is necessary to make


the assumption that the distribution is normal. Given normality
and the formula for the standard error of the mean, the
probability that the mean of 5 students is over 580 can be
calculated in a manner almost identical to that used in calculating
the area under portions of the normal
curve.

Since the question involves the probability of a mean of 5


numbers being over 580, it is necessary to know the distribution
of means of 5 numbers. But that is simply the sampling
distribution of the mean with an N of 5. The mean of the sampling
distribution of the mean is μ (500 in this example) and the

standard deviation is = 100/2.236 =


44.72. The sampling distribution of the mean is shown below.

The area to the left of 580 is shaded. What proportion of the curve
is below 580? Since
580 is 80 points above the mean and the standard deviation is
44.72, 580 is 80/44.72 =
1.79 standard deviations above the mean.
Area under Sampling Distribution of the Mean (3 of 4)

1. The formula for z used here is a special case of the general


formula for z:

Since the distribution of interest is the distribution of means, the


formula can be rewritten as

where M is a sample mean, μ is mean of the distribution of


means which is equal to the population mean, and is the standard
error of the mean.
In general, when a problem asks about the probability of a
mean, the sampling distribution of the mean should be used.
The standard error of the mean is used as the
standard deviation. Continuing with the calculations,

M=
580 μ=
500
σm=
44.72

and therefore,

Area under Sampling Distribution of the Mean (4 of 4)

Next section: Difference between means


From a z table, it can be determined that 0.96 of the distribution is
below 1.79. Therefore the probability that the mean of 5 numbers will
be greater than 580 is only 0.04. The calculation of the probability
with N = 10 is similar. The standard error of the mean (σ m) is equal to
:

which, of course, is smaller than the value of 44.72


obtained for N=5. Using the formula:

to calculate z and a z table to calculate the probability, it can be


determined that the probability of obtaining a mean based on N = 10
that is greater than 580 is only 0.01. As expected, this is much lower
than the probability of .04 obtained for N = 5.

Summing up, finding an area under the sampling distribution of the


mean is the same as finding an area below any normal curve. In this
case, the normal curve is the sampling distribution of the mean. It has a
mean of μ and a standard deviation of
.

Sampling Distribution, Difference Between Independent Means (1 of


5)

This section applies only when the means are computed from
independent samples. The formulas are more complicated when
the two means are not independent. Let's say that a researcher has
come up with a drug that improves memory. Consider two
hypothetical populations: the performance of people on a memory
test if they had taken the drug and the performance of people if

they had not. Assume that the mean (μ) and the variance ( ) of
the distribution of people taking the drug are 50 and 25

respectively and that the mean (μ) and the variance ( ) of the
distribution of people not taking the drug are 40 and 24
respectively. It follows that the drug, on average, improves
performance on the memory test by 10 points. This 10-point
improvement is for the whole population. Now consider the
sampling distribution of the difference between means. This
distribution can be understood by thinking of the following
sampling plan:

Sampling Distribution, Difference between Independent Means (2 of


5)

Sample n1 scores from the population of people taking the drug


and compute the mean. This mean will be designated as M1.
Then, sample n2 scores from the population of people not taking
the drug and compute the mean. This mean will be designated as
M 2. Finally compute the difference between M1 and M2. This
difference will be called Md where the "d" stands for
"difference." This is the statistic whose sampling distribution is of
interest.

The sampling distribution could be approximated by repeating


the above sampling procedure over and over while plotting each
value of M d. The resulting frequency distribution would be an
approximation to the sampling distribution. The mean and the
variance of the sampling distribution of Md are:

and
.

If and n1 = n2= n then

For the present

example, = 50
- 40 = 10 and

Sampling Distribution, Difference between Independent Means (3 of


5)
If n1 = 10 and n2 = 8 then

Finally the standard error of Md is simply the square root of the


variance of the sampling distribution of Md. So,

= 2.35.

Once you know the mean and the standard error of the sampling
distribution of the difference between means, you can answer
questions such as the following: If an experiment with the memory
drug just described is conducted, what is the probability that the
mean of the group of 10 subjects getting the drug will be 15 or
more points higher
than the mean of the 8 subjects not getting the drug? Right away it
can be determined that the chances are not very high since on
average the difference between the drug and no -
drug groups is only 10.

Sampling Distribution, Difference between Independent Means (4 of


5)

A look at a graph of the sampling distribution of Md makes the


problem more concrete. The mean of the sampling distribution is
10 and the standard deviation is 2.35. The graph shows the
distribution. Back to the question: What is the probability that the
drug group will score 15 or more points higher?

The blue region of the graph is 15 or higher and is a small portion


of the area. The probability can be determined by computing the
number of standard deviations above the mean 15 is. Since the
mean is 10 and the standard deviation is 2.35, 15 is (15 -10)/2.35 =
2.13 standard deviations above the mean. From a z table it can be
determined that 0.983 of the area is below 2.13; therefore, 0.017
of the area is above. It follows that the probability of a difference
between the drug and no-drug means of 15 or larger is .017.

As shown below, the z table calculator allows you to compute


this value directly without using z scores.
Sampling Distribution, Difference between Independent Means (5 of
5)

Next section: Linear combination of means

The main difference between this problem and simple problems


involving the area under the normal distribution is that in this
problem you first had to figure out the mean and standard
deviation of the sampling distribution of the difference between
two means. In the simpler problems involving the area under the
normal distribution, you were always given the mean and standard
deviation of the distribution. The formula used there:

is used here in a
different form:

Since the problem now concerns differences between means, the


relevant statistic is the difference between means obtained in the
sample (experiment), the relevant population mean (μ) is the
mean of the sampling distribution of the difference between
means, and
the relevant standard deviation is the standard deviation of the
sampling distribution of the difference between means.

Sampling Distribution of a Linear Combination of Means (1 of 4)

Assume there are k populations each with the same variance (σ²).
Further assume that (1) n subjects are sampled randomly from
each population with the mean computed for each sample and (2) a
linear combination of these means is computed by multiplying
each mean by a coefficient and summing the results. Let the linear
combination be designated by the letter "L." If this sampling
procedure were repeated over and over again, a
different value of L would be obtained each time. It is this
distribution of the values of L that makes up the sampling
distribution of a linear combination of means. The importance of
linear combinations of means can be seen in the section
"Confidence interval on linear combination of means " where it is
shown that many experimental hypotheses can be stated in terms
of linear combinations of the mean and that the choice of
coefficients determines the hypothesis tested. The formula for L
is:
L = a1M1 + a2M2 + ... + akMk

where M1is the mean of the numbers sampled from Population 1,


M2 is the mean of the numbers sampled from Population 2, etc.
The coefficient a 1 is used to multiply the first mean, a2 is used
to multiply the second mean, etc.

Sampling Distribution of a Linear Combination of Means (2 of 4)

Assuming the means from which L is computed are independent,


the mean and standard deviation of the sampling distribution of L
are:

μL = a1 μ1 + a2 μ2 + ... + ak μk

and
where μi is the mean for population i, σ² is the variance of each
population, and n is the number of elements sampled from each
population. Consider an example application using the sampling
distribution of L. Assume that on a test of reading ability, the
population means for 10, 12, and 14 year olds are 60, 68, and 80
respectively. Further assume that the variance within each of
these three populations is 100. Then, μ 1 = 60, μ2
= 68, μ3 = 80, and σ² = 100. If eight 10-year-olds, eight 12- year-
olds and eight 14-year-
olds are sampled randomly, what is the probability that the mean
for the 14 year olds will be 15 or more points higher than the
average of the means for the two younger groups?

Sampling Distribution of a Linear Combination of Means (3 of 4)

Or, symbolically, what is the probability that:

15
?
The answer lies in the sampling distribution of

. Letting a1 = -.5, a2 = -.5, and a3 =


1,

L= .

The mean and standard deviation of the sampling distribution of L


for this example are:

μL = a1 μ1 + a2 μ2 + ... + ak μk

= (-.5)(60) + (-.5)(68) + (1)


(80) = 16 and

Sampling Distribution of a Linear Combination of Means (4 of 4)


Next section: Pearson's correlation
The sampling distribution of L therefore has a mean of 16 and a
standard deviation of
4.33. The question is, what is the probability of getting a value of
L greater than or equal to 15? The formula:

= (15 - 16)/4.33 = -0.23

can be used to find out how many standard deviations above μ L an


L of 15 is.

Using a z table, it can be determined that 0.41 of the time a z of


-0.23 or lower would occur. Therefore the probability of a z of
-0.23 or higher occurring is (1 - 0.41) = 0.59. So, the probability

that is greater than 15 is 0.59.

Sampling Distribution of Pearson's r (1 of 3)


Just like any other statistic, Pearson's r has a sampling
distribution. If N pairs of scores were sampled over and over
again the resulting Pearson r's would form a distribution. When
the absolute value of the correlation in the population is low (say
less than about
0.4) then the sampling distribution of Pearson's r is approximately
normal. However, with high values of correlation, the distribution
has a negative skew. The graph below shows the sampling
distribution of Pearson's r when the population correlation is 0.60
and when N = 12. The negative skew is apparent.

A transformation called Fisher's z' transformation converts


Pearson's r to a value that is normally distributed and with a
standard error of:
.

Sampling Distribution of Pearson's r (2 of 3)

It stands to reason that the greater the sample size (N, the number
of pairs of scores), the smaller the standard error. Since N is in the
denominator of the formula, the larger the sample size, the smaller
the standard error. Consider the following problem: If the
population correlation (rho) between scores on an aptitude test and
grades in school were
0.5, what is the probability that a correlation based on 19 students
would be larger than
0.75? The first step is to convert a correlation of 0.5 to z'. This can
be done with the r to z' table. The value of z' is 0.55. The standard
error is:

= 1/4 = .25.
Sampling Distribution of Pearson's r (3 of 3)

Next section: Difference between correlations

The number of standard deviations from the mean can be calculated with
the formula:

where: z is the number of standard deviations above the z' associated


with the population correlation, z' is the value of Fisher's z' for the
sample correlation (z' =.97 in this case), μ is the value of z' for the
population correlation (.55 in this case) and is the mean of the
sampling distribution of z'. is the standard error of Fisher's z'; it was
previously calculated to be .25 for N = 19.

Plugging the numbers into the formula: z = (.97 - .55)/.25 = 1.68.


Therefore, a correlation of .75 is associated with a value 1.68 standard
deviations above the mean. As shown previously, a z table can be used
to determine the probability of a value more than 1.68 standard
deviations above the mean. The probability is .95. Therefore there is a .
05 probability of obtaining a Pearson's r of .75 or greater when the
"true" correlation is only .50.

Sampling Distribution of the Difference Between Independent


Pearson r's (1 of
2)
The sampling distribution of the difference between two
independent Pearson r's can be approached in terms of the
sampling distribution of z'. First both r's are converted to z'. The
standard error of the difference between independent z's is:

where N1 is the number of pairs of scores the first correlation is


based on and N2 is the number of pairs of scores the second
correlation is based upon. It is important to keep in mind that this
formula only holds when the two correlations are independent. This
means that different subjects must be used for each correlation. If
three tests are given to the same subjects then the correlation
between tests one and two is not independent of the
correlation between tests one and three.

Assume that in the population of females the correlation between a


test of verbal ability
and a test of spatial ability is 0.6 whereas in the population of males
the correlation is 0 .5. If a random sample of 40 females and 35
males is taken, what is the probability that the correlation for the
female sample will be lower than the correlation for the male
sample. Sampling Distribution of the Difference Between
Independent Pearson r's (2 of
2)

Next section: Median

Start by computing the mean and standard deviation of the sampling


distribution of the difference between z's. As can be calculated with the
help of the r to z' procedure, r's of .6 and .5 correspond to Z's of .69 and
.55 respectively. Therefore the mean of the sampling distribution is:

0.69-0.55 = 0.14.

The standard deviation is:

The portion of the distribution for which the difference is negative (the
correlation in the female sample is lower) is shaded. What proportion of
the area is this?
A difference of 0 is: standard deviations above
(0.58 sd's below) the mean. A z table can be used to calculate that the
probability of a z less than or equal to -0.58 is
0.28. Therefore the probability the the correlation will be lower in the
female sample is 0.28.

Sampling Distribution of Median

Next section: Standard deviation

The standard error of the median for large samples and normal
distributions is:

Thus, the standard error of the median is about 25% larger than
that for the mean. It is thus less efficient and more subject to
sampling fluctuations. This formula is fairly accurate even for
small samples but can be very wrong for extremely non- normal
distributions. For non-normal distributions, the standard error of
the median is difficult to compute.

The sampling distribution simulation can be used to explore the


sampling distribution of the median for non-normal distributions.

Sampling Distribution of the Standard Deviation

Next section: proportion

The standard error of the standard deviation is approximately


.

The approximation is more accurate for larger sample sizes (N


> 16) and when the population is normally distributed. Try
this simulation to explore the accuracy of the approximation.

The distribution of the standard deviation is positively skewed


for small N but is approximately normal if N is 25 or greater.
Thus, procedures for calculating the area under the normal curve
work for the sampling distribution of the standard deviation as
long as N is at least 25 and the distribution is approximately
normal.

Sampling Distribution of a Proportion (1 of 4)


Assume that 0.80 of all third grade students can pass a test of physical
fitness. A random sample of 20 students is chosen: 13 passed and 7
failed. The parameter π is used to designate the proportion of subjects
in the population that pass (.80 in this case) and the statistic p is used to
designate the proportion who pass in a sample (13/20 = .65 in this case).
The sample size (N) in this example is 20. If repeated samples of size
N where taken from the population and the proportion passing (p) were
determined for each sample, a distribution of values of p would be
formed. If the sampling went on forever, the distribution would be the
sampling distribution of a proportion. The sampling distribution of a
proportion is equal to the binomial distribution. The mean and standard
deviation of the binomial distribution are:

μ=π

and .

For the present example, N = 20, π = 0.80, the mean of the sampling
distribution of p (μ) is .8 and the standard error of p (σp) is 0.089. The
shape of the binomial distribution depends on both N and π. With large
values of N and values of π in the neighborhood of .5, the sampling
distribution is very close to a normal distribution.
Click here for an interactive demonstration of the normal
approximation to the binomial.
Sampling Distribution of a Proportion (2 of 4)

The plot shown here is the sampling distribution for the present
example. As you can see, the distribution is not far from the
normal distribution, although it does have some negative skew.

Assume that for the population of people applying for a job at a


bank in a m ajor city, .40 are able to pass a basic literacy test
required to get the job. Out of a group of 20 applicants, what is the
probability that 50% or more of them will pass? This problem
involves the sampling distribution of p with π = .40 and N = 20.
The mean of the sampling distribution is π = .40. The standard
deviation is:
Sampling Distribution of a Proportion (3 of 4)

Using the normal approximation, a proportion of .50 is:


(.50-.40)/.11 = 0.909 standard deviations above the mean. From a
z table it can be calculated that 0.818 of the area is below a z of
0.909. Therefore the probability that 50% or more will pass the
literacy test is only about 1 - 0.818 = 0.182.

Correction for Continuity


Since the normal distribution is a continuous distribution, the
probability that a sample value will exactly equal any specific
value is zero. However, this is not true when the normal
distribution is used to approximate the sampling distribution of a
proportion. A correction called the "correction for continuity"
can be used to improve the approximation.
The basic idea is that to estimate the probability of, say, 10
successes out of 20 when π is
0.4, one should compute the area between 9.5 and 10.5 as shown
below.

Sampling Distribution of a Proportion (4 of 4)

Next section: Difference between proportions

Therefore to compute the probability of 10 or more successes,


compute the area above
9.5 successes. In terms of proportions, 9.5 successes is 9.5/20 =
0.475. Therefore, 9.5 = (0.475 - 0.40)/.11 = 0.682 standard
deviations above the mean. The probability of being
0.682 or more standard deviations above the mean is 0.247 rather
than the 0.182 that was obtained previously.The exact answer
calculated using the binomial distribution is 0.245. For small
sample sizes the correction can make a much bigger difference
than it did here.

Sampling Distribution of the Difference Between Two Proportions (1


of 2)

The mean of the sampling distribution of the difference between two


independent proportions (p1
- p2) is:

The standard error of p1- p2 is:


.

The sampling distribution of p1- p2 is approximately normal as long as


the proportions are not too close to 1 or 0 and the sample sizes are not
too small. As a rule of thumb, if n 1 and n2 are both at least 10 and
neither is within 0.10 of 0 or 1 then the approximation is satisfactory for
most purposes. An alternative rule of thumb is that the approximation is
good if both Nπ and N(1 - π) are greater than 10 for both π1 and π2.

To see the application of this sampling distribution, assume that 0.8 of


high school graduates but only 0.4 of high school drop outs are able to
pass a basic literacy test. If 20 students are sampled from the population
of high school graduates and 25 students are sampled from the
population of high school drop outs, what is the probability that the
proportion of drop outs that pass will be as high as the proportion of
graduates?

Sampling Distribution of the Difference Between Two Proportions (2


of 2)
Next chapter:
Point estimation

For this example, the mean of the sampling distribution of p 1 - p2


is:

μ= π1 - π2 = 0.8 -
0.4 = 0.4. The
standard error is:
= 0 .133.

The solution to the problem is the probability that p 1 - p2 is less


than or equal to 0. The number of standard deviations above the
mean associated with a difference in proportions of 0 is:

From z table it can be determined that only 0.0013 of the time


would p 1 - p2 be 3.01 or more standard deviations below the
mean.

Chapte
r6

1. What are the mean and standard deviation of the sampling distribution
of the mean?
2. Given a test that is normally distributed with a mean of 30 and a
standard deviation of 6, (a) What is the probability that a single
score drawn at random will be greater than 34?
(b) What is the probability that a sample of 9 scores will have a mean
greater than 34?

(c) What is the probability that the mean of a sample of 16 scores will
be either less than 28 or greater than 32?

3. What is a standard error and why is it important?

4. What is the relationship between sample size and the standard error of
the mean?

5. What is the symbol used to designate the standard error of the mean?
The standard error of the median?

6. Young children typically do not know that memory for a list of words
is better if you rehearse the words. Assume two populations: (1) four-
year-old children instructed to rehearse the words and (2) four-year-old
children not given any specific instructions. Assume that the mean and
standard deviation of the number of words recalled by Population 1 are
3.5 and .8. For Population 2, the mean and standard deviation are 2.4
and 0.9. If both populations are normally
distributed, what is the probability that the mean of a sample of 10 from
Population 1 will exceed the mean of a sample of 12 from Population 2
by 1.8 or more?

7. Assume four normally-distributed populations with means of 9.8, 11,


12, and 10.4 all with the same standard deviation of 5. Nine subjects are
sampled from each population and the mean of each sample computed.
What is the probability that average of the means of the samples from
Populations 1 and 2 will be greater than the average of the means of the
samples from Populations 3 and 4?

8. If the correlation between reading achievement and math


achievement in the population of fifth graders were .45, what would be
the probability that in a sample of 12 students, the sample correlation
coefficient would be greater than .7?

9. If numerous samples of N = 15 are taken from a uniform distribution


and a relative frequency distribution of the means drawn, what would
be the shape of the frequency distribution?

10. If a fair coin is flipped 18 times, what is the probability it will


come up heads 14 or more times?
11. Some computer programs require the user to type in commands to
get the program to do what he or she wants. Others allow the user to
choose from menus and push buttons. Assume that .45
of office workers are able to solve a particular problem using a program
that is "command based" and that .83 of workers can solve the problem
using a "menu/button based" program. If two samples of 12 subjects are
tested, one with each program, what is the probability that difference
in the proportions solving the problem will be greater than .5?

Selected Answers

Chapter 6, Selected Answers

2a. 0.2524
2b. 0.0228
2c. 0.1826

6. 0.027

8. 0.125

10. Exact binomial probability = 0.0154; normal approximation =


0.0169. Show work to get the normal approximation.
Overview of Point Estimation

Next section: Characteristics of estimators

When a parameter is being estimated, the estimate can be either a


single number or it can be a range of scores. When the estimate is a
single number, the estimate is called a "point estimate"; when the
estimate is a range of scores, the estimate is called an interval esti
mate. Confidence intervals are used for interval estimates.

As an example of a point estimate, assume you wanted to estimate the


mean time it takes 12 -
year-olds to run 100 yards. The mean running time of a random sample
of 12-year-olds would be an estimate of the mean running time for all
12-year-olds. Thus, the sample mean, M, would be a point estimate of
the population mean, μ.

Often point estimates are used as parts of other statistical calculations.


For example, a point estimate of the standard deviation is used in the
calculation of a confidence interval for μ. Point
estimates of parameters are often used in the formulas for significance
testing.
Point estimates are not usually as informative as confidence intervals.
Their importance lies in the fact that many statistical formulas are
based on them.

Characteristics of Estimators

Next section: Estimating variance

Statistics are used to estimate parameters. Three important


attributes of statistics as estimators are covered in this text:
unbiasedness, consistency, and relative efficiency.

Most statistics you will see in this text are unbiased estimates of
the parameter they estimate. For example, the sample mean, M, is
an unbiased estimate of the population mean, μ.

All statistics covered will be consistent estimators. It is hard to


imagine a reasonably-chosen statistic that is not consistent.
When more than one statistic can be used to estimate a parameter,
one will naturally be more efficient than the other(s). In general the
relative efficiency of two statistics differs depending on the shape of
the distribution of the numbers in the population. Statistics that
minimize the sum of squared deviations
such as the mean are generally the most efficient estimators for
normal distributions but may not be for highly skewed distributions.

Estimating Variance (1 of 4)
The formula for the variance computed in the population, σ², is
different from the formula for an unbiased estimate of variance,
s², computed in a sample. The two formulas are shown below:

σ² = Σ(X-μ)²/N

s² = Σ(X-M)²/(N-1)

The unexpected difference between the two formulas is that the


denominator is N for σ² and is N-1 for s². That there should be a
difference in formulas is very counterintuitive. To understand the
reason that N-1 rather than N is needed in the denominator of the
formula for s², consider the problem of estimating σ² when the
population mean, μ, is already known.
Assume that you knew that the mean amount of practice it takes
student pilots to master a particular maneuver is 12 hours. If you
sampled one pilot and found he or she took 14 hours to master the
maneuver, what would be your estimate of σ²? The answer lies in
considering the definition of variance: It is the average squared
deviation of individual scores from μ.

Estimating Variance (2 of 4)

With only one score, you have one squared deviation of a score
from μ. In this example, the one squared deviation is: (X - μ)² =
(14-12)²= 4.

This single squared deviation from the mean is the best estimate of the
average squared deviation and is an unbiased estimate of σ². Since it is
based on only one score, the estimate is not a very good estimate
although it is still unbiased. It follows that if μ is known and N scores are
sampled from the population, then an unbiased estimate of σ² could be
computed with the following formula:
Σ(X - μ)²/N.

Now it is time to consider what happens when μ is not known and M is


used as an estimate of μ. Which value is going to be larger for a sample
of N values of X:

Σ(X - M)²/N or Σ(X - μ)²/N?

Since M is the mean of the N values of X and since the sum of squared
deviations of a set of numbers
from their own mean is smaller than the sum of squared
deviations from any other number, the quantity Σ(X - M)²/N will
always be smaller than Σ(X - μ)²/N.

Estimating variance (3 of 4)

The argument goes that since Σ(X - μ)²/N is an unbiased estimate of σ²


and since Σ(X - M)²/N is always smaller than e Σ(X - μ)²/N, then Σ(X -
M)²/N must be biased and will have a tendency to underestimate σ². It
turns out that dividing by N-1 rather than by N increases the estimate
just enough to eliminate the bias exactly.

Another way to think about why you divide by N-1 rather than by N has
to do with the concept of degrees of freedom. When μ is known, each
value of X provides an independent estimate of σ²: Each value of (X - μ)²
is an independent estimate of σ². The estimate of σ² based on N X's is
simply the average of these N independent estimates. Since the
estimate of σ² is the average of these N estimates, it can be written as:
where there are N degrees of freedom and therefore df = N. When μ
is not known and has to be estimated with M, the N values of (X-M)²
are not independent because if you know the value of M and the
value of N-1 of the X's, then you can compute the value of the N'th X
exactly.

Estimating variance (4 of 4)

Next chapter: Confidence intervals

The number of degrees of freedom an estimate is based upon is equal


to the number of independent scores that went into the estimate
minus the number of parameters estimated en route to the estimation
of the parameter of interest. In this case, there are N independent
scores and one parameter (μ) is estimated en route to the estimation
of the parameter of interest, σ² . Therefore the estimate has N-1
degrees of freedom. The formula for s² can then be written as:
where df = N-1. Naturally, the greater the degrees of freedom the
closer the estimate is likely to be to σ².

Chapte
r7

1. Define bias in terms of expected value.

2. Is it possible for a statistic to be unbiased yet be very inefficient?


How about being very efficient but biased?

3. You know that μ = 10 and sample two numbers. If the two numbers
are 6 and 10, what is your best estimate of σ²? How many degrees of
freedom does your estimate have?

4. You do not know and sample two numbers. If the two numbers
are 6 and 10, what is your best estimate of σ²? How many degrees of
freedom does your estimate have?
Overview of Confidence Intervals (1 of 2)

Before a simple research question such as "What is the mean number


of digits that can be remembered?" can be answered, it is necessary to
specify the population of people to which it is addressed. The
researcher could be interested in, for example, adults over the age of
18, all people regardless of age, or students attending high school. For
the present example, assume the researcher is interested in students
attending high school.

Once the population is specified, the next step is to take a random


sample from it. In this example, let's say that a sample of 10 students
were drawn and each student's memory tested. The way to estimate the
mean of all high school students would be to compute the mean of the
10 students in the sample. Indeed, the sample mean is an unbiased
estimate of μ, the population mean. But it will certainly not be a
perfect estimate. By chance it is bound to be at least either a little bit
too high or a little bit too low (or, perhaps, much too high or much too
low) .
For the estimate of μ to be of value, one must have some idea of how
precise it is. That is, how close to μ is the estimate likely to be?

Overview of Confidence Intervals (2 of 2)

Next section: Mean, σ known

An excellent way to specify the precision is to construct a confidence


interval. If the number of digits remembered for the 10 students were:
4, 4, 5, 5, 5, 6, 6, 7, 8, 9 then the estimated value of μ would be 5.9 and
the 95% confidence interval would range from 4.71 to 7.09. (Click here
to see how to compute the interval.)

The wider the interval, the more confident you are that it contains the
parameter. The 99%
confidence interval is therefore wider than the 95% confidence interval
and extends from 4.19 to
7.61.

Below are shown some examples of possible confidence intervals.


Although the parameter μ 1 - μ2 represents the difference between two
means, it is still valid to think of it as one parameter; π 1
- π2 can also be thought of as one parameter.
Lower Limit Parameter Upper Limit

0.2 ≤ π ≤
0.7

-3.2 ≤ μ ≤
4.5

3.5 ≤ μ 1 - μ2
≤ 7.9

0.4 ≤ π ≤
0.8
0.3 ≤ π1 - π2 ≤ 0.7

Confidence Interval for μ, Standard Deviation Known (1 of 3)

This section explains how to compute a confidence interval for the


mean of a normally- distributed variable for which the population
standard deviation is known. In practice, the population standard
deviation is rarely known. However, learning how to compute a
confidence interval when the standard deviation is known is an
excellent introduction to how to compute a confidence interval when
the standard deviation has to be estimated.

Three values are used to construct a confidence interval for μ: the


sample mean (M), the value of z (which depends on the level of
confidence), and the standard error of the mean (σM). The confidence
interval has M for its center and extends a distance equal to the product
of z and σ M
in both directions. Therefore, the formula for a confidence interval is:

M - z σM ≤ μ ≤ M + z σM.
Assume that the standard deviation of SAT verbal scores in a school
system is known to be 100. A researcher wishes to estimate the mean
SAT score and compute a 95% confidence interval from a random
sample of 10 scores.
Confidence Interval for μ, Standard Deviation Known (2 of 3)

The 10 scores are: 320, 380, 400, 420, 500, 520, 600, 660, 720, and 780.
Therefore, M = 530, N
= 10, and

The value of z for the 95% confidence interval is the number of


standard deviations one must go from the mean (in both directions) to
contain 0.95 of the scores.
It turns out that one must go 1.96 standard deviations from the mean in
both directions to contain
0.95 of the scores. The value of 1.96 was found using a z table. Since
each tail is to contain 0.025 of the scores, you find the value of z for
which 1-0.025 = 0.975 of the scores are below. This value is 1.96.

All the components of the confidence interval


are now known: M = 530, σM = 31.62, z =
1.96.
Lower limit = 530 - (1.96)(31.62) = 468.02
Upper limit = 530 + (1.96)(31.62) = 591.98

Confidence Interval for μ, Standard Deviation Known (2 of 3)

The 10 scores are: 320, 380, 400, 420, 500, 520, 600, 660, 720, and 780.
Therefore, M = 530, N
= 10, and

=
The value of z for the 95% confidence interval is the number of
standard deviations one must go from the mean (in both directions) to
contain 0.95 of the scores.

It turns out that one must go 1.96 standard deviations from the mean in
both directions to contain
0.95 of the scores. The value of 1.96 was found using a z table. Since
each tail is to contain 0.025 of the scores, you find the value of z for
which 1-0.025 = 0.975 of the scores are below. This value is 1.96.

All the components of the confidence interval


are now known: M = 530, σM = 31.62, z =
1.96.
Lower limit = 530 - (1.96)(31.62) = 468.02
Upper limit = 530 + (1.96)(31.62) = 591.98

Confidence Interval for μ, Standard Deviation Known (3 of 3)


Next section: Mean, σ estimated
Therefore, 468.02 ≤ μ ≤ 591.98. Naturally, if a larger sample size had
been used, the range of scores would have been smaller.

The computation of the 99% confidence interval is exactly the same


except that 2.58 rather than
1.96 is used for z. The 99% confidence interval is: 448.54 ≤ μ ≤ 611.46.
As it must be, the 99%
confidence interval is even wider than the 95% confidence interval.

Summary of Computations

1. Compute M = ΣX/N.

2. Compute
3. Find z (1.96 for 95% interval; 2.58 for 99% interval)
4. Lower limit = M - z σM
5. Upper limit = M + z σM
6. Lower limit ≤ μ ≤ Upper limit

Assumptions:
1. Normal distribution
2. σ is known
3. Scores are sampled randomly and are independent

Confidence Interval for μ, Standard Deviation Estimated (1 of 3)

It is very rare for a researcher wishing to estimate the mean of a


population to already know its standard deviation. Therefore, the
construction of a confidence interval almost always involves the
estimation of both μ and σ.

When σ is known, the


formula: M - zσM ≤ μ
≤ M + zσM
is used for a confidence interval. When σ
is not known,

(N is the sample
s i z e)
is used as an estimate of σM. Whenever the standard deviation is
estimated, the t rather than the normal (z) distribution should be used.
The values of t are larger than the values of z so
confidence intervals when σ is estimated are wider than confidence
intervals when σ is known

The formula for a confidence interval for μ when


σ is estimated is: M - t sM ≤ μ ≤ M + t sM
where M is the sample mean, sM is an estimate of σM, and t depends on
the degrees of freedom
and the level of confidence.

Confidence Interval for μ, Standard Deviation Estimated (2 of 3)

The value of t can be determined from a t table. The degrees of


freedom for t is equal to the degrees of freedom for the estimate of
σM which is equal to N-1.

Suppose a researcher were interested in estimating the mean reading


speed (number of words per minute) of high-school graduates and
computing the 95% confidence interval. A sample of 6 graduates was
taken and the reading speeds were: 200, 240, 300, 410, 450, and 600.
For these data,
M=
366.6667
sM=
60.9736 df
= 6-1 = 5
t = 2.571
Therefore, the lower limit is: M - (t) (sM) = 209.904 and the upper limit
is: M + (t) (sM) =
523.430, and the 95% confidence interval is:
209.904 ≤ μ ≤ 523.430
Thus, the researcher can conclude based on the rounded off 95%
confidence interval that the mean reading speed of high-school
graduates is between 210 and 523.

Confidence Interval for μ, Standard Deviation Estimated (3 of 3)

Next section: General formula

Summary of
Computations

1. Compute M = ΣX/N

2. Compute s (click for formula)


3. Compute

4. Compute df = N-1

5. Find t for these df using a t table


6. Lower limit = M - t sM

7. Upper limit = M + t sM

8. Lower limit ≤ μ ≤ Upper limit

Assumptions:

1. Normal distribution

2. Scores are sampled randomly and are independent

General Formula for Confidence Intervals (1 of 2)

If a statistic is normally distributed and the standard error of the


statistic is known, then a confidence interval for that statistic can
be computed as follows:
statistic ± (z) (σstat)
where σstat is the standard error of the statistic. For instance, the
confidence interval for the mean
is:
M ± z σM.
where M is the sample mean and σM is the standard error of the mean. If
the standard error has to
be estimated then the formula uses t:
statistic ± (t) (sstat)
where sstat is the estimated standard error of the statistic. For example,
the confidence interval for
the mean when σ is estimated is
M ± t s M.

General Formula for Confidence Intervals (2 of 2)

Next section: Differences between means of independent


groups, σ known

The general formula applies to a whole host of other statistics such as


the difference between means and Pearson's correlation. The one
exception to this rule you will encounter is that when confidence
intervals on proportions (or differences between proportions) are
computed, z rather than t is used even though the standard error is
estimated.
Next section: Differences between means of independent
groups, σ known

Confidence Interval on Difference between Means,


Independent Groups, Standard Deviation Known (1 of 3)

Following the general formula for a confidence interval, the formula for
a confidence interval on the difference between means is:
Md ± (z) ( ) where

Md = M1- M2 (the difference between means) is the statistic and

(the standard error of the difference between means) is the standard


error.

z depends on the level of confidence desired (1.96 for the 95% interval
and 2.58 for the 99%
interval).

For a specific example, consider an hypothetical experiment on reaction


time (the time it takes to push a button in response to a signal). One
group is given a simple reaction time task (a light comes on and they
push the button.) A second group is given a choice reaction time task (
an "A" or a "B" is presented and the subjects push one button for "A"
and a different button for "B"). Assume that it is known that the standard
deviation (σ) for the simple reaction time task is 75 and the standard
deviation for the choice reaction time task is 100. Five subjects are
tested with the simple reaction time task and six subjects with the choice
reaction time task.
Confidence Interval on Difference between Means,
Independent Groups, Standard Deviation Known (2 of 3)

Their scores (in msec) are shown below:

Simple RT: 205, 230, 280, 320, 390


Choice RT: 300, 310, 380, 410, 502, 570

MSimple = 285 and

MChoice = 412.

The problem is to estimate the difference between simple and choice


reaction time by computing the 95% confidence interval on the
difference. The elements required to compute the interval are: the z for
the 95% confidence interval, the mean difference (M d), and the standard

error of the mean difference ( ).

z = 1.96
Md = 285 - 412 = -127

and =
Confidence Interval on Difference between Means,
Independent Groups, Standard Deviation Known (3 of 3)

Next section: Difference between independent groups, σ


estimated

The 95% confidence interval is: -127 ± (1.96)(52.836) = -127 ± 103.559

-230.56 ≤ μ1- μ2 ≤ -23.44

here μ1is the mean for simple RT and μ2 is the mean for choice RT.

Summary of Computations

1. Compute Md = M1 - M2

2. Compute

3. Find z (1.96 for 95% interval; 2.58 for 99% interval)


4. Lower limit = Md - z

5. Upper limit = Md + z

6. Lower limit ≤ μ1 - μ2 ≤ Upper limit

Assumptions:

1. The populations are each normally distributed.

2. σ is known

3. Scores are sampled randomly and are independent

Confidence Interval on Difference Between Means,


Independent Groups, Standard Deviation Estimated (1 of 7)
Following the general formula for a confidence interval, the formula for
a confidence interval on the difference between means ( M1 - M2) is:
Md ± (t)( )

where Md = M1 - M2 is the statistic and is an estimate of (the


standard error of the difference between means). t depends on the level
of confidence desired and on the degrees of freedom. The estimated
standard error, , is computed assuming that the variances in the two
populations are equal. If the two sample sizes are equal (n1 = n2) then
the population variance σ² (it is the same in both populations) is
estimated by using the following formula:

where MSE (which stands for mean square error) is an estimate of σ².
Once MSE is calculated, can be computed as follows:

Confidence Interval on Difference Between Means,


Independent Groups, Standard Deviation Estimated (2 of 7)
A concrete example should make the procedure for computing the
confidence interval clearer. Assume that an experimenter were
interested in computing the 99% confidence interval on the difference
between the memory spans of seven- and nine-year old children. Four
children at each age level are tested and their memory spans are shown
below:

Age Age
3 5
3 6
4 6

The first step is to compute the means


of each group: MAge 7 = 3.75 and
MAge 9 = 6.00.
Therefore, Md = 3.75 - 6.00 = -2.25.

To obtain a value of t, one must first compute the degrees of freedom


(df). The degrees of freedom is equal to the degrees of freedom for
MSE (MSE is used to estimate σ²). Since MSE is made up of two
estimates of σ² (one for each sample), the df for MSE is the sum of the
df for
these two estimates. Therefore, the df for MSE is (n -1) + (n - 1) = 3 + 3
= 6.

Confidence Interval on Difference Between Means,


Independent Groups, Standard Deviation Estimated (3 of 7)

A t table shows that the value of t for a 99% confidence interval for

6 df is 3.707. The only remaining term is .

The first step is to compute and :

= 0.917 and = 0.667.

MSE = (0.917 +
0.667)/2 = .792. From
the formula:
= 0.629

All the terms needed to construct the confidence interval have now
been computed. The lower limit (LL) of the interval is:

LL = Md - t

= -2.25 - (3.707)(0.629)
= -4.58.

UL = -2.25 + (3.707)
(0.629) = 0.09. Therefore
-4.58 ≤ μ1 - μ2 ≤ 0.09.

Confidence Interval on Difference Between Means,


Independent Groups, Standard Deviation Estimated (4 of 7)

The calculations are only slightly more complicated when the sample
sizes are different (n 1 does not equal n2). The first difference in the
calculations is that MSE is computed differently. If the
two values of s² were simply averaged as they are in the case of
equal sample sizes, then the estimate based on the smaller sample
size would count as much as the estimate based on the larger sample
size. Instead the formula for MSE is:

MSE =
SSE/df

where df is the degrees of freedom and SSE is the sum of squares


error and is defined as: SSE = SSE1 + SSE2
SSE1 = Σ(X - M1)² where the X's are from the first group (sample) and
M1 is the mean of the first
grou
p.

Similarly, SSE2=
Σ(X - M2)²

where the X's are from the second group and M2 is the mean of the
second group. The formula:
cannot be used without modification since there is not one value of
n but two: (n 1 and n2).

Confidence Interval on Difference Between Means,


Independent Groups, Standard Deviation Estimated (5 of 7)

The solution is to use the harmonic mean of the two sample sizes for n.
The harmonic mean (nh)
of n1 and
n2 is:

Therefore the formula for the estimated standard error of the


difference between means is:

For the example of the confidence interval on the difference between


memory spans of seven - year olds and nine-year olds, assume that one
more seven year old was tested and the resulting memory span score
was 3.

For these data, M1 = 3.6, M2 = 6.0,


and Md = -2.4
SSE1 = (3-3.6)²+(3-
3.6)²+(3 - 3.6)² + (4 - 3.6)²
+ (5 - 3.6)² = 3.2.

SSE2 = (5-6)²+(6-6)²+(6-6)²+(7-6)²= 2.0.

Confidence Interval on Difference Between Means,


Independent Groups, Standard Deviation Estimated (6 of 7)

3 = 7.
Since MSE = SSE/df, MSE =
5.2/7 = 0.743. The harmonic

= 2/(0.2 + 0.25) = 4.4444.


= 0.578.

Finally, a t table can be used to find that the value of t (with 7 df) to be
used for the 99%
confidence interval is 3.499.

The confidence interval is


therefore: LL = -2.4 -
(3.499)(0.578) = -4.42
UL = -2.4 + (3.499)(0.578) = -0.38

-4.42 ≤ µ1 - µ2 ≤ -0.38

Confidence Interval on Difference Between Means,


Independent Groups, Standard Deviation Estimated (7 of 7)
Next section: Linear combinations of means from
independent groups

Summary of Computations

1. Compute Md = M1 - M2

2. Compute SSE1 = Σ(X - M1 )² for Group 1 and SSE2 = Σ(X -


M2 )² for Group 2

3. Compute SSE = SSE1 + SSE2

4. Compute df = N - 2 where N = n1 + n2

5. Compute MSE = SSE/df

6. Find t from a t table.

7. Compute
(If the sample sizes are equal then nh = n1 = n2).

8. Compute:

9. Lower limit = Md - t

10. Upper limit = Md + t

11. Lower limit ≤ µ1- µ2 ≤ Upper limit

Assumptions:

1. The populations each are normally distributed.

2. Homogeneity of variance

3. Scores are sampled randomly and independently from 2


different populations
4. Confidence Interval on Linear Combination of Means,
Independent
Groups (1 of 5)
5.
6.
7. Following the general formula for a confidence interval, the
formula for a confidence
interval on a linear combination
of means is: L ± t sL
where L is the linear combination of the sample means, t depends
on the level of
confidence desired and the degrees of freedom, and SL is an
estimate of σL, the standard error of a linear combination of
means.

The formula for SL is:

which is the same as the formula for σL except that MSE (an
estimate of σ²) is used in place of σ². The formula for MSE here is
very similar to the formula for MSE used in the calculation of a
confidence interval on the difference between two means. The
formula is:
where is the sample variance of the ith group and k is the
number of groups. This formula assumes homogeneity of
variance and that the k sample sizes are equal.

Confidence Interval on Linear Combination of Means, Independent


Groups (2 of
5)

As an example, consider a hypothetical experiment in which aspirin,


Tylenol,
Aspirin
and Tylenol
a placebo are tested to see how much pain relief each
Placebo
3 2 2
provides.
5 The amount
2 of pain
1 relief is rated on a five - point scale. Four
3 4 3
subjects are tested in each group. The data are shown below:
5 4 2
The problem is to calculate the 95% confidence interval on the
difference between the average of the aspirin and Tylenol
groups and the placebo. That parameter being estimated is:
where μaspirin is the population mean for
aspirin, μTylenol is the population mean for Tylenol, and
μplacebo is the population mean for the placebo.

The first step is to find the coefficients (a's) so that

Σaiμi =

where aspirin is Condition 1, Tylenol is Condition 2, and placebo


is Condition 3. The solution is: a1= 0.5, a2 = 0.5 and a3= -1. The
next step is to compute the 3 sample means.

Confidence Interval on Linear Combination of Means, Independent


Groups (3 of
5)

They
are:
Maspiri
n= 4
MTylenol= 3, and

Mplacebo = 2.

L can now be computed as: L = (.5)(4) + (.5)(3)


+ (-1)(2) = 1.5. The estimated value of

is therefore 1.5.

In order to compute MSE, it is necessary to compute s² for each


group. The values are:
1.334, 1.334, and 0.666. Therefore,

MSE = (1.334 + 1.334 +


0.666)/3 = 1.111. Σa² =[.5² + .5²
+ (-1)²] = 1.5
Since there are four subjects in each group, n = 4

The component parts of s are now computed:


Confidence Interval on Linear Combination of Means, Independent
Groups (4 of
5)

Finally, to compute t it is necessary to know the degrees of


freedom. Since three values of s² went into MSE and each s² was
based on n - 1 = 3 df, the degrees of freedom for MSE and for t are
3 x 3 =9. The value of t for the 95% confidence interval with 9 df
can be found from a t table: It is 2.262. The lower limit (LL) of the
confidence interval is:

LL = 1.5 - (2.262)
(.6455) = 0.04. The
upper limit (UL) is:
UL = 1.5 + (2.262) (.6455) = 2.96.

The 95% confidence interval is therefore:


The results of this experiment show that, based on the 95%
confidence interval, the average effect of aspirin and Tylenol is
somewhere between 0.04 and 2.96 units more effective than a
placebo.

Confidence Interval on Linear Combination of Means, Independent


Groups (5 of
5)

Next section: Pearson's correlation

Summary of Computations

1. Compute the sample mean (M) for each group.

2. Compute the sample variance (s²) for each of the k groups.

3. Find the coefficients (a's) so that Σ aiμi is the parameter to be


estimated.

4. Compute L = a1M1 + a2M2 + ... + akMk


5. Compute MSE = Σs² /k
6. Compute

7. Compute df = k(n-1) where k is the number of groups and n is


the number of subjects in each group.

8. Find t for the df and level of confidence desired using a t table

9. Lower limit = L - t sL

10. Upper limit = L + t SL

11. Lower limit ≤ Σaiμi ≤ Upper limit

Assumptions:

1. All populations are normally distributed.

2. All population variances are equal (homogeneity of variance)


3. Scores are sampled randomly and independently from k different
populations.

4. The sample sizes are equal.

Confidence Interval on Pearson's Correlation (1 of 3)

Since the sampling distribution of Pearson's r is not normally


distributed, Pearson's r is converted to Fisher's z' and the
confidence interval is computed using Fisher's z'. The values of
Fisher's z' in the confidence interval are then converted back to
Pearson's r's. For example, assume a researcher wished to
construct a 99% confidence interval on the correlation between
SAT scores and grades in the first year in college at a large state
university. The researcher obtained data from 100 students chosen
at random and found that the sample value of Pearson's r was
0.60. The first step in computing the confidence interval is to
convert 0.60 to a value of z' using the r to z' table. The value is: z'
= 0.69.
The sampling distribution of z' is known to be approximately
normal with a standard error of
where N is the number of pairs of scores.

Confidence Interval on Pearson's Correlation (2 of 3)

From the general formula for a confidence interval, the formula


for a confidence interval on z' is:

which equals

A z table can be used to find that for the 99% confidence interval
you should use a z of
2.58. For N = 100, the standard error of z' is 0.102.
The confidence interval for z' can be
computed as: Lower limit = 0.69 -
(2.58)(0.102) = 0.43
Upper limit = 0.69 +(2.58)(.102) = 0.95
Using the r to z' table to convert the values of 0.43 and 0.95 back
to Pearson r's, it turns out that the confidence interval for the
population value of Pearson's correlation, rho (ρ) is:

0.41 ≤ ρ ≤ 0.74

Therefore, the correlation between SAT scores and college


grades is highly likely to be somewhere between 0.41 and 0.74.

Confidence interval on Pearson's Correlation (3 of 3)

Next section: Difference between correlations

Summary of Computations

1. Compute the sample r.

2. Use the r to z' table to convert the value of r to z'.

3. Use a z table to find the value of z for the level of confidence


desired.
4. Compute:

5. Lower limit = z' - (z)(σz')

6. Upper limit = z' + (z)(σz')

7. Use procedure the r to z' table to convert the lower and upper
limits from z' to r

Assumptions

1. Each subject (observation) is sampled independently from each


other subject.

2. Subjects are sampled randomly.

Confidence Interval, Difference between Independent


Correlations (1 of 4)
The procedure for computing a confidence interval on the
difference between two independent correlations is similar to the
procedure for computing a confidence interval on one correlation.
The first step is to convert both values of r to z'. Then a
confidence interval is constructed based on the general formula
for a confidence interval where the statistic is z'1 - z'2 , and the
standard error of the statistic is:

where N1 is the number of pairs of scores in r1 and N2 is the


number of pairs of scores in r2. Once the confidence interval is
computed, the upper and lower limits are converted back from z'
to r. As an example, assume a researcher were interested in
whether the correlation between verbal and quantitative SAT
scores (VSAT and QSAT) is different for females than it is for
males. Samples of 80 females and 75 males are tested and the
correlation for females is 0.55 and the correlation for males is
0.42.

Confidence Interval, Difference between Independent Correlations


(2 of 4)
The problem is to construct the 95% confidence on the difference
between correlations. The formula is:

The first step is to use the r to z' table to convert the two r's. The r
of 0.55 corresponds to
a z' of 0.62 and the r of 0.42 corresponds to a z' of 0.45. Therefore,

The next step is to find z. From a z table, it can be determined that


the z for 95%
confidence intervals is 1.96. Finally,

Therefore,

Lower limit = 0.17 - (1.96)(0.164) = -0.15


Upper limit = 0.17 + (1.96)(0.164) = 0.49

Converting the lower and upper limits back to r, results in


r's of -0.15 and 0.45 respectively. Therefore, the
confidence interval is:

-0.15 ≤ ρ1 - ρ2 ≤ 0.45
where ρ1 is the population correlation for females and ρ 2 is the
population correlation for males.

Confidence Interval, Difference between Independent Correlations


(3 of 4)

The experiment shows that there is still uncertainty about the difference
in correlations. It could be that the correlation for females is as much as
0.49 higher. However, it could also be 0.15 lower.

Summary of Computations

1. Compute the sample r's.

2. Use an r to z' table to convert the values of r to z'.

3. Compute z'1 - z'2.

4. Use a to z table to find the value of z for the level of confidence


desired.

5. Compute:
6. Lower limit = z'1 - z'2 )
- (z)(

8. Use an r to z' table to convert the lower and upper limits to r's.

Confidence Interval, Difference between Independent Correlations


(4 of 4)

Next section: Proportion

Assumptions:

1. Each subject (observation) is sampled independently from each


other subject.

2. Subjects are sampled randomly.

3. The two correlations are from independent samples (different


groups of subjects). If the same subjects were used for both
correlations then the assumption of independence would be
violated.

Confidence Interval for a Proportion (1 of 3)

Applying the general formula for a confidence interval, the


confidence interval for a proportion, π, is:

p ± z σp

where p is the proportion in the sample, z depends on the level of


confidence desired, and σp, the standard error of a proportion, is
equal to:
where π is the proportion in the population and N is the
sample size. Since π is not known, p is used to estimate it.
Therefore the estimated value of σp is:
As an example, consider a researcher wishing to estimate the
proportion of X-ray machines that malfunction and produce
excess radiation. A random sample of 40 machines is taken and
12 of the machines malfunction. The problem is to compute the
95% confidence interval on π, the proportion that malfunction in
the population.

Confidence Interval for a Proportion (2 of 3)

The value of p is 12/40 = 0.30. The estimated value of σ p is


= 0.072.

A z table can be used to determine that the z for a 95% confidence


interval is 1.96. The limits of the confidence interval are therefore:
Lower limit = .30 - (1.96)(0.072) = .16
Upper limit = .30 +
(1.96)(0.072) = .44.
The confidence interval
is: 0.16 ≤ π ≤ .44.
Correction for Continuity
Since the sampling distribution of a proportion is not a continuous
distribution, a slightly
more accurate answer can be arrived at by applying the correction
for continuity. This is done simply by subtracting 0.5/N from the
lower limit and adding 0.5/N to the upper limit. For the present
example, 0.5/N = 0.5/40 = 0.01. Therefore the corrected interval
is:
0.15 ≤ π ≤ 0.45.

Confidence Interval for a Proportion (3 of 3)

Next section: Difference between proportions

Summary of Computations

1. Compute p

2. Estimate σp by

3. Find z for the level of confidence desired with a z table.

4. Lower limit = p - (z) (Estimated σp) - 0.5/N


5. Upper limit = p + (z) (Estimated σp) + 0.5/N

6. Lower limit ≤ π ≤ Upper limit

Assumptions
1. Observations are sampled randomly and independently.

2. The adequacy of the normal approximation depends on the sample


size (N) and π.
Although there are no hard and fast rules, the following is a guide
to needed sample size:

If π is between 0.4 and 0.6 then an N of 10 is adequate. If π is as


low as 0.2 or as high as
0.8 then N should be at least 25. For π as low as 0.1 or as high as
0.9, N should be at least
30.

A more conservative rule of thumb that is often recommended is


that Nπ and N(1 - π)
should both be at least 10.
3. Confidence Interval on the Difference Between Proportions (1
of 4)
4.
5.
6. The confidence interval on the difference between proportions is
based on the same
general formula as are other confidence intervals. A confidence
interval on the difference between two proportions is computed in
the following situation: There are two populations and the
members of each population can be classified as falling into one of
two categories. For example, the categories might be such things
as whether or not one has a high-school degree or whether or not
one has ever been arrested.

Consider a researcher interested in whether people who majored in


psychology are more or less likely than physics majors to solve a
problem that involves a certain type of statistical reasoning. The
researcher is interested in estimating the difference in the
proportions of people in the two populations that can solve the
problem and in computing a 99% confidence interval on the
difference. Random samples of 100 psychology majors and 110
physics majors are taken and each person is given a chance to
solve the problem. Of the 100 psychology majors, 65 solve the
problem; of the 110 physics majors only 45 solve it.

7. Confidence Interval on the Difference Between Proportions (2


of 4)
8.
9.
10. Therefore the proportions who solve the problem are 0.65 for the
psychology majors (p 1)
and 0.41 for the physics majors (p2). The goal of the experiment is
to estimate the
difference between the proportion of psychology majors in the
population that can solve
the problem (π1) and the proportion of physics majors in the
population that can solve it ( π2). The statistic p1 - p2 is used as
an estimate of π1 - π2. In this experiment, the estimated value of
π1 - π2 is 0.65 - 0.41 = 0.24.

To compute the confidence interval, the standard error of p1- p2


is needed. The standard error is:
The estimated standard error is:

= 0.067

Confidence Interval on the Difference Between Proportions (3 of 4)

Finally, the value of z for the 99% confidence interval is computed using
a z table; it is 2.58. The lower limit of the confidence interval is simply:

p1 - p2 - (z) (estimated )

= 0.65 - 0.41 - (2.58)(0.067) = 0.07

The upper limit is:


= 0.65 - 0.41 + (2.58)(0.067) =0.41

The 99% confidence interval is therefore:

0.07 ≤ π1 - π2 ≤ 0.41.
This indicates that the proportion of psychology majors that can solve
the problem is from 0.07 to 0.41 higher than the proportion of physics
majors that can solve it.

Summary of Computations

1. Compute p1 - p2.

2. Find z for confidence interval using a z table.

3. with the formula:

Confidence Interval on the Difference Between Proportions (4 of 4)


Next chapter: Logic of hypothesis testing

4. lower limit = p1 - p2 - (z) (estimated )

5. upper limit = p1 - p2 + (z) (estimated )

6. Confidence interval: lower limit ≤ π1 - π2 ≤ upper limit.

Assumptions

1. The two proportions are independent.

2. The adequacy of the normal approximation depends on the sample


size (N) and π.
Although there are no hard and fast rules, the following is a guide
to needed sample size:

If π is between 0.4 and 0.6 then an N of 10 is adequate. If π is as


low as 0.2 or as high as
0.8 then N should be at least 25. For π as low as 0.1 or as high as
0.9, N should be at least
30.

A more conservative rule of thumb that is often recommended is


that Nπ and N(1 - π)
should both be at least 10.

Click here for an interactive demonstration of the normal


approximation to the binomial to explore the validity of these rules
of thumb.

Chapte
r8

1. Which is wider, a 95% or a 99% confidence interval?

2. When you construct a 95% confidence interval, what are you 95%
confident about?
3. When computing a confidence interval, when do you use t and when
do you use z?

4. Assume a researcher found that the correlation between a test he or


she developed and job performance was .5 in a study of 25
employees. If correlations under .30 are considered unacceptable,
would you have any reservations about using this test to screen job
applicants?

5. What is the effect of sample size on the width of a confidence


interval?

6. How might an experimenter demonstrate that a drug has a negligible


effect?
7. When is Fisher's z' used? Why is it necessary to use Fisher's z'?

8. What assumptions are made in the construction of a confidence


interval on µ?

9. Under what circumstances would a researcher find it useful to


construct a confidence interval?

10. Why is the construction of confidence intervals considered part of


inferential statistics?

11. How does the t distribution compare with the normal distribution.
How does the difference affect the size of confidence intervals
constructed using z relative to those constructed using t? Does sample
size make a difference?

12. Which would be more likely to be wider, a confidence interval on


the mean based on a sample of 10 subjects or a confidence interval on
the difference between means based on a sample of 10 subjects from
each population? Assume that the variances in these populations are the
same.

13. A population is known to be normally-distributed with a standard


deviation of 2.8. Compute the 95% confidence interval on the mean
based on the following sample of 6 numbers: 8, 9, 12,
13, 14, 16.

14. A person claims to be able to predict the outcome of flipping a


coin. They are correct 12/18 times. Compute the 95% confidence
interval on the proportion of times this person can predict coin flips
correctly. What conclusion can you draw about this test of his ability to
predict the future?

15. A hypothetical experiment is conducted to see whether cognitive


therapy is more effec tive for relieving depression than is
psychodynamic psychotherapy. Out of the population of depressed
people in a certain area, 10 are sampled for the cognitive therapy
group and 10 are sampled for the psychodynamic therapy group. After
6 weeks of therapy, the improvement in each patient is assessed. The
improvement scores are shown below:

Cognit Psychody
9 3
7 2
7 4
8 0
3 5
8 2
7 4
5 3
6 2
8 5
Estimate the mean difference in effectiveness and compute a 95%
confidence interval on this estimate. What conclusion about the
relative effectiveness of the two treatments in the population.

16. An experiment compared the ability of 7 and 9 year olds to solve a


problem requiring a certain type of abstract reasoning. Five of 21
seven-year olds and 14 of 16 nine-year olds solved the problem.
Construct the 95% confidence interval on the difference in proportions.

17. In an experiment on the relationship between chess skill and


memory for chess positions, one group of 18 chess players was
presented with a meaningful chess position to memorize while another
group of 20 chess players was given a random position to memorize.
The correlation in the first group between each chess player's skill (as
measured by his or her chess rating) and memory performance was .70;
the correlation in the second group was only .11. Compute the
95% confidence interval on the difference in correlations.

18. An experiment is conducted comparing the effectiveness of two


methods of teaching algebra. Ten gifted and 10 average students are
taught using each method. There are therefore four groups of subjects.
Their scores on a final exam are shown on the next card. Compute the
99% confidence interval on the difference between the mean of the
students being taught by Method 1 and the mean of the students taught
by Method 2. Hint: do this using a linear combination of means. Do not
ignore the grouping based on ability.

Method 1 Method 2

Aver Gift Average


67 87 3 9
56 78 4 8
55 86 5 8
61 90 3 8
67 77 5 9
56 78 3 9
68 81 4 9
53 91 3 8

1. Which is wider, a 95% or a 99% confidence interval?

99% is wider

2. When you construct a 95% confidence interval, what are you


95% confident about? You are 95% confident that the interval
contains the parameter.
3. When computing a confidence interval, when do you use t and when
do you use z?

You use t when the standard error is estimated and z when it is known.
One exception is for a confidence interval for a proportion when z is
used even though the standard error is estimated.
4. Assume a researcher found that the correlation between a test he or
she developed and job performance was .5 in a study of 25
employees. If correlations under .30 are considered unacceptable,
would you have any reservations about using this test to screen job
applicants?

You should have reservations since the lower limit of the 95%
confidence interval is 0.133, you cannot be confident that the true
correlation is at least 0.30.

5. What is the effect of sample size on the width of a


confidence interval? The larger the sample size, the
smaller the width of the confidence interval.

6. How might an experimenter demonstrate that a drug has


a negligible effect? By showing that all values in the
confidence interval are negligble.

7. When is Fisher's z' used? Why is it necessary to use Fisher's z'?


It is used to construct a confidence interval on Peason's
correlation. It is used because the sampling distribution of r is not
normal.

8. What assumptions are made in the construction of a


confidence interval on µ? Normality of the parent population
and that each observation is sampled randomly and
independently
General Formula for Testing Hypotheses (1 of 3)

The formula shown below is used for testing hypotheses about a


parameter.

The "statistic" is an estimate of the parameter in question. The


"hypothesized value" is the value of the parameter specified in the null
hypothesis. The standard error of the statistic is assumed to be known
and the sampling distribution of the statistic is assumed to normal.

Consider an experiment designed to test the null hypothesis that µ


= 10. The test would be conducted with the following formula:
where M (the statistic) is the sample mean, 10 is the hypothesized value
of µ, and σM is standard error of the mean. Once z is determined, the
probability value can be found using a z table. For example, if M = 15
and σM = 2, then z would be (15-10)/2 = 2.5. The two-tailed probability
value would be 0.0124. The one-tailed probability value would be
0.0124/2 = 0.0062.

General Formula for Testing Hypotheses (2 of 3)

The t rather than the z (normal) distribution is used if the


standard error has to be estimated from the data. The
formula then becomes:

(The one exception to this rule you will encounter is that in tests
of differences between proportions, z is used even when the
standard error is estimated.) For this example, the formula is:
where sM is the estimated standard error of the mean. If M = 15
and sM = 2, then t = 2.5. If
N were 12 then the degrees of freedom would be 11. A t table can
be used to calculate that
the two-tailed probability value for a t of 2.5 with 11 df is 0.0295.
The one-tailed probability is 0.0148.
General Formula for Testing Hypotheses (3 of 3)

Next section: Tests of μ, σ known

Why the Formula Works


The formula is basically a method for calculating the area under the
normal curve. If the mean and standard deviation of the sampling
distribution of a statistic are known, it is easy to calculate the probability
of a statistic being greater than or less than a specific value. That is what
is done in hypothesis testing.

(See the section on the area under the sampling distribution of the mean
for an example.)

Next section: Tests of μ, σ known

Tests of μ, Standard Deviation Known (1 of 4)


This section explains how to compute a significance test for the mean
of a normally-distributed variable for which the population standard
deviation (σ) is known. In practice, the standard deviation is rarely
known. However, learning how to compute a significance test when the
standard deviation is known is an excellent introduction to how to
compute a significance test in the more realistic situation in which the
standard deviation has to be estimated.

1. The first step in hypothesis testing is to specify the null


hypothesis and the alternate hypothesis. In testing hypotheses
about µ, the null hypothesis is a hypothesized value of
µ. Suppose the mean score of all 10-year old children on an
anxiety scale were 7. If a researcher were interested in whether
10-year old children with alcoholic parents had a different mean
score on the anxiety scale, then the null and alternative
hypotheses would be:

H0: µalcoholic = 7

H1: µalcoholic ≠ 7
2. The second step is to choose a significance level. Assume the 0.05
level is chosen.

3. The third step is to compute the mean. Assume M = 8.1.

Tests of μ, Standard Deviation Known (2 of 4)


4. The fourth step is to compute p, the probability (or probability
value) of obtaining a difference between M and the
hypothesized value of µ (7.0) as large or larger than the
difference obtained in the experiment. Applying the general
formula to this problem,

The sample size (N) and the population standard deviation (σ) are
needed to calculate σM. Assume that N = 16 and σ= 2.0. Then,

and

A z table can be used to compute the probability value, p = 0.028.


A graph of the sampling distribution of the mean is shown below.
The area 8.1 - 7.0 = 1.1 or more units from the mean (in both
directions) is shaded. The shaded area is 0.028 of the total area.
Tests of μ, Standard Deviation Known (3 of 4)

5. In step five, the probability computed in Step 4 is compared


to the significance level stated in Step 2. Since the probability
value (0.028) is less than the significance level (0.05) the effect
is statistically significant.

6. Since the effect is significant, the null hypothesis is rejected.


It is concluded that the mean anxiety score of 10-year-old
children with alcoholic parents is higher than the population
mean.

7. The results might be described in a report as follows:


The mean score of children of alcoholic parents (M = 8.1) was
significantly higher than the population mean ( µ= 7.0), z = 2.2, p
= 0.028.

Summary of Computational Steps

1. Specify the null hypothesis and an alternative hypothesis.

2. Compute M = ΣX/N.

3. Compute = .

4. Compute where M is the sample mean and µ is the


hypothesized value of the population.

5. Use a z table to determine p from

of μ, Standard Deviation Known (4 of 4)

Next section: Tests of μ, σ estimated


ASSUMPTIONS

1. Normal distribution

2. Scores are independent

3. σ is known.

See also: "Confidence interval for mean, σ known"

Tests of µ, Standard Deviation Estimated (1 of 4)

Rarely does a researcher wishing to test a hypothesis about a


population's mean already know the population's standard deviation (σ).
Therefore, testing hypotheses about means almost always involves the
estimation σ. When σ is known, the formula:

is used for a confidence interval. When σ is not known,


is used as an estimate of σM (s is an estimate of the standard deviation
and N is the sample size) When σM is estimated by sM, the
significance test uses the t distribution instead of the normal
distribution.

Suppose a researcher wished to test whether the mean score of fifth


graders
72 69
on a test
98
of 87
reading achievement in his or her city differed from
78 76 78 66
the national
85 97 mean
84 of8676. The researcher randomly sampled the scores
88 76 79 82
of 20 students. The scores are shown below.
82 91 69 74

Tests of µ, Standard Deviation Estimated (2 of 4)


1. The first step in hypothesis testing is to specify the null
hypothesis and an alternative hypothesis. When testing
hypotheses about µ, the null hypothesis is an hypothesized value
of µ. In this example, the null hypothesis is µ = 76. The alternative
hypothesis is: µ
≠ 76.

2. The second step is to choose a significance level. Assume the .05


level is chosen.

3. The third step is to compute the mean. For this example, M =


80.85.

4. The fourth step is to compute p, the probability (or probability


value) of obtaining a difference between M and the
hypothesized value of µ (76) as large or larger than the
difference obtained in the experiment. Applying the general
formula to this problem,

The estimated standard error of the mean (sM) was computed using
the formula:
= 8.87/4.47=1.984 where s is the estimated standard deviation and
N is the sample size. The probability value for t can be determined
using a t table. The degrees of freedom for t
is equal to the degrees of freedom for the estimate of σM which is
N - 1 = 20 - 1 = 19. A t table can be used to calculate that the two-
tailed probability value of a t of 2.44 with 19
df is 0.025.

Tests of µ, Standard Deviation Estimated (3 of 4)

5. In the fifth step, the probability computed in Step 4 is compared


to the significance level stated in Step 2. Since the probability
value (0.025) is less than the significance level (0.05) the effect is
statistically significant.

6. Since the effect is significant, the null hypothesis is rejected.


It is concluded that the mean reading achievement score of
children in the city in question is higher than the population
mean.

7. A report of this experimental result might be as follows:


The mean reading-achievement score of fifth grade children in
the sample (M = 80.85) was significantly higher than the mean
reading-achievement score nationally (µ = 76), t(19) = 2.44, p =
0.025.

The expression "t(19) = 2.44" means that a t test with 19 degrees


of freedom was equal to
2.44. The probability value is given by "p = 0.025." Since it was
not mentioned whether the test was one- or two- tailed, a two-
tailed test is assumed.

Tests of μ, Standard Deviation Estimated (4 of 4)

Next section: μ1 -μ2, independent groups, σ estimated

Summary of Computations

1. Specify the null hypothesis and an alternative hypothesis.

2. Compute M = Σ X/N.
3. Compute

4. Compute .
5. Compute where μ is the hypothesized value of the
population mean.

7. Use a t table to compute p from t and df.

Assumptions

1. Normal distribution

2. Scores are independent

See also: "Confidence interval for mean, σ estimated"

Next section: μ1 -μ2, independent groups, σ estimated

Tests of Differences between Means, Independent Groups, Standard


Deviation
Estimated (1 of 8)
This section explains how to test the difference between group means
for significance. The formulas are slightly simpler when the sample
sizes are equal. These formulas are given first.

Equal Sample Sizes


An experiment was conducted comparing the memory of expert and
novice chess players. The
mean number of pieces correctly placed across several chess positions
was computed for each subject. (These data are from the novices and
tournament players in the dataset "Chess.") The question is whether
the difference between the means of these two groups of subjects is
statistically significant.

1. The first step is to specify the null hypothesis and an alternative


hypothesis. For experiments testing differences between means,
the null hypothesis is that the difference between means is some
specified value. Usually the null hypothesis is that the difference
is zero.
For this example, the null and alternative
hypotheses are: Ho: µ1 - µ2 = 0
H1: µ1 - µ2 ≠ 0
Tests of Differences between Means, Independent Groups, Standard
Deviation
Estimated (2 of 8)

2. The second step is to choose a significance level. Assume


the .05 level is chosen.
3. The third step is to compute the difference between sample
means (Md). In this example, Mt = 63.89, MN = 46.79 and Md =
MT - MN = 17.10.

4. The fourth step is to compute p, the probability (or probability


value) of obtaining a difference between and the value specified
by the null hypothesis (0) as large or larger than the difference
obtained in the experiment. Applying the general formula,

where Md is the difference between sample means, µ1 - µ2 is


the difference between population means specified by the null
hypothesis (usually zero), and is the estimated standard
error of the difference between means.
Tests of Differences between Means, Independent Groups, Standard
Deviation
Estimated (2 of 8)

2. The second step is to choose a significance level. Assume the .05


level is chosen.
3. The third step is to compute the difference between sample
means (Md). In this example, Mt = 63.89, MN = 46.79 and Md =
MT - MN = 17.10.

4. The fourth step is to compute p, the probability (or probability


value) of obtaining a difference between and the value specified
by the null hypothesis (0) as large or larger than the difference
obtained in the experiment. Applying the general formula,
where Md is the difference between sample means, µ1 - µ2 is
the difference between population means specified by the null
hypothesis (usually zero), and is the estimated standard
error of the difference between means.
5. Tests of Differences between Means, Independent Groups,
Standard
Deviation Estimated (3 of 8)
6.
7.

8. The estimated standard error, , is computed assuming that


the variances in the two populations are equal. If the two sample
sizes are equal (n1 = n2) then the population variance σ² (it is the
same in both populations) is estimated by using the following
formula:

where MSE (which stands for mean square error) is an

estimate of σ². Once MSE is calculated, can be computed


as follows:

where n = n1 = n2. This formula is derived from the formula for


the standard error of the difference between means when the
variance is known
Tests of Differences between Means, Independent Groups, Standard
Deviation
Estimated (4 of 8)

4. (continued) For the present example,

MSE = (81.54 + 244.03)/2


= 162.78, and

The probability value for t can be determined using a t table. The


degrees of freedom for t
is equal to the degrees of freedom for MSE which is equal to df =
n 1 - 1 + n2 -1 = 18 or, df = N - 2 where N = n1 + n2 The
probability is: p = 0.008.

5. In step 5, the probability computed in Step 4 is compared to the


significance level stated in Step 2. Since the probability value
(0.008) is less than the significance level (0.05) the effect is
significant.

6. Since the effect is significant, the null hypothesis is rejected.


It is concluded that the mean memory score for experts is
higher than the mean memory score for novices.

7. A report of this experimental result might be as follows:

The mean number of pieces recalled by tournament players (MT


= 63.89) was significantly higher than the mean number of
pieces recalled by novices (M N = 46.79), t(18) = 2.99, p =
0.008.

The expression "t(18) = 2.99" means that a t test with 18 degrees of


freedom was equal to
2.99. The probability value is given by "p = 0.008." Since it was
not mentioned whether the test was one- or two-tailed, it is
assumed the test was two tailed.

Tests of Differences between Means, Independent Groups, Standard


Deviation
Estimated (5 of 8)

Unequal Sample Sizes


The calculations in Step 4 are slightly more complex when n1 ≠
n2. The first difference is that MSE is computed differently. If the
two values of s² were simply averaged as they are for equal sample
sizes, then the estimate based on the smaller sample size would
count as much as the estimate based on the larger sample size.
Instead the formula for MSE is:

MSE = SSE/df
where df is the degrees of freedom (n1 - 1 +
n2 - 1) and SSE is: SSE = SSE1 + SSE2
SSE1 = Σ(X - M1)² where the X's are from the first group
(sample) and M1 is the mean of the first group. Similarly, SSE2=
Σ(X- M2)² where the X's are from the second group and M2 is the
mean of the second group.
Tests of Differences between Means, Independent Groups,
Standard Deviation
Estimated (6 of 8)

The formula:

cannot be used without modification since there is not one value of n


but two: (n 1 and n2). The solution is to use the harmonic mean of the
two sample sizes. The harmonic mean (nh) of n1 and n2 is:

The formula for the estimated standard error of the difference between
means becomes:
The hypothetical data shown below are from an experimental group and
a control group.

Experime Contro
3 3
5 4
7 6

n1 = 5, n2 = 4, M1 = 6, M2 = 5,

SSE1 = (3-6)² + (5-6)² + (7-6)² + (7-6)²


+ (8-6)² = 16, SE2 = (3-5)² + (4-5)² +
(6-5)² + (7-5)² = 10

Tests of Differences between Means, Independent Groups,


Standard Deviation
Estimated (7 of 8)
SSE = SSE1 + SSE2 = 16
+ 10 = 26 df = n1 - 1 + n2
-1=7
MSE = SSE/df = 26/7 = 3.71

The p value associated with a t of 0.77 with 7 df is 0.47. Therefore,


the difference between groups is not significant.

Summary of Computations

1. Specify the null hypothesis and an alternative hypothesis.

2. Compute Md = M1 - M2
3. Compute SSE1 = Σ(X - M1)² for Group 1 and SSE2 = Σ(X
-M2)² for Group 2

4. Compute SSE = SSE1 + SSE2

5. Compute df = N - 2 where N = n1 + n2

6. Compute MSE = SSE/df

7. Compute:

(If the sample sizes are equal then nh = n1 = n2).

8. Compute:

9. Compute:
where µ1 - µ2 is the difference between population means specified
by the null hypothesis
(and is usually 0).

10. Use a t table to compute p from t (step 9) and df (step 5).

Tests of Differences between Means, Independent Groups, Standard


Deviation
Estimated (8 of 8)

Next section: Tests of differences between dependent


means

Assumptions

1. The populations are normally distributed.

2. Variances in the two populations are equal.

3. Scores are independent (Each subject provides only one score)


See also: Confidence interval on µ1 - µ2, independent groups, σ
estimated

Next section: Tests of differences between dependent


means

Tests of Differences between Means, Dependent Means (1 of 4)

When the same subjects are tested in two experimental conditions,


scores in the two conditions are not independent because subjects who
score well in one condition tend to score well in the other condition.
This non-independence must be taken into account by the statistical
analysis. The t test used when the scores are not independent is
sometimes called a correlated t test and sometimes called a related-
pairs t test. (Click here to see the advantage of testing the same
subjects in both conditions.)

Consider an experimenter interested in whether the time it takes to


respond to a visual signal is different from the time it takes to respond to
an auditory signal. Ten subjects are tested with both the visual signal
and with the auditory signal. (To avoid confounding with practice
effects, half are in the auditory condition first and the other half are in
the visual task first). The reaction
times (in milliseconds)
Subject Visual
of the ten subjects in the two conditions are shown
Auditory
1 420 380
below.
2 235 230
3 280 300
4 360 260
5 305 295
6 215 190
7 20 20
8 46 41
9 34 33
10 37 38

Tests of Differences between Means, Dependent Means (2 of 4)

The first step in testing the difference between means is to compute a


difference
Sub
scoreAuditory
Visual
for each Vis-A
subject. The difference score is simply the
1 420 380 40
difference
2 235between the two conditions.
230 5 The difference scores are shown
3 280 300 -20
below:
4 360 260 100
5 305 295 10
6 215 190 25
7 200
ud 200 00
8 460 410 50
9 345 330 15
10 375 380 -5
Mean 319.5 297.5 22
Notice that the difference between the visual mean (319.5) and the
auditory mean (297.5) is the same as the mean of the difference
scores (22). This will always be the case: The difference between
the means will always equal the mean of the difference scores. The
significance test of the difference between means consists of
determining whether the mean of difference scores is significantly
different from zero. If it is, then the difference between means is
significant. The procedure for testing whether a mean is
significantly different from zero is given in the section "Tests of µ,
standard deviation estimated ."

Tests of Differences between Means, Dependent Means (3 of 4)

For this example,

There are 10 - 1 = 9 degrees of freedom (Note that the sample size is


the number of difference scores, not the total number of scores.) A t
table can be used to find that the probability value is
0.074.

An alternate formula for sM is:


where s1 is the standard deviation of Condition 1, s2 is the standard
deviation of Condition 2, n is the number of subjects, and r is the
correlation between scores in Conditions 1 and 2. For the example data,
s1 = 87.764, s2 = 77.719, n = 10, and r = 0.9206 and therefore sM =
10.88.

An example of a report of this result is shown below.


The mean time to respond to a visual stimulus (M = 319.5) was
longer than the mean time to respond to an auditory stimulus (M =
297.5). However, this difference was not statistically significant, t(9)
= 2.02, p = 0.074.

Tests of Differences between Means, Dependent Means (4 of 4)

Next section: Linear combination of means


Summary of Computations

1. Compute a difference score for each subject.

2. Test whether the mean difference score is significantly different


from zero (click here for a summary of the computational steps
for this test)

Assumptions

1. Each subject is sampled independently from each other subject.

2. The difference scores are normally distributed. (If both raw


scores are normally distributed then the difference score will be
normally distributed.) The two raw scores from each subject do
not have to be independent of each other.

Tests of Linear Combinations of Means, Independent Groups (1 of 7)

In a hypothetical experiment, aspirin, Tylenol, and a placebo were tested


Aspirin Tylenol Placebo
to see how much pain relief each provides. Pain relief was rated on a
3 2 2
5 2 1
3 4 3
five-point scale. Four subjects were tested in each group and their data
are shown below:
5 4
2
Before conducting the experiment, the researcher planned to make two
comparisons: (1) a
comparison of the average of the aspirin and Tylenol groups with the
placebo control and (2) a comparison between the aspirin and Tylenol
groups. The comparisons can be framed in terms of linear combinations
of means: The first comparison is:

In terms of coefficients:

(0.5)(μaspirin) + (0.5)(μTylenol) + (-1)(μplacebo)

Tests of Linear Combinations of Means, Independent Groups (2 of


7)
The steps for testing a linear contrast follow:

1. The first step is to specify the null hypothesis and an alternative


hypothesis. For experiments testing linear combinations of means,
the null hypothesis is: Σa iµi is equal to some specified value,
usually zero. In this experiment, a 1 = 0.5, a2 = 0.5, and a3 = -1.
The null hypothesis is Σaiµi = 0 which can be written as: a1µ1 +
a2µ2 + a3 µ3 = 0. For this experiment, the null hypothesis is:
(0.5)(μaspirin) + (0.5)(μtylenol) +
(-1)(μplacebo) = 0. The alternative
hypothesis is:
a1µ1 + a2µ2 + a3 µ3 ≠ 0.

2. The second step is to choose a significance level. Assume the .05


level is chosen.

3. The third step is to compute the value of the linear combination


(L) based on the samples.

L = a1M1 + a2M2 + ... + aa Ma

For these data,

L = (0.5)(4) + (0.5)(3) + (-1)(2) = 1.5.


Tests of Linear Combinations of Means, Independent
Groups (3 of 7)
4. The fourth step is to calculate the probability of a value of L
this different or more different from the value specified in the
null hypothesis (zero). Applying the general
formula, one obtains

where L is the value of the linear combination in the sample and


s L is the standard error of L. The formula for the standard error
of L is given below.

he
This formula is t same as the formula for σL except that MSE (an
estimate of
calculation ofa co ) is used in place of . The formula for MSE (and
SL) is the same as the formula used in the
nfidence interval on a linear combination of means.
where is the sample variance of the ith group and "a" is the
number of groups. This formula assumes homogeneity of
variance and that the "a" sample sizes are equal. A related
formula is described elsewhere that can be used with unequal
sample sizes.

The degrees of freedom are equal to the sum of the degrees of


freedom in the a = 3
groups. Therefore, df = a(n-1) = 3(4-1) = 9. A t table shows that the
two-tailed probability value for a t of 2.33 with 9 df is 0.045.

Tests of Linear Combinations of Means, Independent Groups (4 of 7)

4. (continued) For this example,

= 1.334, = 1.334, and = 0.666 and, therefore,


The sample size, n, is 4 since there are four subjects in each group.
Therefore,
Plugging these values into the formula:

results in:

Tests of Linear Combinations of Means, Independent Groups (5 of


7)

5. The probability computed in Step 4 is compared to the


significance level stated in Step 2.
Since the probability value (0.045) is less than the significance
level (0.05), the effect is significant.
6. Since the effect is significant, the null hypothesis is rejected.
It is concluded that the average relief provided by the two
drugs is greater than the relief provided by the placebo.

7. An example of how the results of the experiment could be


described in a report is shown below.

Both aspirin (M = 4) and Tylenol (M = 3) provided more pain


relief than the placebo (M
= 2). A planned comparison of the mean of the two experimental
groups with the placebo control was significant, t(9) = 2.33, p =
0.045. The means of the two experimental groups were not
significantly different from each other, t(9) = 1.34, p = 0.213.

It is important to state that the combination of means was


planned in advance since a different procedure is used if the
comparison is decided on after viewing the data.

The expression "t(9) = 2.33, p = 0.045" means that a t test with 9


degrees of freedom was conducted and that the probability value
was 0.045. The test for the difference between the two
experimental groups was conducted using the coefficients: a 1 = 1,
a2 = -1, a3 = 0.
Tests of Linear Combinations of Means, Independent Groups (6 of
7)
Summary of Computations

1. Specify the null and alternate hypotheses. This is done by choosing


the coefficients (a's).
The null hypothesis is Σaiµi = 0. The alternate hypothesis is Σaiµi
≠ 0. (On rare occasions, the null hypothesis will be that Σaiµi is
some value other than zero.)

2. Compute L = ΣaiMi.

3. Compute

where "a" is the number of groups.

4. Compute

where n is the number of subjects in each group (there must be


equal n).
5. Compute

6. Compute df = a(n-1) where "a" is the number of groups.

7. Use a t table to find the probability value of t.

Tests of Linear Combinations of Means, Independent Groups (7 of 7)

Next section: Pearson's correlation

Assumptions

1. All populations are normally distributed.

2. Variances in the "a" populations are equal.

3. Scores are independent (Each subject provides only one score).


4. There is an equal number of subjects in each group.

5. The comparison was planned in advance.

See also: "Confidence interval on linear combination of means ,


independent groups"

Next section: Pearson's correlation

Tests of Pearson's Correlation (1 of 6)

The sampling distribution of Pearson's r is normal only if the


population correlation (ρ) equals zero; it is skewed if ρ is not equal to
0 (click here for illustration). Therefore, different formulas are used to
test the null hypothesis that ρ = 0 and other null hypotheses.

Null Hypothesis: ρ = 0
A hypothetical experiment is conducted on the relationship between job
satisfaction and job
performance. A sample of 100 employees rate their own level of job
satisfaction. This measure
of job satisfaction is correlated with supervisors' ratings of performance.
The question is whether there is a relationship between these two
measures in the population.

1. The first step is to specify the null hypothesis and an


alternative hypothesis. The null hypothesis is ρ = 0; the
alternative hypothesis is ρ ≠ 0.

2. The second step is to choose a significance level. Assume


the 0.05 level is chosen.

3. The third step is to compute the sample value of Pearson's


correlation (click here for the formula). In this experiment, r =
0.27.

Tests of Pearson's Correlation (2 of 6)

4. The fourth step is to compute p, the probability (or probability


value) of obtaining a difference between and the value specified by
the null hypothesis (zero) as large or larger than the difference
obtained in the experiment. This can be calculated using the
following formula:
The degrees of freedom are N-2 where N is
the sample size. For this example,
The df are N - 2 = 100 - 2 = 98. A t table can be used to
find that the two-tailed probability value for a t of 2.776
with 98 df is 0.007.

Tests of Pearson's Correlation (3 of 6)

5. The probability computed in Step 4 is compared to the


significance level stated in Step 2.
Since the probability value (0.007) is less than the
significance level (0.05), the correlation is significant.

6. Since the effect is significant, the null hypothesis is rejected.


It is concluded that the correlation between job satisfaction
and job performance is greater than zero.

7. A report of this finding might be as follows:


There was a small but significant relationship between job
satisfaction and job performance, r = .27, t(98) = 2.78, p =
.007.

The expression "t(98) = 2.78" means that a t test with 98 degrees of


freedom was equal to
2.78. The probability value is given by "p = .007." Since it was
not mentioned whether the test was one- or two-tailed, it is
assumed the test was two tailed. The relationship between job
satisfaction and job performance was described as "small." Refer
to the section on "scatterplots of example values of r" to see
what a relationship of that size looks like.

Tests of Pearson's Correlation (4 of 6)

Other Null Hypotheses


For other null hypotheses, the significance test of r is done using
Fisher's z' transformation. The
formulas used in the fourth step of hypothesis testing are shown below:
where is the value of r converted to z', is the value of ρ
converted to z', and is the standard error of z'. This standard
error can be computed as:
Once z is computed, a z table is used to calculate the p value.

Tests of Pearson's Correlation (5 of 6)

Suppose a theory of intelligence predicts that the correlation between


intelligence and spatial ability is 0.50. A sample of 73 high school
students was given both a test of intelligence and spatial ability. The
correlation between the tests was 0.677. Is the value of 0.677
significantly higher than 0.50? The null hypothesis therefore is ρ =
0.5.

The r to z' table can be used to convert both the r of 0.677 and the of
0.50 to Fisher's z'. The values of z' are 0.824 and 0.549 respectively.
The standard error of z' is:
and therefore,

z = (0.824 - 0.549)/0.1195 = 2.30.

A z table shows that the two-tailed probability value for a z of 2.30 is


0.021. Therefore, the null hypothesis that the population correlation is
0.50 can be rejected.

Summary of Computations

1. Specify the null hypothesis and an alternative hypothesis.

2. Compute Pearson's r (click here for formula)

3. If the null hypothesis is ρ = 0 then compute t using the following


formula:

4. Use a t table to compute p for t and N-2 df.

Tests of Pearson's Correlation (6 of 6)


Next section: Differences between correlations

4. If the null hypothesis is that is some specific value other than zero,
then compute

and use a z table to find p.

Assumptions

1. The N pairs of scores are sampled randomly and independently.


2. The distribution of the two variables is bivariate normal.

Tests of Differences between Independent Pearson Correlations (1 of


5)
A researcher was interested in whether the relationship between the
quantitative portion of the SAT (QSAT) and grades in the first year of
college (GPA) is different for engineering majors than it is for
humanities majors. One hundred engineering majors and 100
humanities majors were sampled and each student's QSAT and GPA
were recorded. The Pearson's correlation between these two measures
was 0.61 for the engineering majors and 0.35 for the humanities
majors. Is this difference in correlations significant?

1. The first step is to specify the null hypothesis and an


alternative hypothesis. The null hypothesis and alternative
hypotheses for this problem are:

H0: ρengineers = ρhumanities


H1: ρengineers ≠ ρhumanities

where ρengineers is the population correlation between QSAT


and GPA for engineering majors and ρhumanities is the
population correlation between QSAT and GPA for the
humanities majors.

2. The second step is to choose a significance level. Assume the 0.05


level is chosen.
Tests of Differences between Independent Pearson Correlations (2 of
5)

3. The third step is to compute the sample correlations. In this


experiment, r engineers = 0.61 and rhumanities = 0.35.
4. The fourth step is to compute p, the probability (or probability
value). It is the probability of obtaining a difference between the
statistic

r1- r2 ( rengineers -
rhumanities in this example)

and the value specified by the null hypothesis (zero) as large or


larger than the difference observed in the experiment. Since the
sampling distribution of Pearson's r is not normal, (click here for
illustration) the sample correlations are transformed to Fisher's
z 's .

The general formula applied to


this problem is:

where is the first correlation transformed to z', is the second


correlation converted to z', and is the standard error of the
difference between z's and is equal to:
where N1 is the sample size for the first correlation and N2 is
the sample size for the second correlation.

Tests of Differences between Independent Pearson Correlations (3 of


5)

4. (continued) For the experiment on the correlation between


QSAT and GPA,the standard error of the difference between z's
is:

The correlations of 0.61 and 0.35 can be transformed using the r


to z' table to z's of 0.709 and 0.365 respectively. Therefore,

A z table can be used to find that the two-tailed probability value


for a z of 2.40 is 0.016.
5. The probability computed in Step 4 is compared to the significance
level stated in Step 2.
Since the probability value (0.016) is less than the significance
level (.05), the effect is significant.

Tests of Differences between Independent Pearson Correlations (4 of


5)

6. Since the effect is significant, the null hypothesis is rejected. It is


concluded that the correlation between QSAT and GPA is higher
for engineering majors than for humanities majors.

7. An example of a report of this finding is shown below.

The correlation between QSAT and GPA was significantly


higher for the engineering majors (r = 0.61) than for the
humanities majors (r =0.35), z = 2.40, p = 0.016.

Summary of Computations

1. Specify the null hypothesis and an alternative hypothesis.


2. Compute Pearson's r for both samples. (click here for formula).

3. Use the r to z' table to convert the r's to z's.

4. Compute where N1 is the sample


size for the first correlation and N2 is the sample size for the
second correlation.

5. Compute .

6. Use a z table to find the probability value.

Tests of Differences between Independent Pearson Correlations (5 of


5)

Next section: Proportions

Assumptions

1. The two correlations are from two independent groups of subjects.


2. The N pairs of scores in each group are sampled randomly and
independently.
3. The distribution of the two variables is bivariate normal.

Tests of Proportions (1 of 4)

A manufacturer is interested in whether people can tell the difference


between a new formulation of a soft drink and the original formulation.
The new formulation is cheaper to produce so if people cannot tell the
difference, the new formulation will be manufactured. A sample of 100
people is taken. Each person is given a taste of both formulations and
asked to identify the original. Sixty-two percent of the subjects correctly
identified the new formulation. Is this proportion significantly different
from 50%?

1. The first step in hypothesis testing is to specify the null


hypothesis and an alternative hypothesis. In testing proportions,
the null hypothesis is that π, the proportion in the population, is
equal to some specific value. In this example, the null hypothesis is
that π =
0.5. The alternate hypothesis is π ≠ 0.5.

2. The second step is to choose a significance level. Assume the 0.05


level is chosen.
3. The third step is to compute the difference between the sample
proportion (p) and the value of π specified in the null
hypothesis. In this example, p - π = 0.62 - 0.5 = 0.12.

Tests of Proportions (2 of 4)

4. The fourth step is to compute p, the probability (or probability


value). It is the probability of obtaining a difference between the
proportion and the value specified by the null hypothesis as large
or larger than the difference observed in the experiment. The
general formula for significance testing as applied to this problem
If p is greater than π then the formula is:

If p is less than π then the formula is:


Note that the correction always makes z smaller.

N is the sample size and σpis the standard error of a proportion.


The formula for a
standard error of a proportion is:

The term is the correction for continuity. For the present


problem,

Therefore,

A z table can be used to determine that the two-tailed probability


value for a z of 2.3 is
0.0214.
Tests of Proportions (3 of 4)

5. The probability computed in Step 4 is compared to the significance


level stated in Step 2.
Since the probability value (0.0214) is less than the significance
level of 0.05, the effect is statistically significant.

6. Since the effect is significant, the null hypothesis is rejected.


It is concluded that the proportion of people choosing the
original formulation is greater than 0.50.

7. This result might be described in a report as follows:

The proportion of subjects choosing the original formulation


(0.62) was significantly greater than 0.50, z = 2.3, p = 0.021.
Apparently at least some people are able to distinguish between
the original formulation and the new formulation.

Tests of Proportions (4 of 4)
Next section: Differences between proportions

Summary of Computations

1. Specify the null hypothesis and an alternative hypothesis.

2. Compute the proportion in the sample.

3. Compute

4. If p > π then compute

otherwise, compute
5. Use a z table to compute the probability value from z.

Assumptions

1. Observations are sampled randomly and independently.

2. The adequacy of the normal approximation depends on the sample


size (N) and π.
Although there are no hard and fast rules, the following is a guide
to needed sample size: If π is between 0.4 and 0.6 then an N of 10
is adequate. If π is as low as 0.2 or as high as
0.8 then N should be at least 25. For π as low as 0.1 or as high as
0.9, N should be at least
30. A conservative rule of thumb is that both Nπ and N(1 - π)
should be greater than 10.

Tests of Differences between Proportions (1 of 5)

An experiment is conducted investigating the long-term effects of early


childhood intervention programs (such as head start). In one
(hypothetical) experiment, the high-school drop out rate of the
experimental group (which attended the early childhood program) and
the control group
(which did not) were compared. In the experimental group, 73 of 85
students graduated from
high school. In the control group, only 43 of 82 students graduated. Is
this difference statistically significant?

1. The first step in hypothesis testing is to specify the null


hypothesis and an alternative hypothesis. When testing
differences between proportions, the null hypothesis is that the
two population proportions are equal. That is, the null hypothesis
is:

H0: π1 = π2.
The alternative hypothesis is: H1: π1 ≠ π2.

In this example, the null


hypothesis is: H0:
πintervention - πno
intervention = 0.
Tests of Differences between Proportions (2 of 5)

2. The second step is to choose a significance level. Assume the


0.05 level is chosen.
3. The third step is to compute the difference between the
sample proportions. In this example, p1 - p2 = 73/85 - 43/82
= 0.8588 - 0.5244 = 0.3344.
4. The fourth step is to compute p, the probability (or probability
value). It is the probability of obtaining a difference between the
proportions as large or larger than the difference observed in the
experiment. Applying the general formula to the problem of
differences between proportions

where p1- p2 is the difference between sample proportions and

is the estimated standard error of the difference between


proportions. The formula for the estimated standard error is:

where p is a weighted average of the p1 and p2, n1 is the number


of subjects sampled from the first population, and n2 is the number
of subjects sampled from the second population.

Tests of Differences between Proportions (3 of 5)


4. (continued) Therefore, z = 0.344/0.0713 = 4.69. A z table can
be used to find that the two-tailed probability value for a z of
4.69 is less than 0.0001.

If n1 = n2 then p is simply the average of p1 and p2:

The two p values are averaged since the computations are based
on the assumption that the null hypothesis is true. When the null
hypothesis is true, both p 1 and p2 are estimates of the same value
of π. The best estimate of π is then the average of the two p's.

Naturally, if one p is based on more observations than the


other, it should be counted more heavily. The formula for p
when n1 ≠ n2 is:
The computations for the example on early childhood intervention
are:

Tests of Differences between Proportions (4 of 5)

5. The probability computed in Step 4 is compared to the significance


level stated in Step 2.
Since the probability value (<0.0001) is less than the significance
level of 0.05, the effect is significant.

6. Since the effect is significant, the null hypothesis is rejected.


The conclusion is that the probability of graduating from high
school is greater for students who have participated in the early
childhood intervention program than for students who have not .
7. The results could be described in a report as:
The proportion of students from the early-intervention group who
graduated from high school was 0.86 whereas the proportion
from the control group who graduated was only
0.52. The difference in proportions is significant, z = 4.69, p <
0.001.

Tests of Differences between Proportions (5 of 5)

Next chapter: Power

Summary of
Computations

1. Compute p1 and p2.

2. Compute

3. Compute .
4. Compute .

5. Use a z table to compute the probability value from z. Note


that the correction for continuity is not used in the test for
differences between proportions.

Assumptions

1. 1. The two proportions are independent.

2. For the normal approximation to be adequate, π should not be


too close to 0 or to 1.
Values between 0.10 and 0.90 allow the approximation to be
adequate.

3. For the normal approximation to be adequate, there should be


at least 10 subjects per group.
Power (1 of 3)

Power is the probability of correctly rejecting a false null hypothesis.


Power is therefore defined as: 1 -  where  is the Type II error
probability. If the power of an experiment is low, then there is a good
chance that the experiment will be inconclusive. That is why it is so
important to consider power in the design of experiments. There are
methods for estimating the power of an experiment before the
experiment is conducted. If the power is too low, then the experiment
can be redesigned by changing one of the factors that determine power.

Consider a hypothetical experiment designed to test whether rats


brought up in an enriched environment can learn mazes faster than
rats brought up in the typical laboratory environment (the control
condition). Two groups of 12 rats each are tested. although the
experimenter does
not know it, the population mean number of trials it takes to learn the
maze is 20 for the enriched condition and 32 for the control condition.
The null hypothesis that the enriched environment makes no difference
is therefore false.
Power (2 of 3)

The question is, "What is the probability that the experimenter is


going to be able to demonstrate that the null hypothesis is false by
rejecting it at the .05 level?" This is the same thing as asking
"What is the power of the test?" Before the power of the test can be
determined, the standard deviation (σ) must be known. If σ = 10
then the power of the significance test is 0.80. (Click here to see
how to compute power.) This means that there is a 0.80 probability
that the experimenter will be able to reject the null hypothesis.
Since power = 0.80,  = 1-.80 = .20.

It is important to keep in mind that power is not about whether or


not the null hypothesis is true (It is assumed to be false). It is the
probability the data gathered in an experiment will be sufficient to
reject the null hypothesis. The experimenter does not know that the
null hypothesis is false. The experimenter asks the question: If the
null hypothesis is false with specified population means and
standard deviation, what is the probability that the data from the
experiment will be sufficient to reject the null hypothesis?

Power (3 of 3)
Next section: Factors affecting power: Introduction

If the experimenter discovers that the probability of rejecting the null


hypothesis is low (power is low) even if the null hypothesis is false to
the degree expected (or hoped for), then it is likely that the experiment
should be redesigned. Otherwise, considerable time and expense will go
into a
project that has a small chance of being conclusive even if the
theoretical ideas behind it are correct.

Factors Affecting Power : Introduction (1 of 1)

Next section: Size of difference between means

The factors affecting power will be illustrated in terms of a simple


example. Suppose an experimenter were interested in testing whether a
drug affects reaction time. Subjects are tested once after taking the drug
and once after taking a placebo (naturally, half of the subjects take the
drug first and half take the placebo first). The difference in reaction time
between the drug and placebo conditions is calculated for each subject.
The null hypothesis is that the population mean difference score is
zero. H0: µdiff = 0 where µdiff is the population mean difference score:

μdiff = μdrug - placebo.

To simplify the calculations, assume that the standard deviation of the


difference scores (σ) is known to the experimenter. The experimenter
samples N subjects at random and computes the mean difference score
(Mdiff). A significance test is conducted using the formula:
Subsequent sections will use this example to investigate the factors
affecting power.

Factors Affecting Power: Size of the Difference between Population


Means (1 of
3)

The size of the difference between population means is an


important factor in determining power. Naturally, the more the
means differ from each other, the easier it is to detect the
difference. In the example, the difference between means, µdiff ,
is the population mean
difference score. It represents the size of the drug effect. For instance, if
there were no difference
between the drug and the placebo, then µdiff would be zero and there
would be no effect of the drug. If the drug slows people down and, as
a result, increases reaction time, µ diff would be a positive number.
The larger the effect of the drug, the larger the value of µ diff.
Assume that the standard deviation and sample size are: σ = 50 and N
= 25. (Also assume the experimenter knew the value of σ in order to
make the calculations easier. The same principles would apply if the
standard deviation had to be estimated.) The null hypothesis would
then be rejected at the 0.05 level if M were larger than 19.6 or less
than -19.6. (Click here for
calculations.) The sampling distribution of M for four values of μdiff (0,
10, 20, and 30) are shown on the next page. As you will see, the farther
the value of μ diff is from zero, the smaller the Type
II error rate (β) and therefore the larger the power.

Factors Affecting Power: Size of the Difference between Population


Means (2 of
3)
In figure A, the null hypothesis that µdiff = 0 is true. The area shaded in
black represents Type I errors; the striped area represents correct
decisions not to reject the null hypothesis. The mean of the sampling
distribution is µdiff which equals 0.

In figures B-D, the null hypothesis is false. The mean of each


distribution is µ diff which is 10 for Figure B, 20 for Figure C and 30 for
Figure D. The cutoff for significance is 19.6 and is the same for all
figures. The area to the right of 19.6 is shaded and represents power.
The area below
19.6 is shaded and represents Type II errors (the area below -19.6 is
negligible). As µdiff
increases, power (the proportion of the distribution larger than 19.6)
increases

Factors Affecting Power: Size of the Difference between Population


Means (3 of
3)

Next section: Significance level

The experimenter usually has little control over the size of the effect
although sometimes the experimenter can choose the levels of the
independent variable to increase the size of the effect.
For instance, the size of the effect may be greater when a 500 mg dose
is compared to a placebo then when a 300 mg dose is compared.
Naturally researchers hope that the variables they manipulate have
large effects. When they do not, the power of their experiments is
generally low.

Factors Affecting Power: Significance Level (2 of 2)

Next section: Sample size

The power for the three values of μdiff at the two significance levels is
shown below.

μdi 0.0 0.0


10 0.1 0.0
20 7 7
It stands to reason that the power would be higher for the 0.05 level
than for the 0.01 level. The cost of stronger protection against Type I
errors is more Type II errors.

Factors Affecting Power: Sample Size (1 of 2)

Increasing N, the sample size, decreases the denominator of the


equation for z:

and therefore increases power.

The graphs below show the sampling distribution of M (for the


example) when σ = 50 and
µdiff = 20. The cutoff points for being significant at the 0.05 level
were determined using the formulas:

and

Note how the cutoff point and the variance of the distribution
change as N increases. The mean of the distribution does not
change.

Factors Affecting Power: Sample size (2 of 2)


Next section: Variance

Naturally, power increases as the sample size increases. The effect of


sample size on power for this example is seen in more detail in the
following graph. Choosing a sample size is a difficult decision. There
is a tradeoff between the gain in power and the time and cost of testing
a large number of subjects.
Factors affecting power: Variance (1 of 2)

The larger the variance (σ²), the lower the power. In the formula for z:

increasing σ² increases the denominator and therefore lowers z and


power. For the example, σ is the standard deviation of the difference
scores. The power of the test using the .05 significance level, for N =
25, µdiff= 20, and various values of σ is shown in the table on the right
side of this page. The table below shows that the power decreases as σ
increases.

σ Pow
50 0.52
75 0.26
10 0.17

There are ways that an experimenter can reduce variance to


increase power. One is to define a relatively homogeneous
population. For instance, if one were studying reading speed, one
could begin by studying the population of people in their first year
at a selective college rather than the population of all English-
speaking adults. The variance would be much reduced.
Factors Affecting Power: Variance (2 of 2)

Next section: Other factors


The cost would be that the results would not be as generalizeable. A
second way to reduce variance is by using a within-subjects design. In
these designs, the overall level of performance of each person is
subtracted out. This usually reduces the variance substantially.

Other Factors Affecting Power

Next section: Estimating power

Several other factors affect power. One factor is whether the


population distribution is normal. Deviations from the assumption
of normality usually lower power. A second factor affecting power
is the type of statistical procedure used. Some of the distribution-
free tests are less powerful than other tests when the distribution is
normal distributions but more powerful when the distribution is
highly-skewed. One-tailed tests are more powerful than two-tailed
tests as long as the effect is in the expected direction. Otherwise,
their power is zero.

The choice of an experimental design can have a profound effect on


power. Within-subject designs are usual much more powerful than
between-subject designs.

Next section: Estimating power


Estimating Power (1 of 4)
The primary purpose of power analysis is to guide in the choice of
sample size. First, the experimenter specifies the power that he or she
wishes to achieve. Then, the sample size needed for that level of power
can be estimated. The calculations for power depend on the size of the
effect in the population. Therefore, the first and most difficult step in
choosing a sample size is to estimate the size of the effect. If there are
published experiments similar to the one to be conducted, then the
effects obtained in these published studies can be used as a guide to the
size of the effect. There is a need for caution, however, since there is a
tendency for published studies to contain overestimates of effect sizes.

Often previous studies are not sufficiently similar to a new study to


provide a valid basis for estimating the effect size. In this case, it is
possible to specify the minimum effect size that is considered
important. For example, an experimenter interested in the
effectiveness of a course that prepares students for the quantitative
portion of the SAT might consider a difference of 30 points between
the treatment group and the control group to be the minimum effect
size worth detecting.

Estimating Power (2 of 4)
Estimating the effect size includes estimating the population
variance as well as the population means. In the SAT example,
the experimenter would have to estimate the
variance of the SAT (it is about 10,000) before the calculations could
be done.

Frequently it is easiest to specify the effect size in terms of the


number of standard deviations separating the population means.
Thus, one might find it easier to estimate that the population mean
for the experimental group is 0.5 standard deviations above the
population mean for the control group than to estimate the two
population means and the population variance. Fortunately, the
power for any experiment in which the difference between
population means is the same number of population standard
deviations apart is the same. The power for an experiment in which
µ 1 = 10, µ2 = 20, and σ = 20 is the same as the power for an
experiment in which µ1 = 5, µ2 = 7, and σ = 4. In both cases, the
means are
0.5 standard deviations apart.

Estimating Power (3 of 4)

How much power is enough power? Naturally, the more power the
better. However, in some experiments it is very time consuming and
expensive to run each subject. In these experiments, the experimenter
usually must accept less power than is typically found in experiments in
which subjects can be run cheaply and easily. In any case, power below
.25 would almost always be considered too low and power above .80
would be considered satisfactory. Keep in mind that a Type II error is
not necessarily so bad since a failure to reject the null hypothesis does
not mean that the research hypothesis should be abandoned. If the
results are suggestive, further experiments should be conducted to see if
the existence of the effect can be confirmed. The power of the two
experiments taken together will be greater than the power of either one .

Estimating Power (4 of 4)

Next chapter: Introduction to between-


subjects ANOVA

The computation of power when the experimenter is going to use a t


test rather than a z test (as used in the section on factors affecting
power for computational convenience) is much more complicated.
These formulas for power are not given here. Instead, you should use a
computer program to compute power such as Russ Lenth's power
calculation page.
Next chapter: Introduction to
between-subjects ANOVA
Chapter
11

1. State the effect on the probability of a Type I and of


a Type II error of: (a) the difference between
population means
(b) the variance
(c) the sample size
(d) the significance level
2. Suppose a new drug is discovered that improves digit span (the
number of digits that can be remembered). The population mean digit
span of people not taking the drug is 6.5, the population mean digit span
of taking the drug is 7.8, and the standard deviation in both populations
is 1.5 .
An experiment is conducted to establish the effectiveness of the drug.
Twenty subjects are chosen at random and are divided randomly into
two groups of 10. One group is given the new drug and the other group
is given a placebo. A two-tailed t-test is conducted to test the null
hypothesis that the drug has no effect. What is the probability that the
test will be significant at the .05 level? What is the probability it will
be significant at the .01 level?

3. A standardized achievement test has been known for years to have a


population mean of 450. A researcher suspects that the mean has now
dropped to 400. Assume the researcher is correct and that the
variance of the test is 10,000. What sample size would be required so that
a two -tailed test at the
.05 level would have a .80 probability of showing that the population
mean is not 450? What sample size would be necessary for power of .
80 if a one-tailed test (using the .05 level) were used?
4. For the experiment described in Problem 3, what would the power
(using the .05 level) have been for a sample size of 25 if thee
xperimenter had known that the population variance was
10,000 and therefore conducted a z test instead of a t test?

5. Two population means differ by 1.5 standard deviations. If nine


subjects are sampled from each population, what is the chance that the
means will be significantly different at the .01 level?

6. What is the most difficult step in estimating power?

7. Why do published experiments tend to contain overestimates of effect


sizes?

8. An experiment is conducted in which 15 subjects are each tested in


each of two conditions. The population mean of each condition, the
sample size, and the variance of each condition are known. The scores
are normally distributed. There is one piece of information missing
without which power cannot be calculated. What is it?

Selected Answers
Chapter
11

2.
0.05 level: Power = 0.45
0.01 level: Power = 0.21
3.
two-tailed 33
subjects one-
tailed 26
subjects

4. 0.71
Introduction to Between-Subjects Analysis of Variance:
Preliminaries (1 of
4)

Analysis of variance (ANOVA) is used to test hypotheses about


differences between two or more means. The t-test based on the
standard error of the difference between two means can only be used to
test differences between two means. When there are more than two
means, it is possible
to compare each mean with each other mean using t-tests. However,
conducting multiple t-tests can lead to severe inflation of the Type I
error rate. (Click here to see why) Analysis of variance can be used to
test differences among several means for significance without
increasing the Type I error rate. This chapter covers designs with
between-subject variables. The next chapter covers designs with within-
subject variables.

Consider a hypothetical experiment on the effect of the intensity of


distracting background noise on reading comprehension. Subjects were
randomly assigned to one of three groups. Subjects in Group 1 were
given 30 minutes to read a story without any background noise.
Subjects in Group
2 read the story with moderate background noise, and subjects in
Group 3 read the story in the presence of loud background noise.

Introduction to Between-Subjects Analysis of Variance:


Preliminaries (2 of 4)

The first question the experimenter was interested in was whether


background noise has any effect at all. That is, whether the null
hypothesis: µ1 = µ2 = µ3 is true where µ1 is the population mean for
the "no noise" condition, µ2 is the population mean for the "moderate
noise" condition, and µ3 is the population mean for the "loud noise"
condition. The experimental design therefore has one factor (noise
intensity) and this factor has three levels: no noise, moderate noise, and
loud noise.

Analysis of variance can be used to provide a significance test of the


null hypothesis that these three population means are equal. If the test
is significant, then the null hypothesis can be rejected and it can be
concluded that background noise has an effect.
In a one-factor between- subjects ANOVA, the letter "a" is used to
indicate the number of levels of the factor (a = 3 for the noise intensity
example). The number of subjects assigned to condition 1 is designated
as n1, the number of subjects assigned to condition 2 is designated by
n2, etc.

Introduction to Between-Subjects Analysis of Variance:


Preliminaries (3 of 4)
If the sample size is the same for all of the treatment groups, then the
letter "n" (without a subscript) is used to indicate the number of
subjects in each group. The total number of subjects across all groups is
indicated by "N." If the sample sizes are equal, then N = (a)(n);
otherwise,

N = n1 + n2 + ... + na.

Some experiments have more than one between-subjects factor. For


instance, consider a hypothetical experiment in which two age groups
(8-year olds and 12-year olds) are asked to perform a task either with
or without distracting background noise. The two factors are age an d
distraction.

Assumptions
Analysis of variance assumes normal distributions and homogeneity
of variance. Therefore, in a one-factor ANOVA, it is assumed that each
of the populations is normally distributed with the same variance (σ²).
In between-subjects analyses, it is assumed that each score is sampled
randomly and independently. Research has shown that ANOVA is
"robust" to violations of its assumptions.
Introduction to Between-Subjects Analysis of Variance:
Preliminaries (4 of 4)

Next section: Two estimates of variance

This means that the probability values computed in an ANOVA are


satisfactorily accurate even if the assumptions are violated. Moreover,
ANOVA tends to be conservative when its assumptions are violated.
This means that although power is decreased, the probability of a Type
I error is as low or lower than it would be if its assumptions were met.
There are exceptions to this rule. For example, a combination of unequal
sample sizes and a violation of the assumption of homogeneity of
variance can lead to an inflated Type I error rate.

Two estimates of variance (1 of 5)

Analysis of variance tests the null hypothesis that all the


population means are equal: H0: µ1 = µ2 = ... = µa
by comparing two estimates of variance (σ²). (Recall that σ² is the
variance within each of the "a"
treatment populations.) One estimate (called the Mean Square Error or
"MSE" for short) is based on the variances within the samples. The
MSE is an estimate of σ² whether or not the null hypothesis is true. The
second estimate (Mean Square Between or "MSB" for short) is based on
the variance of the sample means. The MSB is only an estimate of σ² if
the null hypothesis is true. If the null hypothesis is false then MSB
estimates something larger than σ². (A later section discusses exactly
what MSB estimates when the null hypothesis is false.) The logic by
which
analysis of variance tests the null hypothesis is as follows: If the null
hypothesis is true, then MSE and MSB should be about the same since
they are both estimates of the same quantity (σ²); however, if the null
hypothesis is false then MSB can be expected to be larger than MSE
since MSB is estimating a quantity larger then σ².

Two Estimates of Variance (2 of 5)

Therefore, if MSB is sufficiently larger than MSE, the null hypothesis


can be rejected. If MSB is not sufficiently larger than MSE then the null
hypothesis cannot be rejected. How much larger is "sufficiently" larger?
That is determined by the statistical analysis explained in a later
section.

How σ² is Estimated by MSE


The estimation of σ² by MSE in the analysis of variance is exactly the
same as the estimation of σ² by MSE in the construction of confidence
intervals on a linear combination of means and in the testing of linear
combinations of means. It is very similar to the estimation of σ² by
MSE in the construction of confidence intervals on the difference
between means and in the testing of differences between means. The
only difference is that there are "a" groups instead of two groups.
For simplicity, assume that the sample sizes in each of the "a" samples
are equal. The variance (σ²) is estimated by s² in each of the samples.
The average of these "a" estimates of σ² is then used as an overall
estimate of σ².

Two Estimates of Variance (3 of 5)

Therefore,

(Click here for the formula for unequal sample sizes.) The example
Aspirin Tylenol
data3from the section
2 onPlacebo
te2sting of linear combinations of means are
5 2 1
reproduced
3 here.4 3
5 4 2
The means af the Aspirin, Tylenol, and Placebo conditions are 4, 3,
and 2 respectively. The sample variances are 1.334, 1.334, and
0.666, The MSE is the mean of the three sample variances and is
therefore:
(1.334 + 1.334 + 0.666)/3 = 1.111.

How σ² is Estimated by MSB


Estimating σ² using the sample means is slightly more complicated than
using the sample variances. The first step is to use the sample means to

estimate the variance of the sampling distribution of the mean ( ). In


an experiment there are "a" means. The variance of these "a" means is

used as the estimate of . For the example, the three sample means
are: 4, 3, and 2.

Two Estimates of Variance (4 of 5)

Recall that the formula for s² is:

.
In this application, each "X" will be a sample mean and M will be the
mean of the sample means. Since there are "a" means, N = a. Letting M
designate the mean of the sample means,

M = (4 + 3 + 2)/3 = 3.

is the estimate of . For the example,

This means that the estimate of is 1. But what is really needed is not

an estimate of but an estimate of σ². Fortunately, there is a simple

relationship between σ² and . Since the sampling distribution of the

mean has a standard deviation of: , (click here for details) the
sampling distribution of the mean has a variance of:
where N is the number of scores each mean is based upon.

Two Estimates of Variance (5 of 5)

Next section: Significance test

Rearranging the equation,

Therefore, to go from an estimate of to an estimate of σ², you

multiply the estimate of by n. For this example n = 4, so the


estimate of σ² is: MSB = (4)(1) = 4. In short,
where n is the number of subjects in each group and is the
variance of the sample means. The significance tests associated with
analysis of variance are based on the ratio of MSB to
MSE. If the ratio is large enough, then the null hypothesis that the
population means are equal can be rejected. The next section discusses
the distribution of MSB/MSE and how to conduct a significance test.

Next section: Significance test

The Significance Test in ANOVA (1 of 2)

If the null hypothesis is true, then both MSB and MSE estimate the
same quantity. If the null hypothesis is false, then MSB is an estimate of
a larger quantity (click here to see what it is). The significance test
involves the statistic F which is the ratio of MSB to MSE: F =
MSB/MSE. If the null hypothesis is true, then the F ratio should be
approximately one since MSB and MSE should be about the same. If the
ratio is much larger than one, then it is likely that MSB is estimating a
larger quantity than is MSE and that the null hypothesis is false. In order
to conduct a
significance test, it is necessary to know the sampling distribution of F
given that the null hypothesis is true. From the sampling distribution, the
probability of obtaining an F as large or larger than the one calculated
from the data can be determined. This probability is the probability
value. If it is lower than the significance level, then the null hypothesis
can be rejected. The
mathematics of the sampling distribution were worked out by the
statistician R. A. Fisher and is called the F distribution in his honor.
(Click here for information about the F distribution.)

The Significance Test in ANOVA (2 of 2)

Next section: Partitioning the sums of squares

After an F is computed, the probability value can be computed from an


F table. To use this table, you need to know the two degrees of freedom
parameters: dfn and dfd:

dfn =
a-1 dfd
= N-a

where a is the number of groups and N is the total number of


subjects in all groups. The parameter dfd is often called degrees
of freedom error or dfe for short.
For the example data, dfn = 3-1 = 2 dfd = 12-3 = 9, MSB =
4, and MSE = 1.111. Therefore,
F = MSB/MSE = 4/1.111 = 3.6.

An F table can be used to compute that the probability value for an F of


3.6 with 2 and 9 df is
0.071. Therefore the null hypothesis cannot be rejected at the 0.05 level.

Partitioning the Sums of Squares (1 of 7)

Data from a hypothetical experiment on pain relief described earlier are


reproduced below.
Aspiri Tylen Plac
3 2 2
5 2 1
3 4 3
5 4 2

The experiment consists of one factor (Drug condition) which has


three levels (Aspirin, Tylenol, and Placebo). There are four subjects in
each of the three groups making a total of N = 12 subjects. The scores
for these 12 subjects obviously differ from each other: They range from
two to five. There are two sources of these differences among scores (1)
treatment effects and (2) error.

Treatment Effects
It is possible that some of the differences among the 12 scores obtained
in the experiment are due to the fact that different subjects were given
different drugs. If the drugs differ in their
effectiveness, then subjects receiving the more effective drugs will
have the higher pain-relief scores. The more different the effectiveness
of the drugs, the greater will be the differences among the scores. Since
the different drugs represent different ways the subjects were "treated,"
differences due to differences in drug effectiveness are said to be due to
"treatment effects."

Partitioning the Sums of Squares (2 of 7)

Error
All other differences among the subjects' scores are said to be due to
"error." Do not take the word "error" literally, however. It does not mean
that a mistake was made. For historical reasons, any differences among
subjects that cannot be explained by the experimental treatmen ts are
called error. It would be better, perhaps, to call this source of differences
"unexplained
variation," but to be consistent with the terminology in common usage,
these differences will be called "error."

A major source of error is the simple fact that even if subjects are
treated exactly the same way in an experiment, they differ from each
other, often greatly. This stands to reason since subjects differ in both
their pre-experimental experiences and in their genetics. In terms of
this example, it means that subjects given the same drug will not
necessarily experience the same degree of pain relief. Indeed, there are
differences among subjects within each of the three groups.

Partitioning the Sums of Squares (3 of 7)

Total Sum of Squares


The variation among all the subjects in an experiment is measured
by what is called sum of squares total or SST. SST is the sum of the
squared differences of each score from the
mean of all the scores. Letting GM (standing for "grand mean")
represent the mean of all scores, then

SST = Σ(X - GM)²

where GM = ΣX/N and N is the total number of subjects in the


experiment

For the example


data: N = 12
GM = (3+5+3+5+2+2+4+4+2+1+3+2)/12 = 3
SST = (3-3)²+(5-3)²+(3-3)²+(5-3)² + (2-3)²+(2-3)²+(4-3)²+(4-3)² + (2-
3)²+(1-3)²+(3-3)²+(2-3)² =
18
Sum of Squares Between Groups
The sum of squares due to differences between groups (SSB) is
computed according to the following formula:

where ni is the sample size of the ith group and Mi is the mean of
the ith group, and GM is the mean of all scores in all groups.
Partitioning the Sums of Squares (4 of 7)

If the sample sizes are equal then the formula can be


simplified somewhat: SSB = nΣ(Mi - GM)²
For the example data,
M1 = (3+5+3+5)/4 = 4
M2 = (2+4+2+4)/4 = 3
M3 = (2+1+3+2)/4 = 2
GM =
3n=
4
SSB = 4[(4-3)² + (3-3)² + (2-3)²] = 8

Sum of Squares Error


The sum of squares error is the sum of the squared differences between
the individual scores and their group means. The formula for sum of
squares error (SSE) for designs with two groups has already been given
in the section on confidence interval on the difference between two
independent means and in testing differences between two independent
means.

The SSE is computed separately for each of the groups in the


experiment and then summed. SSE = SSE1 + SSE2 + ... + SSEa
SSE1 = Σ(X - M1)² ; SSE2 = Σ(X - M2)² SSEa = Σ(X - Ma)² where M1
is the mean of Group 1, M2
is the mean of Group 2, and Ma is the mean of Group a.

Partitioning the Sums of Squares (5 of 7)

For the example,


SSE1 = (3-4)² + (5-4)² +( 3-4)² + (5-4)² = 4

SSE2 = (2-3)² + (4-3)² + (2-3)² + (4-3)² = 4

SSE3 = (2-2)² + (1-2)² + (3-2)² + (2-2)² = 2

Therefore,
SSE = SSE1 + SSE2 + SSE3 = 4 + 4 + 2 = 10.

They All Add Up


The sums of squares computed in this
example are: SST = 18
SSB = 8
SSE = 10.
Notice that SST = SSB + SSE. This is important because it shows that
the total sum of squares can be divided into two components: the sum
of squares due to treatments (SSB) and the sum of squares that not due
to treatments (SSE).

Partitioning the Sums of Squares (6 of 7)


The ANOVA Summary Table
It is convenient to view the results of an analysis of variance in a
summary table. The summary
table for the example data is shown below.
Sour d Ssq Ms F
Grou 2 8.0 4.0 3.60
Erro 9 10.0 1.1
Tota 11 18.0 1.6
The first column contains the source of variation (or source for short).
There are two sources of
variation: differences among treatment groups and error. The second
column contains the degrees of freedom (df). The df for groups is
always equal to the number of groups minus one. The df for error is
equal to the number of subjects minus the number of groups. The
formulas for the degrees of freedom are shown below:

dfgroups = a - 1 = 3
- 1 = 2 dferror = N
- a = 12 - 3 = 9
dftotal = N - 1 = 12
- 1 = 11

a is the number groups


N is the total number of
subjects. dftotal =
dfgroups + dferror
Partitioning the Sums of Squares (7 of 7)

Next section: Computational methods

Sour d Ssq Ms F
Grou 2 8.0 4.0 3.60
Erro 9 10.0 1.1
Tota 11 18.0 1.6
The third column contains the sums of squares. Notice that the sum of
squares total is equal to
the sum of squares groups + sum of squares error. The fourth column
contains the mean squar es. Mean squares are estimates of variance and
are computed by dividing the sum of squares by the degrees of freedom.
The mean square for groups (4.00) was computed by dividing the sum of
squares for groups (8.00) by the degrees of freedom for groups (2). The
fifth column contains the F ratio. The F ratio is computed by dividing
the mean square for groups by the mean square for error. In this
example,

F = 4.000/1.111 = 3.60.

There is no F ratio for error or total. The last column contains the
probability value. It is the probability of obtaining an F as large or larger
than the one computed in the data assuming that the null hypothesis is
true. It can be computed from an F table. The df for groups (2) is used
as the degrees of freedom in the numerator and the df for error (9) is
used as the degrees of freedom in the denominator. The probability of an
F with 2 and 9 df as larger or larger than 3.60 is 0.071.

Computational Methods (1 of 2)

The formulas for sums of squares given in the section on partitioning the
variance are not the most efficient computationally. They were chosen
because they help to convey the conceptual basis of analysis of variance.
Aspirin Tylenol Placebo
This3 section provides
2 computational
2 formulas. Most likely you will not
5 2 1
often3 have to use4 these formulas
3 since analysis of variance is usually
5 4 2
done by computer. Nonetheless, if you ever need to do an ANOVA with
a hand calculator, these formulas may help. Data from a hypothetical
experiment on pain relief described earlier are reproduced below.

The sum of squares total (SST) can be calculated as:


where
and N is the total number of subjects.

For these data: CF = (3+5+3+5+2+2+4+4+ 2+1+3+2)²/12 = 36²/12 =


108.

Computational Methods (2 of 2)

Next section: Introduction to tests supplementing ANOVA

SST = 3²+5²+3²+5²+2²+2²+4²+4²+2²+1²+3²+2² - CF
= 126-108 = 18

The sum of squares between (SSB) can be calculated as:


where is the total of the scores in group 1 squared, is the total
of the scores in group 2 squared, etc. n1 is the sample size for group
1, n2 is the sample size for group 2, etc. For the example data,

SSB = 256/4 + 144/4 + 64/4 - 108 = 8

Since SST = SSB + SSE,


SSE = SST - SSB = 18 - 8 = 10

With equal sample sizes, the formula for SSB can be simplified
somewhat:
which in this case is:
Introduction to Tests Supplementing a One-factor Between-
Subjects ANOVA (1 of 2)

The null hypothesis in a one-factor between-subjects ANOVA is that


all the population means are equal:

H0: µ1 = µ2 = ... = µa.

Unfortunately, when the analysis of variance is significant and the null


hypothesis is rejected, the only valid inference that can be made is that at
least one population mean is different from at
least one other population mean. The analysis of variance does not reveal
which population
means differ from which others. Experimenters usually are interested in
more information. They want to know precisely where the differences
lie. Consequently, further analyses are usually conducted after a
significant analysis of variance. These further analyses almost always
involve conducting a series of significance tests. This causes a very
serious problem: the more significance tests that are conducted, the
greater the chance that at least one of them will produce a Type I error.
The probability that a single significance test will result in a Type I error
is called the per- comparison error rate (PCER).

Introduction to Tests Supplementing a One-factor Between-


Subjects ANOVA (2 of 2)

Next section: Introduction to pairwise comparisons

The probability that at least one of the tests will result in a Type I error
is called the experimentwise error rate (EER). Statisticians differ in
their views of how strictly the EER must be controlled. Some statistical
procedures provide strict control over the EER whereas others control it
to a lesser extent. Naturally there is a tradeoff between the Type I and
Type II error rates. The more strictly the EER is controlled, the lower
the power of the significance tests.

The remaining sections in this chapter discuss tests used to


supplement the finding of a significant ANOVA. Pay particular
attention to the tradeoffs among three conflicting goals:

1. to extract all information from the data that is meaningful,


2. to control the EER, and
3. to achieve a high level of power.
4. All Pairwise Comparisons among Means: Introduction (1 of 1)
5.
6. Next section: All pairwise t tests

Treatment conditions are normally included in an experiment to


see how they compare to other treatment conditions. Therefore,
experimenters frequently wish to compare the mean of each
treatment condition with the mean of each other treatment
condition. The problem with performing comparisons of all pairs
of means (pairwise comparisons) is
that there can be quite a large number of comparisons and
therefore the danger of a high EER. The figure below shows the
number of possible pairwise comparisons as a function of the
number of conditions in the experiment. If "a" is the number of
conditions, then (a)(a-1)/2 pairwise comparisons among condition
means are possible. Therefore there are (a)(a-1)/2 chances to make
a Type I error.
All Pairwise Comparisons among Means: All t-tests

1. Next section: Fisher's LSD

If there were no need to be concerned about the EER, then, instead


of computing an analysis of variance, one could simply compute t-
tests among all pairs of means. However, the effect on the EER is
not trivial. For example, consider an experiment conducted with
eight subjects in each of six treatment conditions. If the null
hypothesis were true and all 15 t-tests were conducted using the
0.01 significance level, then the probability that at least one of the
15 tests would result in a Type I error is 0.10. Thus, the EER of
0.10 would be 10 times as high as the PCER of 0.01. Because
computing t-tests
on all pairs of means results in such a high EER, it is
generally not considered an acceptable approach.

All Pairwise Comparisons among Means: Fisher's LSD Procedure (1


of 2)

An approach suggested by the statistician R. A. Fisher (called the


"least significant difference method" or Fisher's LSD) is to first test
the null hypothesis that all the population means are equal (the
omnibus null hypothesis) with an analysis of variance. If the analysis
of variance is not significant, then neither the omnibus null hypothesis
nor any other null hypothesis about differences among means can be
rejected. If the analysis of variance is significant, then each mean is
compared with each other mean using a t-test. The advantage of this
approach is that there is some control over the EER. If the omnibus
null hypothesis is true, then the EER is equal to whatever
significance level was used in the analysis of variance. In the example
with the six groups of subjects given in the section on t-tests, if the
0.01 level were used in the analysis of variance, then the EER would
be 0.01. The problem with this approach is that it can lead to a high
EER if most population means are equal but one or two are different.
All Pairwise Comparisons among Means: Fisher's LSD Procedure (2
of 2)

Next section: Tukey's HSD

In the example, if a seventh treatment condition were included


and the population mean for the seventh condition were very
different from the other six population means, an analysis of
variance would be likely to reject the omnibus null hypothesis.
So far, so good, since the omnibus null hypothesis is false.
However, the probability of a Type I error in one or more of the
15 t-tests computed among the six treatments with equal
population means is about 0.10. Therefore, the LSD method
provides only minimal protection against a high EER.

Homogeneity of variance is typically assumed for Fisher's LSD


procedure. Therefore, MSE, the estimate of variance, is based
on all the data, not just on the data for the two groups being
compared. In order to make the relationship between Fisher's
LSD and
other methods of computing pairwise comparisons clear, the
formula for the studentized t (ts) rather then the usual formula for
t is used. This makes no difference in the outcome since, for
Fisher's LSD procedure, the critical value of t is computed as if
their were only two means in the experiment, a situation in which t
and t s result in identical probability values. although ts will be
1.414 times t, the critical value of ts will also be 1.414 times
the critical value
All Pairwise Comparisons among Means: Tukey's HSD
Procedure (1 of 2)

The "Honestly Significantly Different" (HSD) test proposed by the


statistician John Tukey is based on what is called the "studentized
range distribution." To test all pairwise comparisons among means
using the Tukey HSD, compute t for each pair of means using the
formula:

where Mi - Mj is the difference between the ith and jth means,


MSE is the Mean Square
Error, and nh is the harmonic mean of the sample sizes of groups i
and j.

All Pairwise Comparisons among Means: Tukey's HSD


Procedure (2 of 2)
Next section: Newman-Keuls

The critical value of ts is determined from the distribution of the


studentized range. The number of means in the experiment is used
in the determination of the critical value, and this critical value is
used for all comparisons among means. Typically, the largest
mean is compared with the smallest mean first. If that difference is
not significant, no other comparisons will be significant either, so
the computations for these comparisons can be skipped.

The Tukey HSD procedure keeps the EER at the specified


significance level. This is a great advantage. This advantage
comes at a cost, however: the Tukey HSD is less powerful than
other methods of testing all pairwise comparisons.

All Pairwise Comparisons among Means: Newman-Keuls Procedure


(1 of 4)

The Newman-Keuls method, like the Tukey HSD, is based on


the studentized range distribution. Consider an experiment in
which there are five treatment conditions. First
the means are rank ordered from smallest to largest. Then, the
smallest mean is compared to the largest mean using the
studentized t. If the test is not significant, then no pairwise tests are
significant and no more testing is done. So far, the Newman-Keuls
method is exactly the same as the Tukey HSD. If the difference
between the largest mean and the smallest mean is significant, then
the difference between the smallest mean (Mean 1) and
the second largest mean (Mean 4) as well as the difference between
the largest mean
(Mean 5) and the second smallest mean (Mean 2) are tested. Unlike
the Tukey HSD,
these comparisons are done using a critical value based on only
four means rather than all five. The rationale is that the comparison
of Mean 1 to Mean 4 only spans four means so the lower critical
value associated with four rather than five means is used. The basic
idea is that when a comparison that spans k means is significant,
comparisons that span k-1 means within the original span of k
means are performed.

All Pairwise Comparisons among Means: Newman-Keuls Procedure


(2 of
4)

The critical value of t associated with k -1 means is used for these


comparisons. If a comparison spanning k means is not significant, then
no further comparisons within the span of k means is performed. For
instance, assume that the following five sample means were obtained
from an experiment in which there were 5 subjects in each group: 10,
11, 14, 18, 25. Further assume that MSE = 30. The critical values for
the studentized t with 20 degrees of freedom (20 = k x (k-1) =
5 x 4) at the 0.05 level are shown below (Use a table of the
Span Critical value
stud2entized rang2.95
e to determine critical values.) :
3 3.58
4 3.96
5 4.23

The first comparison is between 10 and 25. The critical value of t


is based on a span of 5 means and is equal to 4.23. The calculated
value of t for this comparison is 6.12, so the comparison is
significant.

All Pairwise Comparisons among Means: Newman-Keuls Procedure


(3 of 4)

Since the comparison at span 5 is significant, the following two


comparisons are made at span 4 using the lower critical value of t of
3.96:

Mean 1 versus Mean 4


Mean 2 versus Mean 5
The t for Mean 1 versus Mean 4 is 3.27, which is not significant.
Therefore, no further comparisons within the span are performed
and the following comparisons are deemed non significant:
Mean 1 versus Mean 2
Mean 1 versus Mean 3
Mean 1 versus Mean 4
Mean 2 versus Mean 3
Mean 2 versus Mean 4
Mean 3 versus Mean 4
The t for Mean 2 versus Mean 5 is 5.71 which is higher than the
critical value for a span of four of 3.96. Since the comparison of
Means 2 and 5 is significant, comparisons of span 3 within this
span that were not previously ruled out are performed. Since the
comparison of Means 2 and 4 is ruled out by the failure of the
comparison of Means 1 and 4 to be significant, the only
comparison that is done is between Means 3 and 5. The value of t
is 4.49 which is higher than the critical value for span 3 of 3.58.

Pairwise Comparisons among Means: Newman-Keuls Procedure (4


of 4)

Next section: Duncan's procedure

After the significant difference at span 3, differences not


previously ruled out at span 2 are tested. The comparison of Means
3 and 4 has been ruled out by the failure of the comparison of
Means 1 and 4 to be significant. Therefore, the only comparison
left to be performed is between Means 4 and 5. The t is 2.86 which
is less than the critical value for a span of 2 of 2.95, so the
difference is not significant. The Newman-Keuls procedure has the
advantage of being more powerful than the Tukey HSD. It is
better at controlling the EER than the Fisher's LSD. However,
there are patterns of population means that can
lead to an inflated EER. For instance, if six population means were:
10, 10, 100,100,
1,000, and 1,000 then comparisons among sample means at span
2 would almost certainly be performed. The null hypothesis is
true for three of these comparisons: Mean
1 versus Mean 2 Mean 3 versus Mean 4 Mean 5 versus Mean 6
Since the PCER for these comparisons is 0.05, the EER is above
0.05.

All Pairwise Comparisons among Means: Duncan's Procedure (1 of


1)

Next section: Recapitulation and recommendations

Duncan's procedure is similar to the Newman-Keuls procedure. The


difference is that the critical values are much more liberal. The goal of
Duncan's procedure is to keep the EER at
where "a" is the number of means and is the PCER. This level of
EER can be unacceptably high, however. With six means and a
per comparison error rate of 0.05, the EER is 0.23. Duncan's
procedure is only very slightly more conservative than Fisher's
LSD. In practice, they almost always lead to the same
conclusions.

All Pairwise Comparisons among Means: Recapitulation and


Recommendations
(1 of 3)
The Fisher's LSD, Duncan's, Newman-Keuls, and Tukey HSD
procedures all use the same formula for computing t . They differ
in the critical value. The following figure shows the critical values
for an experiment with 10 treatment conditions and four subjects
per condition as a function of the number of means spanned in the
comparison.

Note that the number of means spanned does not affect the
critical value for either Tukey's HSD or Fisher's LSD. The
former uses the most conservative (highest) critical value for all
comparisons whereas the latter uses the least conservative
(lowest) critical value for all comparisons. The Newman-Keuls
and the Duncan procedures increase the critical value as a
function of span, although Duncan's does not increase it much.

All Pairwise Comparisons among Means: Recapitulation and


Recommendations
(2 of 3)

There is no "correct" procedure to use; the various procedures


trade off power for control of the EER in different ways. There is
a consensus that the Duncan's and Fisher's LSD procedures result
in too high an EER and should not be used. The choice between
the Newman-Keuls and Tukey HSD is a close call. You have to
decide how important it is to control the EER completely. If you
want to be sure that you have controlled the EER,
then the Tukey HSD should be used. Most statisticians now
consider the Newman-Keuls unacceptably liberal because of
situations in which it allows an inflated EER. However, it is up to
the researcher to weigh the balance between power and controlling
the EER.

Be careful not to accept the null hypothesis for comparisons for


which the null hypothesis is not rejected. Otherwise you may be
left with contradictory conclusions. You may find, for example,
that M1 is significantly different from M3 but that M1 is not
significantly different from M2 and that M2 is not significantly
different from M3.
All Pairwise Comparisons among Means: Recapitulation and
Recommendations
(3 of 3)
Next section: Example calculations

If the null hypothesis were accepted for the two nonsignificant


comparisons, you would conclude that:

1. µ1 ≠ µ3
2. µ1 = µ2 and
3. µ2 = µ3

which is clearly not possible. The proper conclusion is that


either µ1 ≠ µ2, µ2 ≠ µ3 or neither µ1 nor µ3 is equal to μ2.
More data are needed to decide among these three alternatives.

All Pairwise Comparisons among Means: Example Calculations (1 of


3)

The table shown below contains the data from a hypothetical


G1 G2 G3 G4
experiment
4.0 with4.0
four groups
4.0 of
6.0four subjects each.
2.0 5.0 5.0 7.0
3.0 4.0 4.0 7.0
3.0 6.5 7.0 8.0
Means: 3.0 4.875 5.0 7.0
The MSE is 1.1823. The critical values as a function of the number of
steps for the four methods
(as determined from a studentized range table) are shown below.
Number of steps
PROCEDURE 2 3 4
Fisher's 3.0 3.0 3.
Duncan 3.0 3.2 3.
Newman- 3.0 3.7 4.
Tukey HSD 4.2 4.2 4.

All Pairwise Comparisons among Means: Example Calculations (2 of


3)

Substituting the values for nh and MSE,


low hig
G v G2 3.00
er 4.87
her 3.4
G V G3 3.00 5.00 3.6
G V G4 3.00 7.00 7.3

G V G4 4.87 7.00 3.9

The table shows the value of t for each of the six possible pairwise
comparisons. For Fisher's LSD, all values of t except the
comparison of Group 2 and Group 3 are above the critical value of
3.08 and are therefore significant.

Duncan's test begins with a comparison of G1 and G4 using a


critical value of 3.33. The t of 7.36 is significant so G1 Vs G3 and
G2 Vs G4 are tested using a critical value of 3.23. Both are
significant, so all remaining comparisons are tested using a critical
value of
3.08.

All Pairwise Comparisons among Means: Example Calculations (3 of


3)
Next section: Comparing means with a control

low hig
G v G2 3.00
er 4.87
her 3.4
G v G3 3.00 5.00 3.6
G v G4 3.00 7.00 7.3

G v G4 4.87 7.00 3.9

All are significant except for the comparison of G2 vs G3.


Therefore, the Duncan's test and the Fisher's LSD yield the same
results for these data. The Newman-Keuls test begins with a
comparison of G1 and G4 using a critical value of 4.20. The test is
significant so
the comparisons G1 vs G3 and G2 vs G4 are tested using a
critical value of 3.77. The comparison G1 vs G3 is not
significant. This means that the comparisons of G1 vs G2
and G2 vs G3 are deemed not significant as well. The comparison
of G2 vs G4 is significant, so the comparison G3 vs G4 is done
using the critical value of 3.08. The comparison is significant.
Two comparisons that were significant with the Fisher's LSD are
not significant with the Newman-Keuls test (G1 vs G3 and G1 vs
G2). The Tukey HSD uses a critical value of 4.20 for all
comparisons. The only significant difference is between Group 1
and Group 4.

Comparing Means with a Control (1 of 4)

The purpose of some experiments is to compare the mean of each


of several experimental groups with the mean of a control group.
For example, a researcher may wish to find out whether any of
four new methods of teaching arithmetic is better than the traditi
onal method. Subjects are taught with one of the four new methods
or with the traditional method (the control group). The
experimenter then wishes to see which experimental group means
(if any) are significantly different from the control group mean.
Naturally, one could use a procedure such as Tukey's HSD to
compare each mean with each other mean. The problem with using
the Tukey's HSD or any other method designed to
compare each mean with each other mean is that these methods
overcorrect for the number of comparisons made. If each mean is
compared to each other mean, then a total of (a)(a-1)/2
comparisons among "a" means would be tested. If there are "a"
means (including the control) then there are only (a-1)
comparisons between experimental means and the control mean to
be tested.

Comparing Means with a Control (2 of 4)

The procedure for comparing each experimental mean with the


control mean is called "Dunnett's test" after the statistician who
developed it. Dunnett's test controls the EER and is more
powerful than tests designed to compare each mean with each
other mean. Dunnett's test is conducted by computing a t-test
between each experimental group and the control group using the
formula:
where Mi is the mean of the ith experimental group, Mc is the
mean of the control group, MSE is the mean square error as
computed from the analysis of variance, and nh is the harmonic
mean of the sample sizes of the experimental group and the
control group. The
degrees of freedom (df) for the test are equal to N-a where N
is the total number of subjects in all groups and "a" is the
number of groups (including the control).

Comparing Means with a Control (3 of 4)

The critical value of td depends on the number of means in the


experiment. Critical values can be obtained from a table of Dunnett's
test.

In a hypothetical experiment described earlier, the effectiveness of


Aspirin, Tylenol, and a
Placebo were measured. The data are reproduced below.
Aspiri Tylen Plac
3 2 2
5 2 1
3 4 3
5 4 2

The analysis of variance summary table is shown below:


Sour df Ss Ms F
Grou 2 8. 4.00 3.60
Erro 9 10. 1.111
Tota 11 18. 1.636

Comparing Means with a Control (4 of 4)


Next section: Specific comparisons among means --
Overview

The means of the Aspirin, Tylenol, and Placebo groups are: 4,


3, and 2 respectively. Comparing the Aspirin group to the
Placebo control,

= 2.68.

A table for Dunnett's test can be used to determine that the critical
value for the 0.05 level
(two-tailed) is 2.61. Therefore, this comparison is significant at the
0.05 level.
Comparing the Tylenol group to the Placebo control,
= 1.34 which is not significant.

Specific Comparisons among Means:


Overview (1 of 3)

Frequently experimenters wish to make more complex


comparisons than simply comparisons between pairs of means. In
a hypothetical experiment described earlier, the researcher
wished to compare the average pain relief experienced by the
aspirin and Tylenol groups with relief experienced by the placebo
group. The null hypothesis was:

In other words, the null hypothesis was that the average amount of
relief experienced by subjects taking either of the two drugs was
the same as the relief experienced by subjects taking a placebo.
This null hypothesis can be specified in terms of a linear
combination of means:

H0: Σaiµi = 0 where a1 = 0.5, a2 = 0.5, and a3 = -1.

Planned versus Unplanned Comparisons


It makes a tremendous difference whether or not a comparison
among means is planned prior to viewing the data. If the
comparison is planned in advance, it can be tested using the
procedures spelled out in another section.

Specific Comparisons among Means:


Overview (2 of 3)

A very different and much less powerful procedure must be used


if the comparison is unplanned. It may seem that it should not
matter whether or not a comparison is planned in advance. After
all, the the same data are analyzed regardless of whether or not
the comparison is planned. It does matter, however, because if an
experimenter looks at the data and then chooses a comparison, he
or she will almost certainly choose to compare the means that
differ the most. This is tantamount to doing all possible
comparisons among means, a procedure that, by capitalizing on
chance, produces an inflated Type I error rate.
The Scheffé test is a procedure that allows one to test unplanned
comparis ons among means without inflating the Type I error rate.
Scheffé's test gives the freedom to test any and all comparisons
that looks interesting. However, this great flexibly has a cost:
Scheffé's test normally has very low power.

Specific Comparisons among Means: Overview (3 of 3)

Next section: Computing comparisons

Testing Multiple Comparisons


Even if the comparisons are planned in advance, testing more than
one comparison
results in an increase in the EER. One way to control the EER is
to lower the significance level for the PCER. Naturally, the more
comparisons that are conducted, the more the PCER has to be
lowered. A second way to control the EER is to use Scheffé's test.
Scheffé's test controls the EER no matter how many comparisons
are conducted.
Next section: Computing comparisons

Computing Tests of Comparisons


(1 of 6)

Planned Comparisons
A method for computing planned comparisons among means is
described in another section. That method is generalized here
so that it can be used with unequal as well as
with equal sample sizes.

First compute

where L = ΣMiai and


ai is the coefficient applied to
the ith mean ni is the sample
size of the ith group
Mi is the ith mean
MSE is from the analysis of variance. It equals SSE/dfe
The t is based on dfe = N-a degrees of freedom where N is the
total number of subjects and "a" is the number of groups.

Computing Tests of Comparisons (2 of 6)

Consider a hypothetical experiment on the effect of background


music on reading comprehension. There were five groups of
subjects who read in the presence of either:

Group 1: classical
music (soft) Group 2:
classical music (loud)
Group 3: rock music
(soft) Group 4: rock
music (loud) Group 5:
no background music

The reading comprehension scores are shown below:


G1 G2 G3 G4 G
94 92 86 85 9
91 93 76 86 9
92 85 85 89 9
90 88 91 75 8
89 92 87

Assume the experimenter had planned in advance to test whether


classical music and
rock music have different effects. The coefficients to test the
difference between classical and rock music (ignoring volume)
are:

1, 1, -1, -1, 0.

These coefficients compare the first two groups with the second
two groups. The means of the five groups are: 91.2, 89.5, 86.0,
84.4, and 94.0 respectively. To compute L, the coefficients are
multiplied times the means. Therefore,
L = (1)(91.2) + (1)(89.5) + (-1)(86.0) + (-1)(84.4) + (0)(94) = 10.3

Computing Tests of Comparisons (3 of 6)


As can be seen from the analysis of variance summary table shown
Source df Ssq Ms F p
Groups 4 274.913 68.728 3.16 .039
below, MSE = 21.722.
Error 18 391.000 21.722
Total 22 665.913 30.269

= (1)²/5 + (1)²/4 + (-1)²/5 + (-1)²/5 + (0)²/4 = .85

Therefore,

= 4.297. (Click here for


formula for sL) Finally,

= 10.3/4.297 = 2.397.

The degrees of freedom error = N-a = 23 - 5 =18 (see the


summary table). From a t table it can be determined that the
probability value associate with a t of 2.397 with 18 df is
0.028.
Computing Tests of Comparisons (4 of 6)

Comparisons among means can be tested with F as well as with


t. The formula for F is: F = MSB/MSE
where MSB (the mean square between) is equal to the sum of
squares between (SSB)
divided by the degrees of freedom numerator (dfn):
MSB = SSB/dfn. The formula for SSB for a
comparison is:

where L, ai, and ni are as defined previously. The dfn for a


planned comparison is always one. Therefore, MSB = SSB.
MSE, as defined previously, is taken from the
analysis of variance. The degrees of freedom for
the F test are:
dfn = 1
dfe = N-a.

An F with one dfn is equal to t² . Therefore, the F test of a


comparison is equal to the square of the t-test of the
comparison described previously. For the example data,

MSB = SSB = 10.3²/0.85 = 124.812.

Computing Tests of Comparisons (5 of 6)

From the ANOVA table,


MSE = 21.722. F =
124.812/21.722 = 5.746.
The F of 5.746 = t² = 2.397².
The probability value for F (as determined from an F table using
dfn = 1 and dfe = 18) is
0.028, which is the same as for t.

Unplanned Comparisons
Scheffé's test is used for unplanned comparisons. This test is the
same as the F test for planned comparisons just discussed
except that dfn = a-1. Therefore,

MSB = SSB/dfn = SSB/(a-1)

As in the section on planned comparisons, MSE is from the


analysis of variance. It equals: SSE/dfe.

Computing Tests of Comparisons (6 of 6)

Next section: Multiple comparisons


The section on planned comparisons has a computational
example of a planned comparison. For that example, assume that
the comparison between the classical music and the rock music
was not planned in advance. This makes it necessary to do
Scheffé's
test. In the example, the values: SSB = 124.812 and
MSE = 21.722 have already been computed.

MSB = 124.812/dfn = 124.812/(5-1) = 31.203


F = 31.203/21.722 = 1.44

An F table shows that the probability of an F with 4 and 18 degrees


of freedom being
1.44 or larger is 0.26. Compare this value with the probability
value of 0.028 obtained when the comparison was assumed to
be planned in advance. Scheffé's test is not a powerful test. If at
all possible, you should plan your comparison(s) among means
in advance.

Synonyms
Planned comparisons are sometimes called "a priori"
comparisons. Unplanned comparisons are sometimes called "post
hoc" comparisons and at other times are called "a posteriori"
comparisons.

Multiple Comparisons (1 of 3)
If more than one comparison among means is conducted at a
given PCER, the EER will be higher than the PCER. The
following inequality can be used to control the EER:

EER ≤ (c)(PCER)

where c is the number of comparisons performed. For example, if


six comparisons are performed at the 0.05 significance level
(PCER = 0.05), then the EER is less than or equal to (6)(0.05) =
0.30. If a researcher wishes to perform six comparisons and keep
the EER at the 0.05 level, the 0.05/6 = 0.0083 significance level
should be used for each comparison. That way, the EER would be
less than or equal to (6)(.0083) = .05.

In general, to keep the EER at or below 0.05, the


PCER should be: PCER = 0.05/c where c is the
number of comparisons.
More generally, to insure that the EER is less than or equal to α, use

PCER = α/c.
Adjusting the PCER in this manner is called either the "Bonferroni
adjustment" or
"Dunn's procedure" after the statisticians who developed it.
Multiple Comparisons (2 of 3)

Since Scheffé's test can also be used to test multiple comparisons,


it is important to use the Scheffé's test if it is more powerful than
the Bonferroni adjustment. Scheffé's test will be more powerful if
more than three comparisons are planned among three means or
more than seven comparisons are planned among four means.
In almost all realistic situations, the Bonferroni adjustment is
more powerful.

As in the case of pairwise comparisons, there is a tradeoff between


controlling the EER and power. The Bonferroni adjustment
provides control over the EER at a substantial cost in power. Some
statisticians argue that it is not always necessary to contro l the
EER. The "experiment," after all, is an arbitrary unit of analysis.
Why is it necessary to control the error rate in one experiment but
not in a whole series of experiments. For example, if a researcher
conducted a series of five related experiments, few statisticians
would recommend that the probability of a Type I error in any of
the comparisons in any of the experiments be controlled.
Nonetheless, the experiment is a convenient unit of analysis. The
general consensus is that steps should be taken to control the EER.
Multiple Comparisons (3 of 3)

Next section: Orthogonal comparisons

There is room for argument, however, about whether it should be


strictly controlled, or whether it should be allowed to rise
slightly.

One consideration is the definition of a family of comparisons.


Let's say you conducted a study in which you were interested in
whether there was a difference between male and female babies in
the age at which they started crawling. After you finished
analyzing the data, a colleague of yours had a totally different
research question: Do babies who are born in the winter differ
from those born in the summer in the age they start crawling?
Should the EER be controlled or should it be allowed to be greater
than 0.05? A compelling argument can be made that there is no
reason you should be penalized (by lower power) just because
your colleague used the same data to address a different research
question.

Orthogonal Comparisons (1 of 5)
When comparisons among means provide independent
information, the comparisons are called "orthogonal." If an
experiment with four groups were conducted, then a comparison
of Groups 1 and 2 would be orthogonal to a comparison of
Groups 3 and 4.
There is nothing in the comparison of Groups 1 and 2 that provides
information about the comparison of Groups 3 and 4. These two
comparisons are orthogonal.

Now consider the following two comparisons: Group 1 with Group


2 and Group 1 with the mean of Groups 2 and 3. These two
comparisons are clearly not orthogonal: both involve a comparison
of Groups 1 and 2, although the second comparison also involves
Group 3. If Group 1 is larger than Group 2, then it is probably (but
not necessarily) larger than the mean of Groups 2 and 3. The
information conveyed by the two co mparisons overlaps; the
comparisons are not independent.

Orthogonal Comparisons (2 of 5)

There is a simple rule for determining if two comparisons are


orthogonal: they are orthogonal if and only
if Σaibi= 0

where ai is the ith coefficient of the first comparison and bi is the ith
coefficient of the second comparison. Again consider the
comparisons:
Group 1 with Group 2
Group 1 with the average of Groups 2 and 3
The coefficients for the first comparison are: 1, -1, 0, 0.
The coefficients for the second comparison
are: 1, -.5, -.5, 0. Σaibi = (1)(1)+(-1)
(-.5)+0+0 ≠0.
Therefore, these two comparisons are not orthogonal. For the
comparisons:
Group 1 with Group 2
Group 3 with Group 4
the coefficients are: 1, -1, 0, 0 and 0, 0, 1,
-1. Therefore, Σaibi = (1)(0) + (-1)(0) +
(0)(1) + (0)(-1) = 0
and the comparisons are orthogonal.

Orthogonal Comparisons (3 of 5)
If there are "a" groups in an experiment, then it is possible to make up
a-1 mutually orthogonal comparisons among the means. There are
Set 1
many
Mean 1ways
1 to 1make setSetof
1 up a 3
2
0 a-10 orthogonal comparisons. There is no
Mean 2 1 -1 -1 -1 2 0
way
Meanto
3 make
-1 up
1 a-1set of more
-1 -1 than
1 a-1 comparisons. Two sets of 3
Mean 4 -1 -1 1 -1 -1 -1
mutually orthogonal comparisons among four means are shown below.

The value of Σaibi for any pair of comparisons in either set is zero.
For instance, for
Comparisons 2 and 3 of Set 1,

Σ aibi = (1)(1) + (-1)(-1) + (1)(-1) + (-1)(1) = 1 + 1 + (-1) + (-1) =


0.

Orthogonal Comparisons (4 of 5)
If a set of a-1 orthogonal comparisons is constructed and the sum of
squares for each comparison is computed, then the sum of the a-1 sums
of squares will be equal to the sum of squares between in the one factor
analysis of variance. For example, consider the data described in the
section on partitioning the sums of squares. The means for the three
groups are:

M1 = 4
M2 = 3
M3 = 2,
n = 4, and,
as previously computed, SSB = 8.

The SSB can also be computed by making up two orthogonal


Mean 1 2 0
comparisons
Mean 2 -1 among
1 the three means and adding the sums of squares
Mean 3 -1 -1
associated with each. A set of orthogonal comparisons is:

The formula for each sum of square is:


Orthogonal comparisons (5 of 5)

Next section: Trend analysis


For this example,

L1 = (2)(4)+ (-1)(3) + (-1)(2) = 3

SSB1 = 3²/1.5 = 6

L2 = (0)(4)+ (1)(3) + (-1)(2) = 1

SSB2 = 1²/.5 = 2
SSB = SSB1 + SSB2 = 6 + 2 = 8 which is the same value
obtained in the analysis of variance.

It is important to know whether the comparisons you are


conducting are orthogonal or not. However, it is not critical that
you limit yourself to orthogonal comparisons. It is much more
important to test the comparisons that make sense in terms of
your experimental hypotheses.

Trend Analysis (1 of 5)

If an experiment contains a quantitative independent variable, then


the shape of the function relating the levels of this quantitative
independent variable to the dependent variable is often of interest.
For instance, consider a hypothetical experiment on the effect of
the magnitude of reward on the time it takes rats to run down an
alley. Running time as a function of the number of food pellets
provided as the reward is shown in the graph below. The graph
shows that running time decreased as reward size increased.
Trend Analysis (2 of 5)

Trend analysis can be used to test different aspects of the shape of


the function relating the independent variable (Number of pellets
in this example) and the dependent variable (running time). Trend
analysis consists of testing one or more components of trend.
These components are tested using specific comparisons. The
linear component of trend is used to test whether there is an
overall increase (or decrease) in the dependent variable as the
independent variable increases. The graph below shows that
running time decreased as magnitude of reward increased.
A test of the linear component of trend is a test of whether this
decrease in running time is significant. If there were a perfectly
linear relationship between magnitude of reward and running
time, then no components of trend other than the linear would be
present. The figure shows, however, that the relationship is not
perfectly linear: The slope of the
function decreases as magnitude of reward increases. Running
time decreases four seconds as the number of pellets increases
from two to four, two seconds as the number of pellets increases
from four to six, and only one second as the number of pellets
increases form six to eight.

Trend Analysis (3 of 5)

The quadratic component of trend is used to test whether the slope


increases (or decreases) as the independent variable increases.
Components of trend beyond the quadratic are not usually
psychologically interpretable. The cubic component, for example, tests
whether the slope
changes twice (decreasing and then increasing or increasing and
then decreasing) as the independent variable increases.

Trend analysis is computed as a set of orthogonal comparisons


using a particular set of coefficients. Coefficients for the linear,
quadratic, and cubic components of trend are given below:
2 means
Lin -1 1

3 means
Lin - 0 1
Qua 1 - 1

4 means
Lin - - 1 3
Qua 1 - - 1
Cub - 3 - 1

5 means
Lin - - 0 1 2
Qua 2 - - - 2
Cub - 2 0 - 1

6 means
Lin - - - 1 3 5
Qua 5 - - - - 5
Cub - 7 4 - - 5

7 means
Lin - - - 0 1 2 3
Qua 5 0 - - - 0 5
Cub - 1 1 0 - - 1

8Linmeans
-7 -5 -3 -1 1 3 5 7
Quad 7 1 -3 -5 -5 -3 1 7
Cubic -7 5 7 3 -3 -7 -5 7

9
Lin - - - -1 0 1 2 3 4
Quad 2 7 - - - - -8 7 2
Cubic -14 7 13 9 0 -9 -13 -7 14

10 means
Lin - - - - - 1 3 5 7 9
Qua 6 2 - - - - - - 2 6
Cub - 1 3 3 1 - - - - 4

Trend Analysis (4 of 5)

Assume that the MSE for the magnitude of reward experiment were
5 and that there were 10 subjects per group. The group means are:
Reward size Time
2
10
4
6
6
4
8
3
The table on the previous page shows that the coefficients for the
linear component of
trend with four groups are:
-3, -1, 1,
3.

Applying the formulas for a comparison presented elsewhere:

L = ΣMiai = (10)(-3) + (6)(-1) + (4)(1) +(3)(3) = -23

= 3.162.

t = -23/3.162 = -7.27.

Trend analysis (5 of 5)
Next section: Formal model

The degrees of freedom for the t is equal to the degrees of freedom


error in the analysis of
variance which is: N - a = 40 - 4 = 36. A t table can be used to
find that the probability value (p) is less than .0001. Therefore,
the decrease in running time associated with the increase in
magnitude of reward is significant. Rats run faster if the reward
is larger.

The quadratic component of trend is used to test whether the


decreasing slope that shows up as a flattening out of the function
is significant. The table on a previous page shows that the
coefficients for the quadratic component of trend with four groups
are: 1, -1, -1,
1.

L = ΣMiai = (10)(1) + (6)(-1) + (4)(-1) +(3)(1) = 3

sL = 1.414

t = 3/1.414 = 2.12,
df = 36, and p = .041. The quadratic component of trend is
significant.

Formal Model (1 of 2)

In an analysis of variance design, each score in the "a" populations can


be defined in terms of the following formal model:

yij = μg + αj + εij where:

 yij is the score of the ith subject in the jth population,


 μg is the mean of all the population means,
 αj equals μj - μg where μj is the mean of the jth population,
 αj represents the effect of being in the jth population, where Σα j =
0,
 εij is the sum of all other effects on the ith person in the jth
population and is referred to
as error.
The error within each population is assumed to be normally
distributed and have a mean of zero. The error variances within
the "a" populations are assumed to be equal.

Formal Model (2 of 2)
Next section: Expected mean squares

The formal model shows that each score is assumed to be the sum
of three components: the overall height or elevation of the scores
(μ g), the elevation or depression resulting from being in the jth
population (μj), and error (εij). In terms of this model, the null
hypothesis tested by ANOVA is that all "a" values of αj = 0. That
is, each score is equal to the mean of the population means plus
error.

Mathematical statisticians use the formal model in many of their


derivations. An example is the derivation of the expected value of
M SB .

Expected Mean Squares for a One-factor between-subject ANOVA


(1 of 2)

As stated in another section, both the numerator and the


denominator of the F ratio estimate the population variance when
the null hypothesis is true. Since they are both unbiased
estimates, the expected value of both MSB and MSE is σ².
Symbolically, E[MSB] = E[MSE] = σ². This section covers the
expected value of MSB and MSE when the null hypothesis is
false.

The MSB is based upon the sample means: the greater the
variance of the sample means, the greater the MSB. Therefore,
when the null hypothesis is false and the population means are not
equal, the expected value of MSB is greater than when the null
hypothesis is true. It stands to reason that the more the population
means differ from each other, the greater the expected value of
MSB. Mathematical statisticians have derived the following
formula for the expected value of MSB:

where σ² is the population variance, µi is the ith population


mean, is the mean of the population means, n is the number of
subjects in each group, and "a" is the number of population
means.

Expected Mean Squares for a One-factor between-subject ANOVA


(2 of 2)
Next section: Reporting results

For example, if σ² = 100, n = 10, a = 3, µ1 = 5, µ2 = 6, and µ3 = 10


then = 7 and

The expected value of MSB is σ² when the null hypothesis is true


but is equal to a larger value when the null hypothesis is false.
E[MSE] = σ² whether or not the null hypothesis is true.

For the example, E[MSE] = 100. Whenever the null hypothesis is


false, E[MSB] > E[MSE] and relatively large values of F =
MSB/MSE can be expected. Although you will rarely need to
know the expected value of MSB or MSE, it is important to see
that both expected values are the same when the null hypothesis is
true and that the expected value of MSB is larger when the null
hypothesis is false.

Reporting Results (1 of 2)

Results should be described as simply and as free of statistical jargon as


possible. It is best to start by presenting a graph or table that portrays the
descriptive statistics. Then, describe the relevant findings in simple plain
English. Finally state which effects were statistically significant. A
report of a the hypothetical study comparing the effect of background
noise might read as follows:
Table
1
Condition M s
No 8.90 2.47
Moderate Noise 5.60 2.50
Loud Noise 2.60 2.22

As can be seen in Table 1, background noise had a substantial effect on


performance: the louder the background noise, the lower the
performance.
Box plots further illustrating the differences are shown in Figure 1. An
analysis of variance was conducted and the effect of noise was
significant, F(2,27) = 5.94, p = 0.007. The Tukey HSD procedure
revealed that all pairwise differences among means were significant, p
<0 .05.

Reporting Results (2 of 2)

Next chapter: Factorial between-subjects ANOVA

The first sentence sums up the findings. The reader is referred to


the table and the figure to see just how large the effect of noise
was. The rest of the report contains the results of the statistical
analyses. It shows that the differences were not due to chance. If
possible, the section on the statistical analyses should be written
so that it can be skipped without loss of continuity. The symbol
"F(2,27) = 5.94" means that the F was computed to be
5.94, that the degrees of freedom numerator were 2, and
the degrees of freedom denominator were 27.

In general, the analysis of variance summary table is not reported,


although it is sometimes included in an appendix. When planned
comparisons are reported, make sure to specify that they were
planned. If the rationale for a planned comparison is not obvious,
it should be spelled out. Otherwise, the reader will have difficulty
believing that it was really planned.

Next chapter: Factorial between-subjects ANOVA

Chapter
12

1. What is the null hypothesis tested by analysis of variance?


2. What are the assumptions of between-subjects analysis of variance?

3. What is a between-subjects variable?

4. Why not just compute t-tests among all pairs of means instead
computing an analysis of variance

5. What is the difference between "N" and "n"?

6. How is it that estimates of variance can be used to test a hypothesis


about means?

7. Explain why the variance of the sample means has to be multiplied


by "n" in the computation of MSB.

8. What kind of skew does the F distribution have?

9. When do MSB and MSE estimate the same quantity?

10. If an experiment is conducted with 6 conditions and 5 subjects in


each condition, what are dfn and dfe?

11. How is the shape of the F distribution affected by the degrees of


freedom?
12. What are the two components of the total sum of squares?

13. How is the mean square computed from the sum of squares?

14. An experiment compared the ability of three groups of subjects to


remember briefly- presented chess positions. The data are shown
below. Compute an analysis of variance. Using the Tukey HSD
procedure, which groups are significantly different from each other at
the .05 level?

Group 1: Nonplayers
Group 2: Beginners
Group 3: Tournament players

1 22.1
1 22.3
1 26.2
1 29.6
1 31.7
1 33.5
1 38.9
1 39.7
1 43.2
1 43.2
2 32.5
2 37.1
2 39.1
2 40.5
2 45.5
2 51.3
2 52.6
2 55.7
2 55.9
2 57.7
3 40.1
3 45.6
3 51.2
3 56.4
3 58.1
3 71.1
3 74.9
3 75.9
3 80.3
3 85.3

15. If you plan to make 4 comparisons among means and keep the EER
at .05, what should the
PCER be?
16. What difference does it make whether or not comparisons are
planned or not?

17. What is the difference between the Tukey HSD test and the
Newman-Keuls test? Which is more conservative?

18. What procedure do you use to make unplanned comparisons?


Computationally, how does it differ from the procedure for planned
comparisons?

19. What are orthogonal comparisons? What is the maximum number of


orthogonal comparisons that can be made among 6 means?

20. Test to see if there is a significant linear trend of chess skill in the
dataset given in Problem
14. Test to see if there is a significant quadratic trend.

21. If there are three populations and m1= 6, m2 = 8, and m3 = 10, then
what are: a1, a2, and a3? If
Σ² = 25 and n = 10, what is the expected value of MSB?

22. What is the relationship between F and t?

23. When is the assumption of homogeneity of variance something to


worry about?
24. Below are shown data from a hypothetical experiment on the
effects of sleep deprivation on the time it takes to solve two anagram
problems. Subjects went without sleep for either 16, 24,
32, or 40 hours. Are increased levels of sleep loss significantly (a = .
05) associated with longer problem solving time? Do the test assuming
this comparison was planned in advance and assuming it was not. Use
two-tailed tests in both cases.

Hours without
16 24 32 40
4 4 5 5
4 6 6 6
5 6 8 7
6 7 8 10
7 8 9 11
7 8 9 11

25. What coefficients could be used to test the null hypothesis: (a) m 1
- (m2+m3)/2 = 0 (b) m1 - m2 = m3 - m4 (c) There is no difference
between the mean of population means 1, 2, and 3 and the mean of
population means 4 and 5.

26. What test should an experimenter use if he or she wishes to


compare several means to a control mean? For the data in Problem
24, assume that 16 hours without sleep is the control condition and
compare each of the other conditions to this control condition.
27. Give the coefficients for a set of four orthogonal
comparisons among five means. Selected Answers

14. F(2,27) = 18.3688, p < 0.001. All pairwise comparisons are


significant.
Factorial Design (1 of 2)

When an experimenter is interested in the effects of two or more


independent variables, it is usually more efficient to manipulate these
variables in one experiment than to run a separate experiment for each
variable. Moreover, only in experiments with more than one
independent variable is it possible to test for interactions among
variables. Consider a hypothetical experiment on the effects of a
stimulant drug on the ability to solve problems. There were three levels
of drug dosage: 0 mg, 100 mg, and 200 mg. A second variable, type of
task, was also manipulated. There were two types of tasks: a simple
well-learned task (naming colors) and a
more complex task (finding hidden figures in a complex display). The
mean time to complete the task for each condition in the experiment is
shown below:

Dose Simple Complex


0 32 80
100 25 91
200 21 96

As you can see, each level of dosage is paired with each level of type of
task. The number of conditions (six) is therefore the product of the
number of levels of dosage (three) and type of task (two).
Factorial Design (2 of 2)

Next section: Main effects

Experimental designs in which every level of every variable is paired


with every level of every other variable are called factorial designs. In
this example, every level of dosage (0 mg, 100 mg, and 200 mg) is
paired with every level of type of task (simple and complex). Also note
that this implies that every level of type of task is paired with every level
of dosage. This experiment on dosage and type of task is described as a
Dosage (3) x Type of task (2) between-subjects factorial design.

Factorial designs can also contain more than two variables. For example,
an Age (2) x Dosage
(3) x Type of task (2) factorial design would consist of the following 2 x
3 x 2 = 12 conditions.
Main Effect (1 of 2)

The main effect of an independent variable is the effect of the variable


averaging over all levels of other variables in the experiment. The
means from the hypothetical experiment described in the section on
factorial designs are reproduced below.

Dose Simple Complex


0 32 80
100 25 91
200 21 96

The main effect of type of task is assessed by computing the mean


for the two levels of type of task averaging across all three levels of
dosage. The mean for the simple task is: (32 + 25
+ 21)/3 = 26 and the mean for the complex task is: (80 + 91 + 95)/3 =
86.67. The main effect of type of task, therefore, involves a
comparison of the mean of the simple task (26) with the mean of the
complex task (86.67).

Analysis of variance provides a significance test for the main


effect of each variable in the design.
Main Effect (2 of 2)

Next section: Interaction

If the main effect of type of task were significant, then the null
hypothesis that there is no difference between the simple and complex
tasks would be rejected. Similarly, if the main effect of drug dosage
were significant, then the null hypothesis that there is no effect of drug
dosage would be rejected.

See also: simple effect


Next section: Interaction

Interaction (1 of 8)

Two independent variables interact if the effect of one of the variables


differs depending on the level of the other variable. The means from the
hypothetical experiment described in the section on factorial designs
are reproduced below.

Dose Simple Complex


0 32 80
100 25 91
200 21 96

Notice that the effect of drug dosage differs depending on whether the
task is simple or complex. For the simple task, the higher the dosage, the
shorter the time to complete the task. For the complex task, the higher
the dosage, the longer the time to complete the task. Thus, there is an
interaction between dosage and task complexity. It is usually much
easier to interpret an interaction from a graph than from a table. A graph
of the means for the interaction between task complexity and drug
dosage is shown below.

The dependent variable (response time) is shown on the Y axis. The


levels of drug dosage are shown on the X axis. The two levels of task
complexity are graphed separately.

Interaction (2 of 8)
A look at the above graph shows that the effect of dosage differs as a
function of task complexity. It also shows that the effect of task
complexity differs as a function of drug dosage: The larger the drug
dosage, the greater the difference between the simple task and the
complex task. An interaction does not necessarily imply that the
direction of an effect is different at different levels of a variable. There
is interaction as long as the magnitude of an effect is greater at one level
of a variable than at another. In the example, the complex task always
takes longer than the simple task. There is an interaction because the
magnitude of the difference between the simple and complex tasks is
different at different levels of the variable drug dosage.

Interaction (3 of 8)

Two variables interact if a particular combination of variables


leads to results that would not be anticipated on the basis of the
main effects of those variables. For instance, it is known that both
drinking alcohol and smoking increase the chance of throat cancer.
However, people who both drink and smoke have a much higher
chance of getting cancer than would be predicted if one knew only
how much more likely smokers are than nonsmokers to get throat
cancer and how much more likely drinkers are than nondrinkers to
get throat cancer. The combination of smoking and drinking is
particularly dangerous: these drugs interact.
This definition of interaction in terms of a particular combination
of variables is consistent with the previously-given definition that
there is an interaction if the effect of one variable differs depending
on the level of another variable. In the tobacco and alcohol example,
th e effect of smoking on the probability of getting cancer is greater
for people who drink than for people who do not drink: the effect of
smoking differs depending on whether drinkers or nondrinkers are
being considered. Similarly, the effect of drinking di ffers
depending on
whether smokers or nonsmokers are being considered.
Interaction (4 of 8)

Interactions can be described in a variety of ways. Examples of


graphs of interactions and possible verbal descriptions of each
follow.

1. The difference between the treatment and control conditions


was greater for subjects performing Task 1 than for subjects
performing Task 2.

2. There was a greater difference between Task 1 and Task 2 for


subjects in the treatment condition than there was for subjects in
the control condition.
3. Tasks 1 and 2 were performed about equally well in the control
condition but Task 1 was performed considerably better than Task
2 in the treatment condition.

Interaction (5 of 8)
1. On the well-learned task, increased magnitude of reward was
associated with better performance whereas on the novel task,
increased magnitude of reward was associated with reduced
performance.

2. Under low magnitude of reward, the novel task was performed


better than the well- learned task. Under high magnitude of reward,
the well-learned task was performed better than the novel task.

3. The difference between the novel task and the well-learned task
changed from positive
for low magnitude of reward to slightly negative for medium
magnitude of reward to very negative for high magnitude of
reward.

Interaction (6 of 8)
1. Overall, Condition B2 led to better performance than did either
B1 or B3. This effect was much more pronounced for subjects
performing Task 1 than for subjects performing Task
2.

2. The difference between Tasks 1 and 2 was greatest for subjects in


Condition B 2.

3. The combination of Task 1 and Condition B2 led to especially


high performance.

In Example 4 there is no interaction. The effect of task is the same at all


three levels of B and the effect of B is the same for both tasks. Notice
that the two lines are parallel. When there is no interaction, the lines will
always be parallel.

Interaction (7 of 8)
Higher Order
Interactions
So far, all the interactions that have been
described are called "two-way" interactions.
They are two-way interactions because they
involve the interaction of two variables. A
three-way interaction is an interaction among
three variables. There is a three-way
interaction whenever a two-way interaction
differs depending on the level of a third
variable. Consider the two figures on the left
side of this page. The upper figure shows the
interaction between task and condition (B) for
well-rested subjects; the lower figure shows the
same interaction for sleep- deprived subjects.
The forms of these interactions are different.
For the well-rested subjects, the difference
between Tasks 1
and 2 is largest under condition B2 whereas
for the sleep- deprived subjects the difference
between Tasks 1 and 2 is smallest under
condition B2. The two-way interactions are
therefore different for variable "sleep deprivation." This means that
the two levels of the there is a three-way interaction
among the variables sleep deprivation, task, and condition.

Interaction (8 of 8)

Next section: Analysis of two-factor designs

Four-way interactions occur when three-way interactions differ as a


function of the level of a fourth variable. Four-way and higher
interactions are usually very difficult to interpret and are rarely
meaningful.

Next section: Analysis of two-factor designs

Analysis of Two-factor Designs (1 of 5)

A two-factor analysis of variance consists of three significance tests: a


Sum of Mean
test of each of the two main effects and aF test of pthe interaction of the
SOURCE df Squares Square
T 1 47125.3333 47125.3333 384.174 0.000
D 2 42.6667 21.3333 0.174 0.841
variables. An analysis of variance summary table is a convenient way to
TD 2 1418.6667 709.3333 5.783 0.006
ERROR 42 5152.0000 122.6667
display the results of the significance tests. A summary table for the
hypothetical experiment described in the section on factorial designs and
a graph of the means for the experiment are shown below.
TOTAL 47 53738.6667

Sources of Variation
The summary table shows four sources of variation: (1) Task, (2) Drug
dosage, (3) the Task x
Drug dosage interaction, and (4) Error.
Analysis of Two-factor Designs (2 of 5)

Degrees of Freedom
The degrees of freedom total is always equal to the total number of
numbers in the analysis minus one. The experiment on task and drug
dosage had eight subjects in each of the six groups resulting in a total
of 48 subjects. Therefore, df total = 48 - 1 = 47.

The degrees of freedom for the main effect of a factor is always equal
to the number of levels of the factor minus one. Therefore, df task = 2 -
1 = 1 since there were two levels of task (simple and complex).
Similarly, df dosage = 3 - 1 = 2 since there were three levels of drug
dosage (0 mg, 100 mg, and 200 mg).

The degrees of freedom for an interaction is equal to the product of the


degrees of freedom of the variables in the interaction. Thus, the degrees
of freedom for the Task x Dosage interaction is the product of the
degrees of freedom for task (1) and the degrees of freedom for dosage
(2). Therefore, df Task x Dosage = 1 x 2 = 2.
The degrees of freedom error is equal to the degrees of freedom
total minus the degrees of freedom for all the effects. Therefore, df
error = 47 - 1 - 2 - 2 = 42.

Analysis of Two-factor Designs (3 of 5)


Sums of Squares
Computational formulas for the sums of squares will not be given
since it is assumed that complex analyses will not be done by
hand.

Mean Squares
As in the case of a one-factor design, each mean square is equal to the
sum of squares divided by the degrees of freedom. For instance, Mean
square dosage = 42.6667/2 = 21.333 where the sum of squares dosage is
42.6667 and the degrees of freedom dosage is 2.

F Ratios
The F ratio for an effect is computed by dividing the mean square for
the effect by the mean square error. For example, the F ratio for the
Task x Dosage interaction is computed by dividing the mean square for
the interaction ( 709.3333) by the mean square error (122.6667). The
resulting F ratio is: F = 709.3333/122.6667 = 5.783.

Analysis of Two-factor Designs (4 of 5)

Probability Values
To compute a probability value for an F ratio, you must know the
degrees of freedom for the F ratio. The degrees of freedom numerator is
equal to the degrees of freedom for the effect. The degrees of freedom
denominator is equal to the degrees of freedom error. Therefore, the
degrees of freedom for the F ratio for the main effect of task are 1 and
42, the degrees of freedom for the F ratio for the main effect of drug
dosage are 2 and 42, and the degrees of freedom for the F for the Task x
Dosage interaction are 2 and 42.

An F distribution calculator can be used to find the probability values.


For the interaction, the probability value associated with an F of
5.783 with 2 and 42 df is 0.006.

Drawing Conclusions
When a main effect is significant, the null hypothesis that there is no
main effect in the population can be rejected. In this example, the effect
of task was significant. Therefore it can be concluded that, in the
population, the mean time to complete the complex task is greater than
the mean time to complete the simple task (hardly surprising). The
effect of dosage was not significant. Therefore, there is no convincing
evidence that the mean time to complete a task (in the population) is
different for the three dosage levels.
Analysis of Two-factor Designs (5 of 5)

Next section: Analysis of three-factor designs


The significant Task x Dosage interaction indicates that the effect of
dosage (in the population) differs depending on the level of task.
Specifically, increasing the dosage slows down performance on the
complex task and speeds up performance on the simple task. The effect
of increasing the dosage therefore depends on whether the task is
complex of simple.

There will always be some interaction in the sample data. The


significance test of the interaction lets you know whether you can infer
that there is an interaction in the population.

Analysis of Three-factor Designs (1 of 3)

A three-factor analysis of variance consists of seven significance tests: a


test for each of the three main effects, a test for each of the three two-
way interactions, and a test of the three-way interaction. Consider a
hypothetical experiment investigating how well children and adults
remember. There are three factors in the experiment:

1. Type of test: Subjects are either asked to recall or to recognize


the stimuli. In the recall test, subjects are asked to state the names
of as many of the stimuli as they can. On each recognition test
trial, subjects are asked to pick out the one stimulus that had been
presented from a set of four stimuli.

2. Age: 10-year old children and adults are compared.

3. Type of stimulus: The stimuli to be remembered are presented


either as pictures or as words.

The experiment can therefore be described as an Age (2) x Type of


stimulus (2) x Type of test (2) factorial design. There were five subjects
in each of the eight conditions. The data are shown on the next page.

Analysis of Three-factor Designs (2 of 3)

Recognition Reca
Children Adults Children Adults
Pictur Wor Pictur Wor Pictur Wor Pictur Wor
9 70 90 87 66 62 88 85
2 70 90 82 65 59 88 85
9 71 92 81 67 58 83 82

The ANOVA summary table is shown below. "T" stands for Type of task
(recall or recognition),
"A" stands for Age (children or adults) and "S" stands for type of
Sum of Mean
stimulus
SOURCE (pictures
df Squaresor words).
Square All Fmain effects
p and interactions are
T 1 1000.00 1000.0000 273.038 0.000
significant,
A 1 p1960.00
< 0.01.1960.0000 535.154 0.000
TA 1 518.40 518.4000 141.543 0.000
S 1 902.50 902.5000 246.416 0.000
TS 1 202.50 202.5000 55.290 0.000
AS 1 220.90 220.9000 60.314 0.000
TAS 1 28.90 28.9000 7.891 0.008
ERROR 32 117.20 3.6625
TOTAL 39 4950.40

Analysis of Three-factor Designs (3 of 3)

Next section: Supplementing main effects


The means are graphed above. The left graph is for the recognition task;
the right graph is for the recall task. The effects in the design can be
described as follows:

1. Task: The percent correct was higher for the recognition task than
for the recall task.

2. Age: Adults performed better than children.

3. Task x Age interaction: The effect of age was greater for the
recall task than for the recognition task.

4. Type of stimulus: Memory was better for pictures than for words.

5. Task x Type of stimulus interaction: The difference between


pictures and words was larger for the recognition task than for
the recall task.
6. Age x Type of stimulus interaction: The effect of age was larger
for the words than it was for the pictures.
7. Task x Age x Type of stimulus interaction: The Age x Type of
stimulus interaction was larger for the recognition task than for
the recall task. With the recall task, the difference between
children and adults was only slightly smaller for pictures than for
words. With the recognition task, the difference between children
and adults was much smaller for pictures than for words. There
was essentially no difference between children and adults for the
pictures whereas there was a large difference between children
and adults for the words.

Tests Supplementing Main Effects in Factorial Designs (1 of 3)

Hypothetical data from a two-factor design with three levels of


Factor A and two levels of factor B are shown below:

Margi
A A A nal
5 9 7
4 8 9
6 7 9
54 88 88

3 6 9
6 8 7
8 5 6
Margin

8.al 5.125 7.375 7.875 6.7

The ANOVA summary table is shown below:

Sour df SSq Ms F p
A 2 34.33 17.1 9.2 0.001
B 1 2.042 2.04 1.1 0.307
AB 2 2.333 1.16 0.6 0.543
Error 1 33.25 1.84
Total 2 71.95 3.12
Tests Supplementing Main Effects in Factorial Designs (1 of 3)

Hypothetical data from a two-factor design with three levels of Factor A


and two levels of factor
B are shown below:

Margi
A A A nal
5 9 7
4 8 9
6 7 9
54 88 88

3 6 9
6 8 7

Margin 8 5 6

al 5.125 7.375 7.875 6.7

The ANOVA summary table is shown below:

Sour df SSq Ms F p
A 2 34.33 17.1 9.2 0.001
B 1 2.042 2.04 1.1 0.307
AB 2 2.333 1.16 0.6 0.543
Error 1 33.25 1.84
Total 2 71.95 3.12
Statistical Consulting for Dissertations and Theses (401-659-4025)

Tests Supplementing Main Effects in Factorial Designs (2 of 3)

The marginal mean for A1 (5.125) is the average of A1B1 and A1B2
which is: (5.00 + 5.25)/2 =
5.125. Similarly, the marginal mean for A2 is the average of A2B1 and
A2B2, etc. The significant
main effect of A means that the marginal means for A1, A2, and A3 are
not all equal. It does not
reveal which population means are different from which others.
Procedures for following up a significant main effect such as this one
are analogous to the procedures used to follow up a significant effect
in a one-factor ANOVA. These procedures are:

1. All pairwise comparisons among means

2. Specific comparisons among means

3. Comparing all means with a control

The formulas used in these procedures all involve the sample size (n).

When applying these formulas to multi-factor experiments, n refers to


the number of scores in each of the means being compared.

Tests Supplementing Main Effects in Factorial Designs (3 of 3)

Next section: When no follow up is needed

For example, in the formula for the standard error of a comparison:


n is the sample size of each group. In the present example, there are
three groups, A 1, A2, and A3 and there are eight subjects ub each of
these groups. Therefore, n = 8. All other aspects of the formulas are
equivalent.

Supplementing Interaction: When no Follow-up is Needed (1 of 2)

A significant interaction indicates that the effect of one variable differs


depending on the level of another variable. Often, this is exactly the
information that a researcher needs, and no follow -up analyses are
necessary. For example, consider a hypothetical experiment designed to
test whether hyperactive and non-hyperactive children are affected
differently by the presence of distracting stimuli. The experiment has
two factors: child's classification (hyperactive or non-hyperactive) and
distraction (distraction present and distraction absent). A graph of the
means for the four conditions is shown below.
The graph shows that the effect of distraction is greater for the
hyperactive children than it is for the non-hyperactive children. If the
interaction were significant, then the researcher would be able to
conclude that distracting stimuli are more disruptive to the hyperactive
children than they are to the non-hyperactive children.

Supplementing Interaction: When no Follow-up is Needed (2 of 2)

Next section: Simple effects

No additional statistical tests would be needed to justify this


conclusion. Some researchers feel that the finding of a significant
interaction requires one to do supplemental analyses. No supplemental
analyses are required if the researcher wishes simply to conclude that
the effect of one variable differs depending on the level of another
variable since that is exactly what the interaction is testing.

Next section: Simple effects

Supplementing Interaction: Simple Effects (1 of 3)

The presence of interaction limits the generalizeability of main effects.


This is because it is difficult to make a general statement about a
variable's effect when the size of the effect depends on the level of a
second variable. For example, consider the hypothetical data shown in
the figure below. The experiment is on the effect of condition (treatment
versus control) on performance
for each of two tasks. Since the effect of condition is different for the two
tasks, it is not possible
to make a general statement about the treatment's effectiveness.

Supplementing Interaction: Simple Effects (2 of 3)

A significant main effect of condition indicates that, on average, the


treatment condition leads to better performance than does the control
condition. Since the effect of condition is different for the two
treatments, a significant main effect of condition does not necessarily
imply an effect of condition for both tasks. It might be that, in the
population, there is an effect of condition for Task 1 but not for Task 2.
If the researcher wished to know whether there were an effect of
condition for both tasks, he or she could not rely on the test of the main
effect. Instead, the researcher would test the significance of the effect of
condition separately for the two tasks. The effect of a variable at a
specific level of another variable is called a "simple effect" of the
variable. In this example there are two simple effects of condition: the
effect of conditi on for Task 1 and the effect of condition for Task 2.
The presence of interaction means that the main effect is not
representative of the simple effects. If you wish to know whether a
variable has an effect at each level of a second variable, you should test
the simple effects.

Supplementing Interaction: Simple Effects (3 of 3)

Next section: Components of interactions

Some researchers have the misconception that testing simple effects is


necessary for understanding the interaction. This is not true since the
interaction is interpretable in its own right. The rationale for testing
simple effects, rather, is simply that sometimes it is important to know
at what levels of a second variable the variable in question has a
significant effect. Testing
simple effects is done following an interaction not to help understand
the interaction, but rather to see where the effect of the variable is
significantly different from zero.

Next section: Components of interactions

Components of Interaction (1 of 3)

Consider the graph shown below.

Overall, there is an interaction since the effect of B depends on the


level of A. However, only one piece or component of the interaction is
present: The effect of B is greater at A3 than it is at A1 or A2. No other
part of the interaction is present since the effect of B is the same at A 1
as it is at A2.

It is possible to test components of interaction for significance by using


the procedures described in the section on specific comparisons among
means. The test of whether the effect of B is significantly different at A3
then it is at the other two levels is done with the comparison among
means shown on the next page.

Components of Interaction (2 of 3)

In this comparison, the difference between B1 and B2 at A3 is


compared with the average of the difference between B1 and B2 at A1
and the difference between B1 and B2 at A2:
The coefficients for the comparison can be determined by
distributing the negative sign and rearranging the expression. The
coefficients are:
A1B1: -0.5
A1B2: 0.5
A2B1: -0.5
A2B2: 0.5
A3B1: 1.0
A3B2: -1.0
Once the coefficients for a component of the interaction are determined,
the comparison is tested
just like any other specific comparison among means.

Components of Interaction (3 of 3)

Next section: Reporting results

As with other comparisons among means, different procedures are used


if the comparison is planned than if it is unplanned. There are as many
components to an interaction as the interaction has degrees of freedom.
In the present example, the interaction has two degrees of freedom.
The remaining component of the interaction is whether the
difference between B1 and B2 is different at A1 than it is at A2.
This can be expressed as:

(A1B1-A1B2) - (A2B1 - A2B2)

The coefficients for this comparison are:

A1B1: 1
A1B2: -1
A2B1: -1
A2B2: 1
A3B1: 0
A3B2: 0

For the example, this component of interaction is zero.

Reporting Results in Factorial Between-Subjects ANOVA (1 of 4)


Results should be described as simply and as free of statistical jargon as
possible. Begin with a presentation of descriptive statistics. The
descriptive statistics may be presented numerically, graphically, or both.
The results of the analysis of variance should be discussed with
reference to a graph of the group means. First note whether or not there
is an interaction. Describe the relevant outcomes and back up any
claims with the results of statistical tests. Do not let the statistical
analysis become the focus of the discussion. Instead, focus the
discussion on the graph
of the means and use the statistical analysis as a way to substantiate the
effects you point out in the graph. For example, consider the following
hypothetical experiment on age differences in memory for words and
pictures.
Sixteen 8-year-old children and 16 12-year-old children were shown a
set of stimuli and later given a test to see how well they could
recognize the stimuli that had been presented. Half children at each
age level were presented with word stimuli; the other half were
presented with pictorial stimuli. The percentage correct on the test was
recorded for each child.

Reporting Results in Factorial Between-Subjects ANOVA (2 of 4)

The results of the experiment might be written up as follows:

Results
Measures of central tendency and variability for the four experimental
groups are shown in Table
1.

Table
1
Eight-Year Twelve-Year
Words Picture Words Pictures
Mean 63.3 84.1 76.7 85.5
Trime 8 2 5 0
am SD 61.2 84.5 76.5 86.2

In all four groups, the mean and trimean are essentially the
same. Therefore, only the means were considered in further
analyses.
Reporting Results in Factorial Between-Subjects ANOVA (3 of 4)

An Age (2) x Type of stimuli (2) analysis of variance was used to test
differences between means for significance. Figure 1 shows the mean
percent correct on the recognition test as a function of age and type of
stimuli. An inspection of figure 1 reveals that the effect of age was much
larger for the word stimuli than it was for the pictorial stimuli. The
difference was reflected in a significant Age x Type of stimuli
interaction, F(1,28) = 6.37, p = 0.018. It is also evident from figure 1
that, overall, 12-year olds performed better than 8-year olds, F(1,28) =
9.35, p < 0.01.
Figure 1. Mean Percent Correct as a function of type of stimulus and age.

Reporting Results in Factorial Between-Subjects ANOVA (4 of 4)

Next section: Next chapter: Within-Subjects ANOVA

An analysis of simple effects showed that this age effect was


significant for the word stimuli, F(1,28) = 15.56, p < 0.01, but not for
the pictorial stimuli, F(1,28) = 0.14, p = 0.71. Therefore, there is no
evidence that 12-year olds differ from 8-year olds in their ability to
recognize pictures. Finally, pictures were recognized better than
words, F(1,28) = 35.05, p <0 .01. An analysis of simple effects
showed that the advantage of words over pictures was significant for
both the 8-year olds, F(1,28) = 26.46, p <0 .01, and for the 12-year
olds, F(1,28) = 5.77, p =0
.
023
.

Chapter
13

1. An experimenter is interested in the effects of two independent


variables on self esteem. What is better about conducting a factorial
experiment than conducting two separate experiements, one for each
independent variable?

2. An experiment is conducted on the effect of age and treatment


condition (experimental versus control) on reading speed. Which
statistical term (main effect, simple effect, interaction, specific
comparison) applies to each of the descriptions of effects.

a. The effect of the treatmentwas larger for 15-year olds than it was
for 5- or 10-year olds. b. Overall, subjects in the treatment
condition performed faster than subjects in the control condition.
c. The difference between the 10- and 15-year olds was significant under
the treatment condition. d. The difference between the 15- year olds and
the average of the 5- and 10-year olds was
significa
nt.
e. As they grow older, children
read faster.

3. An A(2) x B(4) factorial design with 8 subjects in each group is


analyzed. Give the source and degrees of freedom columns of the
analysis of variance summary table.

4. The following data are from a hypothetical study on the effects of age
and time on scores on a test of reading comprehension. Compute the
analysis of variance summary table and tes t simple effects.

12-year olds 16-year olds


6 74
5 71
6 66
6 95
9 92
6 95

5. Define "Three-way interaction"

6. Define interaction in terms of simple effects.


7. Plot an interaction for an A(2) x B(2) design in which the effect of B
is greater at A1 than it is at A2. The dependent variable is "Number
correct." Make sure to label both axes.

8. Following are two graphs of population means for 2 x 3 designs.


For each graph, indicate which effect(s) (A, B, or A x B) are
nonzero.
9. The following data are from an A(2) x B(4) factorial design.

B B B B
1 2 3 4
3 2 3 5
A1 41 42 24 68

1 3 6 9
A2 2 2 7 9

a. Compute an analysis of variance.


b. Test differences among the four levels of B using the Tukey
hsd test at the .05 level. c. Test the linear component of trend for
the effect of B.
d. Plot the interaction.
e. Describe the interaction in words.

Selected Answers

4.
9a.
9b. All significant except 1 versus 2.

9c.
Within-Subjects Design

Next section: Advantages of within-subject designs

Within-subject designs are designs in which one or more of the


independent variables are within- subject variables. Within-subjects
designs are often called repeated-measures designs since
within-subjects variables always involve taking repeated
measurements from each subject. Within-subject designs are
extremely common in psychological and biomedical research.

Click here for more information.

Advantages of Within-Subject Designs (1 of 2)

Subjects inevitably differ from one another. In an experiment on


children's memory, some children will remember more than others; in an
experiment on depression, some subjects will be more depressed than
others; in anControl
experiment on weight control, some subjects will be heavier
Experimental
Subject condition condition
than others. It is simply a fact of life that subjects differ greatly. In
----------------------------------
Subject 1 12 14
bSubject
etween-2subjec
25t designs, these
28 differences among subjects are
Subject 3 29 32
Subject 4 54 57
uncontrolled and are treated as error. In within-subject designs, the same
subjects are tested in each condition. Therefore, differences among
subjects can be measured and separated from error. For example,
consider the following data:

Every subject did better in the experimental condition than in the


control condition. Even though the advantage of the experimental
condition is small, it is very likely real since it is very consistent across
the four subjects.

Advantages of Within-Subject Designs (2 of 2)

Next section: The problem of carryover effects


The important point is that this small but consistent difference can be
detected in the face of large overall differences among the subjects.
Indeed, the difference between conditions is very small relative to the
differences among subjects. It is because the conditions can be
compared within each of the subjects that allows the small difference
to be apparent. Differences between subjects are taken into account and
are therefore not error.

Removing variance due to differences between subjects from the error


variance greatly increases
the power of significance tests. Therefore, within-subjects designs are
almost always more powerful than between-subject designs. Since
power is such an important consideration in the design of
experiments, within-subject designs are generally preferable to
between-subject designs.

The Problem of Carryover Effects (1 of 2)

Carryover effects can pose difficult problems for within-subject


designs. Sometimes the problems are so severe that a within-subjects
design is invalid and a between-subjects design must be used. For
example, consider the study of incidental learning. In incidental
learning, subjects are often presented with stimuli and asked to answer
questions about them. They might be asked to count the number of
letters in each word or to judge the pleasantness of each word.
Subsequently, a surprise memory test is given. An experimenter
wishing to test whether incidental learning was better when subjects
counted the letters or made pleasantness judgments would not be able
to use a within-subjects design. Once subjects had been given one
memory test, the second memory test would not be a surprise.

It is possible to use a within- subjects design even if carryover effects


are present as long as the carryover effects are (a) not severe and (b)
are symmetric.
The Problem of Carryover Effects (2 of 2)

Next section: Assumptions of within-subject designs

Consider an experiment comparing the time it takes to read a list of


color names with the time it takes to name colors. If a within-subjects
design is used, then all subjects are tested in both conditions. Carryover
effects are certainly possible in this design. The second task performed
may be performed better because of some kind of practice effect or
because subjects have become primed to say color names. alternatively,
the second task performed may be performed worse because subjects
have become tired or bored. However, in either case, the carryover
effects would likely be symmetric.

Counterbalancing can be used to control for symmetric carryover


effects. In this experiment, this means simply that half of the subjects
would be given the color-name-reading task before the color-naming
task and the other half of the subjects would be given the color naming
task first.

Assumptions of Within-subject Designs (1 of 2)


Within-subjects ANOVA assumes that the scores in all conditions are
normally distributed. It is also assumed that each subject is sampled
independently from each other subject. Naturally, it is
not assumed that the scores of a given subject are independent of
each other since the whole point of the analysis is that they are
dependent.

In addition to the assumption of normality, within-subject analysis of


variance is based on assumptions about the variances of the
measurements and the correlations among the measurements. Taken
together, these assumptions are called the assumption of sphericity.
Although a complete description of sphericity is beyond the scope of
this text, there is sphericity if (a) the population variances of the
repeated measurements are equal and (b) the population correlations
among all pairs of measures are equal. Other complex and unusual
patterns of variances and correlations can also produce sphericity.
Violation of the assumption of sphericity is serious: It results in an
increase in the Type I error rate.

Assumptions of Within-Subject Designs (2 of 2)

Next section: ANOVA with one within-subject variable

Since real data rarely meet the assumption of sphericity, this assumption
cannot be safely ignored. In 1954 a statistician named Box developed
an index of the degree to which the assumption of sphericity is violated.
The index, called epsilon (ε) ranges from 1.0, meaning no violation to
1/df where df is the degrees of freedom in the numerator of the F ratio.
In the Gesisser-Greenhouse correction, the sample value of epsilon can
be used to correct the probability value for violations of sphericity by
multiplying both the degrees of freedom numerator and denominator by
the sample value of ε. (Note that the corrected df are used only to
compute the p value, not to divide the sum of squares in order to find the
mean square.) The corrected probability value will always be higher
(less significant) than the uncorrected value except when the effect has
one degree of freedom in which case ε will be 1.0 and the corrected and
uncorrected probability values will be the same. Statistical packages
often refer to this correction as the Geisser-Greenhouse correction
because of an article of theirs in 1957 that discussed this issue.

The "Geisser-Greenhouse correction" is known to be somewhat


conservative. An alternative correction developed by Huynh and Feldt
is less conservative and is often computed by standard statistical
packages.

Although the assumption of sphericity has been discussed for many


years, it is still often ignored in practice. This is unfortunate since an
uncorrected probability value from an analysis variance with within-
subject variables is very rarely valid.
ANOVA with 1 Within-Subject Variable (1 of 4)
A one-factor within-subjects analysis of variance tests the null
hypothesis that all the population means are equal: H0: μ1 = μ2 = ... =
μa

Sources of Variation
In a between-subjects ANOVA, variance due to differences among
subjects goes into the error
term. In within-subjects ANOVA, differences among subjects can be
separated from error. "Subjects" is therefore a source of variation in
Source df Ssq Ms F p
within-subjects
Subjects 3 designs. The 629.458
1888.375 analysis of variance summary table for the
Condition 1 15.125 15.125 121.00 0.002
data given in3 the section
Error on the0.125
0.375 advantages of within-subjects designs
Total 7 1903.875 271.982
is shown below. Notice the three sources of variation: Subjects,
Condition, and Error.

ANOVA with One Within-subject Variable (2 of 4)


Source df Ssq M F p
Subjec 3 1888. 629.4
Condit 1 15.1 15.1 121. 0.002
Error 3 0.37 0.12
Total 7 1903. 271.9
Degrees of Freedom
The degrees of freedom for subjects is always N-1 where N is the
number of subjects. For this problem: N-1 = 4 - 1 = 3.

The degrees of freedom for the independent variable (Condition) is


always equal to the number
of levels of the independent variable minus 1. For this example, the
degrees of freedom is 2 - 1 =
1 since there are two conditions (experimental and control).

The degrees of freedom error is always the product of the degrees of


freedom for Subjects and the degrees of freedom for the independent
variable. For this example, the degrees of freedom is equal to 3 x 1 = 3.
The degrees of freedom total is equal to the total number of numbers in
the analysis minus 1. For this example, there are eight numbers so the
degrees of freedom is equal to
8 - 1 = 7. Notice that the degrees of freedom for Subjects, Conditions,
and Error add up to the degrees of freedom total 3 + 1 + 3 = 7.
ANOVA with One Within-subject Variable (3 of 4)

The formulas for degrees of freedom are summarized below:

df Subjects = N - 1 = 4 - 1 = 3
df Conditions = a - 1 =
2 - 1 = 1 df error = (N
-1)(a-1) = 3
df total = aN - 1 = 2 x 4 -1 = 8-1 = 7
a is the number of conditions; N is the total number of subjects.

Error as Interaction
The error term in within-subject ANOVA designs is equal to the
Subjects x Condition interaction. As such, it is a measure of the
degree to which the effect of conditions is different for different
subjects. Remember that when variables interact, the effect of one
variable differs depending on the level of another variable. The
error term is a measure of the degree to which the effect of the
variable "Condition" is different for different levels of the variable
"Subjects." (Each subject is a different level of the variable
"Subjects.") Low interaction (low error) means that the effect of
conditions is consistent across subjects. High interaction (high
error) means that the effect of conditions is not consistent across
"subjects."
ANOVA with One Within-subject Variable (4 of 4)

Next section: ANOVA with two within-subject variables


Sums of Squares
Computational formulas for the sums of squares will not be given
since it is assumed that complex analyses will not be done by
hand.

Mean Squares
As always, a mean square is computed by dividing the sum of squares
by the degrees of freedom. Generally speaking, no mean square is
computed for the variable "subjects" since it is assumed that subjects
differ from one another thus making a significance test of "subjects"
superfluous.

Relationship between F and t


The F from a within-subjects design with two levels of the independent
variable is equal to the t² from a test of differences between dependent
means. This relationship is the same as the
relationship between F and t in between-subject designs.

ANOVA with Two Within-Subject Variables (1 of 3)


Consider the following hypothetical experiment: A total of four rats are
run in a maze three trials a day for two days. The number of wrong turns
on each trial are shown below.
Sources of
Variation
The sources of variation in this design are: Subjects, Days, Trials, the
Days x Trials interaction, and Error. The ANOVA summary table is
shown below. Notice that there are three error terms: one for the effect
of Days (MSE = 1.38), one for the effect of Trials (MSE = 0.61), and
one for the interaction (MSE = 0.17).

ANOVA with Two Within-Subject


Variables (2 of 3)

Degrees of Freedom
The degrees of freedom for Subjects is equal to the number subjects
minus one. In this example,
this is N - 1 = 4 -1 = 3 since there are 4 subjects.

The degrees of freedom for each main effect is equal to the number of
levels of the variable in question minus one. Therefore, since there are
two days, the effect of Days has 2 - 1 = 1 degrees of freedom. There are
three trials, so the effect of Trials has 3 - 1 = 2 degrees of freedom.

The degrees of freedom for the interaction of the two variables is equal
to the product of the degrees of freedom for the main effects of these
variables. Since Days has 1 df and Trials has 2 df, the Days x Trials
interaction has 1 x 2 = 2 df. The formulas are summarized below:
df Subjects = N - 1 = 4
- 1 = 3 df Days = d - 1
=2-1=1
df error(Days) = (N -1)
(d-1) = 3 df Trials = t - 1
=3-1=2
df error(Trials)= (N -1)
(t-1) = 6 df Days x
Trials= (d-1) (t-1) = 2
df error (Days x Trials) = (N-1)
(d-1) (t-1) = 6 df total = dtN - 1 =
(2)(3)(4) - 1 = 23
d is the number of days; t is the number of trials (each day); and
N is the total number of subjects.
ANOVA with Two Within-Subject Variables (3 of 3)

Next section: Tests supplementing ANOVA

Sums of Squares
Computational formulas for the sums of squares will not be given
since it is assumed that complex analyses will not be done by
hand.

Mean Squares
As always, a mean square is computed by dividing the sum of squares
by the degrees of freedom. Generally speaking, no mean square is
computed for the variable "subjects" since it is assumed that subjects
differ from one another thus making a significance test of "subjects"
superfluous. F Ratios The F ratio for each effect is the mean square for
the test divided by the mean square error for the effect. Since there is a
separate error term for each effect, a different mean square error is used
in each test.

Next section: Tests supplementing ANOVA


Tests Supplementing Within-Subjects ANOVA

Next section: ANOVA with between- and within-subject


variables

Tests supplementing a within-subjects ANOVA are analogous to tests


supplementing to a between-subjects ANOVA. The only differences lie
in the computational details which are taken care of for you by standard
statistical packages. One important computational difference is that
when specific comparisons among means are made on within-subjects
variables, an error term
for each specific comparison is calculated. The same error term is used
for all comparisons when the variable is between-subjects.

ANOVA with Between- and Within- Subject Variables (1 of 3)

It is common for designs to have a combination of between- and


within-subject variables. In this section designs with one between-
and one within-subject variable are discussed.

Consider an experiment in which four 8-year-old and four and 12-


year-old subjects are given five trials on a motor learning task.
There are two variables: age and trials. Age is a between-subject
variable since each subject is in either one age group or the other.
Trials is a within-subject variable since each subject performs on all
five trials.
ANOVA with Between- and Within- Subject Variables (2 of 3)
Sources of Variation
The sources of variation are: age, trials, the Age x Trials interaction,
and two error te rms. One error term is used to test the effect of age
whereas a second error term is used to test the effects of trials and the
Age x Trials interaction.

Degrees of Freedom
The degrees of freedom for age is equal to the number of ages minus
one. That is: 2 - 1 = 1. The degrees of freedom for the error term for age
is equal to the total number of subjects minus the number of groups: 8 -
2 = 6. The degrees of freedom for trials is equal to the number of trials -
1:
5 - 1 = 4. The degrees of freedom for the Age x Trials interaction is
equal to the product of the degrees of freedom for age (1) and the
degrees of freedom for trials (4) = 1 x 4 = 4. Finally, the degrees of
freedom for the second error term is equal to the product of the degrees
of freedom of the first error term (6) and the degrees of freedom for
trials (4): 6 x 4 = 24.

ANOVA with Between- and Within- Subject Variables (3 of 3)


Next chapter: Prediction

The formulas for the degrees of freedom for each effect are summarized
below:

 df Age = k - 1 = 2 - 1 = 1
 df error(Age) = N - k = 8 - 2 = 6
 df Trials = t - 1 = 5 - 1 = 4
 df Age x Trials = df Age x df Trials = 1 x 4 = 4
 df error(Trials and Age x Trials)= df error (Age) x df Trials = 6 x
4 = 24
 df total = Nt - 1 = 8 x 5 - 1 = 39

k is the number of levels of the between-subject variable; t is the number


of trials (each day); N
is the total number of subjects.

Sums of Squares
Computational formulas for the sums of squares will not be given
since it is assumed that complex analyses will not be done by
hand.
Mean Squares
As always, a mean square is computed by dividing the sum of squares
by the degrees of freedom.

F Ratios
The F ratio for each effect is the mean square for the test divided by the
mean square error for the effect.
Chapter
14

1. Why are within-subjects designs usually more powerful than between-


subjects design?

2. What source of variation is found in an ANOVA summary table for


a within-subjects design that is not in in an ANOVA summary table for
a between-subjects design. What happens to this source of variation in
a between-subjects design?

3. Compute an ANOVA for the following data. Test all differences


between means using the Tukey test at the .01 level. The three scores
per subject are their scores on three trials of a memory task.

467
377
285
147
469

4. Test the linear and quadratic components of trend for the data in
problem 3.
5. Give the source and df columns of the ANOVA summary table for
the following experiments: a. Twenty subjects are each tested on a
simple reaction time task and on a choice reaction time task.
b. Ten male and 10 female subjects are each tested
under three levels of drug dosage: 0 mg, 10 mg,
and 20 mg.
c. Twelve subjects are tested on a motor learning task for three trials a
day for two days.
d. An experiment is conducted in which depressed people are either
assigned to a drug therapy group, a behavioral therapy group, or a
control group. Ten subjects are assigned to each group. The level of
measured once a month for four months.

6. The dataset "Stroop" has the scores (times) for males and females on
each of three tasks:

1. Subjects read the names of colors printed in black ink.


2. Subjects name the colors of rectangles.
3. Subjects are given the names of colors written in ink of
different colors. For example, blue might be written in red ink.

a. Do an analysis of variance.
b. Test all pairwise comparisons among the three conditions using the
Tukey test at the .01 level.
7. Subjects threw darts at a target. In one condition, they used their
preferred hand; in the other condition they used their other hand. All
subjects in performed in both conditions (the order of conditions was
counterbalanced). Their scores are shown below. Test the difference
between conditions once using a t-test and once using one-way
analysis of variance. What is the relationship between the t and the F?

preferred hand Non-preferred hand


12 7
7 5
12 8
13 10

8. Assume the data in Problem 7 were collected using two different


groups of subjects: One group used their preferred hand and the other
group used their non-preferred hand. Analyze the data and compare the
results to those for problem 7.

9. When should a within-subjects


design be avoided? Selected Answers

Chapter
14

3a.

Source df SSQ MS F p
Subje 4 9.33 2.33
cts 2 3 3
Tria 8 49.73 24.86 13.9 0.00

3b. Using the tukey test at the .01 level, Mean 1 is significantly different
from Mean 2 and Mean
3. Means 2 and 3 are not significantly different from each other.

4.
Linear: t(4) = 7.203, p = .002
Quadratic: t(4) = -1.44, p = .2233.

6a.

Sourc df SSQ MS F p
G 1 83.32 83.324 1.99 0.16
ERR 45 47 70 4 5
OR C 2 1880.56 41.790
G 2 18 26 228.0 0.00
G: Gender
C: Condition

Mean 1 Mea 3: Me
16.39 Me
36.5 t
30.2 Signifi
yes
Mean 2 Mea 3: 20.72 36.5 23.7 yes

7. F(1,3) = 29.4, p =
0.0123 t(3) = 5.422,
p = 0.0123
F = t²

8. F(1,6) = 4.2, p =
0.0863 t(6) = 2.049,
p = 0.0863
Introduction to Prediction (1 of 3)

When two variables are related, it is possible to predict a person's score


on one variable from their score on the second variable with better than
chance accuracy. This section describes how these predictions are made
and what can be learned about the relationship between the variables by
developing a prediction equation. It will be assumed that the
relationship between the two variables is linear. Although there are
methods for making predictions when the relationship is nonlinear,
these methods are beyond the scope of this text.

Given that the relationship is linear, the prediction problem becomes


one of finding the straight line that best fits the data. Since the terms
"regression" and "predictio n" are synonymous, this line is called the
regression line.

The mathematical form of the regression line


predicting Y from X is: Y' = bX + A
where X is the variable represented on the abscissa (X-axis), b is the
slope of the line, A is the Y
intercept, and Y' consists of the predicted values of Y for the various
values of X.
Introduction to Prediction (2 of 3)

The scatterplot to below shows the relationship between the Identical


Blocks Test (a measure of spatial ability) and the Wonderlic Test (a
measure of general intelligence). As you can see, the relationship is
fairly strong: Pearson's correlation is 0.677.

The best fitting straight line is also shown. It has a slope of 0.481 and a
Y intercept of 15.8468. (Click here to see how to compute the slope and
intercept and the criterion for deciding best fit.) The regression line can
be used for predicting. From the graph, one can estimate that the score
on the Wonderlic for someone scoring 10 on Identical Blocks is about
20.
From the formula for Y', it can be calculated that the predicted predicted
score is 0.481 x 10 +
15.86 = 20.67.

Introduction to Prediction (3 of 3)

Next section: Standard error of the estimate

When scores are standardized, the regression slope (b) is equal to


Pearson's r and the Y intercept is 0 . This means that the regression
equation for standardized variables is:
Y' = rX.

Besides its use for prediction, the regression line is useful for describing
the relationship between the two variables. In particular, the slope of the
line reveals how much of a change in the
criterion is associated with a change of one unit on the predictor
variable. For this example, a change of one point on the Identical
Blocks is associated with a change of just under 0.5 points on the
Wonderlic.

Standard Error of the Estimate (1 of 3)

The standard error of the estimate is a measure of the accuracy of


predictions made with a regression line. Consider the following
data.
The second column (Y) is predicted by the first column (X). The
slope and Y intercept of the regression line are 3.2716 and 7.1526
respectively. The third column, (Y'), contains the predictions and is
computed according to the formula:

Y' = 3.2716X + 7.1526.

The fourth column (Y-Y') is the error of prediction. It is simply


the difference between what a subject's actual score was (Y) and
what the predicted score is (Y').

The sum of the errors of prediction is zero. The last column, (Y-
Y')², contains the squared errors of prediction.

Standard Error of the Estimate (2 of 3)


The regression line seeks to minimize the sum of the squared errors
of prediction . The square root of the average squared error of
prediction is used as a measure of the accuracy of prediction. This
measure is called the standard error of the estimate and is
designated as σest. The formula for the standard error of the
estimate is:

where N is the number of pairs of (X,Y) points. For this example,


the sum of the squared errors of prediction (the numerator) is
70.77 and the number of pairs is 12. The standard error of the
estimate is therefore equal to:

An alternate formula for the standard error of the estimate is:


where is σy is the population standard deviation of Y and ρ is the
population correlation between X and Y. For this example,

Standard Error of the Estimate (3 of 3)

Next section: Partitioning the sums of squares

One typically does not know the population parameters and


therefore has to estimate from a sample. The symbol sest is used for
the estimate of σest. The relevant formulas are:
and

where r is the sample correlation and S y is the sample standard deviation


of Y. (Note that S y has a capital rather than a small "S" so it is computed
with N in the denominator). The similarity between the standard error of
the estimate and the standard deviation should be noted: The standard
deviation is the square root of the average squared deviation from the
mean; the
standard error of the estimate is the square root of the average
squared deviation from the regression line. Both statistics are
measures of unexplained variation.

Partitioning the Sums of Squares

Next section: Confidence intervals and significance tests

The sum of the squared deviations of Y from the mean of Y (Y M) is


called the sum of squares total and is referred to as SSY. SSY can be
partitioned into the sum of squares predicted and the sum of squares
error. This is analogous to the partitioning of the sums of squares in the
analysis of variance.

The table below shows an example of this partitioning.


X Y Y- (Y- Y Y'-Y (Y'- Y- (Y-Y')²
2 5 - 6.25 4. - 7.29 0. 0.
3 6 2. 2.25 8 2. 0.81 2 04
Su 3 0.
5 17.00 30. 0.
7 16. 0. 0.

The regression
equation is: Y' =
1.8X + 1.2
and YM = 7.5.

Defining SSY, SSY' and


SSE as: SSY = Σ(Y -
YM)²

SSY' = Σ(Y' - YM)2

SSE = Σ(Y - Y')²

You can see from the table that SSY = SSY' + SSE which means that
the sum of squares for Y is divided into the sum of squares explained
(predicted) and the sum of squares error. The ratio of SSY'/SSY is the
proportion explained and is equal to r². For this example, r = .976, r² = .
976² =
.95. SSY'/SSY = 16.2/17 = .95.
Confidence Intervals and Significance Tests for Correlation and
Regression (1 of
5)

Pearson'
sr

Confidence Interval
The formula for a confidence interval for Pearson's r is given elsewhere.

Significance Test
The significance test for Pearson's r is computed as follows:

with N - 2 degrees of freedom, r is Pearson's correlation and N is the


number of pairs of scores that went into the computation of r. For
example, consider the data shown below: .
X Y
2 5
3 7
4 7
5 9
6 1
9 1
With r = .8614 and N = 6,

with 6 - 2 = 4 degrees of freedom. From a t table, it can be found that


the two-tailed probability value is p = 0.0275.

Confidence intervals and Significance Tests for Correlation and


Regression (2 of
5)

Regression
Slope
Confidence Interval
A confidence interval on the slope of the regression line can be
computed based on the standard error of the slope (b). The formula for
the standard error of b is:

where sb is the standard error of the slope, sest is the standard error of
the estimate, and XM is the mean of the predictor variable. The formula
for the standard error of the estimate is:

for the example data for which Sy = 2.432,

.
The only remaining term of needed to be computed is:

Σ(X - XM)² = (2-4.833)2 + (3-4.833)2 + ... +

(9-4.833)2 = 30.833. Therefore,

Confidence intervals and Significance Tests for Correlation and


Regression (3 of
5)

The slope itself (b) is:

Using the general formula for a confidence interval, the confidence


interval on the slope is: b ± t where t is based on N - 2 degrees of
freedom. For the example data, there are 4 degrees for freedom. The t
for the 95% confidence interval is 2.78 (t table). The 95% confidence
interval is:

0.924 ± (2.78)(.272)
0.168 ≤ Population slope ≤ 1.681

Significance Test
The significance test for the slope of the regression line is done using
the general formula for testing using the standard error. Specifically,

For the example data with a null hypothesis that the population slope is
0, t = .924/.2725 = 3.39. The degrees of freedom are N - 2 = 4. Notice
that this value of t is exactly the same as obtained for the significance
test of r.

Confidence Intervals and Significance Tests for Correlation and


Regression (4 of
5)
This will always be the case: The test of the null hypothesis that the
correlation is zero is the same as the test of the null hypothesis that
the regression slope is zero.

Assumptions

1. Normality of residuals: The errors of prediction (Y-Y') are


normally distributed.

2. Homoscedasticity

3. The relationship between X and Y is linear.

Summary of Computations: Significance test for r

1. Compute r

2. Compute t using the formula:


with N-2 degrees of freedom where N is the number of pairs of
scores.

Confidence intervals and Significance Tests for Correlation and


Regression (5 of
5)

Next section: Introduction to multiple regression

Confidence interval for b

1. Compute (r)(sy/sx)where r is Pearson's correlation, sx is the


standard deviation of X and s y
is the standard
deviation of Y.

2. Compute
3. Compute

4. Compute df = N - 2.

5. Find t for these df using a t table.


6. Lower limit = b - t sb

7. Upper limit = b + t sb

8. Lower limit ≤ population slope ≤ upper limit


9. Introduction to Multiple Regression (1 of 3)
10.

11.
12. In multiple regression, more than one variable is used to predict
the criterion. For
example, a college admissions officer wishing to predict the future
grades of college applicants might use three variables (High
School GPA, SAT, and Quality of letters of recommendation) to
predict college GPA. The applicants with the highest predicted
college GPA would be admitted. The prediction method would be
developed based on students already attending college and then
used on subsequent classes. Predicted scores from multiple
regression are linear combinations of the predictor variables.
Therefore, the general form of a prediction equation from multiple
regression is:

Y' = b1X1 + b2X2 + ... + bkXk + A


where Y' is the predicted score, X1 is the score on the first
predictor variable, X2 is the score on the second, etc. The Y
intercept is A. The regression coefficients (b1, b2, etc.) are
analogous to the slope in simple regression.

13. Introduction to Multiple Regression (2 of 3)


14.
15.
16. The multiple correlation coefficient (R) is the Pearson correlation
between the predicted
scores and the observed scores (Y' and Y). Just as r² is the
proportion of the sum of squares explained in one-variable
regression, R² is the proportion of the sum of squares explained
in multiple regression. These concepts can be understood more
concretely in terms of an example.

The dataset "Multiple regression example" contains hypothetical


data for the college admissions problem discussed on the
previous page. The regression equation for these data is:
17. Y' = 0.3764 X1+0.0012 X2 +0.0227X3 -0.1533
18. where Y' is predicted college GPA, X1 is high school GPA, X2 is
SAT (verbal + math),
and X3 is a rating of the quality of the letters of recommendation.
The Y' scores from this
analysis correlate 0.63 with College GPA. Therefore, the multiple
correlation (R) is .63.
As shown below, the R of 0.63 is higher than the simple correlation
between college
GPA and any of the other variables.
19. Correlations with College GPA
20. High School GPA 0.55
21. SAT 0.52
22. Quality of letters 0.35
23. Introduction to Multiple Regression (3 of 3)
24.

25.
26. Next section: Significance tests in multiple
regression

Just as in the case of one-variable regression, the sum of squares


in a multiple regression analysis can be partitioned into the sum of
squares predicted and the sum of squares error.
27. Sum of squares total: 55.57
28. Sum of squares predicted: 22.21
29. Sum of squares error: 33.36
30. Again, as in one-variable regression, R² is the ratio of sum of
squares predicted to sum of
squares total. In this example, R² = 22.21/55.57 = 0.40.
Sometimes multiple regression analysis is performed on
standardized variables. When this is done, the regression
coefficients are referred to as beta (ß) weights. The Y intercept
(A) is always zero when standardized variables are used.
Therefore, the regression equation for standardized variables is:

Y' = ß1z1 + ß2z2 + ... + ßkzk


Y' where is the predicted standard score on the criterion, ß 1 is the
standardized regression coefficient for the first predictor variable,
ß2 is the coefficient for the second predictor variable, z1 is the
standard score on the first predictor variable, z 2 is the
standardized
score on the second predictor variable, etc.

Significance Tests in Multiple Regression


(1 of 2)

The following formula is used to test whether an R² calculated in a


sample is significantly different from zero. (The null hypothesis is
that the population R² is zero.)

where N is the number of subjects, k is the number of predictor


variables and R² is the squared multiple correlation coefficient.
The F is based on k and N - k - 1 degrees of freedom.
In the example of predicting college GPA, N = 100, k = 3, and R²
= .3997.
Therefore,

In addition to testing R² for significance, it is possible to test the


individual regression coefficients for significance. This is done
by dividing the regression coefficient by the standard error of
the regression coefficient.

Significance Tests in Multiple Regression (2 of 2)

Next section: Shrinkage

Following the general formula for significance testing using the standard
error of a statistic,
where b is the regression coefficient, sb is the standard error of the
regression coefficient, N is the number of subjects, and k is the number
of predictor variables. The resulting t is on N - k - 1 degrees of freedom.

The formula for sb is beyond the scope of this text. However, all
standard statistical packages compute sb and significance tests for the
regression coefficients. Returning to the problem of predicting college
GPA, the regression coefficients and associated significance tests are
shown below:
b sb t p
HS GPA 0.3764 0.1143 3.29 0.0010
SAT 0.0012 0.0003 4.10 0.0001
Letters 0.0227 0.0510 0.44 0.6570
As can be seen, the regression coefficients for High School GPA
and SAT are both
highly significant. The coefficient for Letters of recommendation
is not significant. This means that there is no convincing evidence
that the quality of the lette rs adds to the predictability of college
GPA once High School GPA and SAT are known.

Shrinkage
Next section: Measuring a variable's importance

In multiple regression, the multiple correlation coefficient (R)


tends to overestimate the population value of R. Therefore, R is a
biased estimator of the population parameter. Since R has a
positive bias, the application of the regression equation derived
in one
sample to an independent sample almost always yields a lower
correlation between predicted and observed scores (which is what
R is) in the new sample than in the original sample. This reduction
in R is called shrinkage. The amount of shrinkage is affected by
the sample size and the number of predictor variables: The larger
the sample size and the fewer predictors, the lower the shrinkage.
The following formula can be used to estimate the value of R² in
the population:

Measuring the Importance of Variables in Multiple Regression (1 of


3)

When predictor variables are correlated, as they normally are,


determining the relative importance of the predictor variables is a
very complex process. A detailed discussion of this problem is
beyond the scope of this text. The purpose of this section is to
provide a few cautionary notes about the problem.
First and foremost, correlation does not mean causation. If the
regression coefficient for a predictor variable is 0.75, this means
that, holding everything else constant, a change of one unit on the
predictor variable is associated with a change of 0.75 on the
criterion. It does not necessarily mean that experimentally
manipulating the predictor variable one
unit would result in a change of 0.75 units on the criterion
variable. The predictor variable may be associated with the
criterion variable simply because both are associated with a third
variable, and it is this third variable that is causally related to the
criterion. In
other words, the predictor variable and this unmeasured third
variable are confounded.

Measuring the Importance of Variables in Multiple Regression (2 of


3)

Only when it can be assumed that all variables that are correlated
with any of the
predictor variables and the criterion are included in the analysis
can one begin to consider making causal inferences. It is doubtful
that this can ever can be assumed validly except in the case of
controlled experiments.

One measure of the importance of a variable in prediction is called


the "usefulness" of the variable. Usefulness is defined as the drop
in the R² that would result if the variable were not included in the
regression analysis. For example, consider the problem of
predicting college GPA. The multiple R² when College GPA is
predicted by High School GPA,
SAT, and Letters of recommendation is 0.3997. In a regression
analysis conducted predicting College GPA from just High
School GPA and SAT, R² = 0.3985. Therefore the usefulness of
Letters of Recommendation is only: 0.3997 - 0.3985 = 0.0012.

On the other hand, leaving out SAT and predicting College GPA
from High School GPA and Letters of Recommendation yields an
R² = 0.3319. Therefore, the usefulness of SAT is 0.3997 - 0.3319
= 0.068.

Measuring the Importance of Variables in Multiple Regression (3 of


3)

Next section: Regression toward the mean

It is important to bear in mind that the usefulness of a variable


refers specifically to the usefulness of a variable when the other
variables are included in the regression equation. For instance,
although the Letters of Recommendation add practically nothi ng
to the ability to predict College GPA once High School GPA and
SAT are known, they are somewhat predictive of College GPA
when taken alone: The Pearson correlation between Letters of
Recommendation and College GPA is 0.35.

In sum, the usefulness of a variable refers how much it adds to the


predictability of the criterion over and above the other predictor
variables in the equation. If two predictor variables are highly
correlated, then neither will contribute much to the prediction
above and beyond the other.

Regression Toward the Mean (1 of 6)

A person who scored 750 out of a possible 800 on the quantitative


portion of the SAT takes the SAT again (a different form of the
test is used). Assuming the second test is the same difficulty as the
first and that there was no learning or practice effect, what score
would you expect the person to get on the second test? The
surprising answer i s that the person is more likely to score below
750 than above 750; the best guess is that the person would score
about 725. If this surprises you, you are not alone. This
phenomenon, called regression to the mean, is counter intuitive
and confusing to man y professionals as well
as students.

The conclusion that the expected score on the second test is


below 750 depends on the assumption that scores on the test are,
at least in some small part, due to chance or luck. Assume that
there is a large number, say 1,000 parallel forms of a test and that
(a)
someone takes all 1,000 tests and (b) there are no learning,
practice, or fatigue effects. Differences in the scores on these
tests are therefore due to luck. This luck may be a function of
simple guessing or may be a function of knowing more of the
answers on some tests than on others.

Regression Toward the Mean (2 of 6)

Define the mean of these 1,000 scores as the person's "true" score.
On some tests, the person will have scored below the "true" score;
on these tests the person was less lucky than average. On some
tests the person will have scored above the true score; on these
tests the person was more lucky than average. Consider the ways in
which someone could score 750. There are three possibilities: (1)
their "true" score is 750 and they had exactly average luck, (2)
their "true" score is below 750 and they had better than average
luck,
and (3) their "true" score is above 750 and they had worse than
average luck. Consider which is more likely, possibility #2 or
possibility #3. There are very few people with "true" scores above
750 (roughly 6 in 1,000); there are many more people with true
scores between 700 and 750 (roughly 17 in 1,000). Since there are
so many more people with true scores between 700 and 750 than
between 750 and 800, it is more likely that someone scoring 750
is from the former group and was lucky than from the latter group
and was unlucky.

Regression Toward the Mean (3 of 6)

There are just not many people who can afford to be unlucky and still
score as high as 750. A person scoring 750 was, more likely than not,
luckier than average. Since, by definition, luck does not hold from one
administration of the test to another, a person scoring 750 on one test is
expected to score below 750 on a second test. This does not mean that
they necessarily will score less than 750, just that it is likely. The same
logic can be applied to someone scoring 250. Since there are more
people with "true" scores between 250 and 300 than between 200 and
250, a person scoring 250 is more likely to have a "true" score above
250 and be unlucky than a "true" score below 250 and be lucky. This
means that a person scoring 250 would be expected to score higher on
the second test. For both the person scoring 750 and the person scoring
250, their expected score on the second test is between the score they
received on the first test and the
mean.

This is the phenomenon called "regression toward the mean."


Regression toward the mean occurs any time people are chosen based
on observed scores that are determined in part or entirely by chance. On
any task that contains both luck and skill, people who score above the
mean are likely to have been luckier than people who score below the
mean. Since luck does not
hold from trial to trial, people who score above the mean can be
expected to do worse on a subsequent trial. This counterintuitive
phenomenon is illustrated concretely by a simulation found here.

In regression with standardized variables, the


regression equation is: Zy' = (r)Zx
where Zy' is the predicted standardized score, Zx is the standardized
score on the predictor, and r is Pearson's correlation. This means that
the predicted standardized score will be closer to the mean of zero
whenever the correlation is not perfect (not -1 or 1).

For example, if the SAT had a mean of 500 and a standard deviation of
100, then a score of 750 would have a standard score equivalent of 2.5
since 750 is two and a half standard deviations above the mean. If the
test-retest correlation were 0.90, then the predicted standard score for
someone with a standard score of 2.5 would be (0.90)(2.5) = 2.25.
Therefore, they would be predicted to be 2.25 standard deviations above
the mean on the retest which is 500 + (2.25)(100)
= 725.

Regression Toward the Mean (4 of 6)


Regression toward the mean will occur if one chooses the lowest-
scoring subjects in an experiment. Since the lowest-scoring
subjects can be expected to have been unlucky and therefore have
scored lower than their "true" scores, they will, on average,
improve on a retest. This can easily mislead the unwary researcher.
What if a researcher chose the first - grade children in a school
system that scored the worst on a reading test, administered a drug
that was supposed to improve reading, and retested the children on
a parallel form of the reading test. Because of regression toward
the mean, the mean reading score on the retest would almost
certainly be higher than the mean score on the first test. The
researcher would be mistaken to claim that the drug was
responsible for the improvement since it would be expected to
occur simply on the basis of regression toward the mean.

Consider an acutal study that received considerable media


attention. This study sought to determine whether a drug that
reduces anxiety could raise SAT scores by reducing test anxiety.
A group of students whose SAT scores were surprisingly low
(given their grades) was chosen to be in the experiment.
Regression Toward the Mean (5 of 6)
These students, who presumably scored lower than expected on
the SAT because of test anxiety, were administered the anti-
anxiety drug before taking the SAT for the second time. The
results supported the hypothesis that the drug could improve SAT
scores by lowering anxiety: the SAT scores were higher the
second time than the first time. Since SAT scores normally go up
from the first to the second administration of the test, the
researchers compared the improvement of the students in the
experiment with nationwide data on how much students usually
improve. The students in the experiment improved significantly
more than the average improvement nationwide. The problem with
this
study is that by choosing students who scored lower than
expected on the SAT, the researchers inadvertently chose students
whose scores on the SAT were lower than their "true" scores. The
increase on the retest could just as easily be due to regression
toward the mean as to the effect of the drug. The degree to which
there is regression toward the mean depends on the relative role
of skill and luck in the test.

Regression Toward the Mean (6 of 6)


Next chapter: Chi Square

Consider the following situation in which luck alone determines


performance: A coin is flipped 100 times and each student in a
class asked to predict the outcome of each flip. The student who
did the best in the class was right 62 times out of 100. The tes t is
now given for a second time. What is the best prediction for the
number of times this student will predict correctly? Since this test
is all luck, the best prediction for any given student, including this
one, is 50 correct guesses. Therefore, the prediction is that the
score of 62 would regress all the way to the mean of 50. To take
the other extreme, if a test were entirely skill and involved no luck
at all, then there would be no regression toward the mean.

Chapter
15

1. What is the equation for a regression line? What does each term in
the line refer to?
2. The procedures described in this chapter assume that the
relationship between the predictor and the criterion is
.

3. If a regression
equation were: Y' =
3X + 5,
what would be the predicted score for a person scoring 8 on X?

4. What criterion is used for deciding which regression line fits best?
5. What does the standard error of the estimate measure?

6. How is the size of Pearson's correlation related to the size of


the standard error of the estimate?

7. What "estimate" does the standard error of the estimate refer to?

8. How is the standard error of the estimate related to the standard


deviation?

9. In a regression analysis, the sum of squares for the predicted


scores is 80 and the sum of squares error is 40, what is R²?

10. For the following data, compute r and determine if it is


significantly different from zero.
Compute the slope and test if it differs significantly from zero.
Compute the 95%
confidence interval for
the slope. X Y
25
46
47
59
6 12
11. For the dataset "High School Sample," find the regression line
to predict scores on the Wonderlic Test (Item 3) from scores on
the Identical Blocks Test (Item 2). What is the predicted score
for a person who scored 25 on the Identical Blocks Test?

12. What would the regression equation be if the mean of X were 20,
the mean of Y were 50, and the correlation between X and Y were
0?

13. What is sb? How does the magnitude of sest affect sb?

14. R² = .44, N = 50, and k = 12. Give the adjusted R².

15. Using the dataset High School Sample, predict scores on the
Wonderlic Test from the Missing Cartoons's Test and the
Identical Blocks Test. Plot the errors of prediction (residuals) as
a function of the predicted scores.

16. In multiple regression, what is the effect on the b coefficients of


multiplying the criterion variable by 10? What is the effect on the
standardized regression coefficients?

Selected
Answers
Chapter
15

3. 29

9. 0.667
10a. r =0.92, t(3) = 4.16, p = 0.025

10b. b = 1.73, t(3) = 4.16, p = 0.025.

0.41 ≤ Population Slope ≤ 3.05

11.
Y' = .481 X + 15.864 where:
Y' is the predicted Wonderlic score and X is the score on the Identical
Blocks Test.
The predicted score of a person with a score of 25 on the Identical
Blocks Test = 27.889.
Testing Differences between p and π (1 of 4)

Another section shows how to use a test based on the normal


distribution to see whether a sample proportion (p) differs significantly
from a population proportion (π) . This section shows how to conduct a
test of the same null hypothesis using a test based on the chi square
distribution.

The two tests always yield identical results. The advantage of the test
based on the chi square distribution is that it can be generalized to more
complex situations. In the other section, an example was given in which
a researcher wished to test whether a sample proportion of 62/100
differed significantly from an hypothesized population value of 0.5. The
test based on z resulted in a z of 2.3 and a probability value of 0.0107.
To compute the significance test using chi square, the following table is
formed:

62
Succeed (50
38
Faile (50

The number of people falling in a specified category is listed as the first


line in each cell (62 succeeded, 38 failed). The second line in each cell
(in parentheses) contains the number expected to succeed if the null
hypothesis is true. Since the null hypothesis is that the proportion that
succeed is 0.5, (0.5)(100) = 50 are expected to succeed and (0.5)(100) =
50 are expected to fail.

Testing Differences between p and π (2 of 4)

The formula for chi square is:

where χ² is the symbol for the chi square, E is an expected cell


frequency, and O is an observed cell frequency. The Σ symbol is
summation notation and means to sum up the quantity over both cells.
Therefore, the formula says that for each cell you

1. take the absolute value of the difference between the expected


cell frequency and the observed cell frequency

2. subtract 0.5 (the correction for continuity)

3. square the result,

4. divide by the expected frequency,


5. and finally, sum up the values
across the cells. For the present data,

Testing Differences between p and π (3 of 4)

The degrees of freedom for chi square is equal to the number of


categories minus one. For this section in which there are always just
two categories (success and failure for the present example), the
degrees of freedom is always one. A chi square table can be used to
find that the two-tailed probability value for a chi square of 5.29 with
one degree of freedom is 0.0214.

At the beginning of this section it was stated that the chi square test for
proportions was equivalent to the one based on the normal distribution.
It turns out that chi square will always equal z². For the present
example, the value of z was 2.3 and the value of chi square was 5.29.
Note that 2.3² = 5.29. The probability values for a z of 2.3 and a chi
square of 5.29 are identical (p = 0.0214).

Reporting Results
These results could be reported as follows:
The proportion of subjects choosing the original formulation (0.62) was
significantly greater th an
0.50, χ²(1, N = 100) = 5.29, p = 0.021. Apparently at least some
people are able to distinguish between the original formulation and
the new formulation.
The symbol "(1, N = 100)" means that the chi square is based on 1 df
and that the total number of subjects is 100.

Testing Differences between p and (4 of 4)

Next section: More than two categories

Assumptions

1. Observations are sampled randomly and independently.

2. For the chi square test to be accurate, π cannot be too close to 0


or to 1 and N cannot be too small. A conservative rule of thumb
is that Nπ as well as N(1-π) should both be at least 10.
3. More than Two Categories (1 of 3)
4.

5.
6. A skeptical view of psychotherapy can be summed up as follows:
one third better, one
third worse, one third the same. In other words, one third of the
patients improve as a
result of psychotherapy, one third get worse, and one third stay the
same.

Consider a hypothetical study of 100 patients who underwent


psychotherapy. Assume that 45 were classified as having
improved, 30 as having gotten worse, and 25 as having stayed the
same. Is this outcome significantly different from that expected
under the one third better, one third worse, one third the same?
The test is conducted by constructing a table similar to the one
used to test the difference between p and π .

Frequen
45
Bett (33.33
25
Wors (33.33
30
Sam (33.33
Tot 10
7.
The first column contains the three categories into which patients
could be classified. The top line in each cell in the second column
contains the number of patients in the category; the second line
contains the number expected under the null hypothesis. These
"expected" frequencies are computed by multiplying the total
number of subjects (100) by the
proportion expected to be in the
category (0.33333).
8. More than Two Categories (2 of 3)
9.
10.
Frequency
45
Better
(33.333)
25
Worse
(33.333)
30
Same
(33.333)
Total 100

11
.
The chi square statistic is then computed using the formula:
The degrees of freedom are equal to the the number of categories
-1. For this example,

Since there are three categories, the degrees of freedom equal 3 - 1


= 2. From a chi square table, it can be determined that the
probability value is 0.039. Therefore, the null hypothesis of one
third better, one third worse, one third the same can be rejected.

More than Two Categories (3 of 3)

Next section: Introduction to test of independence

Reporting Results
The results of this experiment could be described as follows:
The proportions of subjects falling into the three outcome categories
(better, worse, the same)
were 0.45, 0.25, 0.30 respectively. These proportions differed
significantly from the expected
0.333, 0.333, 0.333 proportions, χ²(2, N = 100) = 6.50, p = 0.039.
As shown, the Chi Square statistic is reported with its degrees of freedom
and sample size.

Summary of Computations

1. Make a table of expected and observed frequencies.

2. Compute Chi Square using the formula:

3. Degrees of freedom = number of categories minus one.

The correction for continuity is not used when there are more than two
categories. Some authors claim it should be used whenever an expected
cell frequency is below 5. Research in statistics has shown that this
practice is not advisable.

Assumptions
1. Observations are sampled randomly and independently.
2. The formula for chi square yields a statistic that is only
approximately distributed as Chi Square. For the Chi Square
approximation to be sufficiently accurate, the total numbe r of
subjects should be at least 20.
3. Introduction to the Chi Square Test of Independence (1 of 2)
4.

5.
6. Contingency tables are used to examine the relationship between
subjects' scores on two
qualitative or categorical variables. For example, consider the
hypothetical experiment on the effectiveness of early childhood
intervention programs described in another section.
In the experimental group, 73 of 85 students graduated from high
school. In the control group, only 43 of 82 students graduated.
These data are depicted in the contingency table shown below.

Failed
Graduat to Tot
Experime 73 12 85
Contr 43 39 82
Tot 116 51 167
7.
The cell entries are cell frequencies. The top left cell with a "73"
in it means that 73 subjects in the experimental condition went on
to graduate from high school; 12 subjects in the experimental
condition did not. The table shows that subjects in the ex
perimental condition were more likely to graduate than were
subjects in the control condition. Thus, the column a subject is in
(graduated or failed to graduate) is contingent upon (depends
on) the row the subject is in (experimental or control condition).

8. Introduction to the Chi Square Test of Independence (2 of 2)


9.

10.
11. Next section: Calculations

If the columns are not contingent on the rows, then the rows and
column frequencies are independent. The test of whether the
columns are contingent on the rows is called the chi square test of
independence. The null hypothesis is that there is no relationship
between row and column frequencies.

Computing the Chi Square Test of Independence (1 of 5)


The first step in computing the chi square test of independence is to
compute the expected frequency for each cell under the assumption
that the null hypothesis is true. To calculate the
expected frequency of the first cell in the example (experimental
condition, graduated), first calculate the proportion of subjects that
graduated without considering the condition they were in. The table
below shows that of the 167 subjects in the experiment, 116 graduated.

Failed

Experime 7 to 1 85
Contr 4 3 82
Total 11 5 167

Therefore, 116/167 graduated. If the null hypothesis were true,


the expected frequency for the first cell would equal the product
of the number of people in the experimental condition (85) and
the proportion of people graduating (116/167). This is equal to
(85)(116)/167 = 59.042. Therefore, the expected frequency for
this cell is 59.042. The general formula for expected cell
frequencies is:

where Eij is the expected frequency for the cell in the ith row and
the jth column, Ti is the total number of subjects in the ith row, Tj
is the total number of subjects in the jth
column, and N is the total number of subjects in the whole table.

Computing the Chi Square Test of Independence (2 of 5)

The calculations are shown below.

Once the expected cell frequencies are computed, it is convenient to


enter them into the original table as shown below. The expected
frequencies are in parentheses.

Failed

Experime 73 to 1 85
(59.042) (25.958
43 39
Contr (56.95 (25.04 82
Total 11 5 167

The formula for chi square test for independence is

For this example,

χ² = 22.01.

Computing the Chi Square Test of


Independence (3 of 5)
The degrees of freedom are equal to (R-1)(C-1) where R is the
number of rows and C is the number of columns. In this example,
R = 2 and C = 2, so df = (2-1)(2-1) = 1. A chi square table can be
used to determine that for df = 1, a chi square of 22.01 has a
probability value less than 0.0001.

In a table with two rows and two columns, the chi square test of
independence is equivalent to a test of the difference between
two sample proportions. In this example, the question is whether
the proportion graduating from high school differs as a function
of condition. Whenever the degrees of freedom equal one (as they
do when R = 2 and C =
2), chi square is equal to z². Note that the test of the difference
between proportions for these data results in a z of 4.69 which,
when squared, equals 22.01.

Computing the Chi Square Test of Independence (3 of 5)


The degrees of freedom are equal to (R-1)(C-1) where R is the
number of rows and C is the number of columns. In this example,
R = 2 and C = 2, so df = (2-1)(2-1) = 1. A chi square table can be
used to determine that for df = 1, a chi square of 22.01 has a
probability value less than 0.0001.

In a table with two rows and two columns, the chi square test of
independence is equivalent to a test of the difference between
two sample proportions. In this example, the question is whether
the proportion graduating from high school differs as a function
of condition. Whenever the degrees of freedom equal one (as they
do when R = 2 and C =
2), chi square is equal to z². Note that the test of the difference
between proportions for these data results in a z of 4.69 which,
when squared, equals 22.01.

Computing the Chi Square Test of Independence (4 of 5)

The same procedures are used for analyses with more than two rows
and/or more than two columns. For example, consider the following
hypothetical experiment: A drug that decreases anxiety is given to one
group of subjects before they attempted to play a game of chess against
a computer. The control group was given a placebo. The contingency
table is shown below.

Conditi Win Lose Dra Tot


12 18 10
Dru (14.2 (14.2 (11.4 40
13 7 10
Place (10.7 (10.7 (8.5 30
Tot 2 25 2 70

The expected frequencies are shown in parentheses. As in the


previous example, each expected frequency is computed by
multiplying the row total by the column total and dividing by
the total number of subjects. For example, the expected
frequency for the "Drug-Lose" condition is the product of the
row total (40) and the column total (25) divided by the total
number of subjects (70): (40)(25)/70 = 14.29.

The chi square is calculated using the formula:


The df are (R-1)(C-1) = (2-1)(3-1) = 2. A chi square table shows
that the probability of a chi square of 3.52 with 2 degrees of
freedom is 0.172. Therefore, the effect of the drug is not
significant.

Computing the Chi Square Test of Independence (5 of 5)

Next section: Assumptions

Summary of Computations

1. Create a table of cell frequencies.

2. Compute row and column totals.

3. Compute expected cell frequencies using the formula:


where Eij is the expected frequency for the cell in the ith row and
the jth column, T i is the total number of subjects in the ith row, Tj
is the total number of subjects in the jth
column, and N is the total number of subjects in the whole table.

4. Compute Chi Square using the formula:

5. Compute the degrees of freedom using the formula: df = (R-


1)(C-1) where R is the number of rows and C is the number of
columns.

6. Use a chi square table to look up the probability value.

Note that the correction for continuity is not used in the chi square
test of independence.
Assumptions of the Chi Square Test of Independence (1 of 2)
A key assumption of the chi square test of independence is that each
subject contributes data to only one cell. Therefore the sum of all cell
frequencies in the table must be the same as the number of subjects in
the experiment.

Consider an experiment in which each of 12 subjects threw a dart at a


target once using his or her preferred hand and once using his or her
non-preferred hand. The data are shown below:

Hi Misse
Preferred 9 3
Non-preferred 4 8

It would not be valid to use the chi square test of independence on


these data since each subject contributed data to two cells: one
cell based on their performance with their preferred hand and one
cell based on their performance with their non-preferred hand.
The total of the cell frequencies in the table is 24 but the total
number of subjects is only
12.

Assumptions of the Chi Square Test of Independence (2 of 2)

Next section: Reporting results


The formula for chi square yields a statistic that is only approximately a
chi square distribution. In order for the approximation to be adequate,
the total number of subjects should be at least 20.

Some authors claim that the correction for continuity should be used
whenever an expected cell frequency is below 5. Research in statistics
has shown that this practice is not advisable. For example, see:
Bradley, Bradley, McGrath, & Cutcomb (1979).

Reporting Results from the Chi Square Test of Independence

Next chapter: Distribution-free tests

The results for the example on the effects of the early childhood
intervention example could be reported as follows:
The proportion of students from the early-intervention group who
graduated from high school was 0.86 whereas the proportion from the
control group who graduated was only 0.52. The difference in
proportions is significant, χ²(1, N = 167) = 22.01, p < 0.001.
The experiment on the effect of the anti-anxiety drug on chess playing
could be reported as:
The number of subjects winning,losing,and drawing as a function of
drug condition is shown in Figure 1. Although subjects receiving the
drug performed slightly worse than subjects not receiving the drug, the
difference was not significant, χ²(2, N = 70) = 3.52, p = 0.17.

Chapter
16

1. When should the correction for continuity be used?

2. A die is suspected of being biased. It is rolled 24 times with the


following result.

Outcome Frequency
1 8
2 4
3 1
4 8
5 3
6 0
3.

Conduct a significance test to see if the die is biased.

4. When can you use either a Chi Square or a z test and reach the
same conlcusion?

5. Ten subjects are each given two flavors of ice-cream to taste and
say whether they like them. One of the 10 liked the first flavor and
seven out of 10 liked the second flavor. Can the Chi Qquare test be
used to test whether this difference in propotions (.10 versus .70)
is significant? Why or why not?

6. A recent experiment investigated the relationship between


smoking and urinary incontinence. Of 322 subjects in the study
who were incontinet, 113 were smokers, 51
were former smokers, and 158 had never smoked. Of 284 control
subjects who were not incontinent, 68 were smokers, 23 were
former smokers, and 193 had never smoked. Do a significance test
to see if there is a relationship between smoking and incontinence.

Selected Answers

Chapter
16

2. Chi Square = 14.5, df =5, p = .0127

5. Chi Square = 22.98, df = 2, p < .0001


Overview of Distribution-Free Tests (1 of 2)

Most inferential statistics assume normal distributions. Although


these statistical tests work well even if the assumption of normality
is violated, extreme deviations from normality can distort the
results. Usually the effect of violating the assumption of normality is
to decrease the Type I error rate. although this may sound like a
good thing, it often is accompanied by a substantial decrease in
power. Moreover, in some situations the Type I error rate is
increased.

There is a collection of tests called distribution-free tests that do not


make any assumptions about the distribution from which the
numbers were sampled. Thus the name,
"distribution-free." The main advantage of distribution-free tests is
that they provide more power than traditional tests when the
samples are from highly-skewed distributions. Since alternative
means of dealing with skewed distributions such as taking
logarithms or square roots of the data are available, distribution-
free tests have not achieved a high level of popularity. Distribution-
free tests are nonetheless a worthwhile alternative. Moreover,
computer analysis has made possible new and promising variations
of these tests. Overview of Distribution-Free Tests (2 of 2)
Next section: Randomization tests

Distribution-free tests are sometimes referred to as nonparametric tests


because, strictly
speaking, they do not test hypotheses about population parameters.
Nonetheless, they do provide a basis for making inferences about
populations and are therefore classified as inferential statistics.

Some researchers take the position that measurement on at least an


interval scale is necessary for techniques such as the analysis of
variance that result in inferences about means. These researchers use
distribution- free tests as alternatives to the usual parametric tests. As
discussed
in another section, most researchers do not accept the argument that
tests such as the analysis of variance can only be done with interval-
level data.

Randomization Tests (1 of 6)
Most distribution-free tests are based on the principle of
randomization. The best way to understand the principle of
Group 1
randomization
11 isGroup 2
in2 terms of a specific example of a randomization test.
14 9
Assume 7that four numbers
0 are sampled from each of two populations.
8 5
The numbers are shown below.
Mean 10 4

The first step is to compute the difference between means. For these
data, the difference is six. The second step is to compute the number of
ways these eight numbers could be divided into two groups of four. The
general formula is:

where W is the number of ways, N is the total number of numbers (8 in


this case), n 1is the size of the first group (4 in this case) and n2 is the
size of the second group (4 in this case). Therefore, W
= 8!/(4! 4!) = 70.

Randomization Tests (2 of 6)

This means that there are 70 ways in which eight numbers can be
divided into two groups of four.
The third step is to determine how many of these W ways of dividing
the data result in differences between the means as large or larger than
the difference obtained in the actual data. An examination of the data
shows that there are only two ways the eight numbers can be divided
into two groups of four with a larger difference between means than the
difference of six found in the actual arrangement. These two ways are
shown below.

Thus, including the original data, there are three ways in which
the eight numbers can be arranged so that the difference between
means is six or more.

Randomization Tests (3 of 6)
To compute the probability value for a one-tailed test of the difference
between groups, divide this value of three by the W of ways of dividing
the data into two groups of four. The probability value is therefore: p =
3/70 = 0.0429.

For a two-tailed test, the three cases in which Group 2 had a mean that
was greater than Group 1 by six or more would be considered. This
would make the two-tailed probability: p = 6/70 =
0.0857.

In summary, a randomization test proceeds from the data actually


collected. It compares a computed statistic (the difference between
means in this example) with the value of that statistic for other
arrangements of the data. The probability value is simply the proportion
of arrangements leading to a value of the statistic as large or larger than
the value obtained from the actual data.

Randomization Tests (4 of 6)
Consider one more example of a randomization test. Suppose a researcher
wished to know whether or not there were a relationship between two
X Y
4 3
variables: X and Y. The most common way to test the relationship would
8 5
2 1
be using Pearson's r.
10 7
9 8

The test based on the principle of randomization would proceed as


follows. First, Pearson's correlation would be computed for the data as
they stand. The value is: r = 0.9556. Next, the number of ways the X and
Y numbers could be paired is calculated (Note that X's do not become
Y's and Y's do not become X's. It is the pairings that change.) The
formula for the number of ways that the numbers can be paired is
simply:

W = N!

where N is the number of pairs of numbers. For this example, N = 5


and W = 120. This means there are 120 ways the numbers can be
paired. Of these pairings, only one would produce a higher correlation
than 0.9556. It is shown on the next page. Therefore, there are two
ways of arranging the data that result in correlations of 0.9556 or
higher.

Randomization Tests (5 of 6)

X Y
4
3
8
5
2
1
10
8
9
7

r = .981

This makes the one-tailed probability equal to: p = 2/120 = 0.017.


Naturally, the two -tailed probability is 0.034.

Randomization tests have only one major drawback: they are


impractical to compute with moderate to large sample sizes. For
example, the number of ways 45 scores can be equally divided among
three groups is which is an astronomical number. Even high-speed
computers are incapable of the necessary calculations.
Randomization Tests (6 of 6)

Next section: Rank randomization tests

Fortunately, there is an alternative computational method called


"resampling." Instead of calculating all possible ways the data could be
arranged, one can take numerous random samples from the set of
possible arrangements. For the problem of dividing 45 numbers into
three groups of 15 each, one could randomly select 10,000 of the
possible arrangements and determine the proportion of these for which
the effect size in the sample is exceeded. This proportion is then an
accurate estimate of the probability level. This procedure is currently
not in frequent use due, in part, to the general unavailability of programs
to do this type of computation.

Rank Randomization Tests (1 of 4)


The major problem with randomization tests is that they are very
difficult to compute. Rank randomization tests are performed by first
converting the scores to ranks and then computing a randomization test.
The primary advantage of rank randomization tests is that there are
tables that can be used to determine significance. The disadvantage is
that some information is lost when the numbers are converted to ranks.
Therefore, rank randomization tests are generally less powerful than
randomization tests based on the original numbers. Consider the
following data:

Group 1 Group 2
11 2
14 9
7 0
8 5

that were used as an example in the section on randomization tests. A


rank randomization test on these data begins by converting the numbers
to ranks. The converted data are:

Group Group
7 2
8 6
4 1

Rank Randomization Tests (2 of 4)

The sum of the ranks of the first group is 24 and the sum of the ranks of
the second group is 12. The difference is therefore 12. The problem is
to figure out the number of ways that data can be arranged so that the
difference between summed ranks is 12 or more. The bottom of this
page shows the four arrangements of the eight ranks that lead to a
difference of 12 or more. Since there are W = 8!/(4! 4!) = 70 ways of
arranging the data, the one-tailed probability is: 4/70 =
0.057.
8 4
7 3
6 2
5 1

8 5
7 3
6 2
4 1

8 6
7 3
5 2
4 1

8 5
7 4
6 2
3 1

Rank Randomization Tests (3 of 4)


The two-tailed probability is therefore: 8/70 = 0.114.

Notice that this probability value is slightly higher than the 0.0857
obtained with the randomization test for these same data. This rank
randomization test for differences between two groups has several
names. It is most often called the Mann-Whitney U test or the Wilcoxon
Rank Sum test. In practice, tables are available to look up the probability
value based on the sum of the ranks of the group with the lower mean
rank. For these data, the sum of the ranks is: 2+6+1+3
=12. A table would show that for a one-tailed test to be significant at the
0.05 level, the sum of the ranks must be ≤ 11. Since 12>11, the test is
not significant at this level. This agrees with the calculated probability
value of 0.057. The purpose of this section is to present the concepts
rather than the details of the computations. Therefore, if you wish to
actually perform this test, you should find a textbook with appropriate
tables. An online table and further discussion can be found here.

Rank Randomization Tests (4 of 4)

Next section: Sign test


Rank randomization tests are also available for correlation and for
testing differences between more than two groups (analogous to one-
way ANOVA). The rank randomization test for correlation is called
Spearman's rho. The rank randomization test for differences among
groups is called the Kruskal and Wallis Test. In summary, rank
randomization tests are based on the
principle of randomization. They are much easier to compute than
randomization tests. However, they are less powerful than
randomization tests and the more typically performed parameteric tests
such as analysis of variance. For this reason, they are not as widely used.

Sign Test (1 of 2)

Suppose a researcher were interested in whether a drug helped people


fall Drug
asleep. Subjects
Placebo
took the drug one night and a placebo another night
12 21
with the
9 order c16
ounterbalanced. The number of minutes it took each of
11 8
8 subjects
21 to fall36asleep is shown below.
17 28
22 20
18 29
11 22
If the drug has no effect, then the population proportion (π) of people
falling asleep faster in the drug condition is 0.50. Therefore, the null
hypothesis is that π = 0.50. Of the 8 subjects, 7 fell asleep faster with the
drug.

Sign Test (2 of 2)

Next chapter: Measuring effect size

Using the formula for the binomial distribution, the probability of


obtaining 7 or more
"successes" with N = 8 and π = 0.50 can be calculated to be 0.0352 as
follows:
Therefore, P(7) + P(8) = 0.03125 + 0.00391 = 0.0352. The one-
tailed probability value is therefore 0.03516; The two-tailed
probability value is (2)(0.03516) = 0.0732.

Technically speaking, the sign test is a test of whether the median is a


particular value. In this example, the test is whether the median
difference between conditions is zero. The sign test is generally less
powerful than a t test of the difference between dependent means. For
the example data, the t test would be: t(7) = -3.21. The two-tailed
probability value for this t is 0.015.

Chapter
17

1. Explain the principle of randomization.

2. What is the reason that rank randomization tests rather than pure
randomization tests are conducted.
3. Which is more powerful: randomization or rank randomization tests?
Why?

4. Conduct a rank randomization test, a randomization test, and a t test


for the data shown below. Make all tests two-tailed.

Group 1 Group 2
9 5
7 2
12 8
16 4
14 1

Selected Answers

Chapter
17

4.

Rank randomization test: p


= 0.063. Randomization
test: p = 0.040
t-test: t(8) = -2.675, p =0.028.
Introduction to Measuring Effect Size (1 of 2)

As explained in another section, a probability value is not a measure of


the size of an effect; it is a measure of how unlikely the data or more
extreme data are under the assumption that the null hypothesis is true. If
the dependent variable is measured in units that are meaningful in their
own right, then the measurement of the size of the effect can be as
simple as the difference between means. For example, if an
experimenter were interested in how long it takes people to figure out
how to use various designs for an automated teller machine (ATM), then
measurement of the
size of the effect would not pose much of a problem. If, on average,
people can figure out how to use one design for an ATM in 100 seconds
and an alternate design in 145 seconds, then it is clear that the effect size
is 45 seconds. Or, if one wished, one could say that the latter design
takes
45% longer than the former for users to figure out. Unfortunately,
most of the dependent variables used in behavioral research are
more difficult to interpret.

Introduction to Measuring Effect Size (2 of 2)


Next section: Variance explained in ANOVA

Assuming job satisfaction is measured on a seven-point scale, how big


an effect is one that is 1.5 units on the scale? Is it big enough to be
important either theoretically or practically? There is no clear answer to
this question. The most common approach is to interpret the size of the
mean difference relative to the differences occurring within each group.
This involves defining the size of the effect in terms of the degree of
overlap between the groups. Thus, an experiment finding a difference of
1.5 with very little overlap between groups would be said to have
revealed a larger effect than one finding the same difference of 1.5 but
with greater overlap between groups.

Measures of the size of an effect based on the degree of overlap


between groups usually involve calculating the proportion of the
variance that can be explained by differences between groups. At one
extreme is the case in which the only differences are differences
between groups with all scores within a group being equal. At the other
extreme is the case in which the group means are equal. In the former
case, 100% of the variance is explained by differences between groups;
in the latter case, 0% of the variance is explained.
Variance Explained in ANOVA (1 of 2)

The simplest way to measure the proportion of variance explained in an


analysis of variance is to divide the sum of squares between groups by
the sum of squares total. This ratio represents the proportion of variance
explained. It is called eta squared or η².
If there is no error variance, then the sum of squares between equals the
sum of squares total and η² = 1. If all the groups means are equal, then
sum of squares between equals zero and η² = 0. In the former case,
100% of the variance is explained by the treatment; in the latter case,
0% of the variance is explained.

Researchers in statistics have determined that η² has a positive bias. A


alternative and unbiased statistic called omega squared (ω²) has been
developed. Its formula is:

Variance Explained in ANOVA


(2 of 2)

Next section: Variance


explained in correlation
The summary table computed in the section on partitioning the sums
ofSource
squares df
Groups
in AN
2
OVA isMsreproduced
Ssq F p below.
8.000 4.000 3.60 0.071
Error 9 10.000 1.111
Total 11 18.000 1.636

The values of η² and ω² can be computed as follows:

As you can see, η² is larger than ω².

Variance Explained in Correlation (1 of 1)

Next section: Variance explained in contingency tables


The proportion of variance in variable Y explained by variable X is
defined as the ratio of the explained variation in Y to the total
variation. The section on partitioning the variance in prediction
shows that:
r² = SSY'/SSY

where r is the correlation between the two variables, SSY' is the sum
of squares explained, and SSY is the total sum of squares Y.
Therefore, r² is a measure of the proportion of variance explained.

In multiple regression, the best measure of the proportion of variance


explained by the predictor variables is the adjusted R² (corrected for
shrinkage).

Variance Explained in 2 x 2 Contingency Tables (1 of 3)

The phi (φ) coefficient is used as an index of the strength of the


relationship between variables in a 2 x 2 contingency table. Phi can be
computed as the correlation between the two dichotomous variables.

Consider a hypothetical study investigating whether a desensitization


program for people with snake phobias works better than a control
treatment consisting of reading about how snakes help keep the
ecological balance intact. Following the treatment, each subject is asked
to pick up a harmless snake and place it in a bag. The results were as
follows: Of the 10 subjects in the experimental condition, 8 were able to
perform the task. Of the 10 subjects in the control condition, only 1
would perform the task. The section on the chi square test of
independence shows how to test to see if the difference is significant.
To compute the correlation (φ) between group (experimental versus
control) and outcome (success of failure), give each subject two scores.
The first score is a "1" if the subject was in the experimental group or a
"0" if they were in the control group. The second score is a "1" if they
succeeded at the task or a "0" if they did not.

Variance Explained in 2 x 2 Contingency Tables (2 of 3)

The data could therefore be coded as follows:


Condition
Outcome
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
0
0
1
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0

Of the 10 subjects in Condition 1 (Experimental Condition), 8 have


scores of 1 (success) and 2 have scores of 0 (failure). Of the 10 subjects
in Condition 2 (Control Condition) 1 has a score of
1 (success) and 9 have scores of 0 (failure). Pearson's correlation
between these two variables is
0.7035. Therefore, φ = 0.7035. Since the proportion of variance
explained in correlation is the correlation coefficient squared, the
proportion of variance explained is: φ² = 0.7035² = 0.495. Therefore,
about 50% of the variance in outcome is explained by the treatment
effect.

Variance Explained in 2 x 2 Contingency Tables (3 of 3)

Next section: Cautions about variance explained

The following formula can also be used to compute φ:


where χ² is computed using the formula from the chi square test of
independence. For this example, χ² = 9.889 and therefore,

= 0.7035 as shown before.

Cautions in Interpreting Variance Explained (1 of 3)

Measures of the proportion of variance explained can be easily


misinterpreted. One mistake a researcher can make is to ignore the fact
that the levels of an independent variable chosen for inclusion in the
study can be a prime determinant of the proportion of variance
explained. For example, consider an investigation seeking to determine
the proportion of the variance in reading ability that can be accounted
for by age. If the experimenter chooses ages 5 years, 10 years, and
15 years for inclusion in the study, then a large proportion of the
variance will be explained by age. On the other hand, if ages 10 years,
10.2 years, and 10.4 years were included, only a very small proportion
of the variance would be explained by age. This is not a fault with the
measure of effect size per se; however, it represents one way in which
the measure can be misinterpreted.

A second problem with measures of the proportion of variance


explained is that there is no consensus on how big an effect has to be in
order to be considered meaningful. In some cases, effects that when
measured in terms of proportion of variance explained appear to be
trivial, can, in reality be very important.

Cautions in Interpreting Variance Explained (2 of 3)

Rosenthal (1990) showed that although aspirin cut the risk of a heart
attack approximately in half, it explained only 0.0011 of the variance
(0.11%). Similarly, Abelson (1985) found that batting average
accounted for a trivial proportion of the variance in baseball game
outcomes. Therefore, measures of proportion of variance explained
do not always communicate the importance of an effect accurately.
A third problem, closely related to the second, is that the importance of
an effect cannot be determined apart from the context in which the
importance is to be assessed: a variable that may not be a major
determinant of another variable may still be important in certain
contexts. F or example, Martell, Lane, and Willis (1996) demonstrated
how a sex bias that accounts for only
1% of the variance in promotions could cause large differences between
the proportio n of males and females that reach the higher levels of
management in a company. A further problem with measures of variance
explained is that they measure the size of an effect relative to the
variability of subjects rather than by some absolute standard.

Cautions in Interpreting Variance Explained (3 of 3)

Next section: Alternative approaches

In most cases, there is no absolute standard to go by, and there is


nothing much one can do other than measure effect size relative to
subject variability. However, it is important not to overlook instances in
which effect size can be measured more directly. For example, Tullis
(1981) found that improving the format in which the results of a
computerized test of a phone line are displayed reduces the average
response time by about two seconds. although this effect does not
explain much variance, it is important since computerized testing is
done literally millions of times a year. The cumulative time savings of
using the better display is gigantic.

Finally, the sampling distribution of measures of effect size complex,


and confidence intervals
on these measures are rarely computed (although procedures are
available, see Steiger, 2004). It is difficult to interpret an estimate of an
effect size without knowing the range of plausible effect sizes. When
small sample sizes are used, it is possible for the estimate o f effect size
to be very different from the actual effect size. Therefore, it is easy to
make the mistake of thinking a small effect is actually a large effect.

Alternative Approaches to Interpreting Effect size (1 of 2)

Cohen's d'
The number of standard deviations separating two group means
can be used as a measure of effect size. This measure, called d', can
be computed as follows:

where M1 is the mean for Group 1, M2 is the mean for Group 2,


and is an estimate of the standard deviation taken from the
analysis of variance summary table. For example, if the mean of
Group 1 were 20 and the mean of Group 2 were 10 and the the
standard deviation as estimated by were 5, then d' would be 2. The
means are two standard deviations apart. although there are no
generally accepted criteria for determining whether a given d' is
large enough to be important, Cohen (1962) made the reasonable
recommendation that a d' of 0.20 is a small effect, a d' of 0.50 is a
medium sized effect, and
a d' of 0.80 is a large effect.
Alternative Approaches to Interpreting Effect size (2 of 2)

Keep in mind, however, that, just as with any measure of effect size, it
is not possible to specify how big the effect must be in order to be
important without considering the context in which the effect is going
to be used. Just as with the measures of variance explained, there are
cases for which an effect that has very small d' can have a substantial
practical effect.

Graphical Methods
It is obviously difficult to summarize the size of an association with a
single number. A better
approach is to use an appropriate graphical display to indicate the size
of an effect. When differences in the central tendency of two or more
groups is at issue, then side by side box plots
is an excellent way to portray the group differences. The differences
between measures of central tendency and the amount of overlap among
the groups is readily apparent. Scatterplots provide a useful way to
portray the size of a relationship between two quantitative variables.
Too often researchers rely solely on the magnitude of the correlation
coefficient without viewing a scatterplot in their assessment of the size
of a relationship.

Other Resources
There are many issues involving effect sizes that are beyond the scope of
this book. The reader is
referred to the book by Rosenthal, Rosnow, and Rubin for more
information.
End of book!!!

Chapter
18

1. Why can't the probability value be used as a measure of effect size?

2. The following data are from an experiment in which comparing


three kinds of cat food. Cats were randomly assigned to one of three
groups.

Group 1 was fed "Good and


Fishy," Group 2 was fed
“Cat’s Delight," and
Group 3 was fed "Lovely Livers." The time (in seconds) taken to eat a
six-once can of food was recorded. Compute and .

Good and Cat's Lovely


1 1 32
2 8 31
1 2 18
3. If the correlation between X and Y is 0.30, what is the
proportion of the Y variance is explained by X?

4. The effectiveness of a drug is evaluated by comparing the


proportion of patients showing improvement with the drug to the
proportion showing improvement with a placebo. Of 50 patients
taking the drug, 15 showed improvement; of 50 taking the placebo,
6 s howed improvement. Compute .
5. How can the choice of levels of an independent
variable affect ? Selected Answers

Chapter
18

2.

=
0.644
= 0.724
3. 0.09

4. 0.049
What is a Normal Distribution?

Next section: Standard normal

Normal distributions are a family of distributions that have the same


general shape. They are symmetric with scores more concentrated in the
middle than in the tails. Normal distributions are sometimes described as
bell shaped.

Examples of normal distributions are shown below. Notice that they


differ in how spread out they are. The area under each curve is the
same. The height of a normal distribution can be specified
mathematically in terms of two parameters: the mean (μ) and the
standard deviation (σ).
Standard Normal Distribution (1 of 2)

The standard normal distribution is a normal distribution with a mean


of 0 and a standard deviation of 1. Normal distributions can be
transformed to standard normal distributions by the formula:

where X is a score from the original normal distribution, μ is the mean


of the original normal distribution, and σ is the standard deviation of
original normal distribution. The standard normal distribution is
sometimes called the z distribution. A z score always reflects the
number of standard deviations above or below the mean a particular
score is. For instance, if a person
scored a 70 on a test with a mean of 50 and a standard deviation of 10,
then they scored 2 standard deviations above the mean. Converting the
test scores to z scores, an X of 70 would be:

So, a z score of 2 means the original score was 2 standard deviations


above the mean. Note that the z distribution will only be a normal
distribution if the original distribution (X) is normal.

Standard normal distribution (2 of 2)

Next section: Why is it important?

Applying the formula


will always produce a transformed distribution with a mean of zero and
a standard deviation of one. However, the shape of the distribution will
not be affected by the transformation. If X is not normal then the
transformed distribution will not be normal either. One important use of
the standard normal distribution is for converting between scores from a
normal distribu tion and percentile ranks.

Areas under portions of the standard normal distribution are shown


below. About 0.68 (0.34 +
0.34) of the distribution is between -1 and 1 while about 0.96 of the
distribution is between -2 and 2.

Next section: Why is it important?

What's So Important about the Normal Distribution?


Next section: Converting to percentiles and back

One reason the normal distribution is important is that many


psychological and educational variables are distributed approximately
normally. Measures of reading ability, introversion, job satisfaction,
and memory are among the many psychological variables
approximately normally distributed. although the distributions are only
approximately normal, they are usually quite close.

A second reason the normal distribution is so important is that it is easy


for mathematical statisticians to work with. This means that many
kinds of statistical tests can be derived for normal distributions. Almost
all statistical tests discussed in this text assume normal distributions.
Fortunately, these tests work very well even if the distribution is only
approximately normally distributed. Some tests work well even with
very wide deviations from normality.

Finally, if the mean and standard deviation of a normal


distribution are known, it is easy to convert back and forth from raw
scores to percentiles.

Converting to Percentiles and Back (1 of 4)


If the mean and standard deviation of a normal distribution are known,
it is relatively easy to figure out the percentile rank of a person
obtaining a specific score. To be more concrete, assume a test in
Introductory Psychology is normally distributed with a mean of 80 and a
standard deviation of 5. What is the percentile rank of a person who
received a score of 70 on the test?

Mathematical statisticians have developed ways of determining the


proportion of a distribution that is below a given number of standard
deviations from the mean. They have shown that only
2.3% of the population will be less than or equal to a score two standard
deviations below the mean. (click here to see why 70 is two standard
deviations below the mean.) In terms of the Introductory Psychology
test example, this means that a person scoring 70 would be in the 2.3rd
percentile.
This graph shows the distribution of scores on the test. The shaded area
is 2.3% of the total ar ea. The proportion of the area below 70 is equal to
the proportion of the scores below 70.

Converting to Percentiles and Back (2 of 4)

What about a person scoring 75 on the test? The proportion of the area
below 75 is the same as the proportion of scores below 75.

A score of 75 is one standard deviation below the mean because the


mean is 80 and the standard deviation is 5. Mathematical statisticians
have determined that 15.9% of the scores in a normal distribution are
lower than a score one standard deviation below the mean. Therefore,
the proportion of the scores below 75 is 0.159 and a person scoring 75
would have a percentile rank score of 15.9.
The table below page gives the proportion of the scores below various
values of z; z is computed with the formula:

where z is the number of standard deviations (σ) X is


above the mean (μ). Naturally, it is more conventient
to use a z calculator.
Converting to Percentiles and Back (3 of 4)

When z is negative it means that X is below the mean. Thus,


a z of -2 means that X is -2 standard deviations above the
mean which is the same thing as being +2 standard
deviations below the mean.

To take another example, what is the percentile rank of a


person receiving a score of 90 on the test?
The graph shows that most people scored below 90. Since 90 is 2
standard deviations above the mean

z = (90 - 80)/5 = 2

it can be determined from the table that a z score of 2 is equivalent to


the 97.7th percentile: The proportion of people scoring below 90 is thus
.977.

Converting to Percentiles and Back (4 of 4)


Next section: Area under portions of the curve

What score on the Introductory Psychology test would it have taken to


be in the 75th percentile? (Remember the test has a mean of 80 and a
standard deviation of 5.) The answer is computed by reversing the steps
in the previous problems.

First, determine how many standard deviations above the mean one
would have to be to be in the
75th percentile. This can be found by using a z table and finding the z
associated with 0.75. The value of z is 0.674. Thus, one must be .674
standard deviations above the mean to be in the 75th percentile.

Second, the standard deviation is 5,


one must be: (5)(.674) = 3.37
points above the mean. Since the mean is 80, a score of 80 +
3.37 = 83.37 is necessary. Rounding off, a score of 83 is needed
to be in the 75th percentile. Since

a little algebra demonstrates that X = μ+ z σ. For the


present example, X = 80 + (.674)(5) = 83.37 as just
shown in the figure.
Area Under a Portion of the Normal
Curve (1 of 4)

If a test is normally distributed with a mean of 60 and a standard


deviation of 10, what proportion of the scores is above 85? This
problem is very similar to figuring out the percentile rank of a person
scoring 85. The first step is to figure out the proportion of scores less
than or equal to 85. This is done by figuring out how many standard
deviations above the mean 85 is. Since 85 is 85-60 = 25 points above
the mean and since the standard deviation is 10, a score of
85 is 25/10 = 2.5 standard deviations above the mean. Or, in
terms of the formula,
= (85-60)/10 = 2.5

A z table can be used to calculate that 0.9938 of the scores are less than
or equal to a score 2.5 standard deviations above the mean. It follows
that only 1-0.9938 = .0062 of the scores are above a score 2.5 standard
deviations above the mean. Therefore, only 0.0062 of the scores are
above
85.

Area Under a Portion of the Normal Curve (2 of 4)


Suppose you wanted to know the proportion of students receiving scores
between 70 and 80. The approach is to figure out the proportion of
students scoring below 80 and the proportion below
70. The difference between the two proportions is the proportion
scoring between 70 and 80. First, the calculation of the proportion
below 80. Since 80 is 20 points above the mean and the standard
deviation is 10, 80 is 2 standard deviations above the mean.

= (80 - 60)/10 = 2

A z table can be used to determine that .9772 of the scores are


below a score 2 standard deviations above the mean.
Area Under a Portion of the Normal Curve (3 of 4)

To calculate the proportion below 70,

= (70 - 60)/10 = 1

A z table can be used to determine that the proportion of scores less than
1 standard deviation above the mean is 0.8413. So, if 0.1587 of the
scores are above 70 and 0.0228 are above 80, then the proportion
between 70 and 80 can be computed by subtracting 0.0228 from 0.1587:

0.1587 -0.0228 = 0.13590.


Area Under a Portion of the Normal Curve (4 of 4)

Next chapter: Sampling distributions

Assume a test is normally distributed with a mean of 100 and a standard


deviation of 15. What proportion of the scores would be between 85 and
105? The solution to this problem is similar to the solution to the last
one. The first step is to calculate the proportion of scores below 85.
Next, calculate the proportion of scores below 105. Finally, subtract the
first result from the second to find the proportion scoring between 85
and 105.

Begin by calculating the proportion below 85. You can caluclate


that 85 is one standard deviation below the mean:
= (85 - 100)/15 = -1

Using a z table with the value of -1 for z, the area below -1 (or 85 in
terms of the raw scores) is
0.1587.

Doing the same thing for 105,

= (105 - 100)/15 = 0.333

A z table shows that the proportion scoring below 0.333 (105 in


raw scores) is .6306. The difference is .6306 - .1587 = .4719. So, .
472 of the scores are between 85 and 105.
Chapte
r5

1. If scores are normally distributed with a mean of 30 and a


standard deviation of 5, what percent of the scores is: (a) greater
than 30? (b) greater than 37? (c) between 28 and 34?

2. (a) What are the mean and standard deviation of the standard
normal distribution? (b) What would be the mean and standard
deviation of a distribution created by multiplying the standard
normal distribution by 10 and then adding 50?

3. The normal distribution is defined by two parameters. What are


they?

4. (a) What proportion of a normal distribution is within one standard


deviation of the
mean? (b) What proportion is more than 1.8 standard deviations
from the mean? (c) What proportion is between 1 and 1.5 standard
deviations above the mean?
5. A test is normally distributed with a mean of 40 and a standard
deviation of 7. (a) What score would be needed to be in the 85th
percentile? (b) What score would be needed to be in the 22nd
percentile?

6. Assume a normal distribution with a mean of 90 and a


standard deviation of 7. What limits would include the middle
65% of the cases.

7. For this problem, use the scores in the identical blocks test
(second in the dataset "High_School_Sample"). Compute the mean
and standard deviation. Then, compute what the 25th and 75th
percentile would be if the distribution were normal. Compare the
estimates to the actual 25th and 75th percentiles.

Selected Answers

Selected Answers

1a. 50%
1b. 8.08%
1c. 44.35
4a. 0.6826
4c. 0.0919

5a. 47.25
5b. 34.59

6. 83.46 and 96.54

7. Estimates: 13.141 and 23.872. Actual 12.5 and 26.


Ruling out Chance as an Explanation (1 of 5)

When an independent variable appears to have an effect, it is very


important to be able to state with confidence that the effect was really
due to the variable and not just due to chance. For instance, consider a
hypothetical experiment on a new antidepressant drug. Ten people
suffer ing from depression were sampled and treated with the new drug
(the experimental group); an additional 10 people were sampled from
the same population and were treated only with a placebo (the control
group). After 12 weeks, the level of depression in all subjects was
measured and it was found that the mean level of depression (on a 10-
point scale with higher numbers indicating more depression) was 4 for
the experimental group and 6 for the control group. The most basic
question that can be asked here is: "How can one be sure that the drug
treatment
rather than chance occurrences were responsible for the difference
between the groups?" It could be, that by chance, the people who were
randomly assigned to the treatment group were initially somewhat less
depressed than those randomly assigned to the control group.

Ruling out Chance as an Explanation (2 of 5)


Or, it could be that, by chance, more pleasant things happened to the
experimental group than to the control group over the 12 weeks of the
experiment. The way this problem is approached statistically is to
calculate how often one would get a difference as large or larger than the
one obtained in the experiment if the experimental treatment really had
no effect (and thus the differences were due to chance). If a difference as
large or larger than the one obtained in the experiment could be expected
to occur by chance relatively frequently, say, one out of every four
times, then chance would remain a viable explanation of the effect. If
such a difference would only occur by chance very rarely, then chance
would not be a viable explanation.

Returning to the study on the effectiveness of the antidepressant, recall


that the experimental group differed from the control group by 6 - 4 = 2
units on the depression scale. For the sake of argument, assume that if
there were no true difference between means, the sampling distribution
of the difference between means would be as shown on the next page

Ruling out Chance as an Explanation (3 of 5)


The mean of the sampling distribution is 0 since the assumption is that
there are no differences between population means and therefore the
average difference between sample means would be zero. The graph of
the sampling distribution shows that a difference between sample means
of two or more is not an improbable occurrence even when there is no
difference between population means. Twenty percent of the time the
Drug group would be two or more points
lower and 20% of the time the Control group would be two or more
points higher. How should the results of the experiment be interpreted
in light of this analysis? Science requires a conservative approach to
decision making: a conclusion is accepted by the sci entific community
only if the data supporting it are strong enough to convince a skeptic.
No skeptic would be convinced that the antidepressant drug rather than
chance caused the difference.

Ruling out Chance as an Explanation (4 of 5)

The skeptic would argue:


"Since the differences could easily have been produced by chance
factors, why should I believe in the effectiveness of this drug? Perhaps
the drug is effective, perhaps not. I am not convinced." Now, for the
sake of argument, assume that if there were no true difference between
means, the
sampling distribution of the difference between means would be as
shown below instead of the one shown on the previous page.

Under this assumption, a difference of two or more on the depression


scale would occur extremely infrequently, about once in a thousand
times. The skeptic would be hard pressed to argue that the antidepressant
had no effect. The skeptic's thinking might be as follows on the next
page:

Ruling out Chance as an Explanation (5 of 5)

Next section: The null hypothesis


"Of course, things that happen extremely infrequently occasionally do
happen. This might be one of those times. However, being skeptical
does not mean being totally unwilling to accept new findings. The data
are quite strong so I am forced to conclude that the drug is more
effective than the placebo."
Naturally, there is a large degree of subjectivity in deciding how
unlikely results must be (given that only chance is operating) before it
should be concluded that chance is not responsible for the effect.
Traditionally, researchers have used either the 5% or the 1%
significance levels. When using the 5% significance level, one
concludes that the experimental treatment has a real effect if chance
alone would produce a difference as large or larger than the one
obtained only 5% of the time or less. A good portion of this text is
devoted to methods used for figuring out the probability of outcomes
when chance alone is operating.

Null Hypothesis (1 of 4)

The null hypothesis is an hypothesis about a population parameter.


The purpose of hypothesis testing is to test the viability of the null
hypothesis in the light of experimental data. Depending on the data, the
null hypothesis either will or will not be rejected as a viable possibility.
Consider a researcher interested in whether the time to respond to a
tone is affected by the consumption of alcohol. The null hypothesis is
that µ 1 - µ2 = 0 where µ1 is the mean time to respond after
consuming alcohol and µ2 is the mean time to respond otherwise.
Thus, the null hypothesis concerns the parameter µ1 - µ2 and the null
hypothesis is that the parameter equals zero.

The null hypothesis is often the reverse of what the experimenter


actually believes; it is put forward to allow the data to contradict it. In
the experiment on the effect of alcohol, the experimenter probably
expects alcohol to have a harmful effect. If the experimental data show
a sufficiently large effect of alcohol, then the null hypothesis that
alcohol has no effect can be rejected

Null Hypothesis (2 of 4)

It should be stressed that researchers very frequently put forward a null


hypothesis in the hope that they can discredit it. For a second example,
consider an educational researcher who designed a new way to teach a
particular concept in science, and wanted to test experimentally whether
this new method worked better than the existing method. The researcher
would design an experiment comparing the two methods. Since the null
hypothesis would be that there is no difference between the two
methods, the researcher would be hoping to reject the null hypothesis
and conclude that the method he or she developed is the better of the
two.
The symbol H0 is used to indicate the null hypothesis. For the
example just given, the null hypothesis would be designated by the
following symbols:

H0: µ1 -
µ2 = 0 or
by
H0: μ1 = μ2.

The null hypothesis is typically a hypothesis of no difference as in this


example where it is the hypothesis of no difference between
population means. That is why the word "null" in "null hypothesis" is
used -- it is the hypothesis of no difference.

Null hypothesis (3 of 4)

Despite the "null" in "null hypothesis," there are occasions when the
parameter is not hypothesized to be 0. For instance, it is possible for the
null hypothesis to be that the difference between population means is a
particular value. Or, the null hypothesis could be that the mean SAT
score in some population is 600. The null hypothesis would then be
stated as: H 0: μ = 600. although the null hypotheses discussed so far
have all involved the testing of hypotheses about one or more
population means, null hypotheses can involve any parameter. An
experiment investigating the correlation between job satisfaction and
performance on the job would test the null hypothesis that the
population correlation (ρ) is 0. Symbolically, H 0: ρ = 0.
Some possible null hypotheses are given below:
H0: μ=0
H0: μ=10
H0: μ1 - μ2 = 0
H0: π = .5
H0: π1 - π2 = 0
H0: μ1 = μ2 = μ3
H0: ρ1- ρ2= 0

Null hypothesis (4 of 4)

Next section: Steps in hypothesis testing


When a one-tailed test is conducted, the null hypothesis includes the
direction of the effect. A one-tailed test of the differences between
means might test the null hypothesis that μ 1 - μ 2 ≥ 0. If M1 - M2 were
much less than 0 then the null hypothesis would be rejected in favor of
the
alternative hypothesis: μ1 - μ2 < 0.

See also: significance test and significance level

Steps in Hypothesis Testing


(1 of 5)

The basic logic of hypothesis testing has been presented somewhat


informally in the sections on "Ruling out chance as an explanation" and
the "Null hypothesis." In this section the logic will be presented in more
detail and more formally.

1. The first step in hypothesis testing is to specify the null


hypothesis (H0) and the alternative hypothesis (H1). If the research
concerns whether one method of presenting pictorial stimuli leads
to better recognition than another, the null hypo thesis would most
likely be that there is no difference between methods (H0: μ1 - μ2
= 0). The alternative hypothesis would be H1: μ1 ≠ μ2. If the
research concerned the correlation between grades and SAT
scores, the null hypothesis would most likely be that there is no
correlation (H0: ρ= 0). The alternative hypothesis would be H1: ρ
≠ 0.
2. The next step is to select a significance level. Typically the 0.05
or the 0.01 level is used.

3. The third step is to calculate a statistic analogous to the


parameter specified by the null hypothesis. If the null hypothesis
were defined by the parameter μ1- μ2, then the statistic M1 - M2
would be computed.

Steps in Hypothesis Testing


(2 of 5)

4. The fourth step is to calculate the probability value (often called


the p value). The p value is the probability of obtaining a statistic
as different or more different from the parameter specified in the
null hypothesis as the statistic computed from the data. The
calculations are made assuming that the null hypothesis is true.
(click here for a concrete example)

5. The probability value computed in Step 4 is compared with the


significance level chosen in Step 2. If the probability is less than or
equal to the significance level, then the null hypothesis is rejected;
if the probability is greater than the significance level then the null
hypothesis is not rejected. When the null hypothesis is rejected, the
outcome is said to be "statistically significant" when the null
hypothesis is not rejected then the outcome is said be "not
statistically significant."

6. If the outcome is statistically significant, then the null


hypothesis is rejected in favor of the alternative hypothesis. If
the rejected null hypothesis were that μ 1- μ2 = 0, then the
alternative hypothesis would be that μ1≠ μ2. If M1 were greater
than M2 then the researcher would naturally conclude that μ1 ≥
μ2. (Click here to see why you can conclude more than μ1 ≠ μ2)
Steps in Hypothesis Testing (3 of 5)

7. The final step is to describe the result and the statistical


conclusion in an understandable way. Be sure to present the
descriptive statistics as well as whether the effect was significant
or not. For example, a significant difference between a group that
received a drug and a control group might be described as follow:

Subjects in the drug group scored significantly higher (M = 23)


than did subjects in the control group (M = 17), t(18) = 2.4, p =
0.027.

The statement that "t(18) =2.4" has to do with how the probability
value (p) was calculated. A small minority of researchers might
object to two aspects of this wording. First, some believe that the
significance level rather than the probability level should be
reported. The argument for reporting the probability value is
presented in another section. Second, since the alternative
hypothesis was stated as µ 1 ≠ µ2, some might argue that it can
only be concluded that the population means differ and not that the
population mean for the drug group is higher than the population
mean for the control group.
Steps in Hypothesis Testing (4 of 5)

This argument is misguided. Intuitively, there are strong reasons for


inferring that the direction of the difference in the population is the
same as the difference in the sample. There is also a more formal
argument. A non significant effect might be described as follows:
Although subjects in the drug group scored higher (M = 23) than
did subjects in the control group, (M = 20), the difference between
means was not significant, t(18) = 1.4, p = 0.179.
It would not have been correct to say that there was no difference
between the performance of the two groups. There was a difference. It
is just that the difference was not large enough to ru le out chance as an
explanation of the difference. It would also have been incorrect to
imply that there is no difference in the population. Be sure not to
accept the null hypothesis.

Steps in Hypothesis Testing (5 of 5)

Next section: Why the null hypothesis is not accepted


At this point you may wish to see a concrete example of using these
seven steps in hypothesis testing. If so, jump to the section on "Tests
of μ, σ known."

Next section: Why the null hypothesis is not accepted


Why the Null Hypothesis is Not Accepted (1 of 5)

A null hypothesis is not accepted just because it is not rejected. Data not
sufficient to show convincingly that a difference between means is not
zero do not prove that the difference is zero. Such data may even
suggest that the null hypothesis is false but not be strong enough to
make a convincing case that the null hypothesis is false. For example, if
the probability value were 0.15, then one would not be ready to present
one's case that the null hypothesis is false to the (properly) skeptical
scientific community. More convincing data would be needed to do that.
However, there would be no basis to conclude that the null hypothesis is
true. It may or may not be true, there just is not strong enough evidence
to reject it. Not even in cases where there is no evidence that the null
hypothesis is false is it valid to conclude the null hypothesis is true. If
the null hypothesis is that µ1 - µ2 is zero then the hypothesis is that the
difference is exactly zero. No experiment can distinguish between the
case of no difference between means and an extremely small difference
between means. If data are consistent with the null hypothesis, they are
also consistent with other similar hypotheses.

Why the Null Hypothesis is Not Accepted (2 of 5)


Thus, if the data do not provide a basis for rejecting the null
hypothesis that µ 1- µ2 = 0 then they almost certainly will not
provide a basis for rejecting the hypothesis that µ 1- µ2 =
0.001. The data are consistent with both hypotheses. When the null
hypothesis is not rejected then it is legitimate to conclude that the
data are consistent with the null hypothesis. It is not legitimate to
conclude that the data support the acceptance of the null hypothesis
since the data are consistent with other hypotheses as well. In some
respects, rejecting the null hypothesis is comparable to a jury
finding a defendant guilty. In both cases, the evidence is convincing
beyond a reasonable doubt. Failing to reject the null hypothesis is
comparable to a finding of not guilty. The defendant is not declared
innocent. There is just not enough evidence to be convincing beyond
a reasonable doubt. In the judicial system, a decision has to be
made and the defendant is set free. In science, no decision has to be
made immediately. More experiments are conducted.
Why the Null Hypothesis is Not Accepted (3 of 5)

One experiment might provide data sufficient to reject the null


hypothesis, although no experiment can demonstrate that the null
hypothesis is true. Where does this leave the researcher who wishes to
argue that a variable does not have an effect? If the null hypothesis
cannot be accepted, even in principle, then what type of statistical
evidence can be used to support the hypothesis that a variable does not
have an effect. The answer lies in relaxing the claim a little and arguing
not that a variable has no effect whatsoever but that it has, at most, a
negligible effect. This can be done by constructing a confidence
interval around the parameter value.
Consider a researcher interested in the possible effectiveness of a new
psychotherapeutic drug. The researcher conducted an experiment
comparing a drug-treatment group to a control group and found no
significant difference between them. Although the experimenter cannot
claim the drug has no effect, he or she can estimate the size of the
effect using a confidence interval. If µ 1 were the population mean for
the drug group and µ2 were the population mean for the control group,
then the confidence interval would be on the parameter µ 1 - µ2.

Why the Null Hypothesis is Not Accepted (4 of 5)

Assume the experiment measured "well being" on a 50 point scale


(with higher scores representing more well being) that has a
standard deviation of 10. Further assume the 99% confidence
interval computed from the experimental data was:

-0.5 ≤ µ1- µ2 ≤ 1

This says that one can be confident that the mean "true" drug treatment
effect is somewhere between -0.5 and 1. If it were -0.5 then the drug
would, on average, be slightly detrimental; if it were 1 then the drug
would, on average, be slightly beneficial. But, how much benefit is an
average improvement of 1? Naturally that is a question that involves
characteristics of the measurement scale. But, since 1 is only 0.10
standard deviations, it can be presumed to be a small effect. The
overlap between two distributions whose means differ by 0.10 standard
deviations is shown below. Although the blue distribution is

Why the Null Hypothesis is Not Accepted (5 of 5)

Next section: The precise meaning of the p value

So, the finding that the maximum difference that can be expected (based
on a 99% confidence interval) is itself a very small difference would
allow the experimenter to conclude that the drug is not effective. The
claim would not be that it is totally ineffective, but, at most, its
effectiveness is very limited.
See also: pages 2-3 of "Confidence intervals & hypothesis testing"

The Precise Meaning of the Probability Value (1 of 3)


There is often confusion about the precise meaning of the
probability computed in a significance test. As stated in Step 4 of
the steps in hypothesis testing, the null hypothesis (H0) is assumed
to be true. The difference between the statistic computed in the
sample and the parameter specified by H0 is computed and the
probability of obtaining a difference
this large or large is calculated. This probability value is the
probability of obtaining data as extreme or more extreme than the
current data (assuming H0 is true). It is not the probability of the
null hypothesis itself. Thus, if the probability value is 0.005, this
does not mean that the probability that the null hypothesis is true is
.005. It means that the probability of obtaining data as different or
more different from the null hypothesis as those obtained in the
experiment is 0.005.

The inferential step to conclude that the null hypothesis is false


goes as follows: The data (or data more extreme) are very unlikely
given that the null hypothesis is true. This means that: (1) a very
unlikely event occurred or (2) the null hypothesis is false. The
inference usually made is that the null hypothesis is false.

The Precise Meaning of the Probability Value (2 of 3)


To illustrate that the probability is not the probability of the hypot hesis,
consider a test of a person who claims to be able to predict whether a
coin will come up heads or tails. One should take a rather skeptical
attitude toward this claim and require strong evidence to believe in its
validity. The null hypothesis is that the person can predict correctly half
the time (H0: π = 0.5). In the test, a coin is flipped 20 times and the
person is correct 11 times. If the person has no special ability (H0 is
true), then the probability of being correct 11 or more times out of 20 is
0.41.
Would someone who was originally skeptical now believe that there is
only a 0.41 chance that
the null hypothesis is true? They almost certainly would not since
they probab ly originally thought H0 had a very high probability of
being true (perhaps as high as 0.9999). There is no logical reason for
them to decrease their belief in the validity of the null hypothesis
since the outcome was perfectly consistent with the null hypothesis.

The Precise Meaning of the Probability Value (3 of 3)

Next section: At what level is H0 really rejected?


The proper interpretation of the test is as follows: A person made a
rather extraordinary claim and should be able to provide strong evidence
in support of the claim if the claim is to believed. The test provided data
consistent with the null hypothesis that the person has no special ability
since a person with no special ability would be able to predict as well or
better more than 40% of the time. Therefore, there is no compelling
reason to believe the extraordinary claim. However, the test does not
prove the person cannot predict better than chance; it simply fails to
provide
evidence that he or she can. The probability that the null hypothesis is
true is not determined by the statistical analysis conducted as part of
hypothesis testing. Rather, the probab ility computed is the probability
of obtaining data as different or more different from the null hypothesis
(given that the null hypothesis is true) as the data actually obtained

At What Level is H0 Really Rejected? (1 of 3)

According to one view of hypothesis testing, the significance level


should be specified before any statistical calculations are performed.
Then, when the probability (p) is computed from a significance test, it
is compared with the significance level. The null hypothesis is rejected
if p is at or below the significance level; it is not rejected if p is above
the significance level. The
degree to which p ends up being above or below the significance level
does not matter. The null hypothesis either is or is not rejected at the
previously stated significance level. Thus, if an experimenter originally
stated that he or she was using the 0.05 significance level and p was
subsequently calculated to be 0.042, then the person would reject the
null hypothesis at the 0.05 level. If p had been 0.0001 instead of 0.042
then the null hypothesis would still be rejected at the
0.05 level. The experimenter would not have any basis to be more
confident that the null hypothesis was false with a p of 0.0001 than with
a p of 0.041. Similarly, if the p had been 0.051 then the experimenter
would fail to reject the null hypothesis.

At What Level is H0 Really Rejected? (2 of 3)

He or she would have no more basis to doubt the validity of the null
hypothesis than if p had been 0.482. The conclusion would be that the
null hypothesis could not be rejected at the 0.05 level. In short, this
approach is to specify the significance level in advance and use p only
to determine whether or not the null hypothesis can be rejected at the
stated significance level.

Many statisticians and researchers find this approach to hypothesis


testing not only too rigid, but basically illogical. Who in their right mind
would not have more confidence that the null hypothesis is false with a
p of 0.0001 then with a p of 0.042? The less likely the obtained results
(or more extreme results) under the null hypothesis, the more confident
one should be that the null hypothesis is false. The null hypothesis
should not be rejected once and for all. The possibility that it was
falsely rejected is always present, and, all else being equal, the lower
the p value, the lower this possibility.
At What Level is H0 Really Rejected? (3 of 3)

Next section: Statistical and practical significance

According to this view, research reports should not contain the values of
p, only w hether or not
the values were significant (at or below the significance level). It is
much more reasonable to report the p values. That way each reader can
make up his or her mind about just how convinced they are that the null
hypothesis is false. For more discussion see Wilkenson et al. (1999).

Statistical and Practical Significance (1 of 4)

It is important not to confuse the confidence with which the null


hypothesis can be rejected with size of the effect. To make this point
concrete, consider a researcher assigned the task of determining whether
the video display used by travel agents for booking airline reservations
should be in color or in black and white. Market research had shown that
travel agencies were primarily concerned with the speed with which
reservations can be made. Therefore, the question was whether color
displays allow travel agents to book reservations faster. Market research
had also shown that in order to justify the higher price of color displays,
they must be faster by an average of at least 10 seconds per transaction.
Fifty subjects were tested with color displays and
50 subjects were tested with black and white displays. Subjects were
slightly faster at making reservations on a color display (M = 504.7
seconds) than on a black and white display (M =
508.2) seconds. although the difference is small, it was statistically
significant at the .05
significance level. Box plots of the data are shown below.
Statistical and Practical Significance (2 of 4)

The 95% confidence interval on the difference between means is:

-7.0 ≤ μcolor - μblack & white ≤ -0.1

which means that the experimenter can be confident that the color
display is between 7.0 seconds and 0.1 seconds faster. Clearly, the
difference is not big enough to justify the more expensive color displays.
Even the upper limit of the 95% confidence interval (seven seconds) is
below the minimum needed to justify the cost (10 seconds). Therefore,
the experimenter could feel
confident in his or her recommendation that the black and white displays
should be used. The fact that the color displays were significantly faster
does not mean that they were much faster. It just means that the
experimenter can reject the null hypothesis that there is no difference
between the displays.

Statistical and Practical Significance (3 of 4)

The experimenter presented this conclusion to management but


management did not accept it. The color displays were so dazzling that
despite the statistical analysis, the y could not believe that color did not
improve performance by at least 10 seconds. The experimenter
decided to do the experiment again, this time using 100 subjects for
each type of display. The results of the second experiment were very
similar to the first. Subjects were slightly faster at making reservations
on a color display (M = 504.7 seconds) than on a black and white
display (M =
508.1 seconds) . This time the difference was significant at the 0.01
level rather than the 0.05 level found in the first experiment. Despite the
fact that the size of the difference between means was no larger, the
difference was "more significant" due to the larger sample size used. If
the population difference is zero, then a sample difference of 3.4 or
larger with a sample size of 100 is less likely than a sample difference
of 3.5 or larger with a sample size of 50.

Statistical and Practical Significance (4 of 4)

Next section: Type I and II errors

The 95% confidence interval on the difference between means is:

-5.8 ≤ μcolor - μblack


& white ≤ -0.9 and the
99% interval is:
-6.6 ≤ μcolor - μblack & white ≤ -0.1

Therefore, despite the finding of a "more significant" difference


between means, the experimenter can be even more certain that the
color displays are only slightly better than the black and white
displays. The second experiment shows conclusively that the
difference is less than 10 seconds.

This example was used to illustrate the following points: (1) an effect
that is statistically significant is not necessarily large enough to be of
practical significance and (2) the smaller of two effects can be "more
significant" than the larger. Be careful how you interpret findings
reported in the media. If you read that a particular diet lowered
cholesterol significan tly, this does not necessarily mean that the diet
lowered cholesterol enough to be of any health value. It means that the
effect on cholesterol in the population is greater than zero.
Type I and II errors (1 of 2)

There are two kinds of errors that can be made in significance testing:
(1) a true null hypothesis can be incorrectly rejected and (2) a false null
hypothesis can fail to be rejected. The former error is called a Type I
error and the latter error is called a Type II error. These two types of
errors are defined in the table.

True State of the Null


Statistical H0 True H0 False
Reject H0 Type I Correct
Do not Reject Correct Type II

The probability of a Type I error is designated by the Greek letter


alpha () and is called the Type I error rate; the probability of a
Type II error (the Type II error rate) is designated by the Greek
letter beta (ß) . A Type II error is only an error in the sense that
an opportunity to reject the null hypothesis correctly was lost. It is
not an error in the sense that an incorrect conclusion was drawn
since no conclusion is drawn when the null hypothesis is not
rejected.
Type I and II errors (2 of 2)
Next section: One- and two-tailed tests

A Type I error, on the other hand, is an error in every sense of the word.
A conclusion is drawn that the null hypothesis is false when, in fact, it
is true. Therefore, Type I errors are generally considered more serious
than Type II errors. The probability of a Type I error (α) is called the
significance level and is set by the experimenter. There is a tradeoff
between Type I and Type II errors. The more an experimenter protects
himself or herself against Type I errors by choosing a low level, the
greater the chance of a Type II error. Requiring very strong evidence to
reject the null hypothesis makes it very unlikely that a true null
hypothesis will be rejected. However, it increases the chance that a false
null hypothesis will not be rejected, thus lowering power. The Type I
error rate is almost always set at .05 or at .01, the latter being more
conservative since it requires stronger evidence to reject the null
hypothesis at the .01 level then at the .05 level.

Next section: One- and two-tailed tests.

One- and Two-Tailed Tests (1 of 4)


In the section on "Steps in hypothesis testing" the fourth step
involves calculating the probability that a statistic would differ as
much or more from parameter specified in the
null hypothesis as does the statistic obtained in the experiment. This
statement implies that a difference in either direction would be
counted. That is, if the null hypothesis were:

H0: μ- μ = 0

and the value of the statistic M1- M2 were +5, then the probability
of M1- M2 differing from zero by five or more (in either direction)
would be computed. In other words, probability value would be the
probability that either M1- M2 ≥ 5 or M1- M2 ≤ -5.

Assume that the figure shown below is the sampling distribution of


M1- M2.

The figure shows that the probability of a value of +5 or more


is 0.036 and that the probability of a value of -5 or less is .036.
Therefore the probability of a value either greater than or
equal to +5 or less than or equal to -5 is 0.036 + 0.036 = 0.072.
One- and Two-Tailed Tests (2 of 4)
A probability computed considering differences in both directions
is called a "two -tailed" probability. The name makes sense since
both tails of the sampling distribution are considered.

There are situations in which an experimenter is concerned only


with differences in one direction. For example, an experimenter
may be concerned with whether or not μ 1 - μ2 is greater than zero.
However, if μ1 - μ2 is not greater than zero, the experimenter may
not care whether it equals zero or is less than zero. For instance, if
a new drug treatment is developed, the main issue is whether or
not it is better than a placebo. If the treatment is not better than a
placebo, then it will not be used. It does not really matter whether
or not it is worse than the placebo.

When only one direction is of concern to an experimenter, then a


"one-tailed" test can be performed. If an experimenter were only
concerned with whether or not μ 1 - μ2 is greater than zero, then
the one-tailed test would involve calculating the probability of
obtaining a statistic as great or greater than the one obtained in
the experiment.
One- and Two-Tailed Tests (3 of 4)
In the example, the one-tailed probability would be the probability of
obtaining a value of M 1- M2 greater than or equal to five given that
the difference between population means is zero.
The shaded area in the figure is greater than five. The figure shows that
the one-tailed probability is 0.036.

It is easier to reject the null hypothesis with a one-tailed than with a


two-tailed test as long as the effect is in the specified direction.
Therefore, one-tailed tests have lower Type II error rates and more
power than do two-tailed tests. In this example, the one-tailed
probability (0.036) is below the conventional significance level of 0.05
whereas the two-tailed probability (0.072) is not. Probability values for
one-tailed tests are one half the value for two-tailed tests as long as the
effect is in the specified direction.

One- and Two-Tailed Tests (4 of 4)

Next section: Confidence intervals and hypothesis testing


One-tailed and two-tailed tests have the same Type I error rate. One-
tailed tests are sometimes used when the experimenter predicts the
direction of the effect in advance. This use of one -tailed tests is
questionable because the experimenter can only reject the null
hypothesis if the effect is
in the predicted direction. If the effect is in the other direction, then the
null hypothesis cannot be rejected no matter how strong the effect is. A
skeptic might question whether the experimenter would really fail to
reject the null hypothesis if the effect were strong enough in the wrong
direction. Frequently the most interesting aspect of an effect is that it
runs counter to expectations. Therefore, an experimenter who committed
himself or herself to ignoring effects in one direction may be forced to
choose between ignoring a potentially important finding and
using the techniques of statistical inference dishonestly. One-tailed tests
are not used frequently. Unless otherwise indicated, a test should be
assumed to be two-tailed.

Confidence Intervals & Hypothesis Testing (1 of 5)

There is an extremely close relationship between confidence intervals


and hypothesis testing. When a 95% confidence interval is constructed,
all values in the interval are considered plausibl e values for the
parameter being estimated. Values outside the interval are rejected as
relatively implausible. If the value of the parameter specified by the null
hypothesis is contained in the
95% interval then the null hypothesis cannot be rejected at the 0.05
level. If the value specified
by the null hypothesis is not in the interval then the null hypothesis can
be rejected at the 0.05 level. If a 99% confidence interval is
constructed, then values outside the interval are rejected at the 0.01
level.

Imagine a researcher wishing to test the null hypothesis that the mean
time to respond to an auditory signal is the same as the mean time to
respond to a visual signal. The null hypothesis therefore is:

μvisual - μauditory = 0.

Ten subjects were tested in the visual condition and their scores (in
milliseconds) were: 355, 421,
299, 460, 600, 580, 474, 511, 550, and 586.

Confidence Intervals & Hypothesis Testing (2 of 5)

Ten subjects were tested in the auditory condition and their scores were:
275, 320, 278, 360, 430,
520, 464, 311, 529, and 326.
The 95% confidence interval on the difference between means is:
9 ≤ μvisual - μauditory ≤ 196.
Therefore only values in the interval between 9 and 196 are retained as
plausible values for the
difference between population means. Since zero, the value specified
by the null hypothesis, is not in the interval, the null hypothesis of no
difference between auditory and vis ual presentation can be rejected at
the 0.05 level. The probability value for this example is 0.034. Any
time the parameter specified by a null hypothesis is not contained in the
95% confidence int erval estimating that parameter, the null hypothesis
can be rejected at the 0.05 level or less. Similarly, if the 99% interval
does not contain the parameter then the null hypothesis can be rejected
at the
0.01 level. The null hypothesis is not rejected if the parameter
value specified by the null hypothesis is in the interval since the
null hypothesis would still be plausibl e

Confidence Intervals & Hypothesis Testing (3 of 5)

However, since the null hypothesis would be only one of an infinite


number of values in the confidence interval, accepting the null
hypothesis is not justified.
There are many arguments against accepting the null hypothesis when it
is not rejected. The null hypothesis is usually a hypothesis of no
difference. Thus null hypotheses such as:

μ 1 - μ2
= 0 π1 -
π2 = 0
in which the hypothesized value is zero are most common. When the
hypothesized value is zero then there is a simple relationship between
hypothesis testing and confidence intervals:
If the interval contains zero then the null hypothesis cannot be
rejected at the stated level of confidence. If the interval does not
contain zero then the null hypothesis can be rejected.

This is just a special case of the general rule stating that the null
hypothesis can be rejected if the interval does not contain the
hypothesized value of the parameter and cannot be rejected if the
interval contains the hypothesized value.
Confidence Intervals & Hypothesis Testing (4 of 5)

Since zero is contained in the interval, the null hypothesis that μ 1 - μ2


= 0 cannot be rejected at the0 .05 level since zero is one of the plausible
values of μ 1 - μ2. The interval contains both positive and negative
numbers and therefore μ 1 may be either larger or smaller than μ2.
None of the three possible relationships between μ 1 and μ2:

μ1 - μ2 = 0,
μ 1 - μ2 >
0, and μ1 -
μ2 < 0

can be ruled out. The data are very inconclusive. Whenever a


significance test fails to reject the null hypothesis, the direction of the
effect (if there is one) is unknown.

Confidence Intervals & Hypothesis Testing (5 of 5)

Next section: Following a non significant finding

Now, consider the 95% confidence interval:

6 ≤ μ1 - μ2 ≤ 15.
Since zero is not in the interval, the null hypothesis that μ 1 - μ2 = 0 can
be rejected at the 0.05 level. Moreover, since all the values in the
interval are positive, the direction of the effect can be inferred: μ1 > μ2.

Whenever a significance test rejects the null hypothesis that a parameter


is zero, the confidence interval on that parameter will not contain zero.
Therefore either all the values in the interval will be positive or all the
values in the interval will be negative. In either case, the direction of the
effect is known.

Following a Non-Significant Finding (1 of 3)


An experimenter wishes to test the hypothesis that sleep deprivation
increases reacti on time. An experiment is conducted comparing the
reaction times of 10 people who have missed a night's sleep with 10
control subjects. although the sleep-deprived subjects react more slowly,
the difference is not significant, p = 0.10, two tailed. Should the
experimenter be more or less certain that sleep deprivation increases
reaction time than he or she was before the experiment was conducted.
The naive approach is to argue that there was no significant difference
between the sleep-deprived and the control group so the experimenter
should now be less confident that sleep deprivation increases reaction
time. This argument implicitly assumes that the null hypothesis should
be accepted when it is not rejected. A more straightforward and more
correct approach is to consider that the experimenter expected the sleep-
deprived group to have slower reaction time, and they did. The
experimenter's prediction was correct. It is just that the difference was
not
large enough to rule out chance as an explanation.

Following a Non-Significant Finding (2 of 3)

The experimenter's belief that sleep deprivation increases reaction time


should be strengthened. Nonetheless, the data are not strong enough to
convince a skeptic, so no attempt should be made to publish the results.
Instead, the experimenter repeated the experiment. Once again, the
sleep - deprived group had slower reaction times. Based on the results
of the first experiment, the experimenter conducted a one-tailed test in
the second experiment. However, it was not significant, p = 0.08, one
tailed.

The naive interpretation of the two experiments is that the experimenter


tried twice to find a significant result and failed both times. With each
failure, the strength of the experimenter's case that sleep deprivation
increases reaction time is weakened.

The correct interpretation is that in two out of two experiments the sleep
-deprived subjects had the slower reaction times. The experimenter's
case is strengthened by each experiment. Moreover, there are methods
for combining the probability values across experiments . For these two
experiments, the combined probability is 0.047.

Following a Non-Significant Finding (3 of 3)

Next chapter: Testing hypotheses with standard errors


Taken together, these two experiments provide relatively good evidence
(p<0.05) that sleep deprivation increases reaction time. Naturally, the
more times an outcome is replicated, the more believable the outcome.
Assume the experimenter did the experiment six times and the sleep -
deprived group was slower each time. If the probability values were:
0.10, 0.08, 0.12, 0.07, 0.19,
and 0.13, the combined probability would be 0.009. Therefore, six non-
significant probabilities combine to produce one highly significant
probability. Compare this with the naive view that would state that the
experimenter's hypothesis is almost certainly incorrect since not one of
the six experiments found a significant difference between sleep-
deprived and control subjects. Such a view implicitly accepts the null
hypothesis, a serious error.

In summary, a non-significant result means that the data are


inconclusive. Collecting additional data may be all that is needed to
reject the null hypothesis. If the null hypothesis is true, then additional
data will make clear that the effect is at most small. The additional data
can never prove that the effect is nonexistent.

Chapte
r9

1. What is the purpose of hypothesis testing?

2. Why is the following explanation incorrect?

The probability value is the probability of obtaining a statistic as


different from the parameter specified in the null hypothesis as the
statistic obtained in the experiment. The probability value is computed
assuming that the null hypothesis is true.

3. Why might an experimenter hypothesize something (the null


hypothesis) that he or she does not believe is true?

4. State the null hypothesis for :


a. An experiment comparing the mean effectiveness of two
methods of psychotherapy. b. A correlational study on the
relationship between exercise and cholesterol.
c. An invesitgation of whether a particular coin is a fair coin.
d. A study comparing a drug with a placebo on the amount of pain
relief. (A one-tailed test was used.)

5. Assume the null hypothesis is that µ = 20 and that the graph


shown below is the sampling distribution of the mean (M). Would a
sample value of M= 24 be significant at the .05 level?

6. A researcher develops a new theory that predicts that introverts


should differ from extraverts in their perfomance on a psychomotor
task. An experiment is conducted and the difference between introverts
and extraverts is not significant by conventional levels in th at the
probability value is 0.12. What should the experimeter conclude about
the theory?
7. A researcher hypothesizes that the lowering in cholestorol associated
with weight loss is really due to exercise. To test this, the researcher
carefully controls for exercise while comparing the cholesterol levels of
a group of subjects who lose weight dieting with a control group that
does not diet. The difference between groups in cholesterol is not
significant. Can the researcher claim that weight loss has no effect?
What statistical analysis could the researcher use to make his or her case
more strongly.

8. A significance test is performed and p = .20. Why can't the


experimenter claim that the probability that the null hypothesis is
true is .20

9. Why would it be wrong according to the "Classic Neyman-


Pearson" view of hypothesis testing to write in a research report:
"The effect was significant, p = .0082."

10. For a drug to be approved by the FDA, the drug must be shown to
be safe and effective. If the drug is significantly more effective than a
placebo, then the drug is deemed effective. What do you know about
the effectiveness of a drug once it has been approved by the FDA
(assuming that there has not been a Type I error)?

11. What Greek letters are used to represent the Type I and Type II error
rates.
12. What levels are conventionally used for significance testing?

13. When is it valid to use a one-tailed test. What is the advantage of a


one-tailed test? Give an example of a null hypothesis that would be
tested by a one-tailed test.

14. If the probability value obtained in a significance test of the null


hypothesis that µ1 - µ2 = 0 is .033, what do you know about the 95%
confidence interval on µ1 - µ2?

15. Distinguish between probability value and significance l evel?

16. Suppose a study were conducted on the effectiveness of a class on


"How to take tests." The SAT scores of an experimental group and a
control group were compared. (There were 100 subjects in each group.)
The mean score of the experimental group was 503 and the mean score
of the control group was 499. The difference between means was found
to be significant, p =
.037. What do you conclude about the effectiveness of the class?

17. Is it more conservative to use an level of .01 or an level of .05?


Would beta be higher for an alpha of .05 or for an alpha of .01?

18. Why is "Ho: "M1 = M2" not a proper null hypothesis?


19. An experimenter expects an effect to come out in a certain
direction. Is this sufficient basis for using a one-tailed test? Why or
why not?
20. Some people claim that no independent variable is ever so weak
as to have absolutely no effect and therefore the null hypothesis is
never true. For the sake of argument, assume this is true and comment
on the value of significance tests.

21. How do the Type I and Type II error rates of one-tailed and two-
tailed tests differ.

22. A two-tailed probability is .03. What is the one-tailed probability


if the effect were in the specified direction? What would it be if the
effect were in the other direction?

You might also like