Professional Documents
Culture Documents
Chapter 9 - Correlation and Regression
Chapter 9 - Correlation and Regression
Elementary Statistics
Tenth Edition
by Mario F. Triola
Slide 1
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Chapter 10
Correlation and Regression
10-1 Overview
10-2 Correlation
10-3 Regression
10-4 Variation and Prediction Intervals
10-5 Multiple Regression
10-6 Modeling
Slide 2
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Section 10-1
Overview
Slide 3
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Overview
Slide 4
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Section 10-2
Correlation
Slide 5
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Key Concept
Slide 6
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Part 1: Basic Concepts of Correlation
Slide 7
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Definition
Slide 8
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Definition
Slide 9
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Exploring the Data
Slide 10
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Scatterplots of Paired Data
Slide 13
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Notation for the
Linear Correlation Coefficient
n represents the number of pairs of data present.
denotes the addition of the items indicated.
x denotes the sum of all x-values.
x2 indicates that each x-value should be squared and then those
squares added.
(x)2 indicates that the x-values should be added and the total then
squared.
xy indicates that each x-value should be first multiplied by its
corresponding y-value. After obtaining all such products, find their
sum.
r represents linear correlation coefficient for a sample.
represents linear correlation coefficient for a population.
Slide 14
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Formula
The linear correlation coefficient r measures the
strength of a linear relationship between the
paired values in a sample.
nxy – (x)(y)
r=
n(x2) – (x)2 n(y2) – (y)2
Formula 10-1
Slide 15
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Interpreting r
Slide 16
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Rounding the Linear
Correlation Coefficient r
Slide 17
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Example: Calculating r
Using the simple random sample of data
below, find the value of r.
Data
x 3 1 3 5
y 5 8 6 4
Slide 18
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Example: Calculating r - cont
Slide 19
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Example: Calculating r - cont
Data
x 3 1 3 5
y 5 8 6 4
nxy – (x)(y)
r=
n(x2) – (x)2 n(y2) – (y)2
61 – (12)(23)
r=
4(44) – (12)2 4(141) – (23)2
-32
r= = -0.956
33.466
Slide 20
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Example: Calculating r - cont
Slide 21
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Example: Old Faithful
Given the sample data in Table 10-1, find the value of
the linear correlation coefficient r, then refer to Table
A-6 to determine whether there is a significant linear
correlation between the duration and interval of the
eruption times.
Using the same procedure previously illustrated, we
find that r = 0.926.
Slide 23
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Interpreting r :
Explained Variation
Slide 24
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Example: Old Faithful
Using the duration/interval data in Table 10-1, we have found
that the value of the linear correlation coefficient r = 0.926.
What proportion of the variation of the interval after eruption
times can be explained by the variation in the duration times?
Slide 25
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Common Errors
Involving Correlation
1. Causation: It is wrong to conclude that
correlation implies causality.
Slide 26
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Part 2: Formal Hypothesis Test
Slide 27
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Formal Hypothesis Test
Slide 28
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Hypothesis Test
for a Linear
Correlation
Figure 10-3
Slide 29
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Method 1: Test Statistic is t
Test statistic:
r
t=
1–r2
n–2
Critical values:
Conclusion:
If the absolute value of t is > critical value from
Table A-3, reject H0 and conclude that there is a
linear correlation. If the absolute value of
t ≤ critical value, fail to reject H0; there is not
sufficient evidence to conclude that there is a
linear correlation.
Slide 31
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Method 2:
Test Statistic is r
Test statistic: r
Critical values: Refer to Table A-6
Conclusion:
If the absolute value of r is > critical value from
Table A-6, reject H0 and conclude that there is
a linear correlation. If the absolute value of
r ≤ critical value, fail to reject H0; there is
not sufficient evidence to conclude that there
is a linear correlation.
Slide 32
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Example: Old Faithful
Using the duration/interval data in Table 10-1, test the
claim that there is a linear correlation between the
duration of an eruption and the time interval after that
eruption. Use Method 1.
r
t=
1–r2
n–2
0.926
t= = 6.008
1 – 0.926 2
8–2
Slide 33
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Method 1: Test Statistic is t
(follows format of earlier chapters)
Slide 35
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Method 2: Test Statistic is r
Test statistic: r
Critical values: Refer to Table A-6
(8 degrees of freedom)
Figure 10-5
Slide 36
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Example: Old Faithful
Using the duration/interval data in Table 10-1, test the
claim that there is a linear correlation between the
duration of an eruption and the time interval after that
eruption.
Slide 38
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Recap
In this section, we have discussed:
Correlation.
The linear correlation coefficient r.
Requirements, notation and formula for r.
Interpreting r.
Formal hypothesis testing (two methods).
Rationale for calculating r.
Slide 39
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Section 10-3
Regression
Slide 40
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Key Concept
Slide 41
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Part 1: Basic Concepts of Regression
Slide 42
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Regression
Slide 43
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Requirements
Slide 44
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Definitions
Regression Equation
Given a collection of paired data, the regression
equation
^
y = b0 + b1x
algebraically describes the relationship between
the two variables.
Regression Line
The graph of the regression equation is called
the regression line (or line of best fit, or least
squares line).
Slide 45
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Notation for
Regression Equation
Population Sample
Parameter Statistic
Slide 46
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Formulas for b0 and b1
Slide 47
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Special Property
Slide 48
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Rounding the y-intercept b0
and the Slope b1
Slide 49
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Calculating the
Regression Equation
Data
x 3 1 3 5
y 5 8 6 4
Slide 50
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Calculating the
Regression Equation - cont
Data
x 3 1 3 5
y 5 8 6 4
n=4 b0 = y – b1 x
x = 12
5.75 – (–1)(3) = 8.75
y = 23
x2 = 44
y2 = 141
xy = 61
Slide 52
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Calculating the
Regression Equation - cont
Data
x 3 1 3 5
y 5 8 6 4
y2 = 141
xy = 61
Slide 53
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Example: Old Faithful
^
y = 34.8 + 0.234x
Slide 54
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Example: Old Faithful - cont
Given the sample data in Table 10-1,
find the regression equation.
Slide 55
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Part 2: Beyond the Basics of Regression
Slide 56
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Predictions
In predicting a value of y based on
some given value of x ...
1. If there is not a linear correlation,
the best predicted y-value is y.
Slide 57
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Procedure for
Predicting
Figure 10-7
Slide 58
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Example: Old Faithful
Slide 59
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Example: Old Faithful
Given the sample data in Table 10-1, we found that the
regression equation is ^
y = 34.8 + 0.234x. Assuming that
the current eruption has a duration of x = 180 sec, find
the best predicted value of y, the time interval after this
eruption.
Slide 60
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Example: Old Faithful - cont
Given the sample data in Table 10-1, we found that the
regression equation is ^
y = 34.8 + 0.234x. Assuming that
the current eruption has a duration of x = 180 sec, find
the best predicted value of y, the time interval after this
eruption.
^
y = 34.8 + 0.234x
34.8 + 0.234(180) = 76.9 min
Slide 61
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Guidelines for Using The
Regression Equation
1. If there is no linear correlation, don’t use the
regression equation to make predictions.
2. When using the regression equation for
predictions, stay within the scope of the
available sample data.
3. A regression equation based on old data is
not necessarily valid now.
4. Don’t make predictions about a population
that is different from the population from
which the sample data were drawn.
Slide 62
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Definitions
Marginal Change
Slide 63
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Definition
Residual
The residual for a sample of paired (x, y) data, is the
difference (y - ^
y) between an observed sample y-value
and the value of y, which is the value of y that is
predicted by using the regression equation.
Slide 64
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Definitions
Least-Squares Property
Residual Plot
A scatterplot of the (x, y) values after each of the y-coordinate values
have been replaced by the residual value y – ^ y. That is, a residual
^ y – y)
plot is a graph of the points (x,
Slide 65
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Residual Plot Analysis
Slide 66
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Residual Plots
Slide 67
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Residual Plots
Slide 68
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Residual Plots
Slide 69
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Recap
Slide 70
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Section 10-4
Variation and Prediction
Intervals
Slide 71
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Key Concept
Slide 72
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Definition
Total Deviation
The total deviation of (x, y) is the vertical distance y – y,
which is the distance between the point (x, y) and the
horizontal line passing through the sample mean y.
Slide 73
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Definition
Explained Deviation
The explained deviation is the vertical distance ^
y - y,
which is the distance between the predicted y-value
and the horizontal line passing through the sample
mean y.
Slide 74
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Definition
Unexplained Deviation
The unexplained deviation is the vertical distance
y ^- y, which is the vertical distance between the
point (x, y) and the regression line. (The distance^ y-
y is also called a residual, as defined in Section 10-3.)
Slide 75
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Unexplained, Explained, and Total Deviation
Figure 10-9
Slide 76
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Relationships
(total deviation) = (explained deviation) + (unexplained deviation)
(y - y) = ^
(y - y) + ^
(y - y)
2 2 ^ 2
(y - y) = (y - y) + (y - y) ^
Formula 10-4
Slide 77
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Definition
Coefficient of determination
is the amount of the variation in y that
is explained by the regression line.
explained variation.
r= 2
total variation
Slide 78
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Definition
Slide 79
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Standard Error of Estimate
se = (y ^
– y) 2
n–2
or
se = y 2
– b0 y – b1 xy
n–2 Formula 10-5
Slide 80
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Example: Old Faithful
Given the sample data in Table 10-1, find the standard
error of estimate se for the duration/interval data.
n=8
y2 = 60,204
y = 688
xy = 154,378 se = y2 - b0 y - b1 xy
b0 = 34.7698041
b1 = 0.2340614319 n-2
Slide 81
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Example: Old Faithful - cont
Given the sample data in Table 10-1, find the standard
error of estimate se for the duration/interval data.
n=8
y2 = 60,204
y = 688
xy = 154,378
b0 = 34.7698041 se = 4.973916052 = 4.97 (rounded)
b1 = 0.2340614319
Slide 82
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Prediction Interval for an
Individual y
^ ^
y-E< y < y +E
where
2
1 n(x0 – x)
1+ +
E = t 2 s e n 2
n(x ) – (x)
2
n(x0 – x)2
E = t2 se 1+1 +
n 2
n(x ) – (x)2
Slide 84
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Example: Old Faithful - cont
For the paired duration/interval after eruption times in Table 10-1,
we have found that for a duration of 180 sec, the best predicted
time interval after the eruption is 76.9 min. Construct a 95%
prediction interval for the time interval after the eruption, given
that the duration of the eruption is 180 sec (so that x = 180).
^ ^
y
–E< y < y+E
76.9 – 13.4 < y < 76.9 + 13.4
63.5 < y < 90.3
Slide 85
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Recap
Slide 86
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Section 10-5
Multiple Regression
Slide 87
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Key Concept
Slide 88
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Definition
Multiple Regression Equation
A linear relationship between a response
variable y and two or more predictor
variables (x1, x2, x3 . . . , xk)
^y = b + b x + b x + . . . + b x .
0 1 1 2 2 k k
Slide 89
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Notation
^
y = b0 + b1 x1+ b2 x2+ b3 x3 +. . .+ bk xk
(General form of the estimated multiple regression equation)
n = sample size
k = number of predictor variables
y^ = predicted value of y
x1, x2, x3 . . . , xk are the predictor
variables
Slide 90
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Notation - cont
Slide 91
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Technology
Use a statistical software package such as
STATDISK
Minitab
Excel
Slide 92
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Example: Old Faithful
Slide 93
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Example: Old Faithful - cont
Slide 94
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Example: Old Faithful - cont
Slide 95
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Definition
Slide 96
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
2
Adjusted R
Adjusted R = 1 –
2 (n – 1) 2
(1– R )
[n – (k + 1)]
Formula 10-6
Slide 97
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Finding the Best Multiple
Regression Equation
1. Use common sense and practical considerations to include or
exclude variables.
2. Consider the P-value. Select an equation having overall significance,
as determined by the P-value found in the computer display.
3. Consider equations with high values of adjusted R2 and try to include
only a few variables.
If an additional predictor variable is included, the value of adjusted R2
does not increase by a substantial amount.
For a given number of predictor (x) variables, select the equation with the
largest value of adjusted R2.
In weeding out predictor (x) variables that don’t have much of an effect
on the response (y) variable, it might be helpful to find the linear
correlation coefficient r for each of the paired variables being considered.
Slide 98
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Dummy Variables
Slide 99
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Logistic Regression
Slide 100
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Recap
Slide 101
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Section 10-6
Modeling
Slide 102
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Key Concept
Slide 103
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
TI-83/84 Plus Generic Models
Linear: y = a + bx
Quadratic: y = ax2 + bx + c
Logarithmic: y = a + b ln x
Exponential: y = abx
Power: y = axb
Slide 104
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
The slides that follow illustrate the graphs
of some common models displayed on a
TI-83/84 Plus Calculator
Slide 105
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Slide 106
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Slide 107
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Slide 108
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Slide 109
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Slide 110
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Development of a Good
Mathematical Model
Look for a Pattern in the Graph: Examine the
graph of the plotted points and compare the basic
pattern to the known generic graphs of a linear
function.
Find and Compare Values of R2: Select functions
that result in larger values of R2, because such
larger values correspond to functions that better
fit the observed points.
Think: Use common sense. Don’t use a model
that leads to predicted values known to be totally
unrealistic.
Slide 111
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.
Recap
Slide 112
Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.