Professional Documents
Culture Documents
Applied Statistics II Chapter 7 The Relationship Between Two Variables
Applied Statistics II Chapter 7 The Relationship Between Two Variables
Jian Zou
1 / 73
Displaying and Summarizing Association in Bivariate Quantitative
Data
Correlation
2 / 73
One Variable
3 / 73
Two or more Variables
4 / 73
Two or more Variables
Data sets which consist of more than one variable are called
multivariate; those which consist of two variables are called
bivariate. For now, and most of this course, we will stick to
relations between only two variables.
5 / 73
Summarizing Association in Bivariate Quantitative Data
6 / 73
Graphical Summaries (sec 7.1)
The basic plot for showing the relation between two quantitative
variables is the scatterplot, which plots point (x,y).
7 / 73
Scatterplots
8 / 73
Scatterplots
9 / 73
Scatterplots
4. Strength of the association
Is the association strong (clear pattern) or weak (fuzzy pattern).
10 / 73
Independent and Dependent Variables
11 / 73
Numerical Summaries (sec 7.2)
12 / 73
Summary Measures
Suppose there are n pair of random measurements (Xi , Yi ),
i = 1, 2, 3, · · · , n.
e.g:- The height and arm span of n individuals.
where
Xi − X̄ Yi − Ȳ
Xi0 = and Yi0 =
SX SY
are the standardize data.
14 / 73
Pearson Correlation Coefficient
The following illustrate what Pearson correlation measures.
15 / 73
Pearson Correlation Coefficient
16 / 73
Inference using Pearson correlation
17 / 73
Hypothesis testing for correlation
Ha+ : ρ > ρ0
H0 : ρ = ρ0 Ha− : ρ < ρ0
Ha± : ρ 6= ρ0
2. Test Statistic
s
n−2
t ∗ = (r − ρ0 ) .
(1 − r 2 )(1 − ρ20 )
18 / 73
Hypothesis testing for correlation
3. P-value
• For the test of H0 versus Ha+
p-value = 2 min (p − , p + ).
19 / 73
Example
EX. 2. Consider the following data of wear on a certain kind of
carbide-steel cutting tool to the duration of use.
Y (wear): 0.09 0.09 0.07 0.07 0.06 0.05 0.04 0.03 0.02
X (time): 130 120 100 75 70 55 42 30 19
If Person correlation coefficient of the data is .9797, perform a
hypothesis test to test whether the population correlation exceeds
0.9.
20 / 73
Example
21 / 73
Simple linear regression (SLR) model (sec 7.3)
Y = β0 + β1 X (Z ) +
22 / 73
SLR Model
Y = β0 + β1 X +
• Simple:
• Linear:
23 / 73
Transformation
Consider an example of equivalence ratio data. Here is a plot of
fuel consumption versus equivalence ratio (left) and versus the fifth
power of equivalence ratio (right).
130 130
120 120
F F
_ _
C C
O O
N N
S S
110 110
100 100
E_RATIO E5
24 / 73
Transformation
Y = β0 + β1 X (Z ) +
= β0 + β1 Z 5 +
25 / 73
SLR Model
Pn
i=1 (Xi − X̄ )(Yi − Ȳ )
β̂1 = Pn 2
i=1 (Xi − X̄ )
and
β̂0 = Ȳ − β̂1 X̄
26 / 73
Example
EX. 3. Consider the data in EX. 2, fit a simple linear regression
model for data.
X̄ =
Ȳ =
Pn
i=1 (Xi − X̄ )(Yi − Ȳ ) =
Pn
i=1 (Xi − X̄ )2 =
27 / 73
Example
β̂1 =
β̂0 =
28 / 73
Residuals, Predicted and Fitted Values
• The predicted value of Y at X is
Ŷ = β̂0 + β̂1 X .
29 / 73
Example
EX. 3. Consider the data in EX. 2,
30 / 73
Estimator of of i
Recall that we have assumed the responses are produced from the
model
Yi = β0 + β1 Xi + i
Yi = β̂0 + β̂1 Xi + ei
β̂0 is the best guess of β0 , β̂1 is the best guess of β1 , and therefore
ei is the best guess of i .
31 / 73
Assumptions in SLR
32 / 73
Estimating the error Variance
n
1 X 2
S 2 = MSE = ei
n−2
i=1
33 / 73
Tools to Assess the Quality of the Fit
Residual plots
If the model fits well, the residuals (ei s) should behave as the
random errors (i s) would:
34 / 73
Residual plots
35 / 73
Tools to Assess the Quality of the Fit
Coefficient of Determination.
are the total sum of squares, error sum of squares, and regression
sum of squares.
36 / 73
Example
EX. 5 Consider the data and the calculated values in EX. 4. Find
an estimate for the error variance. Calculate the coefficient of
determination.
SSE = ni=1 ei2 =
P
S2 =
37 / 73
Example
Pn
SSTO = i=1 (Yi − Ȳ )2 =
r2 =
38 / 73
Inferences of Slope and Intercept
1. Confidence Intervals
Level L confidence intervals for β0 and β1 are
Level L confidence intervals for β0 and β1 are
and
(β̂1 − σ̂(β̂1 )tn−2, 1+L , β̂1 + σ̂(β̂1 )tn−2, 1+L ),
2 2
respectively, where
v
2
u " #
u 1 X
σ̂(β̂0 ) = tMSE + Pn 2
,
n i=1 (Xi − X )
and v
u , n
u X
σ̂(β̂1 ) = tMSE (Xi − X )2
i=1
39 / 73
Example
EX. 5 Consider the data in Ex 4. Find 95% confidence interval for
β0 and β1 .
βˆ0 = βˆ1 =
r h i
1 X̄ 2
σ̂(β̂0 ) = MSE n + Pn 2 =
i=1 (Xi −X̄ )
q
MSE/ ni=1 (Xi − X̄ )2 =
P
σ̂(β̂1 ) =
40 / 73
Example
tn−2, 1+L =
2
41 / 73
Note:
42 / 73
Inferences of Slope and Intercept
Ha + : βi > βi0
H0 : βi = βi0 vs Ha− : βi < βi0
Ha + : βi 6= βi0
−
43 / 73
Inferences of Slope and Intercept
• Test Statistic:
Let t ∗ be the observed value of the test statistic. Then
β̂i − βi0
t∗ = .
σ̂(β̂i )
• p-value
• For the test of H0 versus Ha+ , p-value = p + = P(T > t ∗ ),
• For the test of H0 versus Ha− , p-value = p− = P(T < t ∗ ),
+
• For the test of H0 versus Ha+ , p-value = p− = 2 min (p− , p + ),
−
where T ∼ tn−2 .
44 / 73
Example
EX. 6. For the data in Ex. 3., parameter estimates are
β̂0 =
β̂1 =
σ̂(β̂0 ) =
σ̂(β̂1 ) =
45 / 73
Example
a) Perform a hypothesis test to test whether β0 = 0 or not.
46 / 73
Example
b) Perform a hypothesis test to test whether β1 = 0 or not.
47 / 73
Estimation of The Mean Response
where s
(x0 − X̄ )2
1
σ̂(Ŷ0 ) = MSE + Pn 2
n i=1 (Xi − X̄ )
48 / 73
Example
EX. 7. Consider the data in EX. 3. Compute a 95% confidence
interval for the mean wear of tools at 100 minutes.
Point estimate =
t- value=
49 / 73
Prediction of a Future Observation
(Ŷnew − σ̂(Ynew − Ŷnew )tn−2, 1+L , Ŷnew + σ̂(Ynew − Ŷnew )tn−2, 1+L )
2 2
where s
(x0 − X̄ )2
1
σ̂(Ŷ0 ) = MSE 1 + + Pn 2
n i=1 (Xi − X̄ )
50 / 73
Example
EX. 8. Consider the data in EX. 3. Compute a 95% prediction
interval for the mean wear of tools at 110 minutes.
Point predictor = ŷnew =
t- value=
51 / 73
The Relation Between Correlation and Regression
Yi − Ȳ Xi − X̄
Yi0 = and Xi0 =
SY SX
.
Then the regression equation fitted by least squares can be written
as
Ŷ 0 = r .X 0 ,
where X 0 is any value of a regressor variable standardized as
described above, and r is the Pearson correlation between X and Y.
52 / 73
The Relation Between Correlation and Regression
Since the standardized responses and regressors are
Ŷi − Ȳ Xi − X̄
Ŷi0 = and Xi0 = ,
SY SX
we get
Ŷi − Ȳ Xi − X̄
=r .
SY SX
This gives
SY SY
Ŷ = Ȳ − r X̄ + r X.
SX SX
This implies
SY SY
β̂1 = r , β̂0 = Ȳ − r X̄ = Ȳ − β̂1 X̄ .
SX SX
53 / 73
Example
EX. 9. For the data in EX.4,
X̄ = , SX = ,
Ȳ = , SY = ,
r= ,
so
β̂1 =
β̂0 =
54 / 73
The Relationship Between Two Categorical Variables (sec
7.4)
55 / 73
Example
56 / 73
Example
Here is a mosaic plot of the 1993 results:
57 / 73
Two-Way Tables
Frequency
Percent
Row Pct. Strongly
Col Pct. Agree Disagree Disagree Total
Male 31 71 30 132
17.32 39.66 16.76 73.74
23.48 53.79 22.73
65.96 79.78 69.77
Gender
Female 16 18 13 47
8.94 10.06 7.26 26.26
34.04 38.30 27.66
34.04 20.22 30.23
Total 47 89 43 179
26.26 49.72 24.02 100.00
58 / 73
Inference for Categorical Data with Two Categories
59 / 73
Example 3, Continued
60 / 73
Inference for Categorical Data with More Than Two
Categories: One-Way Tables
61 / 73
Inference for Categorical Data with More Than Two
Categories: One-Way Tables
62 / 73
Example 4
63 / 73
Inference for Categorical Data with More Than Two
Categories: Two-Way Tables
64 / 73
Inference for Two-Way Tables
Column Marginals
row 1 2 ... c
1 p11 p12 ... p1c p1·
2 p21 p22 ... p2c p2·
. . . . .
. . . . .
. . . . .
r pr 1 pr 2 ... prc pr ·
Marginals p·1 p·2 ... p·c 1
65 / 73
Inference for Two-Way Tables
66 / 73
Inference for Two-Way Tables
Column Totals
row 1 2 ... c
1 Y11 Y12 ... Y1c Y1·
2 Y21 Y22 ... Y2c Y2·
. . . . .
. . . . .
. . . . .
r Yr 1 Yr 2 ... Yrc Yr ·
Totals Y·1 Y·2 ... Y·c n
67 / 73
Inference for Two-Way Tables
68 / 73
Inference for Two-Way Tables
69 / 73
Example
70 / 73
Example
Frequency
Expected
Residual
Activity
Age Watch Listen Surf Play Computer
Group TV Read to Music the Web Game Totals
18-25 21 3 9 10 19 62
(19.13) (9.22) (11.52) (11.75) (10.37)
(+0.43) (−2.05) (−0.74) (−0.51) (+2.68)
26-35 17 5 6 8 13 49
(15.12) (7.29) (9.11) (9.29) (8.20)
(0.48) (−0.85) (−1.03) (−0.42) (1.68)
36-50 14 8 8 12 9 51
(15.74) (7.58) (9.48) (9.67) (8.53)
(−0.44) (0.15) (−0.48) (0.75) (0.16)
51-65 18 10 11 10 3 52
(16.04) (7.73) (9.67) (9.86) (8.70)
(0.49) (0.82) (0.43) (0.04) (−1.93)
Over 65 13 14 16 11 1 55
(16.97) (8.18) (10.22) (10.43) (9.20)
(−0.96) (2.04) (1.81) (0.18) (−2.70)
Totals 83 40 50 51 45 269
71 / 73
Example
72 / 73
Association is NOT Cause and Effect
73 / 73