Download as pdf or txt
Download as pdf or txt
You are on page 1of 73

Applied Statistics II

Chapter 7 The Relationship Between Two


Variables

Jian Zou

WPI Mathematical Sciences

1 / 73
Displaying and Summarizing Association in Bivariate Quantitative
Data

Correlation

Simple Linear Regression

The Relationship Between Two Categorical Variables

2 / 73
One Variable

In first 6 chapters, we dealt with just one variable at a time. This


variable when measured on many different subjects and objects,
took the form of a list of numbers.

e.g:-Height of a student in the class (variable).


data set: 64”, 62”, 69”, 72”, 73”, 65”, 64”, 67”.

3 / 73
Two or more Variables

Often, when collecting or analyzing data, there are two or more


variables of interest. In many cases it is important to know
whether two or more variables are related, and if they are, what
the nature of that relationship is.

e.g:-What is the relationship between height (variable 1) and


weight (variable 2) of a student in the class.
data set:
height 64” 62” 69” 72” 73” 65” 64” 67”
weight 148 152 170 200 175 169 155 162

4 / 73
Two or more Variables

Data sets which consist of more than one variable are called
multivariate; those which consist of two variables are called
bivariate. For now, and most of this course, we will stick to
relations between only two variables.

5 / 73
Summarizing Association in Bivariate Quantitative Data

To parallel our discussion with just one variable case, we discuss


Graphical Summaries and Numerical Summaries.

6 / 73
Graphical Summaries (sec 7.1)

The basic plot for showing the relation between two quantitative
variables is the scatterplot, which plots point (x,y).

It gives us a summary of the relationship look like a bivariate (or


X-Y) plot that plots one variable against the other.

7 / 73
Scatterplots

Here are the things to look for:


1. Is there any relationship?
Here we look for the presence of association (Linear or non linear
(Quadratic or something else)).

2. Type of the association


Is the relationship linear, quadratic, or something else.
(After this point we discuss only linear associstions).

3. Direction of the association


Is the association positive ( Y increases when X is increased) or
negative ( Y decreases when X is increased).

8 / 73
Scatterplots

9 / 73
Scatterplots
4. Strength of the association
Is the association strong (clear pattern) or weak (fuzzy pattern).

10 / 73
Independent and Dependent Variables

When dealing with a relation, the X and Y variables have


particular roles to play. The X (horizontal axis) variable represents
the variable that explains what we see in Y (the vertical axis.)

Often, X is a quantity which we can change or have control over.


Or at least, we want to see how Y reacts to different values of X .

So X is usually called the independent or explanatory variable,


and Y the dependent or response variable.

11 / 73
Numerical Summaries (sec 7.2)

Here we focus on measuring linear association in bivariate


quantitative data. The Pearson correlation coefficient is a measure
of the strength of linear association between two quantitative
variables.

12 / 73
Summary Measures
Suppose there are n pair of random measurements (Xi , Yi ),
i = 1, 2, 3, · · · , n.
e.g:- The height and arm span of n individuals.

We can summarize the location (or center) of each variable by the


sample means:
n n
1X 1X
X̄ = Xi and Ȳ = Yi
n n
i=1 i=1
We also can summarize the spread of each variable by the sample
standard deviations:
v v
u n u n
u 1 X u 1 X
SX = t (Xi − X̄ )2 and SY = t (Yi − Ȳ )2
n−1 n−1
i=1 i=1

(however, none of these summary measures says anything about


the relationship between the two variables).
13 / 73
Pearson Correlation Coefficient

Then the Pearson correlation between X and Y computed from


these data is
n
1 X 0 0
r= Xi Yi ,
n−1
i=1

where
Xi − X̄ Yi − Ȳ
Xi0 = and Yi0 =
SX SY
are the standardize data.

14 / 73
Pearson Correlation Coefficient
The following illustrate what Pearson correlation measures.

15 / 73
Pearson Correlation Coefficient

Things to know about Pearson correlation


• Pearson correlation is always between -1 and 1. Values near 1
signify strong positive linear association. Values near -1 signify
strong negative linear association. Values near 0 signify weak linear
association.
• Correlation between X and Y is the same as the correlation
between Y and X.
• If SX = 0 and/or SY = 0, we define r to be 0. This is because
The formula doesn’t work in this case (division of 0 by 0)
 The standard deviation equals 0 if and only if all data values are
equal, so
 There can be no association since there is no variation.

16 / 73
Inference using Pearson correlation

If the data are a random sample from a large population, we can


consider inference about the population correlation, ρ, which would
be the Pearson correlation computed for all units in the population.

If n is the sample size


s
n−2
t = (r − ρ) ,
(1 − r 2 )(1 − ρ2 )

has approximately a tn−2 distribution. We can use this fact to


obtain a confidence interval for ρ.

17 / 73
Hypothesis testing for correlation

1. Null and Alternative Hypotheses:

Ha+ : ρ > ρ0
H0 : ρ = ρ0 Ha− : ρ < ρ0
Ha± : ρ 6= ρ0

2. Test Statistic
s
n−2
t ∗ = (r − ρ0 ) .
(1 − r 2 )(1 − ρ20 )

Here t ∗ is the observed value of t.

18 / 73
Hypothesis testing for correlation

3. P-value
• For the test of H0 versus Ha+

p-value = p + = P(T > t ∗ ), where T ∼ tn−2 .

• For the test of H0 versus Ha−

p-value = p − = P(T < t ∗ ), where T ∼ tn−2 .

• For the test of H0 versus Ha±

p-value = 2 min (p − , p + ).

19 / 73
Example
EX. 2. Consider the following data of wear on a certain kind of
carbide-steel cutting tool to the duration of use.
Y (wear): 0.09 0.09 0.07 0.07 0.06 0.05 0.04 0.03 0.02
X (time): 130 120 100 75 70 55 42 30 19
If Person correlation coefficient of the data is .9797, perform a
hypothesis test to test whether the population correlation exceeds
0.9.

20 / 73
Example

21 / 73
Simple linear regression (SLR) model (sec 7.3)

The SLR model attempts to quantify the relationship between a


single predictor variable Z and a response variable Y . This
reasonably fexible yet simple model has the form

Y = β0 + β1 X (Z ) + 

where  is a random error term having mean 0 and variance σ 2


(often assumed to have a N(0, σ 2 ) distribution) and X (Z ) is a
function of Z , such as Z , Z 2 , or ln(Z ).

22 / 73
SLR Model

The function X is called the regressor. Often, we omit specifying


the dependence of the regressor X on the predictor Z, and just
write the model as

Y = β0 + β1 X + 

The terms ”simple” and ” linear”’ mean:

• Simple:

• Linear:

23 / 73
Transformation
Consider an example of equivalence ratio data. Here is a plot of
fuel consumption versus equivalence ratio (left) and versus the fifth
power of equivalence ratio (right).

130 130

120 120

F F
_ _
C C
O O
N N
S S

110 110

100 100

0.8 1.0 1.2 0.5 1.0 1.5 2.0 2.5

E_RATIO E5

24 / 73
Transformation

The plot on the right seems to show a nearly linear association. In


terms of the SLR model, we have

Y = β0 + β1 X (Z ) + 
= β0 + β1 Z 5 + 

Where the response Y is fuel consumption, the predictor Z is


equivalence ratio, and the regressor X (Z ) is Z 5 , the fifth power of
equivalence ratio.

25 / 73
SLR Model

When fitting regression models, we’ll use least squares. Using


calculus, we find the least squares estimators of β0 and β1 to be

Pn
i=1 (Xi − X̄ )(Yi − Ȳ )
β̂1 = Pn 2
i=1 (Xi − X̄ )

and

β̂0 = Ȳ − β̂1 X̄

26 / 73
Example
EX. 3. Consider the data in EX. 2, fit a simple linear regression
model for data.
X̄ =

Ȳ =
Pn
i=1 (Xi − X̄ )(Yi − Ȳ ) =

Pn
i=1 (Xi − X̄ )2 =

27 / 73
Example

β̂1 =

β̂0 =

28 / 73
Residuals, Predicted and Fitted Values
• The predicted value of Y at X is

Ŷ = β̂0 + β̂1 X .

• For X = Xi , one of the values in the data set, the predicted


value is called a fitted value and is written

Ŷi = β̂0 + β̂1 Xi .

• The residuals, ei , i = 1, 2, · · · , n are the differences between the


observed and fitted values for each data value:

ei = Yi − Ŷi = Yi − (β̂0 + β̂1 Xi ).

29 / 73
Example
EX. 3. Consider the data in EX. 2,

Yi (wear): 0.09 0.09 0.07 0.07 0.06


Xi (time): 130 120 100 75 70
Ŷi (fitted): 0.0944 0.0882 0.0757 0.0601 0.0570
ei (res): -0.0044 0.0018 -0.0057 0.0099 0.0030

Yi (wear): 0.05 0.04 0.03 0.02


Xi (time): 55 42 30 19
Ŷi (fitted): 0.0477 0.0395 0.0321 0.0252
ei (res): 0.0023 0.0005 -0.0021 -0.0052

30 / 73
Estimator of of i

Recall that we have assumed the responses are produced from the
model

Yi = β0 + β1 Xi + i

After fitting the model, we have

Yi = β̂0 + β̂1 Xi + ei

β̂0 is the best guess of β0 , β̂1 is the best guess of β1 , and therefore
ei is the best guess of i .

31 / 73
Assumptions in SLR

In simple linear regression model we assume random errors are


independent and identically distributed from a normal distribution
with mean 0 and variance σ 2 where σ > 0 (i ∼iid N(o, σ 2 )).

32 / 73
Estimating the error Variance

Here to estimate the variance of the random error (σ 2 ), we use the


analogue sample quantity and it is the mean square error of
residuals.

n
1 X 2
S 2 = MSE = ei
n−2
i=1

33 / 73
Tools to Assess the Quality of the Fit

Residual plots

If the model fits well, the residuals (ei s) should behave as the
random errors (i s) would:

• Residuals should exhibit no patterns when plotted versus the Xi ,


Ŷi . If the model does not fit well, a pattern should show up in the
distribution of the residuals.

• Studentized residuals should be plotted on a normal quantile


plot, or even better, a quantile plot for the tn−3 distribution.

34 / 73
Residual plots

35 / 73
Tools to Assess the Quality of the Fit
Coefficient of Determination.

The coefficient of determination, r 2 , numerical measure of the


quality of the fit, is a measure of the proportion of variation in the
response ”explained” by the regression. The coefficient of
determination is computed as
SSR SSE
r2 = =1−
SSTO SSTO
where
n
X n
X
SSTO = (Yi − Ȳ )2 SSE = ei2 SSR = SSTO − SSE
i=1 i=1

are the total sum of squares, error sum of squares, and regression
sum of squares.

36 / 73
Example
EX. 5 Consider the data and the calculated values in EX. 4. Find
an estimate for the error variance. Calculate the coefficient of
determination.
SSE = ni=1 ei2 =
P

S2 =

37 / 73
Example
Pn
SSTO = i=1 (Yi − Ȳ )2 =

r2 =

38 / 73
Inferences of Slope and Intercept
1. Confidence Intervals
Level L confidence intervals for β0 and β1 are
Level L confidence intervals for β0 and β1 are

(β̂0 − σ̂(β̂0 )tn−2, 1+L , β̂0 + σ̂(β̂0 )tn−2, 1+L ),


2 2

and
(β̂1 − σ̂(β̂1 )tn−2, 1+L , β̂1 + σ̂(β̂1 )tn−2, 1+L ),
2 2

respectively, where
v
2
u " #
u 1 X
σ̂(β̂0 ) = tMSE + Pn 2
,
n i=1 (Xi − X )

and v
u , n
u X
σ̂(β̂1 ) = tMSE (Xi − X )2
i=1

39 / 73
Example
EX. 5 Consider the data in Ex 4. Find 95% confidence interval for
β0 and β1 .
βˆ0 = βˆ1 =

r h i
1 X̄ 2
σ̂(β̂0 ) = MSE n + Pn 2 =
i=1 (Xi −X̄ )

q
MSE/ ni=1 (Xi − X̄ )2 =
P
σ̂(β̂1 ) =

40 / 73
Example
tn−2, 1+L =
2

41 / 73
Note:

Whether the interval for β1 contains 0 is of particular interest. If it


does, it means that we cannot statistically distinguish β1 from 0.
This means we have to consider plausible the model for which
β1 = 0:
Y = β0 + 
This model implies that there is no linear association between Y
and X.

42 / 73
Inferences of Slope and Intercept

2. Hypothesis tests for Slope and Intercept


• Hypotheses:
For i = 0 (intercept) and i = 1 (slope):

Ha + : βi > βi0
H0 : βi = βi0 vs Ha− : βi < βi0
Ha + : βi 6= βi0

43 / 73
Inferences of Slope and Intercept

• Test Statistic:
Let t ∗ be the observed value of the test statistic. Then

β̂i − βi0
t∗ = .
σ̂(β̂i )

• p-value
• For the test of H0 versus Ha+ , p-value = p + = P(T > t ∗ ),
• For the test of H0 versus Ha− , p-value = p− = P(T < t ∗ ),
+
• For the test of H0 versus Ha+ , p-value = p− = 2 min (p− , p + ),

where T ∼ tn−2 .

44 / 73
Example
EX. 6. For the data in Ex. 3., parameter estimates are

β̂0 =

β̂1 =

, and their estimated standard errors are

σ̂(β̂0 ) =

σ̂(β̂1 ) =

45 / 73
Example
a) Perform a hypothesis test to test whether β0 = 0 or not.

46 / 73
Example
b) Perform a hypothesis test to test whether β1 = 0 or not.

47 / 73
Estimation of The Mean Response

• The mean response at X = x0 is µ0 = β0 + β1 x0 .


• The point estimator of µ0 is Ŷ0 = β̂0 + β̂1 x0 .
• A level L confidence interval for µ0 is

(Ŷ0 − σ̂(Ŷ0 )tn−2, 1+L , Ŷ0 + σ̂(Ŷ0 )tn−2, 1+L ),


2 2

where s
(x0 − X̄ )2
 
1
σ̂(Ŷ0 ) = MSE + Pn 2
n i=1 (Xi − X̄ )

48 / 73
Example
EX. 7. Consider the data in EX. 3. Compute a 95% confidence
interval for the mean wear of tools at 100 minutes.
Point estimate =

Estimated standard error=

t- value=

49 / 73
Prediction of a Future Observation

A level L prediction interval for a future observation at X = x0 is

(Ŷnew − σ̂(Ynew − Ŷnew )tn−2, 1+L , Ŷnew + σ̂(Ynew − Ŷnew )tn−2, 1+L )
2 2

where s
(x0 − X̄ )2
 
1
σ̂(Ŷ0 ) = MSE 1 + + Pn 2
n i=1 (Xi − X̄ )

50 / 73
Example
EX. 8. Consider the data in EX. 3. Compute a 95% prediction
interval for the mean wear of tools at 110 minutes.
Point predictor = ŷnew =

Estimated standard error of prediction is

t- value=

51 / 73
The Relation Between Correlation and Regression

If the standardized responses and regressors are

Yi − Ȳ Xi − X̄
Yi0 = and Xi0 =
SY SX
.
Then the regression equation fitted by least squares can be written
as
Ŷ 0 = r .X 0 ,
where X 0 is any value of a regressor variable standardized as
described above, and r is the Pearson correlation between X and Y.

52 / 73
The Relation Between Correlation and Regression
Since the standardized responses and regressors are

Ŷi − Ȳ Xi − X̄
Ŷi0 = and Xi0 = ,
SY SX
we get
Ŷi − Ȳ Xi − X̄
=r .
SY SX
This gives  
SY SY
Ŷ = Ȳ − r X̄ + r X.
SX SX
This implies
SY SY
β̂1 = r , β̂0 = Ȳ − r X̄ = Ȳ − β̂1 X̄ .
SX SX

53 / 73
Example
EX. 9. For the data in EX.4,
X̄ = , SX = ,
Ȳ = , SY = ,
r= ,

so
β̂1 =

β̂0 =

54 / 73
The Relationship Between Two Categorical Variables (sec
7.4)

Recall that categorical variables define categories such as religion


or gender.

Analysis of categorical data is based on counts, proportions or


percentages of data that fall into the various categories defined by
the variables.

We will consider two tools used to analyze bi-variate categorical


data:
1. Mosaic Plots 2. Two-Way Tables

55 / 73
Example

e.g :- A survey on academic dishonesty was conducted among WPI


students in 1993 and again in 1996. One question asked students
to respond to the statement ”Under some circumstances academic
dishonesty is justified.” Possible responses were ”Agree”,
”Disagree” and ”Strongly disagree”.

56 / 73
Example
Here is a mosaic plot of the 1993 results:

57 / 73
Two-Way Tables

Here is a two-way table of the same data:


1993 Survey
Dishonesty Sometimes Justified

Frequency
Percent
Row Pct. Strongly
Col Pct. Agree Disagree Disagree Total

Male 31 71 30 132
17.32 39.66 16.76 73.74
23.48 53.79 22.73
65.96 79.78 69.77
Gender

Female 16 18 13 47
8.94 10.06 7.26 26.26
34.04 38.30 27.66
34.04 20.22 30.23

Total 47 89 43 179
26.26 49.72 24.02 100.00

58 / 73
Inference for Categorical Data with Two Categories

Methods for comparing two proportions can be used (estimation


from Chapter 5 and hypothesis tests from Chapter 6).

59 / 73
Example 3, Continued

Suppose we want to compare pm , the population proportion of all


male college students who disagree or strongly diagree with the
statement “Under some circumstances academic dishonesty is
justified,” with pf , the proportion of all female college students
who do.

Assuming the WPI sample is a random sample of all students (a


big if), we can do so using a confidence interval for pf − pm
(Chapter 5) or an hypothesis test of H0 : pf − pm = 0 versus
Ha : pf − pm 6= 0 (Chapter 6).

60 / 73
Inference for Categorical Data with More Than Two
Categories: One-Way Tables

Suppose the categorical variable has c categories, and that the


population proportion in category i is pi . To test
(0)
H0 : pi = pi , i = 1, 2, . . . , c
(0)
Ha : pi 6 = pi for at least one i
(0)
for pre-specified values pi , i = 1, 2, . . . , c, use the Pearson χ2
statistic
c (0)
X (Yi − npi )2
X2 = (0)
,
i=1 np i
where Yi is the observed frequency in category i, and n is the total
number of observations.

61 / 73
Inference for Categorical Data with More Than Two
Categories: One-Way Tables

Note that for each category the Pearson statistic computes


(observed-expected)2 /expected (noting that we assume H0 true
and under this assumption, the expected number in category i is
(0)
npi ) and sums over all categories.

If there is sufficient data (Guideline: The expected number in each


category is at least 5), then under H0 , X 2 ∼ χ2c−1 . Therefore, if
x 2∗ is the observed value of X 2 , the p-value of the test is
P(χ2c−1 ≥ x 2∗ ).

62 / 73
Example 4

Defects in parts produced by an injection molding process are


monitored and recorded for quality control purposes. Historically,
parts have been rejected for five reasons. To investigate if recent
production mirrors the historical pattern, 200 rejected parts are
randomly sampled.
Reason for Rejection
Black Flow
Spots Scratches Splay Sinks Lines Total
(0)
Historical proportion pi 0.50 0.20 0.15 0.10 0.05 1.00
(0)
Expected (npi ) 100 40 30 20 10 n = 200
Observed (yi ) 88 47 31 29 5 n = 200
(0) 2 (0) ∗
(Yi − npi ) /npi 1.440 1.225 0.033 4.050 2.500 x2 = 9.248
q
(0) (0)
Pearson residual (yi − npi )/ npi −1.200 1.107 0.182 2.012 −1.581

The p-value is P(χ24 ≥ 9.248) = 0.0552.

63 / 73
Inference for Categorical Data with More Than Two
Categories: Two-Way Tables

Suppose a population is partitioned into r × c categories,


determined by r levels of variable 1 and c levels of variable 2.

The population proportion for level i of variable 1 and level j of


variable 2 is pij . This information can be displayed in the following
r × c table:

64 / 73
Inference for Two-Way Tables

Column Marginals
row 1 2 ... c
1 p11 p12 ... p1c p1·
2 p21 p22 ... p2c p2·
. . . . .
. . . . .
. . . . .
r pr 1 pr 2 ... prc pr ·
Marginals p·1 p·2 ... p·c 1

65 / 73
Inference for Two-Way Tables

Suppose we want to test

H0 : row and column variables are independent


Ha : row and column variables are not independent

To do so, we select a random sample of size n from the population.

66 / 73
Inference for Two-Way Tables

Suppose the table of observed frequencies is

Column Totals
row 1 2 ... c
1 Y11 Y12 ... Y1c Y1·
2 Y21 Y22 ... Y2c Y2·
. . . . .
. . . . .
. . . . .
r Yr 1 Yr 2 ... Yrc Yr ·
Totals Y·1 Y·2 ... Y·c n

67 / 73
Inference for Two-Way Tables

Under H0 the expected cell frequency for the ij cell is given by

row i total × column j total


Eij =
sample size
Yi. Y.j
=
n
=np̂i. p̂.j

where pi. = Yi. /n and p.j = Y.j /n.

68 / 73
Inference for Two-Way Tables

To measure the deviations of the observed frequencies from the


expected frequencies under the assumption of independence, we
construct the Pearson χ2 statistic
r X
c
X (Yij − Eij )2
X2 =
Eij
i=1 j=1

If H0 is true, X 2 has (approximately) a χ2(r −1)(c−1) distribution.

69 / 73
Example

EX. 10. A polling firm surveyed 269 American adults concerning


how leisure time is spent in the home. One question asked them to
select which of five leisure activities they were most likely to
partake in on a weeknight. The results are broken down by age
group in the following table:

70 / 73
Example

Frequency
Expected
Residual
Activity
Age Watch Listen Surf Play Computer
Group TV Read to Music the Web Game Totals
18-25 21 3 9 10 19 62
(19.13) (9.22) (11.52) (11.75) (10.37)
(+0.43) (−2.05) (−0.74) (−0.51) (+2.68)
26-35 17 5 6 8 13 49
(15.12) (7.29) (9.11) (9.29) (8.20)
(0.48) (−0.85) (−1.03) (−0.42) (1.68)
36-50 14 8 8 12 9 51
(15.74) (7.58) (9.48) (9.67) (8.53)
(−0.44) (0.15) (−0.48) (0.75) (0.16)
51-65 18 10 11 10 3 52
(16.04) (7.73) (9.67) (9.86) (8.70)
(0.49) (0.82) (0.43) (0.04) (−1.93)
Over 65 13 14 16 11 1 55
(16.97) (8.18) (10.22) (10.43) (9.20)
(−0.96) (2.04) (1.81) (0.18) (−2.70)
Totals 83 40 50 51 45 269

71 / 73
Example

As an example of the computation of table entries, consider the


entries in the (1,2) cell (age: 18-25, activity: Read), in which the
observed frequency is 3. The marginal number in the 18-25 bracket
is 62, while the marginal number in the Read bracket is 40, so
p̂1· = [ ], while p̂·2 = [ ], so the
expected number in the (1,2) cell is [ ].
The Pearson residual is [ ].

The value of the χ2 statistic is 38.91, which is computed as the


sum of the squares of the Pearson residuals. Comparing this with
the χ216 distribution, we get a p-value of 0.0011.

72 / 73
Association is NOT Cause and Effect

Two variables may be associated due to a number of reasons, such


as:
1. X could cause Y .
2. Y could cause X.
3. X and Y could cause each other.
4. X and Y could be caused by a third (lurking) variable Z.
5. X and Y could be related by chance.

73 / 73

You might also like