Lecture 2

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Applied Statistical Methods, T. S.

Lu 1

LECTURE 2: Straight-line Regression Analysis

1 Preview

• Regression analysis: a statistical tool for exploring the relation-


ship of one or more independent variables X1, X2, . . . , Xk to a
single, continuous dependent variable Y .

• You seek a quantitative formula or equation to describe (e.g., pre-


dict) the dependent variable Y as a function of the independent
variables X1, X2, . . . , Xk . For example, a quantitative formula
may be desired for a study of the effect of dosage of a blood-
pressure-reducing treatment (X1) on blood pressure change (Y ),
controlling for age (X2) and weight (X3).

For the present, we assume that our goal is to predict a future


response variable.
We will consider the case in which we observe a single response
and one or more covariates. (For most of the course, we assume that
the covariates are fixed and known.)

Fall 2022
Applied Statistical Methods, T. S. Lu 2

2 Basic Questions to Be Answered

1. What is the most appropriate mathematical model to use - a


straight line, a parabola, a log function, or what?

2. Given a specific model, what do we mean by and how do we


determine the best-fitting model for the data? In other words, if
our model is a straight line, how do we find the best-fitting line?

3 General Strategy

General strategy used to study the relationship:

• forward method

• backward method

• using a model suggested from experience or theory

A step-by-step description of the forward strategy:

1. Assume that a straight-line model is appropriate. Later we will


investigate the validity of this assumption.

2. Find the best-fitting straight line.

3. Assess whether the best estimate helps to describe Y .

4. Is the assumed model appropriate?


Fall 2022
Applied Statistical Methods, T. S. Lu 3

4 Statistical Assumptions for a Straight-line Model

Assumption 1: Existence - For any fixed value of the variable


X, Y is a random variable with a certain probability distribution
having finite mean and variance.
That is, we observe values of random variables with finite variance.
In practice, considering a finite number of subjects ensures that this
assumption holds. The (population) mean is denoted as µY |X and
the (population) variance as σY2 |X . The notation “Y |X” indicates
that the mean and the variance of the random variable Y depend on
the value of X.

Assumption 2: Independence - The Y -values are statistically


independent of one another.

Assumption 3: Linearity - The mean value of Y , µY |X , is a


straight-line function of X.

µY |X = β0 + β1X

where β0 and β1 are the intercept and the slope of this (population)
straight line, respectively. Equivalently,

Y = β0 + β1X + E

Fall 2022
Applied Statistical Methods, T. S. Lu 4

where E denotes a random variable having mean 0 at fixed X. E


is commonly referred as the error component in the model. Mathe-
matically, E is given by the formula

E = Y − (β0 + β1X)

or by
E = Y − µY |X

Assumption 4: Homoscedasticity - The variance of Y is the


same for any X

σY2 |X ≡ σ 2

Assumption 5: For any fixed value of X, Y has a normal


distribution.

Remarks:

• Y : a random variable and an observation of it yields a particular


value or “realization”.

• X: assumed to be measured without error.

• β0, β1: unknown parameters

Fall 2022
Applied Statistical Methods, T. S. Lu 5

• E: a random, unobservable variable

• Using some estimation procedure (e.g., least squares), we have


point estimates βb0 and βb1 of β0 and β1. Then a point estimate
of E at the value X can be calculate,

b = Y − Yb = Y − (βb0 + βb1X)
E

The estimated error E


b is typically called a residual.

5 Determining the Best-fitting Straight Line

5.1 The Least-squares Method

Let Ybi denote the estimated response at Xi based on the fitted


regression line; Ybi = βb0 + βb1Xi, where βb0 and βb1 are the intercept
and the slope of the fitted line. The sum of squares of the distances
between the observed points and the corresponding point on the fitted
line is
n
X n
X
2
(Yi − Ybi) = (Yi − βb0 − βb1Xi)2
i=1 i=1

The minimum sum of squares corresponding to the least-squares


estimates βb0 and βb1 is usually called the sum of squares about the
regression line, the residual sum of squares, or the sum of squares
due to error (SSE). Mathematically, if β0∗ and β1∗ denote any other

Fall 2022
Applied Statistical Methods, T. S. Lu 6

possible estimators of β0 and β1, we must have


n
X n
X
SSE = 2
(Yi − βb0 − βb1Xi) ≤ (Yi − β0∗ − β1∗Xi)2
i=1 i=1

5.2 Solution to the Best-fit Problem

Y : the sample mean of the observations on Y


X: the sample mean of the values of X
The best-fitting straight line is determined by the formulas
n
X
(Xi − X)(Yi − Y )
i=1
βb1 = n
X
b i )2
(Xi − X
i=1

βb0 = Y − βb1X

The least-squares line can be represented by

Yb = βb0 + βb1X = Y + βb1(X − X)

5.3 Example

Use the same example described in the lecture of Review.

/*Descriptive statistics*/
proc means data=stat.sbp;
var sbp age;
run;

Fall 2022
Applied Statistical Methods, T. S. Lu 7

proc means data=stat.sbp n nmiss mean std min max maxdec=3;


class group;
var sbp age;
run;

/*Independent T-test*/
proc ttest data=stat.sbp;
class group;
var sbp;
run;

/*Scatterplot*/
goptions reset=all htext=2;
symbol1 color=black value=dot w=3 ;
axis1 order=(10 to 70 by 10) minor=none;
axis2 order=(100 to 240 by 20) minor=none ;
proc gplot data=stat.sbp;
plot sbp*age / haxis=axis1 vaxis=axis2;
run;
quit;

Fall 2022
Applied Statistical Methods, T. S. Lu 8

/*Simple linear regression*/


proc reg data=stat.sbp;
model SBP = age;
run;
quit;

The REG Procedure

Descriptive Statistics

Uncorrected Standard
Variable Sum Mean SS Variance Deviation

Intercept 30.00000 1.00000 30.00000 0 0


age 1354.00000 45.13333 67894 233.91264 15.29420
SBP 4276.00000 142.53333 624260 509.91264 22.58125

The REG Procedure


Model: MODEL1
Dependent Variable: SBP

Number of Observations Read 30


Number of Observations Used 30

Fall 2022
Applied Statistical Methods, T. S. Lu 9

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 1 6394.02269 6394.02269 21.33 <.0001


Error 28 8393.44398 299.76586
Corrected Total 29 14787

Root MSE 17.31375 R-Square 0.4324


Dependent Mean 142.53333 Adj R-Sq 0.4121
Coeff Var 12.14716

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 98.71472 10.00047 9.87 <.0001


age 1 0.97087 0.21022 4.62 <.0001

5.4 Measure of the Quality of the Straight-line Fit and


Estimate of σ 2

n
X
SSE = (Yi − Ybi)2
i=1

If SSE = 0, the straight line fits perfectly. On the other hand, if the
fit gets worse, SSE gets larger. There are two possible factors for the
inflation of SSE. First, there might be a lot of variation in the data;
that is, σ 2 might be large. Second, the assumption of a straight-line
model may not be appropriate for the data. Here, we will assume

Fall 2022
Applied Statistical Methods, T. S. Lu 10

that the second factor is not the issue.


The estimate of σ 2 is
n
1 X 1
SY2 |X = (Yi − Ybi)2 = SSE
n − 2 i=1 n−2

If a straight-line model is appropriate, the population mean response


µY |X changes with X. Therefore, instead of subtracting Y from each
Yi when estimating σ 2, we should subtract Ybi from Yi since Ybi is the
estimate of µY |X . Furthermore, we subtract 2 from n because the
determination of Ybi requires the estimation of two parameters, β0
and β1.

We summarize the formulas for inference-making procedures in


the following.

• Note: µY |X = β0 + β1X is the assumed true regression model

n
X
(Xi − X)(Yi − Y )
i=1
βb1 = n
X
b i )2
(Xi − X
i=1

βb0 = Y − βb1X

Fall 2022
Applied Statistical Methods, T. S. Lu 11

Yb = βb0 + βb1X = Y + βb1(X − X)

n
1 X
SY2 |X = (Yi − Ybi)2
n − 2 i=1

n
2 1 X
SY = (Yi − Y )2
n − 1 i=1

n
2 1 X
SX = (Xi − X)2
n − 1 i=1

SY |X
Sβb1 = √
SX n − 1

s
2
1 X
Sβb0 = SY |X + 2
n (n − 1)SX

s
1 (X0 − X)2
SYbX = SY |X + 2
0 n (n − 1)SX

• tn−2,1−α/2 is the 100(1 − α/2)% point of the t distribution with


n − 2 degrees of freedom. 100(1 − α/2)% confidence interval for
each parameter:

β1 : βb1 ± tn−2,1−α/2Sβb1
Fall 2022
Applied Statistical Methods, T. S. Lu 12

β0 : βb0 ± tn−2,1−α/2Sβb0

β1 : βb1 ± tn−2,1−α/2Sβb1

µY |X0 : Y + βb1(X0 − X) ± tn−2,1−α/2SYbX


0

Y |X0(prediction interval for a new individual’s Y whenX = X0) :


s
2
1 (X 0 − X)
Y + βb1(X0 − X) ± tn−2,1−α/2SY |X 1 + + 2
n (n − 1)SX

• Tests of hypotheses:
(0)
(0) (βb1 − β1 )
– H0 : β1 = β1 ; Test Statistics (T) =
Sβb1
(0)
(0) (βb0 − β0 )
– H0 : β0 = β0 ; Test Statistics (T) =
Sβb0
(0)
(0)
Y + βb1(X0 − X) − µY |X0
– H0 : µY |X0 = µY |X0 ; Test Statistics (T) =
SYbX
0

5.5 Interpretation of Tests for Slope and Intercept

• Test for Zero Slope:

1. If H0 : β1 = 0 is not rejected, one of the following is true:

– For a true underlying straight-line model, X provides lit-


tle or no help in predicting Y . In other words, we can
simply use Y to predict Y .

Fall 2022
Applied Statistical Methods, T. S. Lu 13

– The true relationship between Y and X is NOT linear.

We can say that since we cannot reject H0, we conclude that


a straight-line model in X is not the best model to use and
does not provide much help for predicting Y ,

2. If H0 : β1 = 0 is rejected: We conclude that a straight-line


model in X is better than a model that does not include X
at all.

• Test for Zero Intercept: In any case, this hypothesis is rarely of


interest in most studies.

5.6 Inferences About the Regression Line

For a given X = X0, we may want to find a confidence interval


for µY |X0 (the mean value of Y at X0) or testing the hypothesis
(0)
H0 : µY |X0 = µY |X0 . The test statistic is

(0)
YbX0 − µY |X0
T =
SYbX
0

where YbX0 = βb0 + βb0X0 = Y + βb1(X0 − X) is the predicted value of


Y at X0 and s
1 (X0 − X)2
SYbX = SY |X + 2
0 n (n − 1)SX

Fall 2022
Applied Statistical Methods, T. S. Lu 14

T has the t distribution with n − 2 degrees of freedom when H0 is


true. The corresponding confidence interval for µY |X0 at X = X0 is

YbX0 ± tn−2,1−α/2SYbX
0

Output showing confidence and prediction intervals:


/*Simple linear regression showing more detailed results*/
goptions reset=all htext=2;
symbol1 color=black value=dot w=2 ;
symbol2 color=black w=2 ;
symbol3 color=red;
symbol4 color=red;
symbol5 color=blue;
symbol6 color=blue;
axis1 order=(10 to 70 by 10) minor=none;
axis2 order=(100 to 240 by 20) minor=none ;
proc reg data=stat.sbp simple;
model SBP = age / p CLM CLI;
plot sbp*age / conf pred;
title ’Results of Regression Analysis’;
output out=results
predicted=pred
residual=resid
L95M=lowmean
U95M=highmean
L95=lowpred
U95=highpred;
run;
quit;

Output Statistics

Fall 2022
Applied Statistical Methods, T. S. Lu 15

Dependent Predicted Std Error


Obs Variable Value Mean Predict 95% CL Mean 95% CL Predict Residual

1 144.0000 136.5787 3.4139 129.5857 143.5717 100.4302 172.7271 7.4213


2 220.0000 144.3456 3.1853 137.8208 150.8704 108.2848 180.4064 75.6544
3 138.0000 142.4039 3.1612 135.9285 148.8792 106.3520 178.4558 -4.4039
4 145.0000 144.3456 3.1853 137.8208 150.8704 108.2848 180.4064 0.6544
5 162.0000 161.8213 5.2377 151.0923 172.5502 124.7684 198.8742 0.1787
6 142.0000 143.3748 3.1663 136.8889 149.8606 107.3210 179.4285 -1.3748
7 170.0000 163.7630 5.5787 152.3356 175.1905 126.5018 201.0242 6.2370
8 124.0000 139.4913 3.2289 132.8771 146.1055 103.4142 175.5684 -15.4913
9 158.0000 163.7630 5.5787 152.3356 175.1905 126.5018 201.0242 -5.7630
10 154.0000 153.0835 3.9001 145.0946 161.0724 116.7292 189.4377 0.9165
11 162.0000 160.8504 5.0717 150.4616 171.2393 123.8945 197.8063 1.1496
12 150.0000 153.0835 3.9001 145.0946 161.0724 116.7292 189.4377 -3.0835
13 140.0000 155.9961 4.2999 147.1881 164.8041 119.4531 192.5391 -15.9961
14 110.0000 131.7243 3.9332 123.6676 139.7810 95.3551 168.0935 -21.7243
15 128.0000 139.4913 3.2289 132.8771 146.1055 103.4142 175.5684 -11.4913
16 130.0000 145.3165 3.2180 138.7248 151.9082 109.2435 181.3895 -15.3165
17 135.0000 142.4039 3.1612 135.9285 148.8792 106.3520 178.4558 -7.4039
18 114.0000 115.2195 6.7058 101.4832 128.9558 77.1867 153.2523 -1.2195
19 116.0000 118.1321 6.1568 105.5204 130.7439 80.4909 155.7734 -2.1321
20 124.0000 117.1613 6.3382 104.1781 130.1444 79.3939 154.9286 6.8387
21 136.0000 133.6661 3.6984 126.0901 141.2420 97.4003 169.9318 2.3339
22 142.0000 147.2582 3.3225 140.4525 154.0640 111.1455 183.3709 -5.2582
23 120.0000 136.5787 3.4139 129.5857 143.5717 100.4302 172.7271 -16.5787
24 120.0000 119.1030 5.9774 106.8588 131.3472 81.5833 156.6227 0.8970
25 160.0000 141.4330 3.1700 134.9395 147.9265 105.3779 177.4882 18.5670
26 158.0000 150.1708 3.5675 142.8632 157.4785 113.9602 186.3815 7.8292
27 144.0000 159.8796 4.9090 149.8238 169.9353 123.0159 196.7432 -15.8796
28 130.0000 126.8700 4.6362 117.3731 136.3668 90.1549 163.5851 3.1300
29 125.0000 122.9865 5.2825 112.1657 133.8072 85.9069 160.0661 2.0135
30 175.0000 165.7048 5.9299 153.5579 177.8517 128.2167 203.1929 9.2952

Fall 2022
Applied Statistical Methods, T. S. Lu 16

6 Prediction of a New Value of Y at X0

The Prediction band,


s
1 (X0 − X)2
YbX0 ± tn−2,1−α/2SY |X 1+ + 2
n (n − 1)SX

In predicting an actual observed Y for a given individual, there are


two sources of error operating: individual error as measured by σ 2
and the error in estimating µY |X0 using YbX0 . We express using the
following equation,

Y − YbX0 = (Y − µY |X0 ) + (µY |X0 − YbX0 )

Fall 2022
Applied Statistical Methods, T. S. Lu 17

Left-hand side of the equation: Error in predicting an individual’s Y


at X0
1st term at the right-hand side of the equation: Deviation of indi-
vidual’s Y from true mean at X0
2nd term at the right-hand side of the equation: Deviation of YbX0
from true mean at X0
We then can write the variance of an individual’s predicted response
at X0 from the above expression,
2
 
1 (X 0 − X)
VarY + VarYbX0 = σ2 1 + + 2
n (n − 1)SX

Fall 2022

You might also like