Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 53

18-1 4/9/2005 11:38 AM

Principles of Biostatistics
Simple Linear Regression

PPT based on
Dr Chuanhua Yu and Wikipedia
Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-2 4/9/2005 11:38 AM

Terminology
Moments, Skewness, Kurtosis
Analysis of variance ANOVA
Response (dependent) variable
Explanatory (independent) variable
Linear regression model      
Method of least squares
Normal equation
sum of squares, Error SSE
sum of squares, Regression SSR
sum of squares, Total SST
Coefficient of Determination R2
F-value P-value, t-test, F-test, p-test
Homoscedasticity
heteroscedasticity          
Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-3 4/9/2005 11:38 AM

Contents
• 18.0 Normal distribution and terms
• 18.1 An Example
• 18.2 The Simple Linear Regression Model
• 18.3 Estimation: The Method of Least Squares
• 18.4 Error Variance and the Standard Errors of Regression
Estimators
• 18.5 Confidence Intervals for the Regression Parameters
• 18.6 Hypothesis Tests about the Regression Relationship
• 18.7 How Good is the Regression?
• 18.8 Analysis of Variance Table and an F Test of the Regression
Model
• 18.9 Residual Analysis
• 18.10 Prediction Interval and Confidence Interval

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-4 4/9/2005 11:38 AM

Normal Distribution

The continuous probability density function of the normal distribution is the Gaussian
function

where σ > 0 is the standard deviation, the real parameter μ is the expected value, and
is the density function of the "standard" normal distribution: i.e., the normal
distribution with μ = 0 and σ = 1.

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-5 4/9/2005 11:38 AM

Normal Distribution

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-6 4/9/2005 11:38 AM

Moment About The Mean


The kth moment about the mean (or kth central moment) of a real-valued random
variable X is the quantity
μk = E[(X − E[X]) k],
where E is the expectation operator. For a continuous uni-variate probability
distribution with probability density function f(x), the moment about the mean μ is

The first moment about zero, if it exists, is the expectation of X, i.e. the mean of the
probability distribution of X, designated μ. In higher orders, the central moments are
more interesting than the moments about zero.
μ1 is 0.
μ2 is the variance, the positive square root of which is the standard deviation, σ.
μ3/σ3 is Skewness, often γ.
μ3/σ4 -3 is Kurtosis.
Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-7 4/9/2005 11:38 AM

Skewness
Consider the distribution in the figure. The bars on the right side of the distribution
taper differently than the bars on the left side. These tapering sides are called tails (or
snakes), and they provide a visual means for determining which of the two kinds of
skewness a distribution has:
1.negative skew: The left tail is longer; the mass of the distribution is concentrated on
the right of the figure. The distribution is said to be left-skewed.
2.positive skew: The right tail is longer; the mass of the distribution is concentrated on
the left of the figure. The distribution is said to be right-skewed.

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-8 4/9/2005 11:38 AM

Skewness

Skewness, the third standardized moment, is written as γ1 and defined as

where μ3 is the third moment about the mean and σ is the standard deviation.

For a sample of n values the sample skewness is

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-9 4/9/2005 11:38 AM

Kurtosis
Kurtosis is the degree of peakedness of a distribution. A normal distribution is a
mesokurtic distribution. A pure leptokurtic distribution has a higher peak than the
normal distribution and has heavier tails. A pure platykurtic distribution has a lower
peak than a normal distribution and lighter tails.

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-10 4/9/2005 11:38 AM

Kurtosis
The fourth standardized moment is defined as

where μ4 is the fourth moment about the mean and σ is the standard deviation.

For a sample of n values the sample kurtosis is

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-11 4/9/2005 11:38 AM

18.1 An example

Table18.1 IL-6 levels in brain and serum (pg/ml) of 10


patients with subarachnoid hemorrhage
Patient Serum IL-6 (pg/ml) Brain IL-6 (pg/ml)
i x y
1 22.4 134.0
2 51.6 167.0
3 58.1 132.3
4 25.1 80.2
5 65.9 100.0
6 79.7 139.1
7 75.3 187.2
8 32.4 97.2
9 96.4 192.3
10 85.7 199.4

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-12 4/9/2005 11:38 AM

Scatterplot
This scatterplot locates pairs of
observations of serum IL-6 on the x-axis
and brain IL-6 on the y-axis. We notice
that:

Larger (smaller) values of brain IL-6 tend


to be associated with larger (smaller)
values of serum IL-6 .
The scatter of points tends to be distributed around a positively sloped straight
line.
The pairs of values of serum IL-6 and brain IL-6 are not located exactly on a
straight line. The scatter plot reveals a more or less strong tendency rather than a
precise linear relationship. The line represents the nature of the relationship on
average.

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-13 4/9/2005 11:38 AM

Examples of Other Scatterplots


0

Y
Y
Y

X 0 X X
Y

Y
X X X

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-14 4/9/2005 11:38 AM

Model Building
Theinexact
The inexactnature
natureof
ofthe
the Data InANOVA,
In ANOVA,the thesystematic
systematic
relationshipbetween
relationship betweenserum
serum componentisisthe
component thevariation
variation
andbrain
and brainsuggests
suggeststhat
thataa ofmeans
of meansbetween
betweensamples
samples
statisticalmodel
statistical modelmight
mightbe be ortreatments
or treatments(SSTR)
(SSTR)andand
usefulininanalyzing
useful analyzingthe
the Statistical therandom
the randomcomponent
componentisis
relationship.
relationship. model theunexplained
the unexplainedvariation
variation
(SSE).
(SSE).
AAstatistical
statisticalmodel
model
separatesthe
separates thesystematic
systematic Inregression,
In regression,the
the
componentof ofaa
Systematic systematiccomponent
componentisis
component systematic
relationshipfrom
fromthe
the component theoverall
overalllinear
linear
relationship the
randomcomponent.
random component. + relationship,and
relationship, andthe
the
Random randomcomponent
random componentisisthethe
errors variationaround
variation aroundthetheline.
line.

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-15 4/9/2005 11:38 AM

18.2 The Simple Linear Regression Model

Thepopulation
The populationsimple
simplelinear
linearregression
regressionmodel:
model:
y= ++xx
y= or
++  or y|x==++
xx
y|x
Nonrandomoror
Nonrandom Random
Random
Systematic
Systematic Component
Component
Component
Component
Whereyyisisthe
Where thedependent
dependent(response)
(response) variable,
variable,thethevariable
variablewe
wewish
wishtoto
explainor
explain orpredict;
predict;xxisisthe
theindependent
independent(explanatory)
(explanatory)variable,
variable,also
alsocalled
calledthe
the
predictorvariable;
predictor variable;andandisisthe
theerror
errorterm,
term,the
theonly
onlyrandom
randomcomponent
componentininthe the
model,and
model, andthus,
thus,the
theonly
onlysource
sourceofofrandomness
randomnessininy.y.
y|xy|xisisthe
themean
meanof ofyywhen
whenxxisisspecified,
specified,all
allcalled
calledthe
theconditional
conditionalmean
meanof of
Y.
Y.

isisthe
theintercept
interceptofofthe
thesystematic
systematiccomponent
componentof
ofthe
theregression
regressionrelationship.
relationship.
isisthe
theslope
slopeof
ofthe
thesystematic
systematiccomponent.
component.

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-16 4/9/2005 11:38 AM

Picturing the Simple Linear Regression Model


Regression Plot The simple
The simple linear
linear regression
regression
Y
model posits
model posits anan exact
exact linear
linear
relationshipbetween
relationship betweenthetheexpected
expected
or average
or average value
value ofof Y,Y, thethe
dependent variable
dependent variable Y,
Y, and
and X,
X, the
the
y|x= +  x independentor orpredictor
predictorvariable:
variable:
independent

{
y

Error:  }  = Slope y|xy|x== ++


xx
}

Actualobserved
Actual observedvalues
valuesof
ofYY
1
(y)differ
(y) differfrom
fromthe
theexpected
expectedvalue
value

{ ((y|x))by
byan
anunexplained
unexplainedor
or
y|x
 = Intercept random error():):
randomerror(

0 x
X
yy == y|xy|x ++ 
== ++ xx++ 
Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-17 4/9/2005 11:38 AM

Assumptions of the Simple Linear


Regression Model
•• The
Therelationship
relationshipbetween
betweenXXand
and
LINE assumptions of the Simple
YYisisaastraight-Line
straight-Line(linear)
(linear) Y
Linear Regression Model
relationship.
relationship.
•• Thevalues
The valuesof ofthe
theindependent
independent
variableXXare
variable areassumed
assumedfixed fixed
(notrandom);
(not random);the theonly
only y|x= +  x
randomnessininthe
randomness thevalues
valuesof ofYY
comesfrom
comes fromthetheerror
errortermterm. .
•• The errorsare
Theerrors areuncorrelated
uncorrelated y
(i.e.Independent)
(i.e. Independent)inin
successiveobservations.
successive observations.The The
errorsare
errors areNormally
Normally Identical normal
distributions of errors,
distributedwith
distributed withmean
mean00and and all centered on the
regression line.
variance22(Equal
variance (Equalvariance).
variance). N(y|x, y|x2)
Thatis:
That is: ~~N(0,
N(0,22))
X
x
Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-18 4/9/2005 11:38 AM

18.3 Estimation: The Method of Least


Squares
Estimationof
Estimation ofaasimple
simplelinear
linearregression
regressionrelationship
relationshipinvolves
involvesfinding
findingestimated
estimated
orpredicted
or predictedvalues
valuesof
ofthe
theintercept
interceptand
andslope
slopeofofthe
thelinear
linearregression
regressionline.
line.

Theestimated
The estimatedregression
regressionequation:
equation:
y=a+
y= a+bx
bx++ee

whereaaestimates
where estimatesthe
theintercept
interceptof
ofthe
thepopulation
populationregression line,;;
regressionline,
bb estimates
estimatesthe
theslope
slope of ofthe
thepopulation
populationregression
regressionline,
line,;;
andee stands
and standsfor
forthe
theobserved
observederrors
errors-------
-------the
theresiduals
residualsfrom
fromfitting
fittingthe
the
estimatedregression
estimated regressionline
linea+
a+bxbxtotoaaset
setof
ofnnpoints.
points.

The estimated regression line:

y  a + b x

where  (y - hat) is the value of Y lying on the fitted regression line for a given
value of X.
Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-19 4/9/2005 11:38 AM

Fitting a Regression Line


Y Y

Data
Three errors from the
least squares regression
X line X
Y e

Three errors Errors from the least


from a fitted line squares regression
line are minimized
X X

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-20 4/9/2005 11:38 AM

Errors in Regression

yˆ  a  bx
yi . the fitted regression line

yˆi
Error ei  yi  yˆi
{ yˆ the predicted value of Y for x

X
xi

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-21 4/9/2005 11:38 AM

Least Squares Regression


The sum of squared errors in regression is:
n n

SSE =  e i   (y i  yi ) 2
2
SSE: sum of squared errors
i=1 i=1

The least squares regression line is that which minimizes the SSE
with respect to the estimates a and b.
SSE a

Parabola function

Least squares a

Least squares b b

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-22 4/9/2005 11:38 AM

Normal Equation
S is minimized when its gradient with respect to each parameter is equal to zero. The elements
of the gradient vector are the partial derivatives of S with respect to the parameters:

Since , the derivatives are

Substitution of the expressions for the residuals and the derivatives into the gradient equations gives

Upon rearrangement, the normal equations

are obtained. The normal equations are written in matrix notation as

The solution of the normal equations yields the vector of the optimal parameter values.

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-23 4/9/2005 11:38 AM

Normal Equation

Yˆ  XBˆ U ~ N (0, )
2
Y  XB  U

 
n n
Q   ei   yi  yˆ i E  Y  Yˆ  Y  XBˆ
2 2

i 1 i 1

 ee  (Y  XBˆ )(Y  XBˆ )


Q  (Y   Bˆ X )(Y  XBˆ )
 ( Y Y  Y XBˆ  Bˆ X Y  Bˆ X XBˆ ) (Y XBˆ  Bˆ X Y ?)
 Y Y  2 Bˆ X Y  Bˆ X XBˆ
Q
0  X Y  X XBˆ  0
ˆ
B
ee
Bˆ   X X  X Y ˆ
1 2

n  k 1

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-24 4/9/2005 11:38 AM

Sums of Squares, Cross Products, and


Least Squares Estimators
Sums of Squares and Cross Products:
  (x  x ) 2   x 2 
 x  2

lxx
n 2
  (y  y )   y 
2 2
 y
lyy
n
 x ( y )
lxy   ( x  x )( y  y )   xy 
ŷ  a  bx n
Least  squares re gression estimators:

 lxy
b
lxx
ŷ  a  bx
a  y bx

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-25 4/9/2005 11:38 AM

Example 18-1

  x 2
2
2 
2 2
Patient x y x y x ×y 592.6 2
x
xx2  n 41222.14
2
1 22.4 134.0 501.76 17956.0 3001.60 lxxlxx 41222.14592.6 6104.66
6104.66
4 25.1 80.2 630.01 6432.0 2013.02 n 10
10
  y 2
8 32.4 97.2 1049.76 9447.8 3149.28 2
2  y 1428.702 2
2
3
51.6
58.1
167.0
132.3
2662.56
3375.61
27889.0
17503.3
8617.20
7686.63 l yyl yyyy2  n 220360.47220360.471428.70
10 16242.10
16242.10
5 65.9 100.0 4342.81 10000.0 6590.00 n 10
7 75.3 187.2 5670.09 35043.8 14096.16
  xyy

xy  n  91866.46 
 x  
592.61428.70
592.6 1428.70
6 79.7 139.1 6352.09 19348.8 11086.27 lxyxy 
l xy 91866.46
10 7201.70
7201.70
10 85.7 199.4 7344.49 39760.4 17088.58 n 10
9 96.4 192.3 9292.96 36979.3 18537.72
7201.70
l lxy 7201.70
bb xylxx 1.18
1.18
Total 592.6 1428.7 41222.14 220360.5 91866.46

lxx 6104.66
6104.66
 592.6 
regression equation: aayybx
bx 1428.7
1428.7
10
10
 (1.18)592.6 
(1.18) 
10 
  10

yˆ  72.96  1.18 x 72.96
72.96

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-26 4/9/2005 11:38 AM

New Normal Distributions

ˆB  ( X X )1 X Y
• Since each coefficient estimator is a linear combination of Y
(normal random variables), each bi (i = 0,1, ..., k) is normally
distributed.


• Notation:
 j ~ N (  j ,  2c jj ), c jj is the jth row jth column element of (X' X) -1

in 2D special case, c jj  1/l xx


when j=0, in 2D special case

 0  a ~ N (a ,  2 ( x 2 ) / nl xx )

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-27 4/9/2005 11:38 AM

Total Variance and Error Variance


Y
 ( y  y ) 2 Y
 ( y  ˆ
y ) 2

n 1 n2

X X

What you see when looking


What you see when looking
at the total variation of Y.
along the regression line at
the error variance of Y.

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-28 4/9/2005 11:38 AM

18.4 Error Variance and the Standard


Errors of Regression Estimators
Y

Degrees of Freedom in Regression:

df = (n-2) (n total observations less one degree of freedom


for each parameter estimated (a and b) )
Square and sum all
2 lxy2
SSE = (y  yˆ )  l 
yy
regression errors to find
lxx SSE.
X
=l yy  blxy
Example 18 -1:
SSE =l yy  blxy
An unbiased estimator of  2 ,denoted by s2 :  16242.10  (1.18)(7201.70)
 7746.23
2
Error Variance: MSE= SSE 
 ( y  yˆ ) SSE 7746.23
n-2 MSE    968.28
n2 n2 8
Standard Error :

s  MSE  968.28  31.12

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-29 4/9/2005 11:38 AM

Standard Errors of Estimates in Regression


Thestandard
The standarderror
errorof
ofaa (intercept)
(intercept):: Example18-1:
Example 18-1:
s 
xx
ss 
2
2

s  a

s  ss xx
 2
2

 s 11  xx 2
2 a
lxxlxx nn
saa   s n l
llxx nn n l xx 31.12 41222.14
31.12 41222.14

xx xx

6104.66
6104.66 10
10
wheress== MSE
where MSE
0.398
0.39864.204
64.20425.570
25.570
s  ss
sb 
b
Thestandard
The standarderror
errorof
ofbb(slope)
(slope):: lxxlxx
31.12
31.12

s  ss 6104.66
sbb  6104.66
llxxxx
0.398
0.398

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-30 4/9/2005 11:38 AM

T distribution

Student's distribution arises when the population standard


deviation is unknown and has to be estimated from the data.

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-31 4/9/2005 11:38 AM

18.5 Confidence Intervals for the


Regression Parameters
Example18-1
Example 18-1
95%Confidence
95% ConfidenceIntervals:
Intervals:
aatt0.05 s
2,1022saa
0.05/ /2,10
=72.961(2.306)
=72.961 (2.306)(25.570)
(25.570)
A 100(1- ) % confidence interval for a : 72.961
72.96158.964
58.964
a t s
 / 2, n  2 a [13.996,131.925]
[13.996,131.925]

A 100(1- ) % confidence interval for b: bbtt0.05 s


2,1022sbb
0.05/ /2,10
b t s
 / 2,n  2 b
=1.180(2.306)
=1.180 (2.306)(0.398)
(0.398)
1.180
1.1800.918
0.918
[0.261,2.098]
[0.261,2.098]

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-32 4/9/2005 11:38 AM

18.6 Hypothesis Tests about the


Regression Relationship
Constant Y Unsystematic Variation Nonlinear Relationship
Y Y Y
H0:=0 H0:=0 H0:=0

X X X
A hypothesis test for the existence of a linear relationship between X and Y:
H0:   0
H1:   0
Test statistic for the existence of a linear relationship between X and Y:
b b
tb  
sb sb
where b is the least - squares estimate of the regression slope and sb is the standard error of b
When the null hypothesis is true, the statistic has a t distribution with n - 2 degrees of freedom.

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-33 4/9/2005 11:38 AM

T-test

A test of the null hypothesis that the means of two normally distributed
populations are equal. Given two data sets, each characterized by its mean,
standard deviation and number of data points, we can use some kind of t test to
determine whether the means are distinct, provided that the underlying
distributions can be assumed to be normal. All such tests are usually called
Student's t tests

 jj
t 
~ t (n  k )
 j

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-34 4/9/2005 11:38 AM

T-test

Example 18-1:
H :   0, H :   0   0.05
0 1
b 1.180
tb    2.962
sb 0.398
p value  0.018 ( 10  2  8)

t0.05 / 2, 8  2.306  2.962


H is rejected at the 5% level and we may
0
conclude that there is a relationship between
serum IL-6 and brain IL-6.

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-35 4/9/2005 11:38 AM

T test Table

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-36 4/9/2005 11:38 AM

18.7 How Good is the Regression?


The coefficient of determination, R2, is a descriptive measure of the strength of
the regression relationship, a measure how well the regression line fits the data.
( y  y )  ( y  y)  ( y  y )
R2 : coefficient of
Y Total = Unexplained Explained
determination
Deviation Deviation Deviation
. (Error) (Regression)

}
Y

Y
Unexplained Deviation
{ Total Deviation
2 2
 ( y  y )   ( y  y)   ( y  y )
2

Y
Explained Deviation
{ SST = SSE + SSR

Percentage of
SSR SSE
Rr22=   1 total variation
SST SST explained by the
X
regression.
X

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-37 4/9/2005 11:38 AM

The Coefficient of Determination

Y Y Y

X X X
SST SST SST
S
R2=0 SSE R2=0.50 SSE SSR R2=0.90 S SSR
E

Example 18 -1 :
SSR blxy 1.180  7201.70
R 
2
 
SST l yy 16242.10
 0.5231  52.31%
Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-38 4/9/2005 11:38 AM

Another Test

• Earlier in this section you saw how to perform a t-test to


compare a sample mean to an accepted value, or to
compare two sample means. In this section, you will see
how to use the F-test to compare two variances or standard
deviations.

• When using the F-test, you again require a hypothesis, but


this time, it is to compare standard deviations. That is, you
will test the null hypothesis H0: σ12 = σ22 against an
appropriate alternate hypothesis.

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-39 4/9/2005 11:38 AM

F-test

T test is used for every single parameter. If there are many dimensions, all
parameters are independent.
Too verify the combination of all the paramenters, we can use F-test.
H 0 :  2   3  ...   k  0
H1 :  2 ,  3 ,...,  k at least one non - zero
The formula for an F- test in multiple-comparison ANOVA problems is:
F = (between-group variability) / (within-group variability)

SSR /( k  1)
F ~ F (k  1, n  k )
SSE /( n  k )

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-40 4/9/2005 11:38 AM

F test table

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-41 4/9/2005 11:38 AM

18.8 Analysis of Variance Table and an


F Test of the Regression Model
Sourceof
Source of Sum
Sumof
of Degreesof
Degrees of
Variation Squares
Variation Squares Freedom Mean
Freedom MeanSquare
Square FFRatio
Ratio

Regression SSR
Regression SSR (1)
(1) MSR
MSR MSR
MSR
MSE
MSE
Error
Error SSE
SSE (n-2)
(n-2) MSE
MSE
Total
Total SST
SST (n-1)
(n-1) MST
MST

Example18-1
Example 18-1

Sourceofof Sum
Source Sumofof Degreesofof
Degrees
Variation Squares
Variation Squares Freedom Mean
Freedom MeanSquare
Square FFRatio
Ratio ppValue
Value

Regression
Regression 8495.87 11 8495.87
8495.87 8.77
8.77 0.0181
0.0181
8495.87
Error
Error 7746.23 88 968.28
968.28
7746.23
Total
Total 16242.10 99
16242.10

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-42 4/9/2005 11:38 AM

F-test T-test and R


1. In 2D case, F-test and T-test are same. It can be proved that f = t2
So in 2D case, either F or T test is enough. This is not true for more variables.
2. F-test and R have the same purpose to measure the whole regressions. They
nk R 2

are co-related as F 
k 1 1  R
, 2

3. F-test are better than R became it has better metric which has distributions
for hypothesis test.

Approach:
1.First F-test. If passed, continue.
2.T-test for every parameter, if some parameter can not pass, then we can
delete it can re-evaluate the regression.
3.Note we can delete only one parameters(which has least effect on regression)
at one time, until we get all the parameters with strong effect.

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-43 4/9/2005 11:38 AM

18.9 Residual Analysis


Residuals Residuals

0 0

x or y x or y

Homoscedasticity: Residuals appear completely Heteroscedasticity: Variance of residuals


random. No indication of model inadequacy. changes when x changes.

Residuals Residuals

0 0

Time x or y

Curved pattern in residuals resulting from


Residuals exhibit a linear trend with time. underlying nonlinear relationship.

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-44 4/9/2005 11:38 AM

Example 18-1: Using Computer-Excel


Residual Analysis. The plot shows the a curve relationship
between the residuals and the X-values (serum IL - 6).

serum IL-6 Residual Plot

40
Residual(残差)

20
0
0 20 40 60 80 100 120
-20
-40
-60
serum IL-6

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-45 4/9/2005 11:38 AM

Prediction Interval

• samples from a normally distributed population.


• The mean and standard deviation of the population are unknown
except insofar as they can be estimated based on the sample. It is
desired to predict the next observation.
• Let n be the sample size; let μ and σ be respectively the unobservable
mean and standard deviation of the population. Let X1, ..., Xn, be the
sample; let Xn+1 be the future observation to be predicted. Let

• and

X ~ N ( X , S n2 / n)

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-46 4/9/2005 11:38 AM

Prediction Interval

• Then it is fairly routine to show that

• It has a Student's t-distribution with n − 1 degrees of freedom.


Consequently we have

• where Tais the 100((1 + p)/2)th percentile of Student's t-distribution


with n − 1 degrees of freedom. Therefore the numbers

• are the endpoints of a 100p% prediction interval for Xn + 1.

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-47 4/9/2005 11:38 AM

18.10 Prediction Interval and Confidence Interval

•• Point
Point Prediction
Prediction
–– AAsingle-valued
single-valuedestimate
estimateof
ofYYfor
foraagiven
givenvalue
valueof
ofXX
obtainedby
obtained byinserting
insertingthe
thevalue
valueof
ofXXin
inthe
theestimated
estimated
regressionequation.
regression equation.

•• Prediction
Prediction Interval
Interval
–– For
Foraavalue
valueof
ofYYgiven
givenaavalue
valueof
ofXX
•• Variation
Variationininregression
regressionline
lineestimate
estimate
•• Variation
Variationofofpoints
pointsaround
aroundregression
regressionline
line

–– For
Forconfidence
confidenceinterval
intervalof
ofan
anaverage
averagevalue
valueof
ofYYgiven
given
aavalue
valueof
ofXX
•• Variation
Variationininregression
regressionline
lineestimate
estimate

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-48 4/9/2005 11:38 AM

confidence interval of an average value of Y


given a value of X

1 ( X  X ) 2
Yˆ0 ~ N (  0   1 X 0 ,  2 (  0 2 ))
n  xi
Yˆ0  (  0   1 X 0 )
t ~ t ( n  2)
S Yˆ
0

Yˆ  t n  2, / 2  SYˆ  E (Y )  Yˆ  t n  2, / 2  SYˆ


where

SYˆ  S
1

X  X0
2

 X  X 
n
n 2
i
i 1

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-49 4/9/2005 11:38 AM

Confidence Interval for the Average Value of Y

100(1-))%
AA100(1- %confidence
confidenceinterval
intervalfor
forthe
themean
meanvalue
valueof
ofY:
Y:

11 ((xx0xx))2 2
yˆyˆx0x0tt/ 2,/ 2,n n2 2ss n 0 l
n lxxxx

Example18
Example 18--11((xx0==7755.3.3):):
0

yˆyˆx0x0 aabx
bx072.96
0
72.961.18 1.1875.3
75.3161.79
161.79

1 (75.3  59.26) 2
31.12 1 (75.3  59.26)
2
161.792.306
161.79 2.30631.12
10
10 6104.666
6104.6
161.79
161.7927.06
27.06 [134.74,188.85]
[134.74,188.85]

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-50 4/9/2005 11:38 AM

Prediction Interval For a value of Y given a


value of X

Y0 ~ N (  0   1 X 0 ,  2 )
1 ( X  X ) 2
Yˆ0  Y0 ~ N (0,  (1   0 2
))
n
Yˆ0  Y0
 xi 2

t ~ t ( n  2)
S Yˆ Y
0 0

Yˆ  t n  2, / 2  S  Y Yˆ   YP  Yˆ  t n  2, / 2  S  Y Yˆ 


where

1
S  Y Yˆ   S 1  
X  X 0
2

X  X 
n
n 2
i
i 1

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-51 4/9/2005 11:38 AM

Prediction Interval for a Value of Y


A 100(1-))%
A100(1- %prediction
predictioninterval
intervalfor
forY:
Y:

1 ( x  x
1 ( x00  x ) 22
)
yˆ  t s 1 
yˆ xx0 0  t/ /2,2,nn22s 1  
nn llxx xx

Example18
Example 18--11((xx0==775.3
5.3):):
0

yˆyˆxx0 0 aabx
bx0 72.96
0
72.961.181.1875.3
75.3161.79
161.79

1 (75.3  22
59.26)
1 (75.3  59.26)
161.79  2.306  31.12 
161.79  2.306  31.12  1 1  
10
10 6104.66
6104.66
161.79
161.7976 .699 [[85.11,
76.6 85.11,238.48]
238.48]

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-52 4/9/2005 11:38 AM

Confidence Interval for the Average Value of Y and


Prediction Interval for the Individual Value of Y
300.0

250.0
brain IL-6

200.0

150.0

100.0

50.0

0.0
20 40 60 80 100
serum IL-6

Actual observations lower of 95% CL for y


upper of 95% CL for y lower of 95% CL for mean
upper of 95% CL for mean

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-53 4/9/2005 11:38 AM

Summary
1. Regression analysis is applied for prediction while
control effect of independent variable X.
2. The principle of least squares in solution of
regression parameters is to minimize the residual sum
of squares.
3. The coefficient of determination, R2, is a descriptive
measure of the strength of the regression relationship.
4. There are two confidence bands: one for mean
predictions and the other for individual prediction
values
5. Residual analysis is used to check goodness of fit for
models
Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

You might also like