Principles of Biostatistics: Simple Linear Regression

18-1 4/9/2005 11:38 AM
Principles of Biostatistics
Simple Linear Regression
PPT based on
Dr Chuanhua Yu and Wikipedia
Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)
18-2 4/9/2005 11:38 AM
Terminology
Moments, Skewness, Kurtosis
Analysis of variance ANOVA
Response (dependent) variable
Explanatory (independent) variable
Linear regression model 　　　　　
Method of least squares
Normal equation
sum of squares, Error SSE
sum of squares, Regression SSR
sum of squares, Total SST
Coefficient of Determination R2
F-value P-value, t-test, F-test, p-test
Homoscedasticity
heteroscedasticity 　　　　　　　
18-3 4/9/2005 11:38 AM
Contents
• 18.0 Normal distribution and terms
• 18.1 An Example
• 18.2 The Simple Linear Regression Model
• 18.3 Estimation: The Method of Least Squares
• 18.4 Error Variance and the Standard Errors of Regression
Estimators
• 18.5 Confidence Intervals for the Regression Parameters
• 18.6 Hypothesis Tests about the Regression Relationship
• 18.7 How Good is the Regression?
• 18.8 Analysis of Variance Table and an F Test of the Regression
Model
• 18.9 Residual Analysis
• 18.10 Prediction Interval and Confidence Interval
18-4 4/9/2005 11:38 AM
Normal Distribution
The continuous probability density function of the normal distribution is the Gaussian
function
where σ > 0 is the standard deviation, the real parameter μ is the expected value, and
is the density function of the "standard" normal distribution: i.e., the normal
distribution with μ = 0 and σ = 1.
18-5 4/9/2005 11:38 AM
Normal Distribution
18-6 4/9/2005 11:38 AM
Moment About The Mean

The kth moment about the mean (or kth central moment) of a real-valued random
variable X is the quantity
μk = E[(X − E[X]) k],
where E is the expectation operator. For a continuous uni-variate probability
distribution with probability density function f(x), the moment about the mean μ is
The first moment about zero, if it exists, is the expectation of X, i.e. the mean of the
probability distribution of X, designated μ. In higher orders, the central moments are
more interesting than the moments about zero.
μ1 is 0.
μ2 is the variance, the positive square root of which is the standard deviation, σ.
μ3/σ3 is Skewness, often γ.
μ3/σ4 -3 is Kurtosis.
18-7 4/9/2005 11:38 AM
Skewness
Consider the distribution in the figure. The bars on the right side of the distribution
taper differently than the bars on the left side. These tapering sides are called tails (or
snakes), and they provide a visual means for determining which of the two kinds of
skewness a distribution has:
1.negative skew: The left tail is longer; the mass of the distribution is concentrated on
the right of the figure. The distribution is said to be left-skewed.
2.positive skew: The right tail is longer; the mass of the distribution is concentrated on
the left of the figure. The distribution is said to be right-skewed.
18-8 4/9/2005 11:38 AM
Skewness
Skewness, the third standardized moment, is written as γ1 and defined as
where μ3 is the third moment about the mean and σ is the standard deviation.
For a sample of n values the sample skewness is
18-9 4/9/2005 11:38 AM
Kurtosis
Kurtosis is the degree of peakedness of a distribution. A normal distribution is a
mesokurtic distribution. A pure leptokurtic distribution has a higher peak than the
normal distribution and has heavier tails. A pure platykurtic distribution has a lower
peak than a normal distribution and lighter tails.
18-10 4/9/2005 11:38 AM
Kurtosis
The fourth standardized moment is defined as
where μ4 is the fourth moment about the mean and σ is the standard deviation.
For a sample of n values the sample kurtosis is
18-11 4/9/2005 11:38 AM
18.1 An example
Table18.1 IL-6 levels in brain and serum (pg/ml) of 10

patients with subarachnoid hemorrhage
Patient Serum IL-6 (pg/ml) Brain IL-6 (pg/ml)
i x y
1 22.4 134.0
2 51.6 167.0
3 58.1 132.3
4 25.1 80.2
5 65.9 100.0
6 79.7 139.1
7 75.3 187.2
8 32.4 97.2
9 96.4 192.3
10 85.7 199.4
18-12 4/9/2005 11:38 AM
Scatterplot
This scatterplot locates pairs of
observations of serum IL-6 on the x-axis
and brain IL-6 on the y-axis. We notice
that:
Larger (smaller) values of brain IL-6 tend

to be associated with larger (smaller)
values of serum IL-6 .
The scatter of points tends to be distributed around a positively sloped straight
line.
The pairs of values of serum IL-6 and brain IL-6 are not located exactly on a
straight line. The scatter plot reveals a more or less strong tendency rather than a
precise linear relationship. The line represents the nature of the relationship on
average.
18-13 4/9/2005 11:38 AM
Examples of Other Scatterplots

0
Y
Y
Y
X 0 X X
Y
Y
X X X
18-14 4/9/2005 11:38 AM
Model Building
Theinexact
The inexactnature
natureof
ofthe
the Data InANOVA,
In ANOVA,the thesystematic
systematic
relationshipbetween
relationship betweenserum
serum componentisisthe
component thevariation
variation
andbrain
and brainsuggests
suggeststhat
thataa ofmeans
of meansbetween
betweensamples
samples
statisticalmodel
statistical modelmight
mightbe be ortreatments
or treatments(SSTR)
(SSTR)andand
usefulininanalyzing
useful analyzingthe
the Statistical therandom
the randomcomponent
componentisis
relationship.
relationship. model theunexplained
the unexplainedvariation
variation
(SSE).
(SSE).
AAstatistical
statisticalmodel
model
separatesthe
separates thesystematic
systematic Inregression,
In regression,the
the
componentof ofaa
Systematic systematiccomponent
componentisis
component systematic
relationshipfrom
fromthe
the component theoverall
overalllinear
linear
relationship the
randomcomponent.
random component. + relationship,and
relationship, andthe
the
Random randomcomponent
random componentisisthethe
errors variationaround
variation aroundthetheline.
line.
18-15 4/9/2005 11:38 AM
18.2 The Simple Linear Regression Model
Thepopulation
The populationsimple
simplelinear
linearregression
regressionmodel:
model:
y= ++xx
y= or
++  or y|x==++
xx
y|x
Nonrandomoror
Nonrandom Random
Random
Systematic
Systematic Component
Component
Component
Component
Whereyyisisthe
Where thedependent
dependent(response)
(response) variable,
variable,thethevariable
variablewe
wewish
wishtoto
explainor
explain orpredict;
predict;xxisisthe
theindependent
independent(explanatory)
(explanatory)variable,
variable,also
alsocalled
calledthe
the
predictorvariable;
predictor variable;andandisisthe
theerror
errorterm,
term,the
theonly
onlyrandom
randomcomponent
componentininthe the
model,and
model, andthus,
thus,the
theonly
onlysource
sourceofofrandomness
randomnessininy.y.
y|xy|xisisthe
themean
meanof ofyywhen
whenxxisisspecified,
specified,all
allcalled
calledthe
theconditional
conditionalmean
meanof of
Y.
Y.
isisthe
theintercept
interceptofofthe
thesystematic
systematiccomponent
componentof
ofthe
theregression
regressionrelationship.
relationship.
isisthe
theslope
slopeof
ofthe
thesystematic
systematiccomponent.
component.
18-16 4/9/2005 11:38 AM
Picturing the Simple Linear Regression Model

Regression Plot The simple
The simple linear
linear regression
regression
Y
model posits
model posits anan exact
exact linear
linear
relationshipbetween
relationship betweenthetheexpected
expected
or average
or average value
value ofof Y,Y, thethe
dependent variable
dependent variable Y,
Y, and
and X,
X, the
the
y|x= +  x independentor orpredictor
predictorvariable:
variable:
independent
{
y
Error:  }  = Slope y|xy|x== ++

xx
}
Actualobserved
Actual observedvalues
valuesof
ofYY
1
(y)differ
(y) differfrom
fromthe
theexpected
expectedvalue
value
{ ((y|x))by
byan
anunexplained
unexplainedor
or
y|x
 = Intercept random error():):
randomerror(
0 x
X
yy == y|xy|x ++ 
== ++ xx++ 
18-17 4/9/2005 11:38 AM
Assumptions of the Simple Linear

Regression Model
•• The
Therelationship
relationshipbetween
betweenXXand
and
LINE assumptions of the Simple
YYisisaastraight-Line
straight-Line(linear)
(linear) Y
Linear Regression Model
relationship.
relationship.
•• Thevalues
The valuesof ofthe
theindependent
independent
variableXXare
variable areassumed
assumedfixed fixed
(notrandom);
(not random);the theonly
only y|x= +  x
randomnessininthe
randomness thevalues
valuesof ofYY
comesfrom
comes fromthetheerror
errortermterm. .
•• The errorsare
Theerrors areuncorrelated
uncorrelated y
(i.e.Independent)
(i.e. Independent)inin
successiveobservations.
successive observations.The The
errorsare
errors areNormally
Normally Identical normal
distributions of errors,
distributedwith
distributed withmean
mean00and and all centered on the
regression line.
variance22(Equal
variance (Equalvariance).
variance). N(y|x, y|x2)
Thatis:
That is: ~~N(0,
N(0,22))
X
x
18-18 4/9/2005 11:38 AM
18.3 Estimation: The Method of Least

Squares
Estimationof
Estimation ofaasimple
simplelinear
linearregression
regressionrelationship
relationshipinvolves
involvesfinding
findingestimated
estimated
orpredicted
or predictedvalues
valuesof
ofthe
theintercept
interceptand
andslope
slopeofofthe
thelinear
linearregression
regressionline.
line.
Theestimated
The estimatedregression
regressionequation:
equation:
y=a+
y= a+bx
bx++ee
whereaaestimates
where estimatesthe
theintercept
interceptof
ofthe
thepopulation
populationregression line,;;
regressionline,
bb estimates
estimatesthe
theslope
slope of ofthe
thepopulation
populationregression
regressionline,
line,;;
andee stands
and standsfor
forthe
theobserved
observederrors
errors-------
-------the
theresiduals
residualsfrom
fromfitting
fittingthe
the
estimatedregression
estimated regressionline
linea+
a+bxbxtotoaaset
setof
ofnnpoints.
points.
The estimated regression line:
y  a + b x
ŷ
where  (y - hat) is the value of Y lying on the fitted regression line for a given
value of X.
18-19 4/9/2005 11:38 AM
Fitting a Regression Line

Y Y
Data
Three errors from the
least squares regression
X line X
Y e
Three errors Errors from the least

from a fitted line squares regression
line are minimized
X X
18-20 4/9/2005 11:38 AM
Errors in Regression
yˆ  a  bx
yi . the fitted regression line
yˆi
Error ei  yi  yˆi
{ yˆ the predicted value of Y for x
X
xi
18-21 4/9/2005 11:38 AM
Least Squares Regression

The sum of squared errors in regression is:
n n
SSE =  e i   (y i  yi ) 2
2
SSE: sum of squared errors
i=1 i=1
The least squares regression line is that which minimizes the SSE
with respect to the estimates a and b.
SSE a
Parabola function
Least squares a
Least squares b b
18-22 4/9/2005 11:38 AM
Normal Equation
S is minimized when its gradient with respect to each parameter is equal to zero. The elements
of the gradient vector are the partial derivatives of S with respect to the parameters:
Since , the derivatives are
Substitution of the expressions for the residuals and the derivatives into the gradient equations gives
Upon rearrangement, the normal equations
are obtained. The normal equations are written in matrix notation as
The solution of the normal equations yields the vector of the optimal parameter values.
18-23 4/9/2005 11:38 AM
Normal Equation
Yˆ  XBˆ U ~ N (0, )
2
Y  XB  U
 
n n
Q   ei   yi  yˆ i E  Y  Yˆ  Y  XBˆ
2 2
i 1 i 1
 ee  (Y  XBˆ )(Y  XBˆ )

Q  (Y   Bˆ X )(Y  XBˆ )
 ( Y Y  Y XBˆ  Bˆ X Y  Bˆ X XBˆ ) (Y XBˆ  Bˆ X Y ？)
 Y Y  2 Bˆ X Y  Bˆ X XBˆ
Q
0  X Y  X XBˆ  0
ˆ
B
ee
Bˆ   X X  X Y ˆ
1 2

n  k 1
18-24 4/9/2005 11:38 AM
Sums of Squares, Cross Products, and

Least Squares Estimators
Sums of Squares and Cross Products:
  (x  x ) 2   x 2 
 x  2
lxx
n 2
  (y  y )   y 
2 2
 y
lyy
n
 x ( y )
lxy   ( x  x )( y  y )   xy 
ŷ  a  bx n
Least  squares re gression estimators:
 lxy
b
lxx
ŷ  a  bx
a  y bx
18-25 4/9/2005 11:38 AM
Example 18-1
  x 2
2
2 
2 2
Patient x y x y x ×y 592.6 2
x
xx2  n 41222.14
2
1 22.4 134.0 501.76 17956.0 3001.60 lxxlxx 41222.14592.6 6104.66
6104.66
4 25.1 80.2 630.01 6432.0 2013.02 n 10
10
  y 2
8 32.4 97.2 1049.76 9447.8 3149.28 2
2  y 1428.702 2
2
3
51.6
58.1
167.0
132.3
2662.56
3375.61
27889.0
17503.3
8617.20
7686.63 l yyl yyyy2  n 220360.47220360.471428.70
10 16242.10
16242.10
5 65.9 100.0 4342.81 10000.0 6590.00 n 10
7 75.3 187.2 5670.09 35043.8 14096.16
  xyy

xy  n  91866.46 
 x  
592.61428.70
592.6 1428.70
6 79.7 139.1 6352.09 19348.8 11086.27 lxyxy 
l xy 91866.46
10 7201.70
7201.70
10 85.7 199.4 7344.49 39760.4 17088.58 n 10
9 96.4 192.3 9292.96 36979.3 18537.72
7201.70
l lxy 7201.70
bb xylxx 1.18
1.18
Total 592.6 1428.7 41222.14 220360.5 91866.46
lxx 6104.66
6104.66
 592.6 
regression equation: aayybx
bx 1428.7
1428.7
10
10
 (1.18)592.6 
(1.18) 
10 
  10

yˆ  72.96  1.18 x 72.96
72.96
18-26 4/9/2005 11:38 AM
New Normal Distributions
ˆB  ( X X )1 X Y
• Since each coefficient estimator is a linear combination of Y
(normal random variables), each bi (i = 0,1, ..., k) is normally
distributed.

• Notation:
 j ~ N (  j ,  2c jj ), c jj is the jth row jth column element of (X' X) -1
in 2D special case, c jj  1/l xx

when j=0, in 2D special case

 0  a ~ N (a ,  2 ( x 2 ) / nl xx )
18-27 4/9/2005 11:38 AM
Total Variance and Error Variance

Y
 ( y  y ) 2 Y
 ( y  ˆ
y ) 2
n 1 n2
X X
What you see when looking

What you see when looking
at the total variation of Y.
along the regression line at
the error variance of Y.
18-28 4/9/2005 11:38 AM
18.4 Error Variance and the Standard

Errors of Regression Estimators
Y
Degrees of Freedom in Regression:
df = (n-2) (n total observations less one degree of freedom

for each parameter estimated (a and b) )
Square and sum all
2 lxy2
SSE = (y  yˆ )  l 
yy
regression errors to find
lxx SSE.
X
=l yy  blxy
Example 18 -1:
SSE =l yy  blxy
An unbiased estimator of  2 ,denoted by s2 :  16242.10  (1.18)(7201.70)
 7746.23
2
Error Variance: MSE= SSE 
 ( y  yˆ ) SSE 7746.23
n-2 MSE    968.28
n2 n2 8
Standard Error :
s  MSE  968.28  31.12
18-29 4/9/2005 11:38 AM
Standard Errors of Estimates in Regression

Thestandard
The standarderror
errorof
ofaa (intercept)
(intercept):: Example18-1:
Example 18-1:
s 
xx
ss 
2
2
s  a
s  ss xx
 2
2
 s 11  xx 2
2 a
lxxlxx nn
saa   s n l
llxx nn n l xx 31.12 41222.14
31.12 41222.14

xx xx
6104.66
6104.66 10
10
wheress== MSE
where MSE
0.398
0.39864.204
64.20425.570
25.570
s  ss
sb 
b
Thestandard
The standarderror
errorof
ofbb(slope)
(slope):: lxxlxx
31.12
31.12

s  ss 6104.66
sbb  6104.66
llxxxx
0.398
0.398
18-30 4/9/2005 11:38 AM
T distribution
Student's distribution arises when the population standard

deviation is unknown and has to be estimated from the data.
18-31 4/9/2005 11:38 AM
18.5 Confidence Intervals for the

Regression Parameters
Example18-1
Example 18-1
95%Confidence
95% ConfidenceIntervals:
Intervals:
aatt0.05 s
2,1022saa
0.05/ /2,10
=72.961(2.306)
=72.961 (2.306)(25.570)
(25.570)
A 100(1- ) % confidence interval for a : 72.961
72.96158.964
58.964
a t s
 / 2, n  2 a [13.996,131.925]
[13.996,131.925]
A 100(1- ) % confidence interval for b: bbtt0.05 s

2,1022sbb
0.05/ /2,10
b t s
 / 2,n  2 b
=1.180(2.306)
=1.180 (2.306)(0.398)
(0.398)
1.180
1.1800.918
0.918
[0.261,2.098]
[0.261,2.098]
18-32 4/9/2005 11:38 AM
18.6 Hypothesis Tests about the

Regression Relationship
Constant Y Unsystematic Variation Nonlinear Relationship
Y Y Y
H0:=0 H0:=0 H0:=0
X X X
A hypothesis test for the existence of a linear relationship between X and Y:
H0:   0
H1:   0
Test statistic for the existence of a linear relationship between X and Y:
b b
tb  
sb sb
where b is the least - squares estimate of the regression slope and sb is the standard error of b
When the null hypothesis is true, the statistic has a t distribution with n - 2 degrees of freedom.
18-33 4/9/2005 11:38 AM
T-test
A test of the null hypothesis that the means of two normally distributed
populations are equal. Given two data sets, each characterized by its mean,
standard deviation and number of data points, we can use some kind of t test to
determine whether the means are distinct, provided that the underlying
distributions can be assumed to be normal. All such tests are usually called
Student's t tests

 jj
t 
~ t (n  k )
 j
18-34 4/9/2005 11:38 AM
T-test
Example 18-1:
H :   0, H :   0   0.05
0 1
b 1.180
tb    2.962
sb 0.398
p value  0.018 ( 10  2  8)
t0.05 / 2, 8  2.306  2.962

H is rejected at the 5% level and we may
0
conclude that there is a relationship between
serum IL-6 and brain IL-6.
18-35 4/9/2005 11:38 AM
T test Table
18-36 4/9/2005 11:38 AM
18.7 How Good is the Regression?

The coefficient of determination, R2, is a descriptive measure of the strength of
the regression relationship, a measure how well the regression line fits the data.
( y  y )  ( y  y)  ( y  y )
R2 ： coefficient of
Y Total = Unexplained Explained
determination
Deviation Deviation Deviation
. (Error) (Regression)
}
Y
Y
Unexplained Deviation
{ Total Deviation
2 2
 ( y  y )   ( y  y)   ( y  y )
2
Y
Explained Deviation
{ SST = SSE + SSR
Percentage of
SSR SSE
Rr22=   1 total variation
SST SST explained by the
X
regression.
X
18-37 4/9/2005 11:38 AM
The Coefficient of Determination
Y Y Y
X X X
SST SST SST
S
R2=0 SSE R2=0.50 SSE SSR R2=0.90 S SSR
E
Example 18 -1 :
SSR blxy 1.180  7201.70
R 
2
 
SST l yy 16242.10
 0.5231  52.31%
18-38 4/9/2005 11:38 AM
Another Test
• Earlier in this section you saw how to perform a t-test to

compare a sample mean to an accepted value, or to
compare two sample means. In this section, you will see
how to use the F-test to compare two variances or standard
deviations.
• When using the F-test, you again require a hypothesis, but

this time, it is to compare standard deviations. That is, you
will test the null hypothesis H0: σ12 = σ22 against an
appropriate alternate hypothesis.
18-39 4/9/2005 11:38 AM
F-test
T test is used for every single parameter. If there are many dimensions, all
parameters are independent.
Too verify the combination of all the paramenters, we can use F-test.
H 0 :  2   3  ...   k  0
H1 :  2 ,  3 ,...,  k at least one non - zero
The formula for an F- test in multiple-comparison ANOVA problems is:
F = (between-group variability) / (within-group variability)
SSR /( k  1)
F ~ F (k  1, n  k )
SSE /( n  k )
18-40 4/9/2005 11:38 AM
F test table
18-41 4/9/2005 11:38 AM
18.8 Analysis of Variance Table and an

F Test of the Regression Model
Sourceof
Source of Sum
Sumof
of Degreesof
Degrees of
Variation Squares
Variation Squares Freedom Mean
Freedom MeanSquare
Square FFRatio
Ratio
Regression SSR
Regression SSR (1)
(1) MSR
MSR MSR
MSR
MSE
MSE
Error
Error SSE
SSE (n-2)
(n-2) MSE
MSE
Total
Total SST
SST (n-1)
(n-1) MST
MST
Example18-1
Example 18-1
Sourceofof Sum
Source Sumofof Degreesofof
Degrees
Variation Squares
Variation Squares Freedom Mean
Freedom MeanSquare
Square FFRatio
Ratio ppValue
Value
Regression
Regression 8495.87 11 8495.87
8495.87 8.77
8.77 0.0181
0.0181
8495.87
Error
Error 7746.23 88 968.28
968.28
7746.23
Total
Total 16242.10 99
16242.10
18-42 4/9/2005 11:38 AM
F-test T-test and R

1. In 2D case, F-test and T-test are same. It can be proved that f = t2
So in 2D case, either F or T test is enough. This is not true for more variables.
2. F-test and R have the same purpose to measure the whole regressions. They
nk R 2
are co-related as F 
k 1 1  R
, 2
3. F-test are better than R became it has better metric which has distributions
for hypothesis test.
Approach:
1.First F-test. If passed, continue.
2.T-test for every parameter, if some parameter can not pass, then we can
delete it can re-evaluate the regression.
3.Note we can delete only one parameters(which has least effect on regression)
at one time, until we get all the parameters with strong effect.
18-43 4/9/2005 11:38 AM
18.9 Residual Analysis

Residuals Residuals
0 0
x or y x or y
Homoscedasticity: Residuals appear completely Heteroscedasticity: Variance of residuals

random. No indication of model inadequacy. changes when x changes.
Residuals Residuals
0 0
Time x or y
Curved pattern in residuals resulting from

Residuals exhibit a linear trend with time. underlying nonlinear relationship.
18-44 4/9/2005 11:38 AM
Example 18-1: Using Computer-Excel

Residual Analysis. The plot shows the a curve relationship
between the residuals and the X-values (serum IL － 6).
serum IL-6 Residual Plot
40
Residual(残差)
20
0
0 20 40 60 80 100 120
-20
-40
-60
serum IL-6
18-45 4/9/2005 11:38 AM
Prediction Interval
• samples from a normally distributed population.

• The mean and standard deviation of the population are unknown
except insofar as they can be estimated based on the sample. It is
desired to predict the next observation.
• Let n be the sample size; let μ and σ be respectively the unobservable
mean and standard deviation of the population. Let X1, ..., Xn, be the
sample; let Xn+1 be the future observation to be predicted. Let
• and
X ~ N ( X , S n2 / n)
18-46 4/9/2005 11:38 AM
Prediction Interval
• Then it is fairly routine to show that
• It has a Student's t-distribution with n − 1 degrees of freedom.

Consequently we have
• where Tais the 100((1 + p)/2)th percentile of Student's t-distribution

with n − 1 degrees of freedom. Therefore the numbers
• are the endpoints of a 100p% prediction interval for Xn + 1.
18-47 4/9/2005 11:38 AM
18.10 Prediction Interval and Confidence Interval
•• Point
Point Prediction
Prediction
–– AAsingle-valued
single-valuedestimate
estimateof
ofYYfor
foraagiven
givenvalue
valueof
ofXX
obtainedby
obtained byinserting
insertingthe
thevalue
valueof
ofXXin
inthe
theestimated
estimated
regressionequation.
regression equation.
•• Prediction
Prediction Interval
Interval
–– For
Foraavalue
valueof
ofYYgiven
givenaavalue
valueof
ofXX
•• Variation
Variationininregression
regressionline
lineestimate
estimate
•• Variation
Variationofofpoints
pointsaround
aroundregression
regressionline
line
–– For
Forconfidence
confidenceinterval
intervalof
ofan
anaverage
averagevalue
valueof
ofYYgiven
given
aavalue
valueof
ofXX
•• Variation
Variationininregression
regressionline
lineestimate
estimate
18-48 4/9/2005 11:38 AM
confidence interval of an average value of Y

given a value of X
1 ( X  X ) 2
Yˆ0 ~ N (  0   1 X 0 ,  2 (  0 2 ))
n  xi
Yˆ0  (  0   1 X 0 )
t ~ t ( n  2)
S Yˆ
0
Yˆ  t n  2, / 2  SYˆ  E (Y )  Yˆ  t n  2, / 2  SYˆ

where
SYˆ  S
1

X  X0
2
 X  X 
n
n 2
i
i 1
18-49 4/9/2005 11:38 AM
Confidence Interval for the Average Value of Y
100(1-))%
AA100(1- %confidence
confidenceinterval
intervalfor
forthe
themean
meanvalue
valueof
ofY:
Y:
11 ((xx0xx))2 2
yˆyˆx0x0tt/ 2,/ 2,n n2 2ss n 0 l
n lxxxx
Example18
Example 18--11((xx0==7755.3.3):):
0
yˆyˆx0x0 aabx
bx072.96
0
72.961.18 1.1875.3
75.3161.79
161.79
1 (75.3  59.26) 2
31.12 1 (75.3  59.26)
2
161.792.306
161.79 2.30631.12
10
10 6104.666
6104.6
161.79
161.7927.06
27.06 [134.74,188.85]
[134.74,188.85]
18-50 4/9/2005 11:38 AM
Prediction Interval For a value of Y given a

value of X
Y0 ~ N (  0   1 X 0 ,  2 )
1 ( X  X ) 2
Yˆ0  Y0 ~ N (0,  (1   0 2
))
n
Yˆ0  Y0
 xi 2
t ~ t ( n  2)
S Yˆ Y
0 0
Yˆ  t n  2, / 2  S  Y Yˆ   YP  Yˆ  t n  2, / 2  S  Y Yˆ 

where
1
S  Y Yˆ   S 1  
X  X 0
2
X  X 
n
n 2
i
i 1
18-51 4/9/2005 11:38 AM
Prediction Interval for a Value of Y

A 100(1-))%
A100(1- %prediction
predictioninterval
intervalfor
forY:
Y:
1 ( x  x
1 ( x00  x ) 22
)
yˆ  t s 1 
yˆ xx0 0  t/ /2,2,nn22s 1  
nn llxx xx
Example18
Example 18--11((xx0==775.3
5.3):):
0
yˆyˆxx0 0 aabx
bx0 72.96
0
72.961.181.1875.3
75.3161.79
161.79
1 (75.3  22
59.26)
1 (75.3  59.26)
161.79  2.306  31.12 
161.79  2.306  31.12  1 1  
10
10 6104.66
6104.66
161.79
161.7976 .699 [[85.11,
76.6 85.11,238.48]
238.48]
18-52 4/9/2005 11:38 AM
Confidence Interval for the Average Value of Y and

Prediction Interval for the Individual Value of Y
300.0
250.0
brain IL-6
200.0
150.0
100.0
50.0
0.0
20 40 60 80 100
serum IL-6
Actual observations lower of 95% CL for y

upper of 95% CL for y lower of 95% CL for mean
upper of 95% CL for mean
18-53 4/9/2005 11:38 AM
Summary
1. Regression analysis is applied for prediction while
control effect of independent variable X.
2. The principle of least squares in solution of
regression parameters is to minimize the residual sum
of squares.
3. The coefficient of determination, R2, is a descriptive
measure of the strength of the regression relationship.
4. There are two confidence bands: one for mean
predictions and the other for individual prediction
values
5. Residual analysis is used to check goodness of fit for
models

Principles of Biostatistics: Simple Linear Regression

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Principles of Biostatistics: Simple Linear Regression

Uploaded by

Copyright:

Available Formats

18-1 4/9/2005 11:38 AM

Moment About The Mean

Skewness, the third standardized moment, is written as γ1 and defined as

For a sample of n values the sample skewness is

For a sample of n values the sample kurtosis is

Table18.1 IL-6 levels in brain and serum (pg/ml) of 10

Larger (smaller) values of brain IL-6 tend

Examples of Other Scatterplots

18.2 The Simple Linear Regression Model

Picturing the Simple Linear Regression Model

Error:  }  = Slope y|xy|x== ++

Assumptions of the Simple Linear

18.3 Estimation: The Method of Least

The estimated regression line:

Fitting a Regression Line

Three errors Errors from the least

Least Squares Regression

Since , the derivatives are

Upon rearrangement, the normal equations

are obtained. The normal equations are written in matrix notation as

 ee  (Y  XBˆ )(Y  XBˆ )

Sums of Squares, Cross Products, and

New Normal Distributions

in 2D special case, c jj  1/l xx

Total Variance and Error Variance

What you see when looking

18.4 Error Variance and the Standard

Degrees of Freedom in Regression:

df = (n-2) (n total observations less one degree of freedom

s  MSE  968.28  31.12

Standard Errors of Estimates in Regression

Student's distribution arises when the population standard

18.5 Confidence Intervals for the

A 100(1- ) % confidence interval for b: bbtt0.05 s

18.6 Hypothesis Tests about the

t0.05 / 2, 8  2.306  2.962

18.7 How Good is the Regression?

The Coefficient of Determination

• Earlier in this section you saw how to perform a t-test to

• When using the F-test, you again require a hypothesis, but

18.8 Analysis of Variance Table and an

F-test T-test and R

18.9 Residual Analysis

Homoscedasticity: Residuals appear completely Heteroscedasticity: Variance of residuals

Curved pattern in residuals resulting from

Example 18-1: Using Computer-Excel

serum IL-6 Residual Plot

• samples from a normally distributed population.

• Then it is fairly routine to show that

• It has a Student's t-distribution with n − 1 degrees of freedom.

• where Tais the 100((1 + p)/2)th percentile of Student's t-distribution

• are the endpoints of a 100p% prediction interval for Xn + 1.

18.10 Prediction Interval and Confidence Interval

confidence interval of an average value of Y

Yˆ  t n  2, / 2  SYˆ  E (Y )  Yˆ  t n  2, / 2  SYˆ

Confidence Interval for the Average Value of Y

Prediction Interval For a value of Y given a

Yˆ  t n  2, / 2  S  Y Yˆ   YP  Yˆ  t n  2, / 2  S  Y Yˆ 

Prediction Interval for a Value of Y

Confidence Interval for the Average Value of Y and

Actual observations lower of 95% CL for y

You might also like