Download as pdf or txt
Download as pdf or txt
You are on page 1of 42

MED-0323: Introducción a la Bioestadística

Clase 13: Regresión lineal

Noviembre 8, 2019
Iván Sisa, MD, MPH, MS
isisa@usfq.edu.ec
Temas de la clase
— Fundamentos de la regresión lineal

— Utilización de una variable dummy

— Ajuste por variables

— Interpretación

— Pruebas diagnósticas
Objetivos de la clase
— Al final de esta clase se espera que el
estudiante esté en capacidad de:

◦ Interpretar correctamente los coeficientes de


regresión

◦ Evaluar si hay una relación lineal significativa


entre dos variables cuantitativas
Resumen
Linear regression

Farzad Noubary, PhD


Big picture
• Linear regression is the most commonly used
statistical technique. It allows the comparison of
dichotomous, categorical and continuous predictors
with a continuous outcome.
• Extensions of linear regression allow
– Dichotomous outcomes-- logistic regression
– Repeated measures
– Survival analysis-- Cox proportional hazards regression
• Amazingly, many of the analyses we have learned
can be completed using linear regression
Example
• We were
investigating

.95
the association
between age

.9
and BPF using a
correlation
.85
BPF
coefficient
• Can we fit a line
.8

to this data?
.75

20 30 40 50 60
Age
Quick math review

The basic equation of a


line is given by y=mx+b Line

where m is the slope and 20


b is the y--intercept 18
16
14
y = 1.5x + 4
• One definition of m is that 12

for every one unit increase 10


8
in x, there is an m unit 6

increase in y 4
2
0

• One definition of b is the 0 2 4 6 8 10 12

value of y when x is equal to


zero
Picture
• Look at the data in this
picture 25

20

• Does there seem to be


a correlation (linear 15

relationship) in the 10

data?
5

• Is the data perfectly 0


0 2 4 6 8 10 12
linear?

• Could we fit a line to
this data?
How do we find the best line?

• Let’s look at three


candidate lines

• Which do you think is


the best?

• What is a way to
determine the best line
to use?
What is linear regression?
• Linear regression tries
to find the best line 25

(curve) to fit the data 20


y = 1.5x + 4
15

• The method of finding


the best line (curve) is
10

least squares, which 5

minimizes the sum of 0

the distance from the


0 2 4 6 8 10 12

line for each of points


Residuals
• The actual observations, yi,
may be slightly off the
population line because of
variability in the population.
The equation is yi = b0 + b1xi
+ ei, where ei is the
deviation from the This is the distance from
population line (See the line for patient 1, e 1
picture).
• This is called the residual
Least squares
• The method employed to find the best line is called
least squares. This method finds the values of b that
minimize the squared vertical distance from the line
to each of the points. This is the same as minimizing
the sum of the ei2
n n

åe = å (yi - (b0 + b1x1))


2 2
i
i=1 i=1
Estimates of regression coefficients
• Once we have solved the least squares equation, we
obtain estimates for the b’s, which we refer to
as bˆ0 , bˆ1
å (x - x)(y - y)
n

i i
bˆ1 = i=1

å (x - x )
n 2
i
i=1

b̂0 = y - bˆ1x

• The final least squares equation is where y hat is


the mean value of y for a value of x1

yˆ = bˆ + bˆx
0 1 1
Example
E(BPF | age) = b0 + b1age
• Here is a regression
equation for the BPFi = b0 + b1agei + ei
comparison of age and Observed data points plus the residuals
BPF .95
.9
.85
BPF
.8
.75

20 30 40 50 60
Age
Results
• The estimated
BPFˆ = 0.957 - 0.0029 * age
regression equation
.95
.9
.85
.8
.75

20 30 40 50 60
Age
BPF predval
. regress bpf age

Source SS df MS Number of obs = 29


F( 1, 27) = 13.48
Model
Residual Estimated
.022226034
.044524108
slope
1
27
.022226034
.001649041
Prob > F
R-squared
= 0.0010
= 0.3330
Adj R-squared = 0.3083
Total .066750142 28 .002383934 Root MSE = .04061

bpf Coef. Std. Err. t P>|t| [95% Conf. Interval]

age -.0028799 .0007845 -3.67 0.001 -.0044895 -.0012704


_con .957443 .035037 27.33 0.000 .885553 1.029333
s

Estimated intercept
Interpretation of regression
coefficients
• The final regression equation is

BPFˆ = 0.957 - 0.0029 * age

• The coefficients mean


– the estimate of the mean BPF for a patient with an age of 0
is 0.957 (b0hat)
– an increase of one year in age leads to an estimated
decrease of 0.0029 in mean BPF (b1hat)
Unanswered questions
• Is the estimate of b1 (b1hat) significantly
different than zero? In other words, is there a
significant relationship between the predictor
and the outcome?
• Have the assumptions of regression been
met?
Estimate of variance for bhat ’s
• In order to determine if there is a significant
association, we need an estimate of the variance of
b0hat and b1hat

( )ˆ
seˆ b0 = s y|x
1
+
x2 ()
seˆ bˆ1 =
n
sy|x
2
å (x - x)
n
n
(
å i
x - x )2
i
i=1 i=1

• sy|x is the residual variance in y after accounting for


x (standard deviation from regression, root mean
square error)
Test statistic
• For both regression coefficients, we use a t--statistic
to test any specific hypothesis
– Each has n--2 degrees of freedom (This is the sample size--
number of parameters estimated)
• What is the usual null hypothesis for b1?
Hypothesis test
1) H0: b1 =0
2) Continuous outcome, continuous predictor
3) Linear regression
4) Test statistic: t=--3.67 (27 dof)
5) p--value=0.001
6) Since the p--value is less than 0.05, we reject the
null hypothesis
7) We conclude that there is a significant association
between age and BPF
. regress bpf age

Source SS df MS Number of obs = 29


F( 1, 27) = 13.48
Est imated
Model slope 1
.022226034 .022226034p-val ue for slope
Prob > F = 0.0010
Residual .044524108 27 .0016 49041 R-squared = 0.3330
Adj R-squared = 0.3083
Total .066750142 28 .002383934 Root MSE = .04061

bpf Coef. Std. Err. t P>|t| [95% Conf. Interval]

age -.0028799 .0007845 -3.67 0.001 -.0044895 -.0012704


_con .957443 .035037 27.33 0.000 .885553 1.029333
s

Estimated intercept
Comparison to correlation
• In this example, we found a relationship between the
age and BPF. We also investigated this relationship
using correlation
• We get the same p--value!!
• Our conclusion is exactly the same!!

Method p-value
Correlation 0.001
Linear regression 0.001
Assumptions of linear regression
• Independence
– All of the data points are independent
– Correlated data points can be taken into account using
multivariate and longitudinal data methods
• Linearity
– Linear relationship between outcome and predictors
• Homoscedasticity of the residuals
– The residuals, ei, have the same variance
• Normality of the residuals
– The residuals, ei, are normally distributed
Linearity assumption
• One of the assumptions of linear regression is that
the relationship between the predictors and the
outcomes is linear

• We call this the population regression line E(Y | X=x)


= µy|x = b0 + b1x

• This equation says that the mean of y given a specific


value of x is defined by the b coefficients
Normality and homoscedasticity
assumption
• Two other assumptions
of linear regression are
related to the ei’s
– Homoscedasticity-- the
variance of y given x is the
same for all values of x
– Normality-- the
distribution of the
residuals are normal.
Distribution of y-values at each value
of x is normal with the same variance
Confidence interval
Confidence interval for b1
• As we have done previously, we can construct a
confidence interval for the regression coefficients
• Since we are using a t--distribution, we do not
automatically use 1.96. Rather we use the cut--off
from the t--distribution

• Interpretation of confidence interval is same as we


have seen previously
Hypothesis test
1) H0: b1 =0
2) Continuous outcome, continuous predictor
3) Linear regression
4) Estimated effect= - 0.0029
5) 95% CI: (- 0.0045, - 0.0013)
6) Since the 95% confidence interval fails to include
the null value, we reject the null hypothesis
7) We conclude that there is a significant association
between age and BPF
. regress bpf age

Source SS df MS Number of obs = 29


F( 1, 27) = 13.48
Estimated
Model
Residual
slope
.022226034
.044524108
1
27
.022226034
.001649041
Prob > F = 0.0010
95% confidence
R-squared = 0.3330 j
AdR-squared = 0.3083
Total .066750142 28 .002383934 interval for slope
Root MSE = .04061

bpf Coef. Std. Err. t P>|t| [95% Conf. Interval]

age -.0028799 .0007845 -3.67 0.001 -.0044895 -.0012704


_con .957443 .035037 27.33 0.000 .885553 1.029333
s

95% confidence
Estimated intercept interval for intercept
R2
R2
• Although we have found a relationship between age
and BPF, linear regression also allows us to assess
how well our model fits the data

• R2=coefficient of determination=proportion of
variance in the outcome explained by the model
– When we have only one predictor, it is the proportion of
the variance in y explained by x
r vs. R2
• R2=(Pearson’s correlation coefficient)2=r2
• Since r is between --1and 1, R2 is always less than r

Method Estimate
r -0.577
R2 0.333
– r= 0.1, R2=0.01
– r= 0.5, R2=0.25
Prediction
Prediction
• Beyond determining if there is a significant
association, linear regression can also be used to
make predictions

• Using the regression equation, we can predict the


BPF for patients with specific age values
– Ex. A patient with age=40

BPFˆ = 0.957- 0.0029*40 = 0.841

• The expected BPF for a patient of age 40 based on


our experiment is 0.841
Extrapolation
• Can we predict the BPF for a patient with age
80? What assumption would we be making?
.95
.9
.85
.8
.75

20 30 40 50 60
Age
BPF predval
Confidence interval for mean value
• We can place a confidence interval around our
predicted mean value
• This corresponds to the plausible values for the mean
BPF at a specific age
• To calculate a confidence interval for the predicted
mean value, we need an estimate of variability in the
predicted mean
wider for more uncertainty

seˆ(yˆ) = s y|x
1
+
(x - x )2
n n 2
å(x i - x)
i=1
Confidence interval
• Note that the standard error equation has a different
magnitude based on the x value. In particular, the
magnitude is the least when x=the mean of x

• Since the test statistic is based on the t--distribution,


our confidence interval is
(yˆ - ta / 2,df * seˆ(yˆ), yˆ + ta / 2,df * seˆ(yˆ))
• This confidence interval is rarely used for hypothesis
testing
.95
.9
.85
.8
.75

20 30 40 50 60
Age
Que hemos aprendido
— Evaluar la relación de dos variables continuas

— Regresión lineal

— Requisitos de regresión lineal

— Interpretar correctamente los coeficientes


de regresión lineal

— Prueba de hipótesis para beta

You might also like