Download as pdf or txt
Download as pdf or txt
You are on page 1of 51

Linear

regression

Alessio
Farcomeni
Linear regression
B.D. in Data Science for Management
Course in Advanced Statistics

Alessio Farcomeni
University of Rome “Tor Vergata”

alessio.farcomeni@uniroma2.it
Statistical inference

Linear
regression

Alessio
Farcomeni

Parameters of interest are doubled, one at sample and


another at population level. The first can be computed,
and used to infer on its population counterpart.
We have a sample mean x̄ and a population mean µ.
The sample should be a fair image of the population (no
bias!)
Random sampling is like stirring a soup.
How?

Linear
regression

Alessio
Farcomeni

A fair, random and large enough sample guarantees that


the sample parameter is close to the population parameter.
Point Estimation: my guess for µ is x̄.
Conceptual difference: sample means is exactly x̄,
population mean is estimated (guessed) as x̄.
Properties: consistency

Linear
regression

Alessio
Farcomeni

Law of large numbers guarantees consistency


If: (i) X1 , . . . , Xn is an i.i.d. sample, (ii) E [X ] = µ, (iii)
V [X ] < ∞
Pr(limn→∞ x̄ = µ) = 1 (strong or almost sure
convergence)
Confidence Intervals

Linear
regression

Alessio
Farcomeni

Point estimation is overly optimistic.


Confidence intervals acknowledge uncertainty and are
more reliable.
People report intervals that have 90, 95 or 99% coverage,
usually.
Coverage?

Linear
regression

Alessio
Farcomeni

95%, 90%, 99%... your choice (well, actually an habit of


your research area)
Before seeing the data, you have 95% probability that your
confidence interval will cover the true parameter.
After seeing the data, you either got it, or not.
95%: you compute 20 intervals, you fail once.
This applies also to huge values of n. As n increases you
will simply get shorter and shorter intervals.
Underlying ideas

Linear
regression

Alessio
Farcomeni

Inference is based on two ideas.


1 I must accept a slight possibility of failing in order to infer
something meaningful.
2 I am lucky
How?

Linear
regression

Alessio
Farcomeni

CIs are based on the sampling distribution of the point


estimator.
This in turn is derived from the sampling distribution of Xi .
Simply put: if Xi is Gaussian, x̄ is Gaussian and you know
how it will behave
Namely.

Linear
regression

Alessio
Farcomeni

If Xi ∼ N(µ, σ 2 ),

x̄ ∼ N(µ, σ 2 /n)

Hence, a 95% CI is given by


√ √
(x̄ − σ1.96/ n, x̄ + σ1.96/ n)
What if Xi is not Gaussian

Linear
regression

Alessio
Farcomeni

It really depends.
In general if you are working with a mean, Xi is not
Gaussian but x̄ is approximately.
Central limit theorem: under the same assumptions as the
LLN,
x̄ ∼ N(µ, σ 2 /n)
approximately, better and better as n grows.
Standard Error

Linear
regression

Alessio
Farcomeni

The standard deviation of our estimates (in repeated


sampling)
√ √
For x̄, σ/ n, which is consistently estimated by s/ n.
Standard error is a quantification of uncertainty in
generalizing sample quantities to population quantities.
q
p̂(1−p̂)
Discrete case: n
Effects at population level

Linear
regression

Alessio
Farcomeni

The most important part of a statistical analysis is


evaluation of some effect (group difference, for instance)
at population level.
Note: a non-zero effect will always be observed at sample
level. This may be due to random error, which must be
excluded.
A quick reminder: why I do see a non-zero effect

Linear
regression

Alessio
Farcomeni

1 Bias
2 Confounding
3 Sampling error
4 True effect
I must rule out the first three. You have an idea about how to
do it for the first two. Hypothesis testing is used for the third.
Ruling out random error

Linear
regression

Alessio
Farcomeni

Hypothesis testing
Hypothesis: null effect at population level
We (usually) want to reject the (null) hypothesis.
We do it when data evidence against a null effect is strong.
In practice

Linear
regression

Alessio
Farcomeni A different test (and summary or test statistic) is used for
each case
A p-value is computed from it. A p-value lies within (0, 1)
interval. It is the probability of observing a larger
discrepancy between data and null hypothesis.
A low p-value allows us to reject the null.
Conventionally, a threshold α is fixed. This is often
α = 0.05. So if p < 0.05, we reject the null.
Important note: if p > α we cannot say we accept the
hypothesis of no effect. We say that we can not conclude
anything as the study sample size is not large enough.
Illustration

Linear
regression

Alessio
Farcomeni


Say your test statistic is T (e.g., n|x̄|/s)
p = Pr(T > tobs |H0 ) (e.g., H0 : µ = 0)
This is simply computed based on the distribution of T ,
which is derived from the sampling distribution of the
point estimate, which is derived from the point
distribution of Xi (or the CLT)

E.g., T = n|x̄|/s follows a folded T distribution!
Equivalently

Linear
regression

Alessio
Farcomeni The rejection region p < α corresponds to a quantile of
the distribution of T by definition.

Pr(T > tobs |H0 ) < α


if and only if
tobs > qT (α),
where qT (α) is such that Pr(T > qT (α)) = α.
Namely for a Z-test where T is folded Gaussian, p < 0.05
only if
T > 1.96.
Practical vs Statistical Significance

Linear
regression

Alessio
Farcomeni

p < 0.05 means: sample effect can be generalized to


population level.
This is not sufficient. A non-zero effect may be so tiny to
be irrelevant. You also need practical significance, that is,
the effect ot be large enough.
Note: practical significance without statistical significance
is not sufficient as well.
Likelihood inference

Linear
regression Most of statistical inference can be explained with
Alessio
Farcomeni
likelihood theory.
Let f (Y |θ) denote a general density or probability mass
function for a random variable X , depending on parameter
vector θ.
(Y −µ)2

e 2σ 2
E.g., Y ∈ R, f (Y |θ) = √
σ 2
with θ = (µ, σ 2 )
Given a sample y1 , . . . , yn , the likelihood is
n
Y
L(θ) = Pr(y1 , . . . , yn |θ) = f (yi |θ)
i=1

The last equality holds under independence.


Important thing to realize: data are fixed, the parameters
are not.
Likelihood principle

Linear
regression

Alessio
Farcomeni

To be coherent, all inferential procedures must depend


only on the likelihood
Note: p-values violate likelihood principle!
In computers, for reasons linked to numerical underflow,
we equivalently workPwith the log-likelihood
n
l(θ) = log(L(θ)) = i=1 log(f (yi |θ))
Maximum Likelihood Estimator

Linear
regression

Alessio
Farcomeni
You want to estimate θ? Use the most likely parameter by
maximizing l(θ) to obtain the MLE θ̂.
MLE is consistent (as n increases, θ̂ gets closer and closer
to θ)
MLE is asymptotically Gaussian, hence for the MLE you
can always compute confidence intervals and tests using
Gaussian theory.
MLE for Gaussian model corresponds is (ȳ , s 2 ), as
expected.
What is different from basic inference?

Linear
regression

Alessio
Farcomeni

Nothing. Let us compute the MLE for a binary problem.


Likelihood: i p yi (1 − p)1−yi
Q
P
Log-likelihood i yi log(p) + (1 − yi ) log(1 − p)
P
MLE p̂ = i yi /n.
p
Estimate of standard error: p̂(1 − p̂)/n.
As n grows, p̂ ∼ N(p, p(1 − p)/n)
In more complex scenarios...

Linear
regression

Alessio
Farcomeni

In more complex scenarios the MLE is not available in


closed form. Let alone its standard error.
Its theoretical properties still hold, but it must be
computed by maximizing the log-likelihood numerically.
The standard error is a bit more complex and involves the
Hessian of the log-likelihood.
Asymmetric modeling

Linear
regression

Alessio
Farcomeni

Variables in many cases are not created equal to each


other
Endpoint/outcome/endogeneous/target/response variable
Y
Predictors/covariates X1 , . . . , Xp , collected in matrix X .
Want to see effects of X on Y
Regression: how a numeric outcome depends on a
predictor
Linear
regression

Alessio
Farcomeni

First instance: Sir Francis Galton showed that height of


men could be predicted by the height of their fathers.
On the other hand, for that special case, very tall fathers
did make slightly less tall sons. This was called “regression
to the mean”.
Note: conditional model

Linear
regression

Alessio
Farcomeni

The model is conditional on the predictors X


These are therefore fixed.
More formally, the density is not f (Y |θ) but f (Y |θ, X )
Likelihood: ni=1 f (yi |θ, Xi )
Q
Purposes

Linear
regression

Alessio
Farcomeni

Understanding interrelationships among predictors and


outcome
One predictor might be more important (e.g., treatment)
and the others might be confounders that we adjust for
Prediction of the outcome in future settings (when it has
not been measured, only the predictors)
Multivariate linear regression

Linear
regression

Alessio
Farcomeni

Gaussian distribution Yi ∼ N(g (Xi ), σ 2 )


g (Xi ) usually linear: β0 + β1 Xi1 + . . . + βp Xip
Equivalently: E [Yi |Xi ] = g (Xi )
Equivalently, residuals ei = Yi − g (Xi ) ∼ N(0, σ 2 ).
Assumptions

Linear
regression

Alessio
Farcomeni

i.i.d replicates Y1 , . . . , Yn from Gaussian population


Linearity of the model
Homoscedasticity (that is, σ 2 does not depend on X ).
The first must be checked before and after model estimation,
the other two only after
Linearity

Linear
regression

Alessio
Farcomeni

True relationships are never actually linear


Taylor series says they are approximately linear:

f (x) ∼
= f (x0 ) + (x − x0 )f 0 (x0 )
Gaussian assumption

Linear
regression

Alessio
Farcomeni

Check Y1 , . . . , Yn follows approximately a Gaussian distribution


Histogram
Testing (not recommended), e.g., Kolmogorov-Smirnov
What if Y is not Gaussian?

Linear
regression

Alessio
Several economic indicators (e.g., income) are severely
Farcomeni skewed
The raw measurements can not be used as endpoints
One could transform them to normality, being careful
about interpretation (see below)
Right-skewed variables: take log, square root, cubic root,
etc.
Left-skewed variables: take exponential, square, cube, etc.
Box-Cox:
Yλ −1
λ
for some λ 6= 0
No transformation is needed for X !
Likelihood inference

Linear
regression

Alessio
Farcomeni

The MLE for a problem with no predictors leads to


(β̂0 , σ 2 ) = (ȳ , s 2 )
In the presence of covariates the MLE is still available, and
so its standard error and theoretical properties
Interpretation

Linear
regression

Alessio
Farcomeni
β̂0 corresponds to an estimate of E [Y |X = 0]. It makes
sense when the situation X = 0 makes sense
Predictors might be previously mean centered for instance.
Single predictor: β̂1 is the expected unit increase in Y for
each unit increase in X1
Multiple predictors: β̂1 is the expected unit increase in Y
for each unit increase in X1 when all other predictors are
held fixed.
This takes care of observed confounding!
Interpretation after transformation

Linear
regression

Alessio
Farcomeni

Suppose your outcome is Ỹ = g (Y )


β̂ is unit change on g (·) scale
g −1 (β̂) should be evaluated on a case by case basis, note
that the effect is not linear anymore
After log transformation: exp(β̂) is the fold change per
unit of X (linear → multiplicative effect)
E.g., if exp(β̂) = 1.2, there is an increase of 20% in Y per
unit of X
Prediction

Linear
regression

Alessio
Farcomeni

Obviously, for new predictors x = (x1 , . . . , xp ), prediction


is ŷ = β̂0 + β̂1 x1 + · · · + β̂p xp .
If we predict observed data, observed residual
êi = (y − ŷi ) is the observed error
Should predict only for x in the range of observed data
Categorical predictors: dummy variables

Linear
regression

Alessio
Farcomeni
Categorical predictors should be recoded to binary
(dummy) variables. One (arbitrarily chosen) category is
reference (zero values)
If two categories, the non-reference category takes unit
values
If k > 2 categories, k − 1 dummy predictors are created,
each being a dummy for each non-reference category
β̂ is the expected mean difference in Y comparing the
non-reference category to the reference one
Confidence intervals and testing

Linear
regression

Alessio
Farcomeni

Since β̂ is the MLE, it is asymptotically Gaussian.


Hence one can compute (Wald) 95% confidence intervals

(β̂ − 1.96SEβ̂ , β̂ + 1.96SEβ̂ )

Analogously, a test statistic for H0 : β = 0 is given by

β̂/SEβ̂ ,

which is a standard Gaussian under the null hypothesis.


Model choice

Linear
regression

Alessio
Farcomeni

Definitely: p << n
Stepwise approaches are possible. These are often based
on Akaike Information Criterion:

−2l(θ) + 2q,

where q is the number of parameters. Small AIC is better.


My recommendation: use the automatically selected
model as a starting point, improve manually.
Forward selection

Linear
regression

Alessio
Farcomeni

With p predictors, there are 2p candidate models.


Exhaustive search is often impossible
Forward selection
1 Estimate AIC0 for the empty model
2 At stage j: keep model with j − 1 predictors fixed, compute
p − j models with one additional and AICj,1 , . . . , AICj,p−j .
3 If minu AICj,u < AICj−1 , set AICj = minu AICj,u and add
arg minu AICj,u to the set of predictors selected.
4 Stop with optimal model with j − 1 predictors otherwise
Screening

Linear
regression

Alessio
Farcomeni
For ultra-high dimensional problems (say, p/n > 40) an
independence screening is needed. For our purposes, also
for high dimensional problems.
At the first step, univariate associations with the outcome
are examined (see next) and only a subset of the variables
is retained.
At the second step, variable selection, or (regularized, see
next) model estimation is performed.
I have the habit of experimenting with data augmentation (e.g.,
polynomial augmentation) between the first and second step.
Sure Independence Screening

Linear
regression

Alessio
Farcomeni

Choose γ ∈ (0, 1)
Choose a measure of univariate association between Y and
Xj , e.g., absolute value of univariate regression coefficient
Select the nγ << p variables with largest values of the
measure above
These are then used for model selection or estimation.
Why model selection?

Linear
regression

Alessio
Farcomeni

Occam’s razor
Bias: expected difference between true (population-level)
target quantity and estimate
Variance: variability of estimate at repeated sampling
Mean Squared Error: Bias2 +Variance
Bias-Variance trade off

Linear
regression

Alessio
Farcomeni

Underparameterized models are biased


Overparameterized models exhibit high variance
Internal validity of selected model

Linear
regression

Alessio
Farcomeni

Is everything significant?
Is everything interpretable?
External validity of selected model

Linear
regression

Alessio
Farcomeni

Are there non-linear relationships? E.g., linear relationship


between Y and Xj2 .
Are there interactions? E.g., the product X1 X3 is
significant
Note: for interpretability reasons in general interacting
factors are included hierarchically (X1 and X3 must be
there if there is X1 X3 ), and (mean or median) centered if
continuous
Goodness of fit

Linear
regression

Alessio
Farcomeni

Goodness of fit: how well the observed data are


approximated by predictions
R 2:
cor(Y , β̂0 + β̂1 X1 + · · · + β̂p Xp )
It is monotone in the number of variables!
An honest assessment of goodness of fit

Linear
regression

Alessio
Farcomeni

Adjusted R 2
n−1
R̄ 2 = 1 − (1 − R 2 ) ,
n−g −1
where g is the number of predictors in the model
It penalizes for the number of predictors
It is a more honest assessment of the expected
performance on new data
Collinearity

Linear
regression

Alessio
Farcomeni

When predictors are uncorrelated, including or excluding


each one does not modify β̂ for the remaining ones
In general they are correlated, different models give
different scenarios (due to confounding)
Large correlation is a problem: instability of estimates,
huge standard errors
One should then check that predictors in the model are not too
collinear.
Variance Inflaction Factors

Linear
regression

Alessio
Farcomeni

VIF: how much the variance of β̂j is increased because of


collinearity?
Computation:
1 Use Xj as outcome as in Xj = β0 + β1 X1 + · · · + βp Xp
2 VIFj = 1/(1 − R̄j2 ), Rj2 being the R 2 of model above.
VIFj > 10 (or 5): too large, remove Xj .
The process for your reference

Linear
regression

Alessio
Farcomeni 1 Check Y is Gaussian, transform if necessary
2 Stepwise model selection for initial candidate model, SIS if
needed
3 Refine automatically selected model by excluding (one at a
time) non-significant predictors, including some that are
left out, including transformations (e.g., X32 ) and
interactions
4 Final model: everything is significant, everything is
interpretable, collinearity is under control. If target is
(also) prediction: the adjusted R 2 is large. If anything
fails, go back to step 3.

You might also like