03 Linear Regression

Linear
regression
Alessio
Farcomeni
Linear regression
B.D. in Data Science for Management
Course in Advanced Statistics
Alessio Farcomeni
University of Rome “Tor Vergata”
alessio.farcomeni@uniroma2.it
Statistical inference
Linear
regression
Alessio
Farcomeni
Parameters of interest are doubled, one at sample and

another at population level. The first can be computed,
and used to infer on its population counterpart.
We have a sample mean x̄ and a population mean µ.
The sample should be a fair image of the population (no
bias!)
Random sampling is like stirring a soup.
How?
Linear
regression
Alessio
Farcomeni
A fair, random and large enough sample guarantees that

the sample parameter is close to the population parameter.
Point Estimation: my guess for µ is x̄.
Conceptual difference: sample means is exactly x̄,
population mean is estimated (guessed) as x̄.
Properties: consistency
Linear
regression
Alessio
Farcomeni
Law of large numbers guarantees consistency

If: (i) X1 , . . . , Xn is an i.i.d. sample, (ii) E [X ] = µ, (iii)
V [X ] < ∞
Pr(limn→∞ x̄ = µ) = 1 (strong or almost sure
convergence)
Confidence Intervals
Linear
regression
Alessio
Farcomeni
Point estimation is overly optimistic.

Confidence intervals acknowledge uncertainty and are
more reliable.
People report intervals that have 90, 95 or 99% coverage,
usually.
Coverage?
Linear
regression
Alessio
Farcomeni
95%, 90%, 99%... your choice (well, actually an habit of

your research area)
Before seeing the data, you have 95% probability that your
confidence interval will cover the true parameter.
After seeing the data, you either got it, or not.
95%: you compute 20 intervals, you fail once.
This applies also to huge values of n. As n increases you
will simply get shorter and shorter intervals.
Underlying ideas
Linear
regression
Alessio
Farcomeni
Inference is based on two ideas.

1 I must accept a slight possibility of failing in order to infer
something meaningful.
2 I am lucky
How?
Linear
regression
Alessio
Farcomeni
CIs are based on the sampling distribution of the point

estimator.
This in turn is derived from the sampling distribution of Xi .
Simply put: if Xi is Gaussian, x̄ is Gaussian and you know
how it will behave
Namely.
Linear
regression
Alessio
Farcomeni
If Xi ∼ N(µ, σ 2 ),
x̄ ∼ N(µ, σ 2 /n)
Hence, a 95% CI is given by

√ √
(x̄ − σ1.96/ n, x̄ + σ1.96/ n)
What if Xi is not Gaussian
Linear
regression
Alessio
Farcomeni
It really depends.
In general if you are working with a mean, Xi is not
Gaussian but x̄ is approximately.
Central limit theorem: under the same assumptions as the
LLN,
x̄ ∼ N(µ, σ 2 /n)
approximately, better and better as n grows.
Standard Error
Linear
regression
Alessio
Farcomeni
The standard deviation of our estimates (in repeated

sampling)
√ √
For x̄, σ/ n, which is consistently estimated by s/ n.
Standard error is a quantification of uncertainty in
generalizing sample quantities to population quantities.
q
p̂(1−p̂)
Discrete case: n
Effects at population level
Linear
regression
Alessio
Farcomeni
The most important part of a statistical analysis is

evaluation of some effect (group difference, for instance)
at population level.
Note: a non-zero effect will always be observed at sample
level. This may be due to random error, which must be
excluded.
A quick reminder: why I do see a non-zero effect
Linear
regression
Alessio
Farcomeni
1 Bias
2 Confounding
3 Sampling error
4 True effect
I must rule out the first three. You have an idea about how to
do it for the first two. Hypothesis testing is used for the third.
Ruling out random error
Linear
regression
Alessio
Farcomeni
Hypothesis testing
Hypothesis: null effect at population level
We (usually) want to reject the (null) hypothesis.
We do it when data evidence against a null effect is strong.
In practice
Linear
regression
Alessio
Farcomeni A different test (and summary or test statistic) is used for
each case
A p-value is computed from it. A p-value lies within (0, 1)
interval. It is the probability of observing a larger
discrepancy between data and null hypothesis.
A low p-value allows us to reject the null.
Conventionally, a threshold α is fixed. This is often
α = 0.05. So if p < 0.05, we reject the null.
Important note: if p > α we cannot say we accept the
hypothesis of no effect. We say that we can not conclude
anything as the study sample size is not large enough.
Illustration
Linear
regression
Alessio
Farcomeni
√
Say your test statistic is T (e.g., n|x̄|/s)
p = Pr(T > tobs |H0 ) (e.g., H0 : µ = 0)
This is simply computed based on the distribution of T ,
which is derived from the sampling distribution of the
point estimate, which is derived from the point
distribution of Xi (or the CLT)
√
E.g., T = n|x̄|/s follows a folded T distribution!
Equivalently
Linear
regression
Alessio
Farcomeni The rejection region p < α corresponds to a quantile of
the distribution of T by definition.
Pr(T > tobs |H0 ) < α

if and only if
tobs > qT (α),
where qT (α) is such that Pr(T > qT (α)) = α.
Namely for a Z-test where T is folded Gaussian, p < 0.05
only if
T > 1.96.
Practical vs Statistical Significance
Linear
regression
Alessio
Farcomeni
p < 0.05 means: sample effect can be generalized to

population level.
This is not sufficient. A non-zero effect may be so tiny to
be irrelevant. You also need practical significance, that is,
the effect ot be large enough.
Note: practical significance without statistical significance
is not sufficient as well.
Likelihood inference
Linear
regression Most of statistical inference can be explained with
Alessio
Farcomeni
likelihood theory.
Let f (Y |θ) denote a general density or probability mass
function for a random variable X , depending on parameter
vector θ.
(Y −µ)2
−
e 2σ 2
E.g., Y ∈ R, f (Y |θ) = √
σ 2
with θ = (µ, σ 2 )
Given a sample y1 , . . . , yn , the likelihood is
n
Y
L(θ) = Pr(y1 , . . . , yn |θ) = f (yi |θ)
i=1
The last equality holds under independence.

Important thing to realize: data are fixed, the parameters
are not.
Likelihood principle
Linear
regression
Alessio
Farcomeni
To be coherent, all inferential procedures must depend

only on the likelihood
Note: p-values violate likelihood principle!
In computers, for reasons linked to numerical underflow,
we equivalently workPwith the log-likelihood
n
l(θ) = log(L(θ)) = i=1 log(f (yi |θ))
Maximum Likelihood Estimator
Linear
regression
Alessio
Farcomeni
You want to estimate θ? Use the most likely parameter by
maximizing l(θ) to obtain the MLE θ̂.
MLE is consistent (as n increases, θ̂ gets closer and closer
to θ)
MLE is asymptotically Gaussian, hence for the MLE you
can always compute confidence intervals and tests using
Gaussian theory.
MLE for Gaussian model corresponds is (ȳ , s 2 ), as
expected.
What is different from basic inference?
Linear
regression
Alessio
Farcomeni
Nothing. Let us compute the MLE for a binary problem.

Likelihood: i p yi (1 − p)1−yi
Q
P
Log-likelihood i yi log(p) + (1 − yi ) log(1 − p)
P
MLE p̂ = i yi /n.
p
Estimate of standard error: p̂(1 − p̂)/n.
As n grows, p̂ ∼ N(p, p(1 − p)/n)
In more complex scenarios...
Linear
regression
Alessio
Farcomeni
In more complex scenarios the MLE is not available in

closed form. Let alone its standard error.
Its theoretical properties still hold, but it must be
computed by maximizing the log-likelihood numerically.
The standard error is a bit more complex and involves the
Hessian of the log-likelihood.
Asymmetric modeling
Linear
regression
Alessio
Farcomeni
Variables in many cases are not created equal to each

other
Endpoint/outcome/endogeneous/target/response variable
Y
Predictors/covariates X1 , . . . , Xp , collected in matrix X .
Want to see effects of X on Y
Regression: how a numeric outcome depends on a
predictor
Linear
regression
Alessio
Farcomeni
First instance: Sir Francis Galton showed that height of

men could be predicted by the height of their fathers.
On the other hand, for that special case, very tall fathers
did make slightly less tall sons. This was called “regression
to the mean”.
Note: conditional model
Linear
regression
Alessio
Farcomeni
The model is conditional on the predictors X

These are therefore fixed.
More formally, the density is not f (Y |θ) but f (Y |θ, X )
Likelihood: ni=1 f (yi |θ, Xi )
Q
Purposes
Linear
regression
Alessio
Farcomeni
Understanding interrelationships among predictors and

outcome
One predictor might be more important (e.g., treatment)
and the others might be confounders that we adjust for
Prediction of the outcome in future settings (when it has
not been measured, only the predictors)
Multivariate linear regression
Linear
regression
Alessio
Farcomeni
Gaussian distribution Yi ∼ N(g (Xi ), σ 2 )

g (Xi ) usually linear: β0 + β1 Xi1 + . . . + βp Xip
Equivalently: E [Yi |Xi ] = g (Xi )
Equivalently, residuals ei = Yi − g (Xi ) ∼ N(0, σ 2 ).
Assumptions
Linear
regression
Alessio
Farcomeni
i.i.d replicates Y1 , . . . , Yn from Gaussian population

Linearity of the model
Homoscedasticity (that is, σ 2 does not depend on X ).
The first must be checked before and after model estimation,
the other two only after
Linearity
Linear
regression
Alessio
Farcomeni
True relationships are never actually linear

Taylor series says they are approximately linear:
f (x) ∼
= f (x0 ) + (x − x0 )f 0 (x0 )
Gaussian assumption
Linear
regression
Alessio
Farcomeni
Check Y1 , . . . , Yn follows approximately a Gaussian distribution

Histogram
Testing (not recommended), e.g., Kolmogorov-Smirnov
What if Y is not Gaussian?
Linear
regression
Alessio
Several economic indicators (e.g., income) are severely
Farcomeni skewed
The raw measurements can not be used as endpoints
One could transform them to normality, being careful
about interpretation (see below)
Right-skewed variables: take log, square root, cubic root,
etc.
Left-skewed variables: take exponential, square, cube, etc.
Box-Cox:
Yλ −1
λ
for some λ 6= 0
No transformation is needed for X !
Likelihood inference
Linear
regression
Alessio
Farcomeni
The MLE for a problem with no predictors leads to

(β̂0 , σ 2 ) = (ȳ , s 2 )
In the presence of covariates the MLE is still available, and
so its standard error and theoretical properties
Interpretation
Linear
regression
Alessio
Farcomeni
β̂0 corresponds to an estimate of E [Y |X = 0]. It makes
sense when the situation X = 0 makes sense
Predictors might be previously mean centered for instance.
Single predictor: β̂1 is the expected unit increase in Y for
each unit increase in X1
Multiple predictors: β̂1 is the expected unit increase in Y
for each unit increase in X1 when all other predictors are
held fixed.
This takes care of observed confounding!
Interpretation after transformation
Linear
regression
Alessio
Farcomeni
Suppose your outcome is Ỹ = g (Y )

β̂ is unit change on g (·) scale
g −1 (β̂) should be evaluated on a case by case basis, note
that the effect is not linear anymore
After log transformation: exp(β̂) is the fold change per
unit of X (linear → multiplicative effect)
E.g., if exp(β̂) = 1.2, there is an increase of 20% in Y per
unit of X
Prediction
Linear
regression
Alessio
Farcomeni
Obviously, for new predictors x = (x1 , . . . , xp ), prediction

is ŷ = β̂0 + β̂1 x1 + · · · + β̂p xp .
If we predict observed data, observed residual
êi = (y − ŷi ) is the observed error
Should predict only for x in the range of observed data
Categorical predictors: dummy variables
Linear
regression
Alessio
Farcomeni
Categorical predictors should be recoded to binary
(dummy) variables. One (arbitrarily chosen) category is
reference (zero values)
If two categories, the non-reference category takes unit
values
If k > 2 categories, k − 1 dummy predictors are created,
each being a dummy for each non-reference category
β̂ is the expected mean difference in Y comparing the
non-reference category to the reference one
Confidence intervals and testing
Linear
regression
Alessio
Farcomeni
Since β̂ is the MLE, it is asymptotically Gaussian.

Hence one can compute (Wald) 95% confidence intervals
(β̂ − 1.96SEβ̂ , β̂ + 1.96SEβ̂ )
Analogously, a test statistic for H0 : β = 0 is given by
β̂/SEβ̂ ,
which is a standard Gaussian under the null hypothesis.

Model choice
Linear
regression
Alessio
Farcomeni
Definitely: p << n
Stepwise approaches are possible. These are often based
on Akaike Information Criterion:
−2l(θ) + 2q,
where q is the number of parameters. Small AIC is better.

My recommendation: use the automatically selected
model as a starting point, improve manually.
Forward selection
Linear
regression
Alessio
Farcomeni
With p predictors, there are 2p candidate models.

Exhaustive search is often impossible
Forward selection
1 Estimate AIC0 for the empty model
2 At stage j: keep model with j − 1 predictors fixed, compute
p − j models with one additional and AICj,1 , . . . , AICj,p−j .
3 If minu AICj,u < AICj−1 , set AICj = minu AICj,u and add
arg minu AICj,u to the set of predictors selected.
4 Stop with optimal model with j − 1 predictors otherwise
Screening
Linear
regression
Alessio
Farcomeni
For ultra-high dimensional problems (say, p/n > 40) an
independence screening is needed. For our purposes, also
for high dimensional problems.
At the first step, univariate associations with the outcome
are examined (see next) and only a subset of the variables
is retained.
At the second step, variable selection, or (regularized, see
next) model estimation is performed.
I have the habit of experimenting with data augmentation (e.g.,
polynomial augmentation) between the first and second step.
Sure Independence Screening
Linear
regression
Alessio
Farcomeni
Choose γ ∈ (0, 1)
Choose a measure of univariate association between Y and
Xj , e.g., absolute value of univariate regression coefficient
Select the nγ << p variables with largest values of the
measure above
These are then used for model selection or estimation.
Why model selection?
Linear
regression
Alessio
Farcomeni
Occam’s razor
Bias: expected difference between true (population-level)
target quantity and estimate
Variance: variability of estimate at repeated sampling
Mean Squared Error: Bias2 +Variance
Bias-Variance trade off
Linear
regression
Alessio
Farcomeni
Underparameterized models are biased

Overparameterized models exhibit high variance
Internal validity of selected model
Linear
regression
Alessio
Farcomeni
Is everything significant?
Is everything interpretable?
External validity of selected model
Linear
regression
Alessio
Farcomeni
Are there non-linear relationships? E.g., linear relationship

between Y and Xj2 .
Are there interactions? E.g., the product X1 X3 is
significant
Note: for interpretability reasons in general interacting
factors are included hierarchically (X1 and X3 must be
there if there is X1 X3 ), and (mean or median) centered if
continuous
Goodness of fit
Linear
regression
Alessio
Farcomeni
Goodness of fit: how well the observed data are

approximated by predictions
R 2:
cor(Y , β̂0 + β̂1 X1 + · · · + β̂p Xp )
It is monotone in the number of variables!
An honest assessment of goodness of fit
Linear
regression
Alessio
Farcomeni
Adjusted R 2
n−1
R̄ 2 = 1 − (1 − R 2 ) ,
n−g −1
where g is the number of predictors in the model
It penalizes for the number of predictors
It is a more honest assessment of the expected
performance on new data
Collinearity
Linear
regression
Alessio
Farcomeni
When predictors are uncorrelated, including or excluding

each one does not modify β̂ for the remaining ones
In general they are correlated, different models give
different scenarios (due to confounding)
Large correlation is a problem: instability of estimates,
huge standard errors
One should then check that predictors in the model are not too
collinear.
Variance Inflaction Factors
Linear
regression
Alessio
Farcomeni
VIF: how much the variance of β̂j is increased because of

collinearity?
Computation:
1 Use Xj as outcome as in Xj = β0 + β1 X1 + · · · + βp Xp
2 VIFj = 1/(1 − R̄j2 ), Rj2 being the R 2 of model above.
VIFj > 10 (or 5): too large, remove Xj .
The process for your reference
Linear
regression
Alessio
Farcomeni 1 Check Y is Gaussian, transform if necessary
2 Stepwise model selection for initial candidate model, SIS if
needed
3 Refine automatically selected model by excluding (one at a
time) non-significant predictors, including some that are
left out, including transformations (e.g., X32 ) and
interactions
4 Final model: everything is significant, everything is
interpretable, collinearity is under control. If target is
(also) prediction: the adjusted R 2 is large. If anything
fails, go back to step 3.

03 Linear Regression

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

03 Linear Regression

Uploaded by

Copyright:

Available Formats

Linear

Parameters of interest are doubled, one at sample and

A fair, random and large enough sample guarantees that

Law of large numbers guarantees consistency

Point estimation is overly optimistic.

95%, 90%, 99%... your choice (well, actually an habit of

Inference is based on two ideas.

CIs are based on the sampling distribution of the point

Hence, a 95% CI is given by

The standard deviation of our estimates (in repeated

The most important part of a statistical analysis is

Pr(T > tobs |H0 ) < α

p < 0.05 means: sample effect can be generalized to

The last equality holds under independence.

To be coherent, all inferential procedures must depend

Nothing. Let us compute the MLE for a binary problem.

In more complex scenarios the MLE is not available in

Variables in many cases are not created equal to each

First instance: Sir Francis Galton showed that height of

The model is conditional on the predictors X

Understanding interrelationships among predictors and

Gaussian distribution Yi ∼ N(g (Xi ), σ 2 )

i.i.d replicates Y1 , . . . , Yn from Gaussian population

True relationships are never actually linear

Check Y1 , . . . , Yn follows approximately a Gaussian distribution

The MLE for a problem with no predictors leads to

Suppose your outcome is Ỹ = g (Y )

Obviously, for new predictors x = (x1 , . . . , xp ), prediction

Since β̂ is the MLE, it is asymptotically Gaussian.

(β̂ − 1.96SEβ̂ , β̂ + 1.96SEβ̂ )

Analogously, a test statistic for H0 : β = 0 is given by

which is a standard Gaussian under the null hypothesis.

where q is the number of parameters. Small AIC is better.

With p predictors, there are 2p candidate models.

Underparameterized models are biased

Are there non-linear relationships? E.g., linear relationship

Goodness of fit: how well the observed data are

When predictors are uncorrelated, including or excluding

VIF: how much the variance of β̂j is increased because of

You might also like