Professional Documents
Culture Documents
03 Linear Regression
03 Linear Regression
regression
Alessio
Farcomeni
Linear regression
B.D. in Data Science for Management
Course in Advanced Statistics
Alessio Farcomeni
University of Rome “Tor Vergata”
alessio.farcomeni@uniroma2.it
Statistical inference
Linear
regression
Alessio
Farcomeni
Linear
regression
Alessio
Farcomeni
Linear
regression
Alessio
Farcomeni
Linear
regression
Alessio
Farcomeni
Linear
regression
Alessio
Farcomeni
Linear
regression
Alessio
Farcomeni
Linear
regression
Alessio
Farcomeni
Linear
regression
Alessio
Farcomeni
If Xi ∼ N(µ, σ 2 ),
x̄ ∼ N(µ, σ 2 /n)
Linear
regression
Alessio
Farcomeni
It really depends.
In general if you are working with a mean, Xi is not
Gaussian but x̄ is approximately.
Central limit theorem: under the same assumptions as the
LLN,
x̄ ∼ N(µ, σ 2 /n)
approximately, better and better as n grows.
Standard Error
Linear
regression
Alessio
Farcomeni
Linear
regression
Alessio
Farcomeni
Linear
regression
Alessio
Farcomeni
1 Bias
2 Confounding
3 Sampling error
4 True effect
I must rule out the first three. You have an idea about how to
do it for the first two. Hypothesis testing is used for the third.
Ruling out random error
Linear
regression
Alessio
Farcomeni
Hypothesis testing
Hypothesis: null effect at population level
We (usually) want to reject the (null) hypothesis.
We do it when data evidence against a null effect is strong.
In practice
Linear
regression
Alessio
Farcomeni A different test (and summary or test statistic) is used for
each case
A p-value is computed from it. A p-value lies within (0, 1)
interval. It is the probability of observing a larger
discrepancy between data and null hypothesis.
A low p-value allows us to reject the null.
Conventionally, a threshold α is fixed. This is often
α = 0.05. So if p < 0.05, we reject the null.
Important note: if p > α we cannot say we accept the
hypothesis of no effect. We say that we can not conclude
anything as the study sample size is not large enough.
Illustration
Linear
regression
Alessio
Farcomeni
√
Say your test statistic is T (e.g., n|x̄|/s)
p = Pr(T > tobs |H0 ) (e.g., H0 : µ = 0)
This is simply computed based on the distribution of T ,
which is derived from the sampling distribution of the
point estimate, which is derived from the point
distribution of Xi (or the CLT)
√
E.g., T = n|x̄|/s follows a folded T distribution!
Equivalently
Linear
regression
Alessio
Farcomeni The rejection region p < α corresponds to a quantile of
the distribution of T by definition.
Linear
regression
Alessio
Farcomeni
Linear
regression Most of statistical inference can be explained with
Alessio
Farcomeni
likelihood theory.
Let f (Y |θ) denote a general density or probability mass
function for a random variable X , depending on parameter
vector θ.
(Y −µ)2
−
e 2σ 2
E.g., Y ∈ R, f (Y |θ) = √
σ 2
with θ = (µ, σ 2 )
Given a sample y1 , . . . , yn , the likelihood is
n
Y
L(θ) = Pr(y1 , . . . , yn |θ) = f (yi |θ)
i=1
Linear
regression
Alessio
Farcomeni
Linear
regression
Alessio
Farcomeni
You want to estimate θ? Use the most likely parameter by
maximizing l(θ) to obtain the MLE θ̂.
MLE is consistent (as n increases, θ̂ gets closer and closer
to θ)
MLE is asymptotically Gaussian, hence for the MLE you
can always compute confidence intervals and tests using
Gaussian theory.
MLE for Gaussian model corresponds is (ȳ , s 2 ), as
expected.
What is different from basic inference?
Linear
regression
Alessio
Farcomeni
Linear
regression
Alessio
Farcomeni
Linear
regression
Alessio
Farcomeni
Alessio
Farcomeni
Linear
regression
Alessio
Farcomeni
Linear
regression
Alessio
Farcomeni
Linear
regression
Alessio
Farcomeni
Linear
regression
Alessio
Farcomeni
Linear
regression
Alessio
Farcomeni
f (x) ∼
= f (x0 ) + (x − x0 )f 0 (x0 )
Gaussian assumption
Linear
regression
Alessio
Farcomeni
Linear
regression
Alessio
Several economic indicators (e.g., income) are severely
Farcomeni skewed
The raw measurements can not be used as endpoints
One could transform them to normality, being careful
about interpretation (see below)
Right-skewed variables: take log, square root, cubic root,
etc.
Left-skewed variables: take exponential, square, cube, etc.
Box-Cox:
Yλ −1
λ
for some λ 6= 0
No transformation is needed for X !
Likelihood inference
Linear
regression
Alessio
Farcomeni
Linear
regression
Alessio
Farcomeni
β̂0 corresponds to an estimate of E [Y |X = 0]. It makes
sense when the situation X = 0 makes sense
Predictors might be previously mean centered for instance.
Single predictor: β̂1 is the expected unit increase in Y for
each unit increase in X1
Multiple predictors: β̂1 is the expected unit increase in Y
for each unit increase in X1 when all other predictors are
held fixed.
This takes care of observed confounding!
Interpretation after transformation
Linear
regression
Alessio
Farcomeni
Linear
regression
Alessio
Farcomeni
Linear
regression
Alessio
Farcomeni
Categorical predictors should be recoded to binary
(dummy) variables. One (arbitrarily chosen) category is
reference (zero values)
If two categories, the non-reference category takes unit
values
If k > 2 categories, k − 1 dummy predictors are created,
each being a dummy for each non-reference category
β̂ is the expected mean difference in Y comparing the
non-reference category to the reference one
Confidence intervals and testing
Linear
regression
Alessio
Farcomeni
β̂/SEβ̂ ,
Linear
regression
Alessio
Farcomeni
Definitely: p << n
Stepwise approaches are possible. These are often based
on Akaike Information Criterion:
−2l(θ) + 2q,
Linear
regression
Alessio
Farcomeni
Linear
regression
Alessio
Farcomeni
For ultra-high dimensional problems (say, p/n > 40) an
independence screening is needed. For our purposes, also
for high dimensional problems.
At the first step, univariate associations with the outcome
are examined (see next) and only a subset of the variables
is retained.
At the second step, variable selection, or (regularized, see
next) model estimation is performed.
I have the habit of experimenting with data augmentation (e.g.,
polynomial augmentation) between the first and second step.
Sure Independence Screening
Linear
regression
Alessio
Farcomeni
Choose γ ∈ (0, 1)
Choose a measure of univariate association between Y and
Xj , e.g., absolute value of univariate regression coefficient
Select the nγ << p variables with largest values of the
measure above
These are then used for model selection or estimation.
Why model selection?
Linear
regression
Alessio
Farcomeni
Occam’s razor
Bias: expected difference between true (population-level)
target quantity and estimate
Variance: variability of estimate at repeated sampling
Mean Squared Error: Bias2 +Variance
Bias-Variance trade off
Linear
regression
Alessio
Farcomeni
Linear
regression
Alessio
Farcomeni
Is everything significant?
Is everything interpretable?
External validity of selected model
Linear
regression
Alessio
Farcomeni
Linear
regression
Alessio
Farcomeni
Linear
regression
Alessio
Farcomeni
Adjusted R 2
n−1
R̄ 2 = 1 − (1 − R 2 ) ,
n−g −1
where g is the number of predictors in the model
It penalizes for the number of predictors
It is a more honest assessment of the expected
performance on new data
Collinearity
Linear
regression
Alessio
Farcomeni
Linear
regression
Alessio
Farcomeni
Linear
regression
Alessio
Farcomeni 1 Check Y is Gaussian, transform if necessary
2 Stepwise model selection for initial candidate model, SIS if
needed
3 Refine automatically selected model by excluding (one at a
time) non-significant predictors, including some that are
left out, including transformations (e.g., X32 ) and
interactions
4 Final model: everything is significant, everything is
interpretable, collinearity is under control. If target is
(also) prediction: the adjusted R 2 is large. If anything
fails, go back to step 3.