Professional Documents
Culture Documents
Statistical Modelling in R PDF
Statistical Modelling in R PDF
Statistical Models in R
Some Examples
Steven Buechler
Department of Mathematics
276B Hurley Hall; 1-6233
Fall, 2007
Statistical Models
Outline
Statistical Models
Structure of models in R
Model Assessment (Part IA)
Anova in R
Statistical Models
Statistical Models
First Principles
General Problem
addressed by modelling
General Problem
addressed by modelling
General Problem
addressed by modelling
Model Formulas
Which variables are involved?
Common Features
of model formulas
Outline
Statistical Models
Structure of models in R
Model Assessment (Part IA)
Anova in R
Statistical Models
Approximate Y
Measure of Residuals
A good model should have predictive value in other data sets and
contain only as many explanatory variables as needed for a
reasonable fit.
To minimize RSS we can set ŷi = yi , for 1 ≤ i ≤ n. However, this
“model” may not generalize at all to another data set. It is heavily
biased to this sample.
We could set ŷi = ȳ = (y1 + · · · + yn )/n, the sample mean, for all
i. This has low bias in that other samples will yield about the same
mean. However, it may have high variance, that is a large RSS.
Statistical Models
A good model should have predictive value in other data sets and
contain only as many explanatory variables as needed for a
reasonable fit.
To minimize RSS we can set ŷi = yi , for 1 ≤ i ≤ n. However, this
“model” may not generalize at all to another data set. It is heavily
biased to this sample.
We could set ŷi = ȳ = (y1 + · · · + yn )/n, the sample mean, for all
i. This has low bias in that other samples will yield about the same
mean. However, it may have high variance, that is a large RSS.
Statistical Models
A good model should have predictive value in other data sets and
contain only as many explanatory variables as needed for a
reasonable fit.
To minimize RSS we can set ŷi = yi , for 1 ≤ i ≤ n. However, this
“model” may not generalize at all to another data set. It is heavily
biased to this sample.
We could set ŷi = ȳ = (y1 + · · · + yn )/n, the sample mean, for all
i. This has low bias in that other samples will yield about the same
mean. However, it may have high variance, that is a large RSS.
Statistical Models
Bias-Variance Trade-off
Selecting an optimal model, both in the form of the model and the
parameters, is a complicated compromise between minimizing bias
and variance. This is a deep and evolving subject, although it is
certainly settled in linear regression and other simple models.
For these lectures make as a goal minimzing RSS while keeping the
model as simple as possible.
Just as in hypothesis testing, there is a statistic calculable from the
data and model that we use to measure part of this trade-off.
Statistical Models
Outline
Statistical Models
Structure of models in R
Model Assessment (Part IA)
Anova in R
Statistical Models
Continuous ∼ Factors
Continuous ∼ Factors
Sums of Squares
measure the impact of level means
K X
X n
SSY = (yij − ȳ¯ )2
i=1 j=1
K
X
SSA = (ȳi − ȳ¯ )2
i=1
XK Xn
RSS = SSE = (yij − ȳi )2
i=1 j=1
Statistical Models
The statistic used to assess the model is calculated from SSA and
SSE by adjusting for degrees of freedom. Let
MSA = SSA/(K − 1) and MSE = SSE /K (n − 1). Define:
MSA
F = .
MSE
Under all of the assumptions on the data, under the null
hypothesis that all of the level means are the same, F satisfies an
F distribution with K − 1 and K (n − 1) degrees of freedom.
Statistical Models
Anova in R
●
3
2
1
0
−1
−2
−3
A B C
Statistical Models
Residuals vs Fitted
52 ●
2
● ●
●
●
●
● ●
1
●
●
● ●
●
●
● ●
● ●
●
Residuals
● ●
●
● ● ●
●
0
●
● ●
●
●
●
● ●
● ●
●
● ●
● ●
● ●
−1
●
● ●
●
●
●
●
−2
35 ●
●6
−3
Fitted values
aov(Y ~ LVS)
Statistical Models
Residuals:
Min 1Q Median 3Q Max
-2.6095 -0.6876 0.0309 0.6773 2.3485
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.314 0.222 -1.41 0.16
LVSB 1.421 0.314 4.52 3.1e-05
LVSC 1.618 0.314 5.15 3.4e-06
(Intercept)
LVSB ***
Statistical Models
Kruskal-Wallis Test