6038 Cheatsheet Guqiang Luo

Sample variance of Y Overall F test (tutorial 2 Q1 (c))
The overall F test:

Var( ^y )=s =S /n-1 =∑ ¿¿ ) /(n-1)
y
2
yy i
2
Var( ^x )=s =S /n-1 =∑ ¿¿ ) /(n-1)
x
2
xx i
2
H0:all slope coefficients,β1, β2, β3, β4=0 Implies Y= β0+ϵ
H1:at least one βj≠0 (j=1,2,….k) In SLR this becomes H0:β1=0 H1:β1≠0
CLT
Same hypotheses as the t-test on the slope coefficient It is true that the overall F test is equivalent to this test about mean
sampling from a population, as n to infinite, the sample distribution of mean will N (u x, σ /n) 2
coefficients, but really the F test is a test about variance components (which is why I appears in the ANOVA table)
T-test on the slope coefficient (individual hypothesis) SA1 Q 1(d) A1 Q1(f)
Standard error and coefficients (minimised standard error)
(1) H0:β1=0 and H1:β1≠0 Or H0:β1=0.02 and H1:β1≠0.02
Population :
∑ ϵ ∑ ( y− ^y )
i
2
= 2
Sample:
∑ e ∑ ¿¿ i
2
= i -( ^β (2) t= (b1-0)/se (b1) Where se (b1)= σ /√ s xx =, σ =RSE, Unknown, σ^2=s =MSE, se (b ), from summary
2
1
+
0 ^β x ) 1 i
2 (3)
(4)
α=0.005, reject H0 if t< tn-2(0.025) or > tn-2(0.975)
t>tn-2,0.975 So reject null H and conclude βj≠0, there is a significant linear relation between y and x.
Summary for individual hypothesis (B0,B1) (Summary table)
^β =S / S (the expected increase in y when x increase 1
1 xy xx,
^β = ý - ^β x́
0 1 (expected value of Estimate SE t(TS) pr(>|t|)
Intercept b0=ybar-b1xbar SE(b0) (bo-0)/SE(bo) p(b0)
S =
xy
∑ (x−x́) ¿ ¿) and S =∑ (x−x́) xx
2
=Sx2*(n-1)
y when x=0) Slope (x) b1=Sxy/Sxx SE(b1) (b1-0)/SE(b1) p(b1)=p(F) in SLR
T-test on the intercept coefficient SA1 Q 1(d)
(1) H0:β0=0 and H1:β0≠0
Errors are independent and identically (normally distributed with mean 0,constant variance σ. 2 (2) t= (b0-0)/se (b0)
1
√
ANOVA table (overall hypothesis)
Source df SS MS F (TS) P se (b0)= +¿ x́ 2 /s xx ¿ (from summary)
Regression (year) k SSreg=SStotal-SSerror SSreg/df MSreg/M
Serror =t2
t n
∑ ¿¿ − ý i )2 *when x=0 is within the range of the x values, we care whether β 0=0, allowing a non-zero intercept gives maximum flexibility
in how the model fits the data. We should always fit a model with an intercept to observable data. We tend to fit lower terms
Residual (error) n-k-1 *k, num of x as part of the model regardless of their significance. On the log-log scale we do have values around the intercept, but it is
∑e
SSerror= i
2
∑ ¿¿
2) (residual se)
i )2 /(n-
2
p=k+1
n,num of obs
difficult to interpret lpsa=0 implies PSA=1 and icaol=-0.508 implies a small cancer turmor./ if the intercept is distant from
the data, don’t interpret value of intercept
∑ ¿¿ )2
i
(1) σ |x/σ =1 and H :σ |x/σ >1

H0: 2 2
F test for the variance model SA1 Q 1(c)
2 2
Total n-1 SStotal y 1 y
∑ ¿¿ )2
∑ ¿¿ i )2/(n-
where σ reflects error variance
2
^y ), s )
i
2
1) (Var( y F= MSregression/MSresidual
No Normal distribution: indicator variable ,values 0 and 1, categorical score, on discrete values in the (2) α=0.005, reject H0 if F> F1,n-2, (0.95)
range 2 to 10, percentage cannot be <0 or >100, here close to both bounds α=0.005, reject H0 if F> F2 or 3,n-k-1, (0.95)
Type errors H0 valid H0 not valid F> Fn-2, (0.95)
Does not reject H0 YES False negative (Type 2 error) So reject null H and conclude σ |x/σ >1, the model involving the x is superior to a null hypothesis, the proportion of the
y
2 2
variance in Y is explained by the larger model (involving x) is significantly larger than error variance.
Reject H0 False positive (Type 1 error) YES
* It is not a coincidence that the p-value from F,T,correlation tests are the same
(coefficient interpretation in MR)Note the intercept coefficient is within machine rounding error of
zero, and the slope coefficient is the partial regression coefficient for Capacity i.e. the "slope" associated
with Capacity, having already accounted for the effects of Weight.
MSregression=
∑ ¿¿ )2/1=b12*Sxx
F= MSregression/MSerror= b12*Sxx/S2=b1/se(b1)=t2 (it would only be the case in SLR)
P(Type 1 error)= α (significant level) 1- α (confidence level) Correlation
P(Type 2 error)= β
A powerful test is one in which we are more likely to correctly reject a false null hypothesis
1- β (power) R=cov (x,y)/
√ var ( x )∗var ( y )
*increase sample size, we can reduce α and β *A one-tailed test is more powerful,
compared to two-tailed hypothesis
∑ (x−¿ x́)( y− ý )¿
= 1/(n-1)* /
(1)H0: β1=0
Hypothesis test on β1
H1: β1>0
√(1/n−1)∑ ( x−¿ x́)∗(1/n−1) ∑ ( y−¿ ý) ¿ ¿
(under H0 this is 0)
=Sxy/
√ s xx∗s yy ¿ ¿
¿
(2)t= -E[ β1|H0])/ se( ^
^ β 1) T-test on the Correlation SA1 Q1 (a)
se( ^
Where ^ β 1) =σ^ /√ s xx H0: ρ ρ ≠0
=0 and H1:
x,y x,y
(3)Reject H0 at α=0.05 If observed test statistic>tn-2=136, 0.95=1.66

Reject H0 in favour of H1 and conclude β1>0
t= (r-0)/se (r)=r
√ n−2/ ¿¿ ) 2
se( ^
¿ ^
(4)t= -0/ β 1) =20.48 * the three tests (F/T for coefficients and T for correlation) all address the question whether X and Y associated (linearly
related)
(5)The slope of the regression model relating average temporary anomalies to year is significantly
Modelling process
positive and temperatures have been increasing over time
1. Is model appropriate (assumption)
Assess the results
2. Is model adequate (does the model have significant explanatory power) (Overall F test using ANOVA table)
R2=SSregression/SStotal=75.5% 3. Look at estimated model coefficients (summary)
*1.We still need to check assumptions underlying the model (plots) 4. Finally use predict (model)
*2.Linear models just address association, they do not necessarily reflect causation 5. Overall assessment (sensible to use to predict)
*3.This analysis does not necessarily rule out alternative explanations Assumptions SA1 Q1(d)
Leverage value hatvalues( ) ST1 Q1(b) iid=independent (no pattern) & identically distributed with Constant variance (homoscedasticity)
1 N=normally distributed errors
n ∑
Except for residuals vs fitted values and qq plot, cook distance is considered.
hii= ∗ ¿¿) S i
2
xx
Residuals vs fitted values plot:
1. “Curvature”- a definite pattern-indicating dependence in the errors and errors are not independent. There are definitely
Point of large leverage (high influence) is one that a point of large leverage (high influence) is one that aspects of the underlying relationship that are not included in the model, suggesting it is not appropriate. A transformation to
has an x value that is a long way from x bar, hii will be large and close to 1. address the non-linearity or try including a quadratic term in the model.
The sum of hii=p=2 in SLR, Large leverage is often defined as >2* Sum (hii)/n, 2. “heteroscedasticity”,
*it is possible for points to have high leverage i.e more than twice average leverage. 3 “potential vertical outliers”, the only observation stands out is the one towards the left of the main residual plot, showing
without being highly influential in the fit of model high leverage and large cook distance but not extreme, it may still follow the general trend of the other observations and has
not been influential in the overall fit of the model
Variance of error, beta 0 and beta 1
Var(ϵ)= σ 2
*I Var( ^
β 0)=σ (1/n+ x́ /s ) 2 2
xx Var( ^
β 1)=σ / s 2
xx
Theories of causality
Mechanisms that make x cause X must precede Y X and Y always result in some
y and rule out Spurious correlation
association x,y,z
Coefficient of Determination (R2) (T2 Q1(d))
Proportion of the variation in Y that can be explained by the model involving x(s)
R2= SSreg/ SStotal =1- SSerror/ SStotal R is also the coefficient of correlation
Adj R2=1-MSerror/MSTotal=R2- (1-R2)(dfregression/dferror), called adjustment factor
F test is similar as Adj R2 adjusting df, F is more comparable to a known standard deviation
95% confidence interval for β0 and β1 (T2 Q1(e)) SA1 Q1(e)
^
β 1+¿ error df, 0.975*se( ^
β 1) ^
β 0+¿ error df, 0.975 *se( ^
β 0)(useful?) Y=cxd, ln(y)=ln(cxd)= ln(c)+ dln(x) , Log model is linear, but original model is not , * High leverage and vertical outliners
A lot of observations lie outside the confidence However, we would have to extrapolate a long (influential observations), shown on cook distance. Generally process: Examine a plot of the data, Transform the data, Fit a
intervals indicating here is a lot of variability way "off the plot" (i.e. away from the data) to linear regression model Test the significance and make predictions. Use plots to examine assumption Transform data again
around this increasing relationship, so PSA is not show the intercept and it doesn't really make Re-fit regression. Re-test and re-predict.
necessarily a reliable indicator of tumour size. sense to talk about the fuel efficiency for a car **************************************************************************************************
This not be a good model to use to try and predict with zero weight. So this interval is not useful (c)The residuals vs fitted plot does not really show a departure from the assumption of independence, and the variance looks
cancer size for an individual with a particular reasonably constant, however the large horizontal gap suggests that the model is predicting distinctly different fitted values
PSA for a group of 4 observations. This group includes observation 16, which refers to 1975, which had the equal second largest
95% confidence interval for E(Y|X=x*) (95% average value of Y lies in CI) SA1 Q1(e) population after the now removed 1976. This observation appears to have been promoted to potential problem status, now
that we have removed observation 17. The other members of this group are likely to be the other years with big populations
^y +¿ ^y ),
error df, 0.975 *se( Se( ^y ¿ ¿=σ √ ¿ ¿ (1972, 1981 and 1983).The normal quantile plot shows only minor departures from the assumption of normality , given the
relatively small sample size. However, the problem with observation 16 is again apparent as a potential vertical outlier in the
X*=0, se becomes se( ^
β 0) and if x*= x́ se becomes se( ý )=σ / √ n upper tail of the residual distribution. The internally studentised residual for observation 2 is not less than –2, suggesting less
95% prediction interval for E(Y|X=x*) (95% chance specific value of Y lie in PI) of a problem in the lower tail.The Cook’s distance plot suggests problems with observation 16 and with the point with the
second most negative internally studentised residual, observation 23, which is almost certainly the point with the large
^y +¿ *se(
error df, 0.975 ^y ), Se( ^
y ¿ ¿=σ √ ¿ ¿ negative residual at the bottom of the group of 4 observations in the left side of the main residual plot. Following the removal
Prediction intervals are wider than the confidence intervals, since they incorporate the extra variability of of 1976, observation 23 now refers to 1983, which was another year with a relatively big population. Both of these points also
a single random event, along with the uncertainty of the expected value. Also, that both bands have a probably have relatively high leverage, if we check some of the influence diagnostics.(describe plots)
quadratic shape, indicating that even if we firmly believe that our linear model holds, it is difficult to
accurately predict as we move away from the centre of the data. (d) reluctant to remove observation 16 (as well as observation 17) as we already have a small sample size and I suspect
**********************************Q(6)*********************************************** removal of any observation would just promote another observation to potential problem status (observation 23). The data do
(a)From the table of partial regression coefficients, the intercept 0.3458097 is r and the slope – appear to divide into two distinct groups (usual years and big population years) and it may not be reasonable to fit the same
0.0004088 is equal to –r/K, so K = –0.3458097/–0.0004088 ≈ 845.91. As the name suggests, r is the model to both groups. We could try modelling the two groups separately, by including an indicator variable in the model, but
growth rate from year to year, i.e. in most years, the population increases by a multiplicative factor of er we have only a relatively small number of observations in big population group, depending on how define “big”
= exp(0.3458097) ≈ 1.41, however, the intercept term is not significantly different from 0 (t = 1.4, p =
0.17). Equation (3) suggests that when the population grows to exceed K, the carrying capacity, then r, (f) the model is arguably an appropriate model for most of the data, there is definitely an issue with the group of observations
the growth rate, becomes negative and the population will decline with big populations identified in the discussion in part (d). This issue means the model is far from being an adequate
description of the data, as evidenced by the highly imprecise prediction in part (e). It is very hard to believe that just one
model of this type can actually cover both of the groups in the data (and 1976, which is arguably yet another group of size 1).
The usual year well follow a Ricker model of regular growth, followed by a collapse once the population exceeds some
(b)Observation 17 is the only observation with an nt value greater than 7,000, and this observation is for optimal carrying capacity. However, there is definitely something different going on in the big population years , when the
the year 1976. The 1976 population of 7,227 does represent a substantial increase on the previous year, population appears to explode well beyond the supposed carrying capacity threshold. Include an indicator variable for type
1975, but the population for that year, 1,819, was already in excess of the estimated carrying capacity of year (big population or usual) may be enough to allow for different models for two groups, but there be some
(K). Even though the population did collapse after 1976, to 852 in 1977, the reduced population was still environmental covariates we could measure (e.g. temperature, rainfall) that might suggest why the carrying capacity seems to
greater than the estimated K. suddenly increase in certain years.
This observation has definitely been highly influential in the fit of the model, though it only appears to
have a relatively small raw (unstandardised) residual and will probably not be classed as a vertical
outlier. We should check the usual residual and Cook’s D plots and various influence diagnostics. I
would expect observation 17 to have high leverage; large DFFITS, DFBETAS and COVRATIO and
possibly a large externally studentised residual.
Q1 Practice question 1 Q4 Practice question 1

Sequential F test for nested models: It is just the special
1. Errors are not independent See “Curvature” (a) The first model sexab.lm1 includes an interaction term which allows for different case of a nested test where we are adding a single additional
slope coefficients for the two different categories of abused, so plot B is the one that 100(1- α)% confidence interval for E[Y|x]
2. ANOVA table See “ANOVA” table shows sexab.lm1. Plot A, which has parallel lines (different intercepts, but same T
3. Predict interval See “Predict interval” table slope) is showing the sexab.lm2 model, which does not include an interaction term. ^y +¿ error df, 0.975 *s*
√ X ∗X ¿ ¿ ¿
0
y ±t
^ *
error df,0.975 S x √¿¿ , S xx
2
=Sx *(n-1) 100(1- α)% prediction interval for E[Y|x]
T
Notes:The observed Y values fall within these intervals, but are towards the lower
end of the range for the mean number of weeks (and the predicted value is a lot
(b) The two models are nested in that model sexab.lm1 includes an interaction term
which is additional to the basic additive relationship in model sexab.lm2, so we can
^y +¿ error df, 0.975 *s*
√ 1+ X ∗X ¿ ¿ ¿
0
larger than either observed value). The observed Y values also fall towards the apply a sequential F test to the cpa:abused interaction term in the ANOVA table for Problem of multiple comparisons
upper end of the range for both the minimum and maximum number of weeks (and model sexab.lm1. This F test ( F1,72 = 0.726, p = 0.397) suggests that the interaction With m comparisons, it is closer to p (all “tests” accepted)
the predicted values are a lot smaller than the observed values). So, even if we do term is not a significant addition to the model. =1-p (at least one test is rejected)>= 1-sum p (each test is
not go the extra step of plotting the residuals, these systematic departures of the rejected) =1-mα. M=3 comparisons, each at α=0.05 so
model from the observed data suggest there are major problems with the fit of this So the appropriate relationship is the parallel lines sexab.lm2 model shown in plot A, overall confidence is only 1-3*0.05=0.85, but the target is
model. rather than the more complicated sexab.lm1 model shown in plot B. The coefficients mα=0.05, therefore, α=1.6%, becomes to 98.3% confidence
4. Why should you still include the other terms from the initial model as well? of the sexab.lm2 model are given on page 12 of the R output, from which we can interval, this (1- mα) correct
Is this additional term a significant addition to the initial model? The ANOVA table derive the following fitted model equations for the two levels of abused: Internally studentised residuals/Press
is again not shown for this second model, but what would be the F statistic and
degrees of freedom associated with this additional term?
4. When fitting higher order terms (quadratic or interaction terms), we should
Y=ptsd X=cpa, For abused=0, ^y =3.97+0.55x+6.27(0)
ri=ei/s
√ 1−hii = e /s√ 1−hii i,-i
always include all lower order terms to allow maximum flexibility in how the
model fits the data. So, unless there is a good reason to assume a model of a
y =3.97+0.55x+6.27(1)
For abused=1, ^
S=
√ MS
Deletion residual (Press)
error
particular form, a quadratic model involving Weeks.sqd such as potatoes.lm2 Also called press (“ prediction sum of squares” residual) e i,-
should include the linear term in Weeks and the constant (Intercept) term. Judging
by the t test on the coefficient of Weeks.sqd in the summary output (t11 = 8.04, p =
0.00000623) this addition to the model is significant at α = 0.05. As this test is for
(c)correlation test i = Yi-Y^ i,-iI =ei/1-hii
Externally studentised residuals (rstudent)
the last term added to the model, we can square this test statistic to find the
equivalent sequential F test in the ANOVA table (F1,11 = 64.64, p = 0.00000623).
√ 1−hii
ti= ei,-i /s-i
Q2 Practice question 1
(b) It has a large externally studentised residual value, which is probably just
S-i=
√ MS error ,−i, , ¿ ¿ )
The null hypothesis for each teat is H0: Δi=0, where Δi=0 is
outside t27(0.025) =–2.0518 (note the degrees of freedom are 1 1ess than the
residual df for trees.lm,). There is also a relatively large vertical gap between this mean shift in model caused by i observation.
residual and the other residuals, so it is probably also a point of high leverage (we Added variable plots
would need to calculate the hatvalues() to confirm this), which is probably high We can check if the relationship on the AVP is linear by
influential in the fit of the model. This is why the observation stands out from the adding a SLR line.
other observations on the plot of Cook’s distances. This observation would Maximise the variance explained by the model (involving
probably require some sort of treatment (exclusion), if it continues to be a problem (d) The coefficient of abused suggests (regardless of cpa), we expect women suffered the x variables & minimise the unexplained variance.
once we are satisfied we have the most appropriate model. abuse (abused = 1) to be around 6.27 points higher on the standardised ptsd scale However, the residuals shows non-random structure, the
(c) The two models are not nested as one model is not a subset of the other, so we than women who have not been abused (abused = 0) and this difference is significant explained variance is relatively small, the overall F-test is
cannot use a nested model F test to decide between the two models.(log vs non log) (t73 = 7.623, p = 0.0000). not significant, so adding another x
The two models do share the same response variable, so we can directly compare
summary measures based on the residuals such as the residual standard error The coefficient of cpa suggests that on average ptsd increases by around 0.55 of a
(smaller is better in terms of RSE). Even if they were not on the same scale, we point on the standardised scale for each unit increase in cpa and this increase is
could arguably still compare properly scaled summary measures such as the R2, significant (t73 = 3.209, p = 0.00198). We have the same expected increase in ptsd as
the adj R2 and the overall F-statistic for the two models (larger is better for these cpa increases, in both of the abused categories.
other summary measures).
The second model trees.lm2 appears to be slightly better on all of the summary The intercept is within the range of the data and is also significant (t73 = 6.317, p =
measures, but the main reason I would choose trees.lm2 over trees.lm is that it 0.0000), but we would need to know more about the standardised scales used in order
appears to be a better fit to all of the data (including observation 31, which no to interpret the value for 3.9753 on the ptsd scale for the abused = 0 category. I doubt added variable plot now shows signs of non-linearity
longer features prominently on the Cook’s distance plot). Note the largest Cook’s that it implies that even women who have not been abused have a significantly non-
distance is much smaller for trees.lm2 than for trees.lm. zero level of ptsd. (coefficients interpretation) DFFITSi (deletion measure of change fitted value
^ )= ^
V β 0+ ^
(d) ln(
β 1ln(r)+ ^
β 2 ln(h) (e) The sexab.lm2 model implies both cpa and csa have effects on the expected levels
It can be interpreted as the approximate number of standard
deviations by with the predicted will change if the i th
of ptsd, but that these effects are additive. The inclusion of both variables in the same
^ = exp ⁡¿ + ^
V β 1ln(r)+ ^
β 2 ln(h)] multiple model allows us to control for the effects of one predictor in examining the observation is deleted.|DFFITSi|> 2 √ p / n, it is
effects of the other predictor. The two-sample t-test is equivalent to just the simple regarded as potentially influential.
^ ^ ^
= β 0 β1 β2
e r h linear regression model of ptsd regressed on csa (abused), it does not control for the
significant effects due to cpa. However, as the effects of cpa and abused are additive,
DFBETASi(Deletion measure of the change in
parameter estimate)
1 the two-sample t-test on the categories of csa will probably give very similar results to
|DFBETAS|>2/
√ n, regarded as potentially influential.
¿ π ≈3.1415 , ≈1.0472 the t-test on the coefficient of abused.
*********************************************************************
3π Multiple regression model introduction
*Cook’s distance can be seen as measuring the overall
effect of the i th observation on parameters, whereas the
P=k+1
^
β 1 ≈ 2∧ β^2≈ 1 so the above relationship is close to: β1, β2, β3…., called partial regression coefficients*
DFBETAS measure the effects on each of the individual
parameters.
*the expected change in y, as the corresponding x changes by 1, with all the other x
V^ = e β^0 r 2 h1, where c= variables held constant.
COVRATIOi (Covariance ratio)
How the I th observation is affecting the estimate of the
Y=cx1dx2h, ln(y)=ln(cx1dx2h)= ln(c)+ dln(x1)+hln(x2) (Transformation) residual scale (s2=MSerror), which is the key estimate in all of
^
e β 0=exp (−0.33065 )=0.7185 Y= β0/(1+ β1x1+ β2x2) 1/Y=(1/ β0+ β1/ β0*x1+ β2/ β0*x2)=r0+r1x1+r2x2
Polynomial of degree
the statistical inference using the model. Should not be the
outside (1-3p/n,1+3p/n)
So the fitted model is suggesting a fitted Volume of useable timber that is somewhat A polynomial of degree (n-1) (total df) will be a perfect fit for n observations, but Variance Inflation Factor VIF=1/1-R2
less than assuming that black cherry trees are either perfect cylinders or cones, there are no degrees of freedom left to estimate the variance of the error. Model selection: two similar models, we will choose
although of those two alternatives, they are closer to fitting a conical tree model. Why SSs in ANOVA depend on the order you fit the model, but the fitted model is simpler one, if there is no significant difference.
There could also be other problems with one of the many other assumptions made the same? 1. Less unexplained variation, smaller MSE, F test which
above, for instance, the height may have been measured to the tops of the tree SSR=SSR (β1, β2, β3| β0) = SSR (β1| β0) + SSR(β2| β1, β0)…………. indicates whether the apparent drop in s2 is significant, but
rather than tops of the trunk. SSR(β2| β1, β0) is the amount of unexplained variability from a simple linear regression
Q3 Practice question 1 cannot compare models with different y scale..Larger R2,
on x1 which is subsequently explained by X2, representing the increase in SSR by x 2
(a)multicollinearity;strong relationships between the predictors but no obvious point of comparison, does not protect
model with x1.
Judging by the scatterplot matrix and the variance inflation factors, the main against over-fitting as additional x will increase the R2.
Partial F test for nested models
culprit is the strong association between Electoral_Roll and Attendance which are Larger adjusted R2 (1-MSerror/MSTotal), which does adjust
Y= β0+β1x1+ϵ (2) Y= β0+β1x1+ϵ (1)
two closely related measures of engagement with the Church of England. Again for the df involved. (which would be equivalent to
Y= β0+ϵ (1) Y= β0+β1x1+ β2x2 + ϵ (2)
judging by the scatterplots, either of these measures is potentially a better Model (1) is “nested” inside model (2), so the overall F test for a SLR model is a preferring models with more significant overall F test, F=
predictor of Annual_giving than Employment, but Electoral_Roll and Attendance special case of a partial or sequential F test., which is a test for part of a model, the MSreg/MSerror)..Less internally studentised residuals sum
are so closely related that they will cause problems if both of them are included in addition of extra terms. of squares, problems with outliers if Pressp>> s2
the same model. As the ANOVA tables are fitted sequentially, the significance of Mallow’s Cp, (mis-specifying the model create bias in the
Employment depends on other variables have been included . H0:σ addition
2
/ σ error
2
=1 H1:σ addition
2
/ σ error
2
>1
(backward elimination), Include all candidates Re-order the model, the least
promising one appear the end, Select the variable with smallest sequential F( not :
(equivalent) H0 β2=0 H :β ≠0
1 2 estimate of σ 2
and that over-fitting will inflate the
select literature suggested), remove, and repeat, remaining variables are significant H :β = β =0 H :not all of β = β =0
0 4 5 1 4 5
variances for predictions. Prefer models where Cp=P
(bias term is 0) with smaller value.
F=((MSaddition1+ MSaddition2)/p)/ MSerror (Test statistic), Decision rule: α=0.005, reject H0 ************************(f)**********************
(c) Forward selection:
if F> Faddition,error, Draw a picture, Reject H0 in favour of H1 and at least one of the two The researchers collected the other explanatory variables
1. Starts with a null model involving intercept term).
2. Add the potentially most significant predictor (the one with largest sum of additional terms is significant addition to the base model as they had prior assumptions that factors such as: the
squares), so adding in Electoral_Roll with a SS of 589.65 as in model church.lm2. nature of the terrain (Elevation); the proximity (Nearest)
3. Add the remaining predictors is the most significant addition to the model. *********************************** Q5****************************** and size (Adjacent) of the nearest island; and how far the
Comparing models church.lm2 and church.lm2a, the next most significant addition (1)A few of islands (Caldwell, Enderby and Onslow) have no non endemic species (i.e. island is from the main centre of human population (Scruz),
would be Employment with SS of 251.91, rather than Attendance with SS of 64.90. Species = Endemics), so the value of our chosen measure of Diversity is 0. log(0) is were all likely to play a role. The researchers will probably
4. Judging by the last sequential F-test church.lm2b, the addition of Employment not defined, which means these islands would be treated as missing and not included
not be happy to find that none of these variables make a
to a model that already includes Electoral_Roll is significant (F1,17 = 5.3906, p = in the estimation of the regression model. To avoid losing these observations, we need
difference and that the only significant factor in explaining
0.032923). Finally, it is apparent from model church.lm2a that the addition of the to add a small positive constant before applying the log transformation.
species diversity is the size of the island. there are number
last remaining candidate predictor Attendance is not a significant addition (F1,16
of other variables that might affect diversity that have not
= 0.0519, p = 0.822615), so we stop and choose model church.lm2b, a model which
does not have multicollinearity issue. (3)No obvious departure from the assumption of independence on the main residuals been measured and included in the study. With the
****************coefficients interpretation************************** vs fitted plot, though there is a suggestion of decreasing (non-constant) variance, as variables that have been measured, the lack of significance
the fitted values increase. As usual, 3 observations have been labelled (by default). could be the result of measurement issues, in that the
(d) coefficient of Electoral_Roll suggests Annual_giving declines by around ₤4.01 There is some space in the vertical direction between observation 7 (Daphne Minor) collected variables may only be poor proxy measures for
as Electoral_Roll increases by 1% and this decrease is significant (t11 = –4.08, p = and the other observations, so it might be a possible outlier, but this is definitely not the factors that the researchers are really interested in. For
0.000779),holding other variables constant, suggests the communities with higher the case for the other two labelled observations (3 and 13). example, is the suggested response variable the best way to
engagement tend to have lower Annual_giving than less engaged communities. measure species diversity? Another possibility is that we
However, it be more engaged communities are ones that are likely to seek Similarly, there is no obvious departure from the assumption of normality on the have not used the best approach to incorporate the
assistance from the church. normal quantile plot. Only observation number 7 out of 30 observations has an information in these variables into model. Added variable
Similarly, the coefficient of Employment suggests that on average Annual_giving internally studentised residual outside (–2, 2) and it is only just outside this range.
plots suggest different scales for some the exploratory
increases by around ₤1.34 as the Employment rate increases by 1% and this variables
increase is significant (t11 = 2.322, p = 0.0032923). communities with higher The Cook’s distance for observation 7 appears large relative to the other
observations, however, the vertical scale on this plot only goes to just over 0.2, which *********************g)************************
levels of employment can give to the church. Added variable plots are used to assess whether or not an
A negative intercept suggests for the poorest communities the church gives out is relatively small for Cook’s distances, even with a relatively small sample size.
additional variable (a candidate predictor) should be added
more in charity than it receives in donations. However, intercept is not significant
(d)I suspect the apparent decreasing variance on the main plot and the status of to a multiple regression model. They might even suggest a
(t11 = –1.086, p = 0.292531), the intercept is outside the range of both predictors,
as Electoral_Roll ranges from 1.9% to 8.7% and Employment ranges from 82.6% observation 7 as a possible outlier are linked. The log transformation, applied to the certain functional form (e.g. a particular transformation or
to 92.8%. response and some of the explanatory variables, appears to have been too strong, as it a quadratic term) to model the relationship between the
(e) has over-corrected the observations with larger fitted values, which leaves some of the unexplained part of the response variable and a candidate
observations with smaller fitted values looking like potential outliers. We could predictor (also adjusted for the effects of the other
Residual standard error=
√ MS
Multiple R-squared=SSreg/SStotal
, on 17 df
error experiment with weaker transformations, such as a square root transformation, which
might be appropriate, given the nature of variables such as Area
variables). However, none of the added variable plots on
page 13 suggest any linear or other relationships, so they
Adj R2=1-MSerror/MSTotal=R2- (1-R2)(dfregression/dferror), do not appear to help in this instance. The closest to
F=((MSaddition1+ MSaddition2)/p)/ MSerror (Test statistic) suggesting some relationship is the added variable plot for
Scruz, but the apparent reasonably strong negative slope is
The above F-statistic is on 2 and 17 degrees of freedom, so can be compared with heavily influenced by just a couple of observations
F2,17(0.95) = 3.592 and we can conclude that at least one of the terms in the (probably Wolf and Darwin, the two remote islands located
model is significant, either the term involving Electoral_Roll or the term involving more than 250 kms from Santa Cruz).
Employment or in this instance, as shown in part (d), both terms are significant.

6038 Cheatsheet Guqiang Luo

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

6038 Cheatsheet Guqiang Luo

Uploaded by

Copyright:

Available Formats

Sample variance of Y Overall F test (tutorial 2 Q1 (c))

The overall F test:

(1) σ |x/σ =1 and H :σ |x/σ >1

(3)Reject H0 at α=0.05 If observed test statistic>tn-2=136, 0.95=1.66

95% confidence interval for β0 and β1 (T2 Q1(e)) SA1 Q1(e)

Q1 Practice question 1 Q4 Practice question 1

You might also like