Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

STAT 440. Homework 2.

Juan Pablo Madrigal Cianci


September 22, 2014
Chapter 2. Exerceise 2: Mortgages Indicators
For this exercise we are required to t a SLR model to test the hypothesis that indicator to test
the prices are generally falling and overdue loan payments are piling up. We start by reading and
organizing the data:
d=read.table("http://www.stat.tamu.edu/~sheather/book/docs/datasets/indicators.txt"
,header=TRUE)
p=d$PriceChange
lpo=d$LoanPaymentsOverdue
length(lpo)
## [1] 18
(a) Fitting a LRM to the data set:
names(d) <- c('city', 'p', 'lpo')
#Applies the Simple linear regression model yi=b0+b1xi
slr <- lm(p~lpo,data=d)
We get:
summary(slr)
##
## Call:
## lm(formula = p ~ lpo, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.654 -3.342 -0.694 2.529 6.916
##
## Coefficients:
1
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.514 3.324 1.36 0.193
## lpo -2.249 0.903 -2.49 0.024 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.95 on 16 degrees of freedom
## Multiple R-squared: 0.279,Adjusted R-squared: 0.234
## F-statistic: 6.2 on 1 and 16 DF, p-value: 0.0242
Plotting:
par(mfrow=c(1,1))
#creates Scatterplot
plot(lpo,p,
main='Percentage change vs LPO (In Dollars)',
ylab='Percentage Change',
xlab='LPO')
#Cretes regression line
abline(b=coef(slr)[2],a=coef(slr)[1],col=2)
#finds fits
fits <- fitted(slr)
attach(d)
## The following objects are masked _by_ .GlobalEnv:
##
## lpo, p
predict(slr,level=.95);
## 1 2 3 4 5 6 7 8 9
## -5.7163 -2.9281 -2.2086 -5.0642 -3.4902 -6.0760 -6.5033 -2.3435 -8.1447
## 10 11 12 13 14 15 16 17 18
## -2.2536 -2.8831 -2.8157 0.1749 -3.2429 -0.6346 0.8044 -5.8287 -2.5459
#finds boundaries
lb <- predict(slr,d,interval="predict",level=.9)[,2]
ub <- predict(slr,d,interval="predict",level=.9)[,3]
lm_coef <- round(coef(slr), 3) # extract coefficients
mtext(bquote(y == .(lm_coef[2])*x + .(lm_coef[1])),
adj=1, padj=0) # display equation
#creates C.I lines
points(lpo[order(lpo)],lb[order(lpo)],
type='l',col=4,lty=2)
points(lpo[order(lpo)],ub[order(lpo)],
type='l',col=4,lty=2)
2
2 3 4 5

1
0

5
0
5
Percentage change vs LPO (In Dollars)
LPO
P
e
r
c
e
n
t
a
g
e

C
h
a
n
g
e
y = 2.249x + 4.514
As we can see, the data is not very linear, however, there seems to be an inverse relation between
the percentage price change and the LPO. Finding a 95% coendence interval for the slope, we get
that the t associated to = 0.05anddf = 17 is given by:
t<-qt(1 - 0.05/2, df = length(lpo) - 1)
t
## [1] 2.11
Then, the 95% condence interval is given by -2.249 1.914274 That is, I am 95% condent that
the slope lies between -4.163274 and -0.334726. Since the slope lies between negative values, there is
a somewhat strong indication that theres negative relation.
(b) To estimate a value for E(Y |X = 4) we use the following code:
newdata<-data.frame(lpo=4)
estpoint<-predict(slr,newdata,interval="predict",level=0.95)
estpoint
3
## fit lwr upr
## 1 -4.48 -13.14 4.179
To which we answer that, even though the model doesnt t perfectly, 0% is indeed a feasible value
for the percentage price change. It would be good to subject this data to a certain transformation, so
it would be easier to analyze. We see that, from the interval given, 0% is a therotically possible value,
however, since the t of the model is not perfect, it would b nice to have some reserve when saying this.
Analyzing the diagnostics:
library("MASS")
residuals <- resid(slr)
fitted <- fitted(slr)
student <- studres(slr)
leverages <- hatvalues(slr)
cooks <- cooks.distance(slr)
par(mfrow=c(2,2))
hist(residuals)
qqnorm(residuals)
qqline(residuals)
plot(lpo,residuals)
plot(fitted,residuals)
4
Histogram of residuals
residuals
F
r
e
q
u
e
n
c
y
6 4 2 0 2 4 6 8
0
1
2
3
4
2 1 0 1 2

2
0
2
4
6
Normal QQ Plot
Theoretical Quantiles
S
a
m
p
l
e

Q
u
a
n
t
i
l
e
s
2 3 4 5

2
0
2
4
6
lpo
r
e
s
i
d
u
a
l
s
8 6 4 2 0

2
0
2
4
6
fitted
r
e
s
i
d
u
a
l
s
par(mfrow=c(3,2))
hist(student)
qqnorm(student)
qqline(student)
plot(lpo,student)
plot(fitted,student)
plot(leverages,cooks)
5
Histogram of student
student
F
r
e
q
u
e
n
c
y
1 0 1 2
0
1
2
3
4
2 1 0 1 2

1
.
0
0
.
0
1
.
0
2
.
0
Normal QQ Plot
Theoretical Quantiles
S
a
m
p
l
e

Q
u
a
n
t
i
l
e
s
2 3 4 5

1
.
0
0
.
0
1
.
0
2
.
0
lpo
s
t
u
d
e
n
t
8 6 4 2 0

1
.
0
0
.
0
1
.
0
2
.
0
fitted
s
t
u
d
e
n
t
0.05 0.10 0.15 0.20 0.25
0
.
0
0
.
2
0
.
4
leverages
c
o
o
k
s
We can see that the model is not exactly lineay and that the residuals dont distribute themselves
quite normally (they are a little bit skewed to the right). There doesn seem to be any pattern in
the residuals, so the model seems like a good t. Also, it could be said that the variance is constant.
There doesnt seem to be any big outliers, and the Cooks distance seems good. I consider the linear
model as valid, however, there seems to be a lot of variation, whichc should also be expected, for the
precentage price change does not only depends on the LPO.
Chapter 3, Exercise 1: Airfare Analysis
(a) The model in fact ts pretty well for the given values of distance, however, it wouldnt be good
idea to extrapolate, i.e, trying to predict for distances much bigger than 2000. Also, even though
the graphical aspect of the tted model seem ok, I think it would also be important to consider that
airfare is also inuenced by how popular the destination is and also by the time of the year; it is
probably cheaper to y mid Septermber than ying on December 31
s
t. Also, an R
2
of 0.994 doesnt
necesarilly mean that the tted regression explains 99.4% of the values, especially when larger data
and larger amounts of data are used (because both increment the value of R
2
)
(b)When we analyze the plot fo the residuals vs distance we see that there is some kind of pat-
6
tern, which has some resemblance to a quadratic equation. As a rs impression, I would say that this
model behaves linearly for the given interval, however, for bigger values, the model MIGHT behave
quadratically. Naturally, this may cause some trouble with this conclusion. Moreover, there seem to
be two outliers. It would also be good to have dierent tted model to dierent dates, becuase, as
I mentioned before, prices are not only a function of distance, and the people to whom those prices
concern (either consumers or managers of airline) might also need these data for further planning.
textbfChapter 3, Exercise 3: Advertisement Revenue
We star by reading and organizing the data:
ad<-read.csv('http://www.stat.tamu.edu/~sheather/book/docs/datasets/AdRevenue.csv')
names(ad) <- c('name', 'company', 'rev','circ')
We then nd the tted model:
slr2=lm(rev~circ,data=ad)
summary(slr2)
##
## Call:
## lm(formula = rev ~ circ, data = ad)
##
## Residuals:
## Min 1Q Median 3Q Max
## -147.69 -22.94 -7.84 13.81 131.13
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 99.810 5.855 17.1 <2e-16 ***
## circ 22.853 0.952 24.0 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.2 on 68 degrees of freedom
## Multiple R-squared: 0.894,Adjusted R-squared: 0.893
## F-statistic: 577 on 1 and 68 DF, p-value: <2e-16
attach(ad)
#creates Scatterplot
plot(circ,rev,
main='Ad. revenue vs Circulation',
ylab='Ad. revenue',
xlab='Circulation')
#Cretes regression line
abline(b=coef(slr2)[2],a=coef(slr2)[1],col=2)
lm_coef <- round(coef(slr2), 3) # extract coefficients
mtext(bquote(y == .(lm_coef[2])*x + .(lm_coef[1])),
adj=1, padj=0) # display equation
7
0 5 10 15 20 25 30
2
0
0
4
0
0
6
0
0
8
0
0
Ad. revenue vs Circulation
Circulation
A
d
.

r
e
v
e
n
u
e
y = 22.85x + 99.81
As we can see, the model doesnt seem to t very well with the data we have. I consider that it is
better to take the transformation Log(X + 1) for the circulation (the +1 is so we dont get negative
factors; i.e, just moving the plot) and Log(Y ) for the revenue:
lr=log(rev)
lc=log(circ+1)
attach(ad)
## The following objects are masked from ad (position 3):
##
## circ, company, name, rev
slr3=lm(lr~lc,data=ad)
summary(slr3)
##
## Call:
## lm(formula = lr ~ lc, data = ad)
8
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4142 -0.1262 -0.0087 0.1089 0.4898
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.1647 0.0442 94.2 <2e-16 ***
## lc 0.7402 0.0346 21.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.184 on 68 degrees of freedom
## Multiple R-squared: 0.871,Adjusted R-squared: 0.869
## F-statistic: 458 on 1 and 68 DF, p-value: <2e-16
#creates Scatterplot
plot(lc,lr,
main='Ad. revenue vs Circulation, Logarithmic scale',
ylab='Ad. revenue, Logarithmic (log)',
xlab='Circulation(log)')
#Cretes regression line
abline(b=coef(slr3)[2],a=coef(slr3)[1],col=2)
lm_coef <- round(coef(slr3), 3) # extract coefficients
mtext(bquote(y == .(lm_coef[2])*x + .(lm_coef[1])),
adj=1, padj=0) # display equation
9
0.5 1.0 1.5 2.0 2.5 3.0 3.5
4
.
5
5
.
0
5
.
5
6
.
0
6
.
5
Ad. revenue vs Circulation, Logarithmic scale
Circulation(log)
A
d
.

r
e
v
e
n
u
e
,

L
o
g
a
r
i
t
h
m
i
c

(
l
o
g
)
y = 0.74x + 4.165
This seems like a way better t than before, so I will use this transformation. Finding the 95%
condence interval, we get that:
t<-qt(1 - 0.05/2, df = length(lc) - 1)
t
## [1] 1.995
Then, the 95% condence interval is given by 0.8092270.069027 That is, I am 95% condent
that the slope lies between 0.0.671173 and 0.809227 (note that this is the slopre for the logarithmic
transformation).
a=log(0.5+1)
b=log(20+1)
newdata<-data.frame(lc=a)
estpoint<-predict(slr3,newdata,interval="predict",level=0.95)
#for the 0.5 million circulation
estpoint
10
## fit lwr upr
## 1 4.465 4.091 4.838
#in dollars:
exp(estpoint)
## fit lwr upr
## 1 86.9 59.82 126.3
#For the 20 million:
newdata<-data.frame(lc=b)
estpoint<-predict(slr3,newdata,interval="predict",level=0.95)
estpoint
## fit lwr upr
## 1 6.418 6.025 6.812
#In dollars:
exp(estpoint)
## fit lwr upr
## 1 612.9 413.4 908.7
Then we can see that the predicted intervals for 0.5 Million and 20 million are, respectively, (
59.82 126.3) and (413.4 908.73), with respective tted values of 86.9 and 612.9.
(c) The model seems good, however, lets take a look at the model dyagnostics:
library("MASS")
residuals <- resid(slr3)
fitted <- fitted(slr3)
student <- studres(slr3)
leverages <- hatvalues(slr3)
cooks <- cooks.distance(slr3)
par(mfrow=c(2,2))
hist(residuals)
qqnorm(residuals)
qqline(residuals)
plot(lr,residuals)
plot(fitted,residuals)
11
Histogram of residuals
residuals
F
r
e
q
u
e
n
c
y
0.4 0.2 0.0 0.2 0.4
0
5
1
0
1
5
2 1 0 1 2

0
.
4

0
.
2
0
.
0
0
.
2
0
.
4
Normal QQ Plot
Theoretical Quantiles
S
a
m
p
l
e

Q
u
a
n
t
i
l
e
s
4.5 5.0 5.5 6.0 6.5

0
.
4

0
.
2
0
.
0
0
.
2
0
.
4
lr
r
e
s
i
d
u
a
l
s
4.5 5.0 5.5 6.0 6.5

0
.
4

0
.
2
0
.
0
0
.
2
0
.
4
fitted
r
e
s
i
d
u
a
l
s
par(mfrow=c(3,2))
hist(student)
qqnorm(student)
qqline(student)
plot(lr,student)
plot(fitted,student)
plot(leverages,cooks)
12
Histogram of student
student
F
r
e
q
u
e
n
c
y
2 1 0 1 2 3
0
5
1
0
1
5
2 1 0 1 2

2
0
1
2
3
Normal QQ Plot
Theoretical Quantiles
S
a
m
p
l
e

Q
u
a
n
t
i
l
e
s
4.5 5.0 5.5 6.0 6.5

2
0
1
2
3
lr
s
t
u
d
e
n
t
4.5 5.0 5.5 6.0 6.5

2
0
1
2
3
fitted
s
t
u
d
e
n
t
0.05 0.10 0.15 0.20
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
leverages
c
o
o
k
s
As we can see, the data is not normal ( but almost) and it seems to have a couple of outliers. The
t of the model seems good in the sense tht there is no aparent pattern on the reiduals, however, they
tend to cluster up on the smaller values of X. I consider that the model would improve if the outliers
were removed.In any case, the diagnostic parameteres seem to be relatively good, which imply that
this model was a good choice for the regression.
Chapter 3 Exercise 4: Cargo Load Time
NOTE: The plots for this exercise are already made by the book and are attached at the back.
First Model: Linear
(a) From the plots from the book, we are asked to analyze if the linear model seems like a good t.
the linear model seems ok, althought I would take out the 2 extreme values, because these may cause
some trouble; we have a bunch a cllustered points for the small values and 2 big leaps that could be
aecting the lines regression. We can also see that the square root fo the standarized residuals have
an obvious pattern, which indicates that the model might not be linear, but more like a polynomial
or exponential. If it wasnt for some of the outliers, this model might give a normal ditribution for
the residuals, which is nice. (b) I dont think the interval will be alright. In fact, I think the interval
13
would be too long, becuase we have 2 extreme values that could be increasing the standard error
and therefore the C.I. Moreover, we lack points close to 10,000 which can also be risky. If we eliminate
the possible outliers (or leverage points) we could be in some other kind of trouble: We would be
extrapolating in order to predict a value for 10,000.
Second Model: Non linear
(a) We see that the model is of the form Log(y) =
1
x
0.25
+
0
+ . We can see that when the trans-
formation for time Log(time) is performed, the model seems to adjust to a linear regression very
nicely. Moreover, from the residuals (both transformed and not) it is clear that there is no evident
pattern. The distribution also has some similitude to a normal. Over all, I consider that this model is
a better t. I do consider that this model is a better at predicting time commpared to the rst model.
(b) I would say that one of the biggest issues -not entirely a aw, but something to consider- is that
the Y axis is in a logarithmic scale, so this will not predict time, but Log(time), which is something
that should be considered before doing predictions. Im not thrilled by the fact that the P value for
this model is bigger than the P value for the rst one. Just as with the rst model there seems to be
an outlier when we see the standarized residual plot (theres an extreme value).
14

You might also like