STAT 4540 Homework 1 Solution: 1 ISLR 2.4.1

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

STAT 4540 Homework 1 Solution

1 ISLR 2.4.1
(a) We expect the performance of a flexible statistical learning method to be better. A more flexible
approach will fit the data closer and with the large sample size a better fit than an inflexible approach
would be obtained.
(b) We expect the performance of a flexible statistical learning method to be worse. A flexible method
would overfit the small number of observations.
(c) We expect the performance of a flexible statistical learning method to be better. With more
degrees of freedom, a flexible model would obtain a better fit.
(d) We expect the performance of a flexible statistical learning method to be worse. Flexible methods
will fit to the noise in the error terms and thus increase the variance.

2 ISLR 2.4.4
(a)

• Response variable: health status (ill/healthy); predictors: age, blood pressure, gender, etc. The
goal is prediction.
• Response variable: outcome of a test (fail/pass); predictors: hardness of the test, preparing time,
etc. The goal is prediction.
• Response variable: poll result (approve/against); predictors: socioeconomic status, eduction level,
age, etc. The goal is both inference and prediction.
(b)
• Response variable: stock market price; predictors: previous prices. The goal is prediction.
• Response variable: income; predictors: age, education level, gender, etc. The goal is both prediction
and inference.
• Response variable: working hours of a bulb; predictors: brand, price, type, etc. The goal is
prediction.
(c)

• Marketing survey.
• Movie rating.
• Symptoms of diseases.

1
3 ISLR 2.4.7
(a)

d(x1 , x0 ) = 32 = 3

d(x2 , x0 ) = 22 = 2
p
d(x3 , x0 ) = 12 + 32 ≈ 3.2
p
d(x4 , x0 ) = 12 + 22 ≈ 2.2
p
d(x5 , x0 ) = 12 + 12 ≈ 1.4
p
d(x6 , x0 ) = 12 + 12 + 12 ≈ 1.7.

(b) Our prediction is green since the single nearest neighbor is obs 5, with Y = green.
(c) Our prediction is red, since 3-nearest neighbors are obs 5, 6, 2, with corresponding Y = green, red,
red.
(d) Small. A small K would be flexible for a non-linear decision boundary, whereas a large K would try
to fit a more linear boundary because it takes more points into consideration.

4 ISLR 2.4.8
R code and output:
##(a)
college <- read.csv("College.csv", header = TRUE)

##(b)
rownames(college) = college[,1]
fix(college)

college=college[,-1]
fix(college)

##(c)
#(i)
summary(college)
Private Apps Accept Enroll Top10perc Top25perc F.Undergra
No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00 Min. : 9.0 Min. : 1
Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00 1st Qu.: 41.0 1st Qu.: 9
Median : 1558 Median : 1110 Median : 434 Median :23.00 Median : 54.0 Median : 1707
Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56 Mean : 55.8 Mean : 3700
3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00 3rd Qu.: 69.0 3rd Qu.: 4005
Max. :48094 Max. :26330 Max. :6392 Max. :96.00 Max. :100.0 Max. :31643
P.Undergrad Outstate Room.Board Books Personal PhD
Min. : 1.0 Min. : 2340 Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
1st Qu.: 95.0 1st Qu.: 7320 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
Median : 353.0 Median : 9990 Median :4200 Median : 500.0 Median :1200 Median : 75.00
Mean : 855.3 Mean :10441 Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
3rd Qu.: 967.0 3rd Qu.:12925 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
Max. :21836.0 Max. :21700 Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
Terminal S.F.Ratio perc.alumni Expend Grad.Rate
Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186 Min. : 10.00
1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751 1st Qu.: 53.00
Median : 82.0 Median :13.60 Median :21.00 Median : 8377 Median : 65.00
Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660 Mean : 65.46
3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830 3rd Qu.: 78.00
Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233 Max. :118.00

2
#(ii)
pairs(college[,1:10])

#(iii)
plot(college$Private, college$Outstate)

#(iv)
Elite = rep("No", nrow(college))
Elite[college$Top10perc>50] = "Yes"
Elite = as.factor(Elite)
college = data.frame(college, Elite)

3
summary(college$Elite)
# No Yes
# 699 78
plot(college$Elite, college$Outstate)

# (v)
par(mfrow=c(2,2))
hist(college$Apps)
hist(college$perc.alumni, col=2)
hist(college$S.F.Ratio, col=3, breaks=10)
hist(college$Expend, breaks=100)

4
# (vi)
par(mfrow=c(1,2))
plot(college$Outstate, college$Grad.Rate)
# High tuition correlates to high graduation rate.
plot(college$Top10perc, college$Grad.Rate)
# Colleges with the most students from top 10% perc don’t necessarily
have the highest graduation rate.

5
R code:
dat <- read.table("u.data", sep = "\t")
colnames(dat) <- c("usrid", "movid", "rating", "timestamp")

dat$time <- as.POSIXct(dat$timestamp, origin="1970-01-01", tz="UTC")

rating <- dat[ c(1:3, 5)]

dat <- scan("u.item", what = rep("character", 24), sep = "\n", encoding = "UTF-8")
movdf <- matrix(NA_character_, length(dat), 24)
for (ii in 1:length(dat)) {
movdf[ii, ] <- strsplit(dat[ii], split = "\\|")[[1]]
}

colnames(movdf) <- c("movid", "title", "reldate", "vidreldate", "URL",


"unknown", "Action", "Adventure", "Animation",
"Children", "Comedy", "Crime", "Documentary", "Drama", "Fantasy",
"FilmNoir", "Horror", "Musical", "Mystery", "Romance",
"SciFi","Thriller", "War", "Western")

movie <- matrix(as.numeric(movdf[ , c(1, 6:24)]), nrow = nrow(movdf), ncol = length(c(1, 6:24)))
colnames(movie) <- colnames(movdf)[c(1, 6:24)]
head(movie)

5
action <- rowSums(movie[,c("Action", "Adventure", "Fantasy", "Horror", "SciFi", "Thriller")])
children <- rowSums(movie[,c("Animation", "Children")])
comedy <- rowSums(movie[,c("Comedy"), drop=FALSE])
drama <- rowSums(movie[,c("Crime", "Documentary", "Drama", "FilmNoir",
"Musical", "Mystery", "Romance", "War", "Western")])

genre <- cbind(action, children, comedy, drama)


genre <- genre[rating$movid, , drop=FALSE]

logit <- function(p) log(p / (1 - p))


pop1 <- aggregate(rating$rating > 3, by = list(rating$movid), sum)
pop2 <- aggregate(rating$rating > 0, by = list(rating$movid), sum)
pop <- logit((pop1[ , 2] ) / (pop2[ , 2]))
head(pop, n=5)
# 0.8962438 -0.4502010 -0.4989912 0.3381129 -0.1865860

popular <- pop[rating$movid]


x <- cbind(1, genre, popular)
y <- rating$rating

head(x, n=5)
# action children comedy drama popular
[1,] 1 0 2 1 0 1.1564319
[2,] 1 3 0 0 0 1.4160205
[3,] 1 1 0 0 0 -2.4849066
[4,] 1 1 0 1 1 0.2231436
[5,] 1 1 0 0 2 0.4519851

head(y, n=5)
# 3 3 1 2 1

You might also like