Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 2

TIM209 HW1 Solutions

Problem1

(a) Compute the Euclidean distance between each observation and the test point, X1 = X2 = X3 = 0.

X1 X2 X3 Y Distance from origin


0 3 0 Red 3
2 0 0 Red 2
0 1 3 Red 3.16
0 1 2 Green 2.24
-1 0 1 Green 1.41
1 1 1 Red 1.73

(b) What is our prediction with K = 1? Why?

Our prediction with K=1 is Green because we will be picking the 1 nearest neighbor and clustering
accordingly.

(c) What is our prediction with K = 3? Why?

Our prediction with K=3 is Red because we will be picking the 3 nearest neighbors and clustering according to
whichever color occurs most number of times.

(d) If the Bayes decision boundary in this problem is highly nonlinear, then would we expect the best value for
K to be large or small? Why?

When K becomes larger, we get a smoother boundary, therefore if the boundary is very non-linear, we would
expect K to be small.

Problem3

(vi) #This depends on what question you want to answer from this dataset. If you want to know which
universities have the highest % of faculty with PhDs, then we can start to dig into that.

summary(college$PhD)
#The range of % of faculty with PhDs is 8 to 103 and the median is 75. The 103% throws me off a little
bit. Not sure if that is data integrity (outlier).

#Let us find the number of colleges with that 103 %

nrow(subset1<-college[college$PhD==103,])

#There is only 1 such university. Clearly an outlier. We can either choose to correct this to 100% in our
final analysis or we can just ignore this record altogether and move on with the rest of the data.

#To find out which university that row belongs to,

row.names(subset1)

You might also like