Introductory Statistical Learning

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 87

Notes on Introductory Statistical

Learning
su wang
Winter 2016

1
Contents
1 Fundamentals of Statistical Learning 4
1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Methods and Evaluation . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Special Topic: Bayes Classifier . . . . . . . . . . . . . . . . . . . 6
1.4 Lab Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Linear Regression 8
2.1 Simple LR: Univariate . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Multiple LR: Multivariate . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Non-linear Fit / Polynomial Regression . . . . . . . . . . . . . . 15
2.5 Issues in Linear Regression . . . . . . . . . . . . . . . . . . . . . 16
2.6 K-Nearest Neighbor Regression: a Nonparametric Model . . . . . 18
2.7 Lab Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Classification 22
3.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . 24
3.3 Lab Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4 Resampling 31
4.1 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Lab Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5 Linear Model Selection & Regularization 37


5.1 Subset Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Shrinkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.3 Dimension Reduction . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.4 High Dimensional Data . . . . . . . . . . . . . . . . . . . . . . . 47
5.5 Lab Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6 Preliminaries on Nonlinear Methods 53


6.1 Polynomial Regression . . . . . . . . . . . . . . . . . . . . . . . . 53
6.2 Step Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.3 Basis Functions: A Generalization . . . . . . . . . . . . . . . . . 55
6.4 Regression Splines . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.4.1 Piecewise Polynomials & Basics of Splines . . . . . . . . . 55
6.4.2 Tuning Spline . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.4.3 Spline vs. Polynomial Regression . . . . . . . . . . . . . . 57
6.4.4 Smoothing Splines . . . . . . . . . . . . . . . . . . . . . . 57
6.5 Local Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.6 Generalized Additive Model . . . . . . . . . . . . . . . . . . . . . 60
6.7 Lab Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

2
7 Tree-Based Models 65
7.1 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
7.1.1 Model of DT . . . . . . . . . . . . . . . . . . . . . . . . . 65
7.1.2 DT: Pros & Cons . . . . . . . . . . . . . . . . . . . . . . . 67
7.2 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.3 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.4 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.5 Lab Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

8 Support Vector Machine 74


8.1 Maximal Margin Classifier . . . . . . . . . . . . . . . . . . . . . . 74
8.2 Support Vector Classifiers . . . . . . . . . . . . . . . . . . . . . . 75
8.3 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . 77
8.4 Lab Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

9 Unsupervised Learning 82
9.1 Principal Component Analysis (PCA) . . . . . . . . . . . . . . . 82
9.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
9.2.1 K-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
9.2.2 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . 85
9.3 Lab Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

3
1 Fundamentals of Statistical Learning
1.1 Basic Idea
Statistical Learning essentially handles the following type of task: Given a
set of independent variables/features X, make a prediction with regard to a
(set of) dependent variable(s)/response(s) Y , which is a function of f of X,
such that:

Y = f (X) +  (1.1)
In Fig 1.1, X is “years of education”, Y is “income”, f is the curve in the right
pane.

Figure 1.1: Statistical Learning Example: Regression

We use SL to make predictions and inferences. Using 1.1 to make predictions,


we face reducible error and irreducible error, which is shown in the first and
second term in 1.2:

2 2
E(Y − Ŷ ) = E[f (X) +  − fˆ(X)] (1.2)
2
= [f (X) − fˆ(X)] + V ar()

In an inference problem, we study how Y changes as a function of Xj (e.g.


which predictors are associated with the response).

1.2 Methods and Evaluation


Broadly speaking, there are two SL methods: parametric SL and nonpara-
metric SL:

4
• Parametric SL: Reduce the problem of estimating f down to one of esti-
mating a set of (prior) parameters (e.g. Y = β0 + β1 x1 + β2 x2 ).
• Nonparametric SL: Operationally similar to parametric method but does
not assume a set of prior parameters. Usually it takes a large amount of
data to come to accurate results. (TODO: example of model)

Typically, there is a trade-off between prediction accuracy and model inter-


pretability. Across models, a general scenario is depicted in the following graph:

Figure 1.2: Trade-off Between Accuracy and Interpretability

In evaluating a model, we have the following methods:

• Measuring the Quality of Fit


n
X 2
– Mean Squared Error (MSE): M SE = 1
n (yi − fˆ(xi ) )
i=1

• Trade-off Between Bias and Variance


– Minimizing bias and variance simultaneously:
2 2
E(y0 − fˆ(x0 )) = V ar(fˆ(x0 )) + [Bias(fˆ(x0 ))] + V ar()

Variance refers to the amount by which fˆ would change if we estimated it using


a different training data set. Bias refers to the error that is introduced by ap-
proximating a real-life problem, which may be extremely complicated, by a much
simpler model. In general, more flexible models tend to have high variance and
low bias. Typically, as the flexibility of a model increase, the bias of it initially
decreases faster than its variance, therefore its MSE decreases. At some point,

5
the decrease in bias is outpaced by that in variance, the MSE rises again. That
point is therefore the best balance point for the trade-off between bias and
variance.

In a classification task, we use the following equation, which is called an error


rate estimate1 , by which we estimate the error rate on average, in the place of
MSE:
n
1X
I(yi 6= ŷi ) (1.3)
n i=1

1.3 Special Topic: Bayes Classifier


As an example, we look at a baseline classification method: Bayes Classifier.
A Bayes Classifier puts a new piece of data in class j (i.e. Y = j) if its conditional
probability on the predictors Xj (i.e. P r(Y = j|X = x0 )) exceeds 50%. In
the case where the conditional distribution is unknown, we use K-Nearest
Neigbor (KNN) method2 , where we select the nearest k data points to the
new data and compute its average conditional probability on the set (using N0
to represent the set). To summarize,

• Conditional Probability Known: P r(Y = j|X = x0 )


X
1
• Conditional Probability Unknown: P r(Y = j|X = x0 ) = K I(yj = j)
i∈N0

The following is a graphical example of a Bayes Classifier task, where the dashed
curve is called a Bayes Decision Boundary, where the conditional probability
of a data point being in class j is 50%:

1.4 Lab Code


For more code: http://www-bcf.usc.edu/~gareth/ISL/Chapter%202%20Lab.
txt.
% matrix
matrix(data=c(1,2,3,4), nrow=2, ncol=2)
matrix(c(1,2,3,4), 2, 2, byrow=TRUE)

% generate same sample sets


set.seed(42); sample(LETTERS, 5)
[1] ”X” ”Z” ”G” ”T” ”O”
set.seed(42); sample(LETTERS, 5)
[1] ”X” ”Z” ”G” ”T” ”O”
1
I is an indicator function.
2
The flexibility of KNN increase when k decreases.

6
Figure 1.3: Bayes Classifier

% generate sequences
seq(-1,1,by=.1) % from -1 to 1 by .1 pace
seq(-1,1,length=10 % 10 evenly spaced points from -1 to 1

% contour
x = seq(-1,1,by=.1)
y=x
f=outer(x,y,function(x,y)cos(y)/(1+xˆ2))
contour(x,y,f)
image(x,y,f) % heatmap version of contour
persp(x,y,f,theta=30,phi=20) % theta, phi configure horizontal
% and vertical view angle

% load data
Auto=read.table (”Auto.data”, header =T,na.strings =”?”)
% header: data has header row
% na.strings: empty cell indicator
fix(Auto) % read in a spread sheet

7
2 Linear Regression
2.1 Simple LR: Univariate
Linear Regression is an approach for predicting a quantitative response Y on
the basis of a single predictor variable X. It assumes that there is approximately
a linear relationship between X and Y . Mathematically, the relationship is
represented as follows:

Y ≈ β0 + β1 X (2.1)
The corresponding model for which is:

Ŷ = βˆ0 + βˆ1 X (2.2)


LR aims to best model the relationship between an independent and a depen-
dent variable such that the estimates of the dependent variable it produces (i.e.
Ŷ ) are as close as possible to the actual values of it. In other words, the dis-
tance/error between the estimate and the actual should be minimized. This
distance/error is commonly measured by residual sum of squares (RSS),
which is defined as follows for a data set (x1 , y1 ), (x2 , y2 ), ..., (xn , yn ):

RSS = e1 2 + e2 2 + ... + en 2 (2.3)


2 2
= (y1 − βˆ0 − βˆ1 x1 ) + ... + (yn − βˆ0 − βˆ1 xn )

With some calculus, it is readily proved that:

βˆ0 = ȳ − βˆ1 x̄

n
X
(xi − x̄)(yi − ȳ) (2.4)
i=1
βˆ1 = n
X 2
(xi − x̄)
i=1

So far we have been assuming the access to all data in a population. In prac-
tice, however, usually we train our model using samples from a population. The
regression lines computed from the samples obviously do not necessarily equal
to the regression line computed from the entire population. On the right-pane
in the following graph, the population regression line is blue solid, and the rest
of the regression lines are generated from 10 samples from the population.

For a particular sample S and a population P from which S is drawn, the sam-
ple mean µ̂ is most likely different from the population mean µ. However, this
difference is systematic, in the sense that, in the long run (with more samples
drawn), the number of samples overestimating the population mean and those

8
Figure 2.1: Regression Lines

that underestimate the population mean will “even out”. The estimate µ̂ for µ,
therefore, is unbiased, and the average of a set of sample means will approach
µ when the cardinality of the set approaches infinity.

Not all sets of samples means are the same. Some may vary greatly, some may
vary to lesser extent. In estimating µ, it is apparently better for the variance
to be as low as possible. The variance, which is called Standard Error (SE),
in fact serves as an important indicator of the quality of the samples, and is
computed as follows, where σ is the standard deviation of each of the realization
yi of Y , and n is the cardinality of the sample set:

σ22
V ar(µ̂) = SE(µ̂) = (2.5)
n
Roughly speaking, the SE tells us the average amount that an estimate µ̂ differs
from the actual value of µ.

Besides the SE of Y , we may also compute the SE for the intercept β0 and the
independent variable β1 , where σ 2 = V ar():

2 1 x̄2
SE(βˆ0 ) = σ 2 [ +X 2]
n n(xi − x̄2 )
i=1
2
(2.6)
2 σ
SE(βˆ1 ) = X 2
n(xi − x̄2 )
i=1

Finally, the SE can also be computed for the residual:

9
p
RSE = RSS/(n − 2) (2.7)
Using the SE for a parameter (e.g. β1 ), we may compute a Confidence In-
terval such that the true value of the parameter has a chosen probability to
fall within the interval. The following exemplifies an approximation of the 95%
confidence interval for the parameter β1 . The same computation applies to all
parameters.

[βˆ1 − 2 · SE(βˆ1 ), βˆ1 + 2 · SE(βˆ1 )] (2.8)


SE can also be used to perform Hypothesis Tests, which takes the following
form:

H0 : β1 = 0 (i.e. There is no relationship between X and Y )


(2.9)
H1 : β1 6= 0 (i.e. There is some relationship between X and Y )
The hypotheses are evaluated using a t-test, which measures the number of
standard deviation that βˆ1 is away from 0:

βˆ1 − 0
t= (2.10)
SE(βˆ1 )
The fraction of the corresponding t-distribution beyond the t-bar (i.e. the result
from the t-test) is the probability at which an estimate of the parameter (i.e. βˆ1
in this case) is observed to be associated with the true value of the parameter,
in the absence of a real association. This probability is called a p-value.

Assuming that we have rejected the null hypothesis and believe there is a rela-
tion between an independent variable and a dependent variable, the next thing
that comes to focus is to evaluate the extent to which our model fits the data.
The quality of fit is commonly evaluated using the following indicators:
• Lack of Fit (LOF)
v
q u n
X
RSS
u 1 2
– LOF = RSE = n−2 =t n−2 (yi − ŷi ) ,
i=1

• R2 Statistic
X 2
– R2 = T SS−RSS
T SS =1− RSS
T SS , where T SS = (yi − ȳ)

In term of interpretation, the RSE is the average deviation of the estimated


value of a model for a variable from the true value of the variable. R2 Statistic
is the proportion of variance explained by the model. Besides LOF and R2
Statistic, we may also evaluate the predictive power of a model (i.e. the related
independent variable) with F -Statistic or simply the correlation between the
independent and the dependent variables. This will not be elaborated here.

10
2.2 Multiple LR: Multivariate
If we have multiple independent variables in hand, fitting a regression line using
the simple LR for each of the variables is not entirely satisfactory. For one thing,
it is not clear how to make a single prediction of the dependent variable given
levels of the p independent variables, since each is associated with a separate
regression line. Further, by fitting separately for each variable, we are implicitly
assuming that the effects of the independent variables on the dependent variable
are additive, which is usually not the case, as there may be correlations among
the independent variables, which causes their effects to overlap. In light of the
discussion, we now extend our model in (1.1) and (1.2) to be:

Y = β0 + β1 X1 + β2 X2 + ... + βp Xp + 
(2.11)
Ŷ = β̂0 + β̂1 X1 + β̂2 X2 + ... + β̂p Xp + 

Correspondingly, the RSS is computed as follows:

n
X 2
RSS = (yi − yˆi )
i=1 (2.12)
n
X 2
= (yi − β̂0 − β̂1 xi1 − ... − β̂p xip )
i=1

The regression line, in the multivariate context, generalizes to be p-dimensional


hyperplanes, where p is the number of independent variables. The following
graph illustrates a linear regression model with 2 independent variables, the
regression line therefore becomes a plane (i.e. 2D hyperplane).

As special note, oftentimes, an independent variable will be shown to be sig-


nificant in effecting changes to the dependent variable as the only predictor,
and at the same time be shown insignificant when placed together with other
predictors. In such cases, the independent variable is in fact insignificant, and
its “effect” as the single predictor comes from its correlation with another pre-
dictor. This can be confirmed using a variable-variable matrix as the one in the
following:

In the related example, newspaper is shown to have a significant effect on sales


as the only predictor, although it is insignificant when placed along with TV
and radio. Observing the table, we learn that its “predictive power” in the
univariate regression actually comes from its correlation with radio, which is
significant in both univariate and multivariate model.

How do we know if an independent variable and a dependent variable are re-


lated? In univaraite LR, as have been shown, we perform a t-test with the null
hypothesis β1 = 0. Generalizing this to the multivariate LR, our hypothesis

11
Figure 2.2: Regression with 2 Variables

Figure 2.3: Correlation Matrix: TV, radio and newspaper as indep. var.; sales
as dep. var.

becomes β1 = β2 = ... = βp = 0. The hypothesis is now tested with an F -test,


which is defined as follows:

(T SS − RSS)/p
F = (2.13)
RSS/(n − p − 1)
In the equation, the denominator indicates the average of the inherent variance
in the data set (size = n), the numerator measures the average amount of
variance (noninherent variance) explained by the predictors. If the predictors
have 0 influence on the response (i.e. β1 = β2 = ... = βp = 0), the two values
are expected to be equal:

E{RSS/(n − p − 1)} = E{(T SS − RSS)/p} = V ar() = σ 2 (2.14)

12
Therefore, the greater the F -statistic is, the more significant the predictors are,
although this does not necessarily mean that each predictor is individually sig-
nificant.

Finally, if we are only interested in a subset of the p predictors, say, the last q
predictors (i.e. H0 : βp−q+1 = ... = βp = 0), the F -statistic is now computed
with:

(RSS0 − RSS)/q
F = (2.15)
RSS/(n − p − 1)
RSS0 here is the RSS in a model that includes all the predictors except for
the last q. The difference RSS0 − RSS, therefore, is the portion of variance
accounted for by the last q variables, which, when divided by q, indicates the
average amount of variance explained by the last q predictors3 .

Note that a significant F -statistic only tells us that at least one of the p predic-
tors involved is significant, not the q of the p predictors are significant. Testing
the predictors individually using t-test is not a viable approach to find the effec-
tive predictors, because some predictors are often tested siginificant in a t-test by
chance even when the F -statistic is not significant. For instance, if the selected
significance level for the F -test is 95%, there is 5% of chance that a predictor
is significant by chance in a t-test. Increasing the significance level may remedy
the problem to some extent, but it fails when p is large.

The most common approaches are the following:

• Forward Selection:
– Start with a null-model (0 predictors) to set up a baseline for RSS.
– Add 1 predictor to the model, and try the p predictors one-by-one.
Select the predictor that produces the highest RSS reduction.
– Continue running the previous step until a stopping rule is satisfied
(e.g. RSS threshold).
• Backward Selection:

– Start with the all-predictor model.


– Remove 1 predictor with the largest p-value from the model.
– Continue running the previous step until a stopping rule is satisfied.
• Mixed Selection:

– Run forward & backward selection interchangeably, until a stopping


rule is satisfied.
3
Essentially, RSS0 − RSS indicates How much less RSS are we able to explain by leaving
out the last q predictors.

13
Finally note that while the forward selection applies if p > n, backward selected
does not. As the forward selection suffers from the problem of including redun-
dant predictors, the mixed selection is most recommended in practice.

As far as model evaluation, the same statistics are used multivariate case:

• RSE
q
1
– RSE = n−p−1 RSS

• R2

– R2 = 1 − RSS
T SS

To use the statistics, we set up a threshold for model improvement to decide


whether the cost of adding (a) new predictor(s)4 is “worth” the corresponding
reduction in RSE (or increase in R2 ).

2.3 Interaction
So far, we have been assuming that the relationship between the predictor and
the response is additive. Specifically, the effect of a predictor X1 on the response
Y is constant, such that one-unit change in Xj corresponds to a constant rate of
change in Y , regardless of changes in another predictor X2 . An additive linear
model as such can be formulated as follows in mathematical terms:

Y = β0 + β1 X1 + β2 X2 +  (2.16)
In the equation, it is clear that any changes in the impact of X1 on Y is subject
only to the changes in the coefficient β1 , independent of changes associated with
X2 . The additive assumption, however, is most likely false. For instance, take
X1 and X2 to be two predictors TV and radio advertising expenditure, and Y
to be the response sales. If investing in radio advertising positively affects the
investment efficacy in TV on sales, i.e. two predictors are correlated, then the
additive assumption will be false. To reformulate 2.16 to include this correlation
between the two predictors, we may put down the following:

Y = β0 + β1 X1 + β2 X2 + β3 X1 X2 +  (2.17)
How, then, is the interaction (i.e. the fact that the changes in X2 affects the
impact of X1 on Y , and vice versa) included in 2.17? This will be made clearer
in 2.18 with some algebraic manipulation:

Y = β0 + (β1 + β3 X2 )X1 + β2 X2 +  (2.18)


= β0 + β̃1 X1 + β2 X2 + 
4
The larger the number of predictors included, the more constraint we put on a model,
and the more difficult it is for us to come up with an ideal fit.

14
In 2.18, β̃1 = β1 + β3 X2 . Now it is obvious that changes in the coefficient of
X1 is subject to changes in X2 . Specifically, β3 indicates the increase in the
effectiveness of X1 on Y for one unit increase in X2 . To evaluate the “worth”
2 2
R w/interaction −R w/ointeraction
of the inclusion of interaction, we may compute 2 ,
100−R w/ointeraction
which produces the percentage of residual variance “left behind” by the additive
model that is accounted for by the inclusion of the interaction term.

Note that we should always include the main effects (i.e. effect from the additive
model) if the interaction is included, even if the main effects are not significant.
This is referred to as the hierarchical principle. The rationale for the principle
is i) X1 × X2 is related to the response, whether or not X1 or X2 have 0
coefficient; ii) leaving out X1 and X2 changes the meaning of the interation,
since they are typically related to X1 × X2 .

2.4 Non-linear Fit / Polynomial Regression


Just briefly, we discuss the cases where the response cannot be effectively pre-
dicted with a regression line/hyperplane (i.e. linear regression).

Figure 2.4: Non-linear Fit (Polynomial Regression)

Observing Fig 2.4, we see that horsepower does not effectively predict mpg in a
linear fit. If we were to include a power-2 term of horsepower, however, the pre-
diction becomes much more accurate. However, we should also be cautious in us-
ing nonlinear fit to avoid overfitting. For instance, adding a power-5 horsepower

15
term may lead to even better prediction than a power-2 term of horsepower
term. The regression model fitted using the current data set, however, may
perform poorly in generalization to new data sets, as its overly “customized” to
one training data set.

Importantly, note that this “tweaking” on our original model still gives us a
linear model, because the rate of change in the response effected by the predictors
is still a constant!

2.5 Issues in Linear Regression


Several underlying issues may arise in working with linear regression models.
These are listed in the following:

• Nonlinearity of the predictor-response relationships.


• Correlation of error terms.
• Non-constant variance of error terms.

• Outliers
• High-levearge points.
• Collinearity

We now tackle the issues one by one. In Fig 2.5, the left pane is a plotting of
the residual against the one-predictor linear model prediction. It is immediately
noticed that the data points systematically deviates from the fitting to form a
convex curve. This means that the relationship between the predictor and the
response is nonlinear. Based on our observation, we adjust the model by in-
cluding a power-2 predictor term. With this addition, the data points are now
shown (right pane) to be randomly dispersed around the fitting, which indicates
that the error in the model is random, and we therefore know the polynomial
regression gives a better fitting.

For every estimate ŷi of a response value y, correspondingly we have an error


margin which is the residual. In some cases (e.g. time series data), the residuals
tend to be correlated. In the case of a time series, the residuals present some
pattern along the dimension of time (e.g. temporally close data points tend to
have similar measurements). Whenever the residuals are suspected to be corre-
lated somehow, we may prove the correlation by plotting them against certain
dimensions and overcome the problem by collecting data more carefully.

The error term  may be correlated with the response sometimes. In the left
pane of Fig 2.6, it is readily observed that the error terms are positively corre-
lated with the response. To eliminate the pattern in the error terms, we may
transform the response by some function. In the right pane of Fig 2.6, a log(Y )

16
Figure 2.5: Nonlinearity

conversion is deployed, and the error terms now scatter randomly around the
fitting curve, as desired.

Figure 2.6: Non-constant Variance of Error

Outliers may distort the fitting curve, when their number is relatively large,
which indicates the poor quality of the data. Nevertheless, in most cases the
number of outliers is small, and we may detect them by plotting the fitted val-
ues against the residuals. As it is difficult to the magnitude threshold by which
to decide whether to include a data point as an outlier, in practice we often
plot with studentized residuals instead and eliminate data points as outliers by
a threshold defined in terms of standard deviation. Fig 2.7 gives an example of

17
outlier detection.

Figure 2.7: Outlier Detection

Similar to the outliers, data points with high levearge also distorts the fitting
curve, although the distortion can be much more severe in the latter case. A
data point is said to have high leverage when it has an unusually large predictor
value x. The detection of high leverage points can be done using a leverage
statistic, although plotting X against the studentized residuals also applies.
2
1 (x − x̄)
hi = + n i (2.19)
n X 2
(xi0 − x̄)
0
i =1

A large leverage statistic indicates the presence of high leverage points. In mul-
tivaraite settings, the threshold for high leverage points is set as (p + 1)/n.

Finally, collinearity refers to the situation where two or more of the predictors
correlate. Collinearity undermines the power of hypothesis testing, because the
significance of a predictor masks the impact of predictors that are correlated
with it. To handle collinearity, the VIF -statistic is used, which is defined as
follows:
1
V IF (β̂j ) = (2.20)
1 − RXj |X−j 2

RXj |X−j 2 is the R2 from a regression Xj onto all the other predictors. The value
of V IF falls between 1 and infinity, where 1 indicates the absolutely absence of
collinearity. When collinearity is detected, we either retain only one of the set
of correlated predictors, or we may also combine these into one single predictor.

2.6 K-Nearest Neighbor Regression: a Nonparametric Model


Linear regression by definition assumes linear relationships between the predic-
tors and the response, which often leads to poor fit to the data, and therefore
poor prediction. K-Nearest Neighbor Regression (KNN Regression)

18
presents a nonparametric alterative which provides more flexible fitting.

The procedure of KNN regression roughly goes as follows: Given a value for K
and a prediction point x0 , KNN first identifies the K training observations that
are closest to x0 , represented by N0 . The following equation is then used to
compute the average of all the training responses in N0 as an estimate of the
response corresponding to x0 :
1 X
fˆ(x0 ) = yi (2.21)
K
xi ∈N0

As is discussed in chapter 1, the larger K is, the less flexible the model will be.
In general, the optimal value for K will depend on the bias-variance trade-off.

Comparing nonparametric and parametric methods, a general rule of thumb is


that The parametric approach will outperform the nonparametric approach if the
parametric form that has been selected is close to the true form of f. With the in-
crease in nonlinearity in data, KNN tends to give better performance.However,
note that when the number of predictors increases, it grows increasingly difficult
for KNN to find reasonable near neighbors, which thus undermines its perfor-
mance. This is demonstrated in Fig 2.8, where the dashed line indicates the
mean squared error of linear regression (a parametric model), and green dotted
line represents that of KNN (a nonparametric model). It is clear that roughly
when p > 3 (the number of predictors), KNN is outperformed by the parametric
model.

Figure 2.8: Comparing Parametric and Nonparametric Models by p

In summary, in deciding between a parametric and a nonparametric approach,


take into consideration i) the linearity between the predictors and the response;
and ii) the number of observations per predictor.

2.7 Lab Code

19
% facilities
library(MASS)
library(ISLR)
library(car)

% linear regression (univariate)


lm. fit = lm(y ˜ x, data=’<data>’)
confint (lm. fit ) % confidence lv. of coefs .

% prediction with confidence lv .


predict(lm.fit, data.frame(x=c(...)),
interval =’confidence’) % conf. lv . for x
predict(lm.fit, data.frame(x=c(...)),
interval =’prediction’) % conf. lv . for y

% show regression curve


plot(x, y)
abline(lm.fit)

% plot linear regression diagnostics


par(mfrow=c(2,2))
plot(lm.fit)
% alternatively ...
plot(predict(lm.fit), residuals(lm.fit))
plot(predict(lm.fit), rstudent(lm.fit))

% leverage statistic
plot(hatvalues(lm.fit ))
which.max(hatvalues(lm.fit)) % find most deviant

% linear regression (multivariate)


lm. fit = lm(y ˜ x1 + x2, data=’<data>’)
% or, if use all predictors
lm. fit = lm(y ˜ ., data = ’<data>’)
% or, if use all but one predictor
lm. fit = lm(y ˜ . - x, data=’<data>’)
% update all-predictor model
lm.fit1 = update(lm.fit, ˜ . - x)

% collinearity evaluation
vif (lm. fit )

% linear regression (interaction)


lm. fit = lm(y ˜ x1 ∗ x2, data=’<data>’)

% polynomial regression

20
lm. fit = lm(y ˜ x1 + I(x1ˆ2), data=’<data>’) % power-2
% alternatively ...
lm. fit = lm(y ˜ poly(x, 2)), data=’<data>’)

% fit with transformed predictor


lm. fit = lm(y ˜ log(x), data=’<data>’)

% model comparison
% H0: model fits equally well
% H1: the model with more predictors significantly superior
lm.fit1 = lm(y ˜ x1, data=’<data>’)
lm.fit2 = lm(y ˜ x1 + I(x1ˆ2), data=’<data>’)
anova(lm.fit1, lm.fit2)

21
3 Classification
A simple example for the classification task is the following: Given medical
symptoms x1 , x2 , and x3 , asking if it is reasonable (or highly probable) to diag-
nose a patient as diabetic. The most commonly used methods in classification
are listed in the following:

• Logistic Regression
• Linear Discriminant Analysis
• K-Nearest Neighbors

There are very good reasons why we do not simply apply linear regression in
classification. For one thing, with a categorical predictor with 3 or more values,
linear regression falsely assumes an equal-distance order among the predictors.
For instance, if we coded three medical symptoms as integer values 1, 2, and
3, the order in which the symptoms are labeled may affect the outcome of the
linear regression model, which is intuitively unreasonable. Even if we were to
convert the three values into three binary predictors (i.e. whether a patient
has a symptom), the linear regression will still suffer from the interpretation
problem: When the fitted value comes beyond the [0,1] interval, it is difficult to
interpret the result as a probability (e.g. how likely is a patient diabetic).

For this reason, dedicated models for the classification task are necessary. Logis-
tic regression model, for instance, handles the out-of-[0,1]-bound issue gracefully,
as is illustrated in Fig 3.15 :

Figure 3.1: Left: Linear Regression; Right: Logistic Regression

5
The task is predicting whether an individual will default based on his/her bank balance.

22
3.1 Logistic Regression
The simplest form of a logistic regression model (univariate) is formulated as
follows, where the fitted value for the response is interpreted as a probability:

eβ0 +β1 X
Y = p(X) = (3.1)
1 + eβ0 +β1 X
As β0 + β1 X can be greater than 1 or less than 0, we use a sigmoid function6
to map it into the [0,1] interval.

To further convince you that 3.1 is a reasonable formulation for the probability
interpretation, note that the equation can be algebraically manipulated to the
3.2:

p(X)
eβ0 +β1 X = (3.2)
1 − p(X)
The right hand side of the equation is the odds at which an event takes place.
Observing 3.1, we know that the probability prediction p(X) increases as eβ0 +β1 X
increases. The odds in 3.2 also positively covaries with eβ0 +β1 X , which implies
the positive correlation between the intuitive odds and the logistic model.

We can also show that 3.1 is indeed a linear model by 3.37 . Specifically, the
predictor is linearly related to the response in that its changes corresponds to
changes in log odds of the response.

p(X)
log( ) = β0 + β1 X (3.3)
1 − p(X)
In logistic regression, we seek to maximize the likelihood that the predictor(s)
gives correct prediction. Mathematically, we search for the predictor coefficients
(and the intercept) that maximize the value of the following likelihood function.
Y Y
L(β0 , β1 ) = p(xi ) (1 − p(xi0 )) (3.4)
i:yi =1 0 0
i :yi =0

The output of logistic regression is similar to that of linear regression, which is


demonstrated in Fig 3.2, which tells us for each unit of increase in the predictor
balance, the log odds of an individual defaults increase by 0.0055 units. The
z-statistic plays the same role as the t-statistic in linear regression. For instance,
the z-statistic associated with the coefficient β1 is evaluated as β̂1 /SE(β̂1 ). The
null hypothesis involved is H0 : β1 = 0, i.e. the balance as a predictor does not
influence whether an individual would default.

6
It is also variously referred to as inverse logit function, or logistic function.
7 p(X)
log( 1−p(X) ) is called the logit.

23
Figure 3.2: Example: Logistic Regression Output

Generalizing 3.1 to the multivariate case, we have the model in 3.5. The other
aspects of the model is the same as the univariate case.

eβ0 +β1 X1 +...+βp Xp


Y = p(X) = (3.5)
1 + eβ0 +β1 X1 +...+βp Xp
For a multi-class task, we may convert each class into a binary response and
handle each individuals. However this is rarely done in practice. The reason is
simple: the model coming up next provides a better alternative!

3.2 Linear Discriminant Analysis


Linear Discriminant Analysis (LDA) is superior to logistic regression in
multi-class tasks in that it edges out on i) stability when classes are well-
separated; ii) when the training size n is smalle and the distribution of the
predictors X is approximately normal in each of the classes.

In LDA, as in logistic regression, we seek to compute the probability of an


observation being classified in class j, where j = 1, ..., k, k > 2, i.e. response
Y = j, given the predictor X = x. That is, P r(Y = j|X = x). Applying Bayes
Theorem, we have the following model:

P r(X = x|Y = j)P r(Y = j)


P r(Y = j|X = x) = k
(3.6)
X
P r(X = x|Y = i)P r(Y = i)
i=1

In 3.6, P r(X = x|Y = j) is the likelihood of X being equal to the value x


given that it is from a class Y = j. P r(Y = j) is the prior probability of an
observation falls into the class j 8 . Clearly, it is easy to find the prior probabil-
ity: We simply compute the fraction of the observations that fall into class j in
the training. Estimating the likelihood is more involved, and it depends on the
number of predictors.

We first consider the univariate case, where p = 1. Assuming no prior knowledge


as to how data are distributed in each of the classes, the best bet for us is to
take them to be all normally distributed with the same dispersion/variability,
which is reflected in equal standard deviation:
8
The prior probability is usually set based on some prior knowledge of a probability dis-
tribution. For instance, knowing that a coin is fair, we would set P r(Y = Head) = .5.

24
1 1 2
P r(X = x|Y = j) = √ exp(− 2 (x − µj ) ) (3.7)
2πσj 2σj
The equal-variability assumption states that σ1 2 = ... = σk 2 . The task of our
Bayes classifier, then, is to find the class j for which P r(Y = j|X = x) is largest.
Knowing that the denominator/normalizer in 3.6 is the same for all classes, and
under the assumption of equal-variability, we now put down the probability we
are trying to maximize as follows:
1 2
P r(Y = j|X = x) ∝ P r(Y = j) · exp(− (x − µj ) ) (3.8)
2σj 2
To further simplify the maximization task, we first take log of 3.8, and then
drop all the terms that are not related to µs, because in comparing a P r(Y =
i|X = x) and a P r(Y = j|X = x), the x value is the same for two probability
evaluations, and σs are assumed to be equal: x and σ are taken as constants.
If we also assume the same prior probabilities across classes, the task is even
further simplified9 . The computation proceeded as follows, where δ(x) is called
the Discriminant Function:

δj (x) = log(P r(Y = j|X = x))


1 2
= log(P r(Y = j)) + log(exp(− (x − µj ) ))
2σj 2
µj µj 2
=x· − (3.9)
σ2 2σ 2
Now, in deciding whether an observation should go to class i or class j, we
put it in i when δi (x) > δj (x), and in j otherwise. If we further simplify the
inequality, it is readily discovered that the decision boundary corresponds to
the point where

µi 2 − µj 2 µi + µj
x= = (3.10)
2(µi − µj ) 2
In concise terms, an LDA classifier evaluate from a data set i) the prior proba-
bilities of classes (i.e. π̂j = P r(Y = j)10 ); ii) the mean of xs in each class (i.e.
µ̂j ; iii) a weighted average of variance across classes and observations (i.e. σ̂ 2 ).
These are defined as follows:
9
Mind that the equal-prior assumption is put down here only to simplify the algebra for a
clearer intuition. It is rarely true in practice.
10
This is for the convenience of reference.

25
1 X
µ̂j = x (3.11)
nj i:y =j i
i

K
1 X X 2
σ̂ 2 = (x − µ̂j ) (3.12)
n − K j=1 i:y =j i
i

π̂j = nj /n, where nj is the number of j-class observations. (3.13)

The general form of the discriminant function, therefore, is

µ̂j µ̂2j
δj (x) = log(π̂j ) + x · − (3.14)
σ̂ 2 2σ̂ 2
The classifier finally compute the decision boundary using 3.10. The results are
illustrated with Fig 3.3, where the black dashed line is the Bayesian decision
boundary, whereas the solid line is the LDA decision boundary. The two only
differ in that LDA decision boundary includes the log(π̂j ) term.

Figure 3.3: Univariate LDA

Generalizing to the multivariate case, where there is more than one predictors
(i.e. p > 1), we are assuming that the predictors are drawn from a multivariate
normal distribution, which is defined as follows:
1 1
exp(− (x − µ) Σ−1 (x − µ))
T
f (x) = p/2 1/2
(3.15)
(2π) |Σ| 2
Not that the x and µ here are both p-dimensional vectors, and Σ is a p × p co-
variance matrix. In a multivariate normal distribution, each individual variable
is normally distributed, and their correlations are captured by the covariance
matrix.

With all aspects virtually the same to the univariate case, the discriminant
function of the multivariate LDA is defined as follows:

26
1
δ̂j (x) = log(π̂j ) + xT Σ−1 µ̂j − µ̂Tj Σ−1 µ̂j (3.16)
2
Fig 3.4 is an example of a multivariate data set with p = 2. The decision
boundaries are computed for each pair of classes.

Figure 3.4: Multivariate LDA (p = 2)

Dropping the equal-variability assumption, each class will have a separate vari-
ablity measure (i.e. σj in the univariate case, Σ in the multivariate case). The
discriminant function, therefore, becomes the following:

1
σ̂j (x) = − (x − µ̂j ) Σj −1 (x − µ̂j ) + log(π̂j )
T
2
1 1
= − xT Σj −1 x + xT Σj −1 µ̂j − µ̂Tj Σj −1 µ̂j + log(π̂j ) (3.17)
2 2
This alternative is called the Quadratic Discriminant Analysis (QDA). It
is so named because in 3.17, x appears as a quadratic function.

Adding class-specific Σ, QDA has more parameters and thus imposes more con-
straints on the classification, which can potentially cause overfitting problem
(or high variance), although it is much more flexible than LDA in fitting, which
is on the opposite prone to underfitting (or high bias) problem. To help you put
your finger on the magnitude of QDA’s parameter proliferation, consider this:
In LDA, all classes share one variability measure (i.e. the covariance matrix)
the estimation of which involves p(p + 1)/2 parameters. In QDA, this number
is kp(p + 1)/2, because each of the k classes gets a separate covariance matrix.
A general rule of thumb in choosing between LDA and QDA is taking LDA
when the data-to-variable ratio is low, because it is crucial to contain variance
in these cases, and choosing QDA otherwise.

Comparing the classification methods we have covered so far (i.e. logistic re-
gression, LDA, QDA, and KNN), it is clear that logistic regression and LDA are

27
very similar, and only differ in i) LDA assumes that observations within classes
are normally distributed; ii) fitting procedures. In fact, in terms of log odds,
LDA has the following formulation:

p1 (x) p (x)
log( ) = log( 1 ) = c0 + c1 x (3.18)
1 − p1 (x) p2 (x)
where c0 and c1 are functions of µ1 , µ2 , σ 2 . This corresponds to the log odds
of logistic regression (cf. 3.3).

In practice, logistic regression fares better when the normal-distribution as-


sumption of LDA does not hold, otherwise LDA wins out.

As for the comparison between logistic regression & LDA as parametric methods
and KNN as a nonparametric method, see the discussion at the end of section
2.6.

QDA serves as a compromise between the nonparametric KNN and the para-
metric logistic & LDA methods.

3.3 Lab Code

% logistic regression

% fitting
glm.fit = glm(y ˜ x1 + x2 + ... + xp, data=’<data>’, family=binomial)

% examining coding of factor


% if ’response=value a’ is coded 1, negative coef. indicates
% negative correlation of predictor to value a
contrasts(factor) % 0 or 1

% demo evaluation on training (using Smarket dataset from ISLR)


attach(Smarket)
glm.fit = glm(Direction˜Lag1+Lag2+Lag3+Lag4+Lag5+Volume,
data=Smarket, family=binomial)
glm.probs = predict(glm.fit,type=”response”)
% predict(model,data,type=?),
% leaving data blank, predict on training.
glm.pred = rep(’Down’,1250)
glm.pred[glm.probs>.5]=’Up’
table(glm.pred,Direction)

% precision, recall , f
precision = 145/(145+457) % .241

28
recall = 145/(145+141) % .507
f = 2∗(precision∗recall / ( precision +recall)) % .327

% create ’testing’ (actually held-out) dataset by partitioning


train = (Year<2005)
Smarket.2005 = Smarket[!train,]
Direction.2005 = Direction[! train ]

% traing on entire dataset


glm.fit = glm(Direction˜Lag1+Lag2+Lag3+Lag4+Lag5+Volume,
data=Smarket, family=binomial, subset=train)
% subset take train as a vector of boolean values
% where TRUE for pre-2005 rows, FALSE otherwise.
glm.probs = predict(glm.fit, Smarket.2005, type=’response’)
... same as before ... create glm.pred ... print table ...

% linear discriminant analysis

% fitting
library(MASS)
lda. fit = lda(Direction˜Lag1+Lag2,data=Smarket,subset=train)
lda. fit
Call:
lda(Direction ˜ Lag1 + Lag2, data = Smarket, subset = train)

Prior probabilities of groups:


Down Up
0.491984 0.508016

Group means:
Lag1 Lag2
Down 0.04279022 0.03389409
Up -0.03954635 -0.03132544

Coefficients of linear discriminants :


LD1
Lag1 -0.6420190
Lag2 -0.5135293
% interpretation
% prior: Pr(class), in this case Pr(Up) and Pr(Down)
% group means: means of predictors when class=...
% coefficient : create class as a linear combination of predictors
% this gives a decision rule ,
% if lin .comb large, positive prediction ; negative otherwise

% evaluation

29
lda.pred = predict(lda.fit, Smarket.2005)
lda.class = lda.pred$class
table(lda.class, Direction.2005)
Down Up
999 0.4901792 0.5098208
1000 0.4792185 0.5207815
1001 0.4668185 0.5331815
...
% on the day with index 999, .49 chance Down, .51 chance Up

% quadratic discriminant analysis

% fitting
qda. fit = qda(Direction˜Lag1+Lag2, data=Smarket, subset=train)
% other operations same as lda(..)
% output doesn’t include a ’coef’, because this is a quadratic function

% k-nearest neighbors

% create arguments for fitting


train .X = cbind(Lag1,Lag2)[train,] % training data (predictors)
test .X = cbind(Lag1,Lag2)[!train,] % testing data (predictors )
train .Direction = Direction[train ] % training labels (response)

% fitting
library ( class )
set .seed(1) % for R to break tie in computation
knn.pred = knn(train,X, test.X, train .Direction, k=1)

% evaluation
table(knn.pred, Direction.2005)

% data management

% scaling/normalization
vector .norm = scale(vector)

30
4 Resampling
In statistical learning, we use sampling to maximize the value of a dataset. This
is particularly important when acquiring new data is difficult. For instance, for
a dataset D (assuming D to be the only data accessible under constraint on
resource), we have discovered in an initial modeling that including a quadratic
term for a predictor x significantly improves the fitting, and we would like to
make a reasonable estimation as to how likely the result comes not due to
chance. One way to do so is to sample m times from D and take each sample as
a training set, and the rest of the data a validation set. For each sample we
build the model using it and test the model on the corresponding validation set.
If we get similar results in most or all of the sampling/partitioning, we may be
assured that the initial result does have some validity, on which it is reasonable
for us to set up some working hypothesis.

In this chapter, we look at two resampling methods:


• Cross-Validation
• Bootstrapping

4.1 Cross-Validation
The simplest validation resampling method is to partition, by random sam-
pling11 , a dataset in two halves: a training and a validation and build a model
on the training in each partitioning. We may then repeat the partitioning-
training-estimating process for m times, and then observe the results and val-
idating underlying patterns we suspected initially. However, the approach has
two crucial drawbacks: i) test errors may be highly variable across partitioning
pairs; ii) overestimating error rate (e.g. MSE) by training on less amount of
data.

To address these issues, we introduce a refined method: Leave-One-Out-


Cross-Validation (LOOCV), where, instead of cutting the dataset in two
comparable subsets, we only leave 1 observation as the “validation set”. The
procedure of LOOCV goes as follows:
• Step 1: Given a dataset of size N , partitioning it into a training set (N − 1
observations) and a validation set (1 observation).
• Step 2: Build a model with the training set, and compute the MSE of the
prediction on the held-out 1 observation.
• Step 3: Repeating Step 1 and Step 2 until all n observations have been
the validation, and estimate the error rate by taking the average of all N
Xn
MSEs. CV(n) = n1 M SEi .
i=1
11
For instance, out of a 100 observations in total, randomly draw 50 of these by indices.

31
Fig 4.1 and 4.2 compare the performance of the half-and-half cross-validation
and LOOCV.

Figure 4.1: Half-and-Half Cross-Validation

Figure 4.2: Leave-One-Out Cross-Validation

This way of computing LOOCV can be overwhelmingly expensive when n is


large. Fortunately a clever shortcut will save us the trouble of building n models:
n 2
1 X yi − ŷi 2 1 (x − x̄)
CV(n) = ( ) , where hi = + n i (4.1)
n i=1 1 − hi n X 2
(xi0 − x̄)
0
i =1
Here we build only 1 model with all n observations. hi is the leverage of the ob-
servation xi , indicating how far xi deviates from other observations (cf. section
2.5, equation 2.19). For a highly deviant data point, its corresponding lever-
age will be high, the contribution it makes to the CV(n) is also relatively high.
The error estimate is therefore adaptive to the underlying pattern in the dataset.

Alternatively, we may also set the held-out set to be of 1/k size of the original
dataset. The error estimate is thus as follows.

32
k
1X
CV(k) = M SEi (4.2)
k i=1
This is called a k-fold cross-validation, which is illustrated in Fig 4.3:

Figure 4.3: k-Fold Cross-Validation

Now, a question naturally follows: Given that the high computational cost of
LOOCV can be avoided by 4.1, why k-fold CV should even be considered? The
question is answered from a bias-variance trade-off point of view. In terms of
bias, LOOCV is obviously superior to the k-fold method, because the training
sets it uses are almost identical to the original dataset, and thus has a very low
bias (i.e. it is a better approximation to the model built from the original set
than k-fold). However, because these training sets have high overlapping with
each other, they are highly correlated, which induces high variance! That is, the
models generalize poorly on new datasets (cf. section 1.3, the definition of bias
and variance). k-fold CV, therefore, serves as a good compromise between the
half-and-half CV and LOOCV by keeping the bias and the variance reasonably
balanced.

So far, our discussion and formulation of CV has focused on the quantitaive


setting. The same methods, however, applies well to qualitative data too, where
the error is computed using the misclassifications:

n
1X
CV(n) = Erri
n i=1
n
1X
= I(yi 6= ŷi ) (4.3)
n i=1

4.2 Bootstrapping
Relating to the context of resampling, the Central Limit Theorem (CLT)
states that the mean of a sufficiently large number of samples will be approxi-

33
mately normally distributed, and the “grand mean” of the samples approaches
the true mean of the population from which the samples are drawn when the
number of samples approaches infinity. In real-world problems, most likely we
will not be able to sample repeatedly from a target population, due to practical
constraint (e.g. the cost in an experimental study of cancer patients). In these
cases, we therefore sample instead from the original dataset. This resampling
method is referred to as Bootstrapping.

As an example, consider a scenario of an allocation of investment to two com-


panies, which return A and B respectively. Our task is to decide to fraction of
resource, call it α to put into the company with return A (and thus 1 − α to
the other company) which minimizes the total variance of the total return, i.e.
Var(αA + (1 − α)B). With some calculus12 , we can show that the optimal α to
be as follows:

σA 2 − σAB 13
α= (4.4)
σA 2 + σB 2 − 2σAB
In practice, it is impossible for us to invest, say 1000 times, into the two com-
panies and analyze the returns as the dataset. All we have is a dataset which is
the past investment allocations and returns. Using bootstrapping, however, we
may then sample 1000 times from the dataset we have to get a decent estimate
of the result from actually investing 1000 times (i.e. the target population). Our
estimate of the true α, then, is the mean of all αs computed from the samples:
1,000
1 X
ᾱ = α̂ (4.5)
1, 000 r=1 r
To evaluate the extent to which the estimate
v is accurate, we compute the stan-
u 1,000
X 2
u
1
dard deviation of the αs: SE1000 (α̂) = t 1,000−1 (α̂r − ᾱ) . According the
r=1
CLT, SE(α̂) approaches 0, ᾱ approaches the true α when the number of samples
approaches infinity.

Thanks to the amazing computation power of modern computers, running boot-


strapping simulation like this may produce for us excellently accurate estimates
which allows us to make good predictions.

4.3 Lab Code

% half-and-half-CV (Auto from ISLR)


12
Differentiate Var(αA + (1 − α)B) with respect to α, and set it to be equal to 0, and then
solve the equation.
13 2 2
V ar(aX + bY ) = a V ar(X) + b V ar(Y ) + 2Cov(X, Y ).

34
% select the first half -set
library(ISLR)
set.seed(1) % so that results can be reproduced later
train = sample(392,196) % 392 observations in total

% fit and evaluate on validation set using MSE


attach(Auto)
lm. fit = lm(mpg˜horsepower,data=Auto,subset=train)
mean( (mpg-predict(lm.fit,Auto))[-train]ˆ2 )
% the evaluation can be compared with power-2 fit
% lm.fit2 = lm(mpg˜poly(horsepower,2),data=Auto,subset=train)

% choose a different half-set


set.seed(2)
train = sample(392,196)
% fitting is the same
% repeating choosing half-set -> evaluate
% the mean of all MSE evaluations is the final evaluation

% LOOCV

% fit
library(boot)
glm.fit = glm(mpg˜horsepower,data=Auto) % same as lm(..) here

% compute MSE
cv. err = cv.glm(Auto,glm.fit)
cv. err$delta
% output two values
% value 1: MSE
% value 2: adjusted MSE (k-fold case: adjustment for not using LOOCV)

% LOOCV for power-i, i=1,2,3,4,5


cv. errs = rep(0,5) % initialize with 0
for ( i in 1:5) {
glm.fit = glm(mpg˜poly(horsepower,i),data=Auto)
cv. errs [ i ] = cv.glm(Auto,glm.fit)$delta[1]
}
plot(cv.errs ,type=’b’) % plot results to find ‘‘ elbow’’

% k-fold CV

% 10-fold demo
set.seed(17)
cv. errs = rep(0,10) % can run more repetitions with lower comp. cost

35
for ( i in 1:10) {
glm.fit = glm(mpg˜poly(horsepower,i),data=Auto)
cv. errs [ i ] = cv.glm(Auto,glm.fit,K=10)$delta[1]
}
plot(cv.errs ,type=’b’)

% bootstrapping (Portfolio from ISLR)


% example: compute best investment ratio alpha

% step 1: define a function that computes optimal statistic


attach(Portfolio)
alpha = function(data,index) {
X=data$X[index]
Y=data$Y[index]
return( (var(Y)-cov(X,Y)) / (var(X)+var(Y)-2∗cov(X,Y)) )
}

% step 2 (option 1: manual)


set.seed(1)
alphas = rep(0,1000)
for ( i in 1:1000) {
alphas[ i ] = alpha(Portfolio,sample(100,100,replace=T))
}
mean(alphas)
sd(alphas)

% step 2 (option 2: automatic)


boot(Portfolio ,alpha,R=1000)
% this ouputs: mean, bias, standard error

% bootstrapping (evaluate quality of linear regression )

% create function that computes coef. of regression


coefs = function(data,index) {
return( coef(lm(mpg˜horsepower,data=data,subset=index)) )
}

% manual resampling
set.seed(1)
coefs(Auto,sample(392,392,replace=T))
% run this 1000 times, store values , compute mean & sd

% automatic resampling
boot(Auto,coefs,1000)

36
5 Linear Model Selection & Regularization
Up to now, we have been using least squares to find the fitting in linear regres-
sion, which, as we have shown in chapter 2, makes an effective and intuitively
simple approach. The prediction accuracy and the model interpretabil-
ity of linear models, however, can be further improved by supplment the least
squares with some additional procedures. For the least squares to function
properly, the following conditions have to more or less stand:

• Linear relationship between the predictors and the response. Otherwise,


the model will have high bias.
• The number of observations siginificantly greater than that of predictors
(i.e. n p). Otherwise, the model will have high variance. In particular, if
p > n, linear fitting will not be unique14 .
• The predictors must be relevant to the response. Otherwise, the linear
model will produce “fake fittings” such that it either does not generalize
well, or it gives misleading predictions based on accidental correlations.

In this chapter, we consider three methods, which aim to solve the issues that
arise when the conditions do not hold, to least squares:

• Subset Selection: The approach effectively select relevant predictors.


• Shrinkage: The approach reduces the magnitude of coefficients of pre-
dictors in order to reduce variance.
• Dimension Reduction: The approach reduces the dimension of linear
models, and discover underlying predictive dimensions.

5.1 Subset Selection


The first subsetting method we consider is best subset selection. This is
a bruth-force search method, where we consider all combinations of p predic-
tors and find the best. The best is defined in two senses: Among predictor
combinations with the different number of predictors, the evaluation methods
are usually15 i) Akaike Information Criterion (AIC); ii) Bayesian Information
Criterion (BIC); and iii) Adjusted R2 . If the number of predictors of two com-
binations are the same, we simply use residual sum of squares (RSS)16 to select
the winner. The selection algorithm is as follows:

• Set M0 as the null model, which does not contain any predictors.
• For k = 1, 2, ..., p:
14
In fact, there will be an infinite number of linear solutions.
15
These will be discussed later on.
16
For logistic regression, use deviance, which is defined as −2L(β1 , ..., βp ) (cf. equation
3.3).

37
p

– Build all k models (k predictors).
– Select the model that has the lowest RSS or largest R2 , and set it as
Mk
• Compare M0 , ..., Mp , select the model with the lowest cross-validation
error, AIC, BIC, or adjusted R2 .

It should be noted that RSS and R-squared oftentimes trade-off with AIC and
BIC: the formers monotonically improve17 while the latters usually decrease as
the number of predictors increases. One way to handle the issue is by first
screening the models of different number of predictors to find the number that
comes with the highest “cost performance”18 . Consider Fig 5.1:

Figure 5.1: RSS & R-squared against the Number of Predictors

It is clear in the plots that the performance improvement to the model by in-
creasing the number of predictors comes to a near halt at p = 3. This can
be taken as an indicator that it is not worth the cost to increase p further. A
point like this is commonly called the elbow point 19 . After having found the
elbow point of p, we then drop the models with more predictors in the further
evaluation by AIC, BIC, etc.

In practice, we routinely enounter regression problems with very large p, in


which cases the best subset selection method quick becomes computationally
overwhelming with the increase in p. Therefore, instead of searching through
17
This means decrease in RSS and increase in R-squared.
18
The cost here refers to the increase in the number of predictors, which adds to the com-
plexity of a model.
19
Note that it is not always easy to find an elbow point, in which case we need to consider
alternative means for evaluation. The Cp , AIC, and BIC statistics introduced later handles
this issue.

38
all combinations of predictors, we find the combination with stepwise selec-
tion.

First consider forward stepwise selection. While being the same as the best
subsect selection method in the first and the third step, the forward stepwise
selection algorithm adds predictors to the model one-by-one, and always picks
the predictor that best improves the model (in terms of RSS and R-squared).
Specifically, in each iteration, we select the predictor such that adding it to the
current model induces the hightest reduction in RSS or increase in R-squared.
For instance, starting from the null model M0 , we iterate through all p predic-
tors, find the best one, and set the current model to M1 , which has this best
predictor as the only predictor. We then search through the rest p−1 predictors,
find the best one, set the current model with this predictor and the predictor
in M1 , call it M2 , then go to the loop again, until adding a predictor does not
improve the model further, by some subjectively set threshold 20 .

Doing the count, it is readily checked that, instead of having to inspect 2p mod-
els, we now only have 1 + p(p + 1)/2 models to process.

Alternatively, we may also start with a full model (i.e. one which includes all
predictors available), and remove predictors one-by-one: the predictor the re-
moval of which leads to the least reduction in R-squared or the least increase
in RSS, and stop when the cost of removal surpasses a threshold. This is called
the backward stepwise selection.

Although the stepwise methods are computationally much less costly than the
best subset selection method, they do not guarantee the finding of the best model
overall. However, in practice, they still select models that predict reasonably
well. As a compromise between the stepwise methods and the best subset selec-
tion method, one may run the forward and the backward stepwise selection in
alternation: Upon the forward selection reaches the “improvement threshold”,
switch to the backward selection, run it until its threshold is reached, and switch
back. Operating as such a few turns, we usually find a better model that the
ones found by using the forward or backward selection alone. The method is
called the hybrid stepwise selection.

As have been pointed out, RSS and R-squared cannot serve as the sole esti-
mate for a model, because the model that includes all predictors always has the
lowest RSS and the highest R-squared (cf. Fig 5.1). Further, the error esti-
mate they give is a training estimate, which does not guarantee generalization
to new datasets. To adjust the estimate of the training data, we therefore need
to somehow penalize the additional cost of adding more predictors and increase
in variance.

20
This could be a threshold value of RSS or R-squared.

39
The following adjustment methods are commonly used:

1
Cp = (RSS + 2dσ̂ 2 ) (5.1)
n
1
AIC = (RSS + 2dσ̂ 2 ) (5.2)
nσ̂ 2
1
BIC = (RSS + log(n)dσ̂ 2 ) (5.3)
n

Note that σ̂ 2 is the variance of the errors for all the responses (i.e. the variance
of yi − yˆi ). d is the number of predictors selected for the current model. The
smaller these values are, the better a model is evaluated. Without digressing
too much by getting into the details of these measurements, one readily notices
that they share the following properties:

• Inversely related to the number of observations n.


• Positively related to the number of predictors and variance of errors.

In practice, the evaluation methods tend to select models with lower test error.
Fig 5.2 is an illustration of the methods in model selection:

Figure 5.2: Cp , AIC, BIC in Model Selection

The methods above are particularly necessary when “elbow points” cannot be
categorically decided. In addition, we can also add penalty to adding predictors
for R-squared, which is 1 − RSS
T SS , to calculated adjusted R-squared :

RSS/(n − d − 1)
AdjustedR2 = 1 − (5.4)
T SS/(n − 1)
Adjusted R-squared takes into account the “noise” generated by introducing
more predictors, which may be interpreted as additional variance.

40
Finally, to mitigate the problem of low-training-error-yet-high-test-error, we al-
ways have the cross-validation procedures introduced in chapter 4.

5.2 Shrinkage
As mentioned earlier, the shrinkage21 techniques aim to reduces the variance
of a model by shrinking the coefficients of its predictors. The most commonly
used shrinkage techniques are ridge regression and lasso.

In chapter 1, we said that the optimization goal of linear regression is to choose


the coefficients of the predictors of a model (i.e. βj , j = 1, ..., p) such that the
Pn Pp 2
RSS of the model, which is defined as i=1 (yi − β0 − j=1 βj xij ) , is mini-
mized. In ridge regression, we add an extra term, which is called a shrinkage
penalty, to the optimization goal, such that the coefficients found when the op-
timization process converges are shrunk in proportion to its relative magnitude.
The new optimization goal is as follows22 :
n p 2 p p
X X X X
(yi − β0 − βj xij ) + λ βj 2 = RSS + λ βj 2 (5.5)
i=1 j=1 j=1 j=1

The mechanism of ridge regression is intuitively clear: With the addition of the
penalty term, we in fact have a “duplidate” for each β. Therefore, the combined
value which is distributed among the coefficients is now allocated to more terms,
which leads to the shrinkage of each coefficient. The λ coefficient of the penalty
term is called a tuning parameter (or regularization parameter ). Its greater than
or equal to 0. It is clear that the larger the tunning parameter is, the greater
the extent to which the coefficients shrink.

An important note is that the scale of the predictors may affect the coefficients
in the fitting. Therefore it is advised to standardize the predictors before regu-
larization:
xij
xˆij = q P (5.6)
1 n 2
n i=1 xij − x̄j

The choice of the tuning parameter is crucial to the performance of the fitting.
Specifically, while regularization drops the variance of a model, it in the mean-
time increase its bias by reducing the flexibility of the fitting: The predictors
make less “contribution” to the prediction under regularization, thus their abil-
ity to “steer the fitting towards the observations” weakens, which increases bias.
The balance point of the bias-variance trade-off can be found by plotting the
21
Shrinkage is also popularly known as regularization in statistics and machine learn-
ing community.
22
Note that by convention we do not shrink the intercept β0 , which is simply the mean
value of the response when all the predictors are “disabled” (i.e. x1 , ..., xp = 0).

41
tunning parameter λ against the MSE (cf. section 1.2), as is demonstrated in
Fig 5.3:

Figure 5.3: Ridge Regression in Bias-Variance Trade-off (Green: vari ance; Black:
bias)

In practice, ridge regression works the best when the least squares estimate has
high variance (e.g. when p/n ratio is high, or when the predictors-response
relationship is close to linear), and it generally does a better job than subset
selection by preserving all predictors while minimizing the concomitant “harm”
of high variance.

While the predictor-preservation in ridge regression is generally beneficial, it


does compromise the model interpretability by not allowing the coefficients of
potentially useless “predictors” to 0. To enable the neutralization of these “im-
poster” predictors, we may instead use lasso regularization, which does force the
coefficients of these predictors to 0 when the tunning parameter is sufficiently
large:
n p 2 p p
X X X X
(yi − β0 − βj xij ) + λ |βj |= RSS + λ |βj | (5.7)
i=1 j=1 j=1 j=1

Fig 5.4 illustrates why lasso, but not ridge regression, allows some coefficients
to be shrunk to 0. The red ellipses in the graphs are contours of RSS, and
the green areas are the constraint functions for lasso and ridge regression (left
and right graph respectively)23 . The first contact point of an RSS contour and
a regularization constraint function is where the optimization βs are decided.
23
Note that, when the magnitude of regularization increases (i.e. λ increases), the constraint
area shrinks.

42
Obviously, in the case of ridge regression, the first contact can never take place
when β1 = 0, whereas this is possible in the case of lasso.

Figure 5.4: Lasso vs. Ridge Regression

Finally, as a comparison of ridge regression and lasso, the former shrinks coeffi-
cients by the same proportion, whereas the latter always do so at by a more or
less same amount.

5.3 Dimension Reduction


In a typical classification task in natural language processing (NLP), one would
usually have a large number of predictors. In such cases, usually many of the
predictors are not effective. To complicate things, oftentimes it is the underlying
patterns, which come in the form of the linear combinations of the predictors,
rather than the predictors themselves, that actually make better predictions. In
Fig 5.5, for instance, the observations are better described by the green and the
blue axes, rather than the original dimensions population and ad spending.
The sheer number of the predictors can also be an issue: It reduces the inter-
pretability of model, and masks hidden“true” predictive factors.

Instead of modeling data on original predictors, therefore, we aim to reduce


the number of the predictors (i.e. p) by lumping them together to create new
predictors (M of these) which are linear combinations of original predictors,
such that M << p. Mathematically, the procedure is as follows, where Zm are
the new predictors24 :
24
In the context of the current example in Fig 5.5, the new predictors corresponde to the
dimensions in green and yellow lines.

43
Figure 5.5: Underlying Dimensions

p
X
Zm = φjm Xj (5.8)
j=1
M
X
yi = θ0 + θm zim + i , i = 1, ..., n (5.9)
m=1

The coefficients of the original predictors (i.e. βs) can be reconstructed using
the coefficients of the new predictors (i.e. θs) and the linear coefficients φs as
follows:

M
X M
X p
X p X
X M p
X
θm zim = θm φjm xij = θm φjm xij = βj xij (5.10)
m=1 m=1 j=1 j=1 m=1 j=1
M
X
βj = θm φjm (5.11)
m=1

In the following, we introduce two such dimension deduction techniques: prin-


cipal component analysis (PCA) and partial least squares.

We start with the principal component analysis. Continuing using Fig 5.5 for
example, PCA aims to find the new dimensions corresponding to the green and
the blue axes. The green axis is aligned in the direction in which the obser-
vations vary the most, and the blue axis accounts for the rest of the variance
which is not accounted for by the green axis25 . The green and the blue axes are
25
Here, we are assuming that the observations only vary in a two-dimensional space for
simplicity.

44
called the 1st principal component and the 2nd principal component.

The PCA algorithm first find the 1st PC, and then the 2nd PC, which is ge-
ometrically orthogonal to the 1st PC26 . Therefore, the 1st PC can be seen as
an “anchor” for finding the rest of the principal components. In addition to
lying along the most-varied dimension for the observations, the 1st PC has the
following property:

• The perpendicular distances between the 1st PC and the observations are
minimized.
• The 0 point on the 1st PC is the intersection of the average lines of the
observations along the original dimensions27 .

The properties are illustrated in Fig 5.6:

Figure 5.6: First Principal Component

It should be pointed out that the number of principal components is best de-
cided based on results from cross-validation.

PCA suffers from a major drawback: It is entirely unsupervised, in that the “di-
rections” of PCs (i.e. linear combination of predictors vectors) are not guided
by the response. There is no guarantee that the directions that best explain the
predictors will also be the best directions to use for predicting the response. To
26
It is easy to see why this should make 2nd PC to account for the variance that is not
account for by 1st PC: The variation of the observations along the 2nd PC is not related to
their values on the 1st PC.
27
This implies if an observation has its 1st PC value greater than 0, then it has above-average
values on the original dimensions.

45
handle this issue when quality training data is available, we use instead its close
cousin partial least squares (PLS), which is in large part identical to PCA, and
only differs in that it guides the model building with the correlation between
the predictors and the response.

Similar to PCA, where the PCs are linear combinations of the original dimen-
sions (i.e. predictors, cf. equation 5.8), PLS also has these “PCs”, only the PCs
of PLS have different coefficients. Specifically, the φj1 (i.e. coefficient of the
j-th predictor in the 1st PC, 1 ) is equal to the coefficient from the simple linear
regression Y ∼ Xj . Therefore, the coefficients, although do not always guide the
directions of the PCs to explain the maximal amount of variance, do link them
up with the response in a semi-supervised way. Fig 5.7 gives a demonstration,
comparing PCA’s PC side-by-side to PLS’s.

Figure 5.7: Directions: PCA vs. PLS

To find the second PC, we first adjust each of the variables for Z1 , by regressing
each variable on Z1 (i.e. Z1 ∼ Xj ) and taking residuals28 . These residuals can
be interpreted as the remaining information that has not been explained by the
Z1 . We then compute Z2 using this orthogonalized data in exactly the same
fashion as Z1 was computed based on the original data29 .

In terms of the number of dimensions, the same operations (i.e. cross-validation


evaluation, dimension-MSE plotting, etc.) used in PCA also apply to PLS.
While the superfised dimension reduction of PLS can reduce bias, it also has
the potential to increase variance, so that the overall benefit of PLS relative to
PCA is context-dependent.
28
To serve as new response values.
29
The implementation details for PLS needs further elaboration.

46
Figure 5.8: Overfitting in High Dimensional Spaces

5.4 High Dimensional Data


So far, we have been dealing with data where the number of observations n
is much greater (or at least close) to the number of dimensions/predictors p.
However, this is not always the case in practice. For instance, a study with a
sample of 200 people and about half a million single nucleotide polymorphisms
(SNP) as the predictors, we would like to predict the group’s blood pressure.
Here n ≈ 200, and p ≈ 500, 000.

The major problem in using least squares fitting for such data is that the least
squares regression line is too flexible and hence will always overfit the data, even
when a sizable number of irrelevant predictors are involved. Fig 5.8 demonstrate
the changes in R2 , training MSE and test MSE.
It is particularly worth noting that many previously mentioned evaluation pa-
rameters will not be appropriate in a high-dimensional setting. These are
squared errors, p-values, Cp , AIC, BIC and (adjusted) R2 . With a high-
dimensional data, the evaluation based on these parameters invariably paints
a rosy picture when the actual predictors may actually be utterly useless in
making predictions for new data.

To combat overfitting, the less flexible least squares models can be adopted,
such as forward stepwise selection, ridge regression, the lass, and PCA. However,
note that the fine-tuning of the model still plays a significant role in striking the
balance in bias-variance tradeoff. That is, heavy regularization30 , for instance,
does not always necessarily lead to a better model.

5.5 Lab Code

% Subset Selection I (Basics)

30
In the case of predictor subsetting and dimension reduction, the tuning will be on the
number of predictors.

47
% Example: Hitters data set from ISLR
% use various predictors to predict the salary of players .

% identify & eliminate NA cells


sum(is.na(Hitters$Salary))
Hitters = na.omit(Hitters)

% subsetting
library(leaps)
regfit . full = regsubsets(Salary˜.,Hitters)
% summary on this outputs best 1 to 8 predictor models
% the selected predictors are marked with ‘∗’s.
regfit . full = regsubsets(Salary˜.,data=Hitters,nvmax=19)
% the number 8 is by default
% to tweak it, included argument nvmax.

% access evaluation parameters in subsetting


reg.summary = summary(regfit.full)
names(reg.summary)
% we have the following parameters
% rsq (R-squared); rss ( residual sum of square)
% adjr2 (adjusted R-squared); cp; bic
plot(reg.summary$rss,xlab=”# of vars”,ylab=”RSS”,type=’l’)
% plotting number of vars against an evaluation parameter.
which.max(reg.summary$adjr2) % returns 11
points(11,reg.summary$adjr2[11],col=’red’,cex=2,pch=20)
% find and mark the number of vars that maximizes
% an evaluation parameter.
plot( regfit . full , scale=’adjr2’)
% ranking models with an evaluation parameter.

% Subset Selection II (Fore/Backward Stepwise)

% forward & backward subsetting


regfit .fwd=regsubsets(Salary˜.,data=Hitters,nvmax=19,method=’forward’)
regfit .bwd=regsubsets(Salary˜.,data=Hitters,nvmax=19,method=’backward’)
% the summary output is the same as before.

% Model Selection

% splitting training & testing


set .seed(1)
train = sample(c(TRUE,FALSE),nrow(Hitters),rep=TRUE)
test = (!train)
% to see how many TRUE and FALSE are there
% in ‘ train ’ , do table(train)

48
% build model with training
regfit .best = regsubsets(Salary˜.,data=Hitters[train ,], nvmax=19)

% find the k-predictor model that fits test best


test .mat = model.matrix(Salary˜.,data=Hitters[test,])
% X matrix for test
val . errors = rep(NA,19)
for ( i in 1:19) {
coefi = coef(regfit .best, id=i)
pred = test.mat[,names(coefi)]%∗%coefi
% matrix multipilication
% X coefs = \hat{Y}
val . errors [ i ] = mean((Hitters$Salary[test]-pred)ˆ2)
% compute MSE
}
% the process can be encapsulated in a function
% as follows (usage shown later):
predict.regsubsets=function (object,newdata,id,...){
form=as.formula(object$call[[2]])
mat=model.matrix(form,newdata)
coefi =coef(object,id=id)
xvars=names(coef)
mat[,xvars ]%∗% coefi
}
which.min(val.errors) % returns k = 10.
coef( regfit .best,10) % show best predictors.

% cross-validation
k = 10
set.seed(1)
folds = sample(1:k,nrow(Hitters),replace=TRUE)
cv. errors = matrix(NA,k,19,dimnames=list(NULL,paste(1:19)))
% the above allocates 263 observations in the data set
% randomly into 10 folds.
for(j in 1:k){
best. fit =regsubsets( Salary ., data=Hitters [folds!=j ,], nvmax=19)
for(i in 1:19) {
pred=predict(best.fit,Hitters [ folds ==j,],id=i)
cv. errors [ j , i]=mean((Hitters$Salary[folds ==j]-pred)ˆ2)
}
}
% the above does the following:
% for each fold (as a test set), build model
% on the rest folds , and test on the left -out fold .
% in each testing round, try i-predictor model

49
% where i = 1,...,19.
% result: a 10∗19 matrix where
% rows = test folds
% columns = i-predictor models
mean.cv.errors = apply(cv.errors,2,mean)
% this computes column means
% each result is the mean MSE of an i-predictor model
% across all k CV-testings.
which.min(mean.cv.errors)
% this gives the best model’s # of predictors
% 11 in this case.

% Regularization

% load package
library (glmnet)
% glmnet()’s alpha argument decides type of regularization
% alpha = 0 for ridge regression
% alpha = 1 for lass

% prepare data (into matrix)


x = model.matrix(Salary˜.,Hitters)[,-1]
y = Hitters$Salary

% ridge regression I ( fit )


grid = 10ˆseq(10,-2,length=100)
% customized generation of lambdas: 10ˆ-2 to 10ˆ10
% (i.e. .01 to 100).
% glmnet() automatically selects a range of lambdas
ridge .mod = glmnet(x,y,alpha=0,lambda=grid)
% glmnet() automatically normalizes variables to same scale
% to turn it off , do standardize=FALSE.
dim(coef(ridge.mode))
% access lambdas: 20∗100 matrix
% predictors: 19+1 (intercept)
% lambdas: 100 (descending)
ridge .mod$lambda[50] % lambda = 11498
coef(ridge.mod)[,50]
% return the coefs of predictors
% with lambda = 11498
% these should be smaller relative to
% coefs in the unregularized fit .
predict(ridge.mod,s=50,type=’coefficients’ )[1:20,]
% use a different value for lambda (i.e. 50)
% this returns coefficients with the regularization .

50
% ridge regression II (cross - validation )
set.seed(1)
train = sample(1:nrow(x),nrow(x)/2)
% split data into training and test
% half-half split :
% each row of the data matrix corresponds to 1 observation
% we randomly select 131 (i.e. nrow(x)/2) players for training.
test = (-train)
% the rest half is put in the test set.
y. test = y[test ]
% actual response values in the test set.
ridge .mod = glmnet(x[train,],y[train ,], alpha=0,lambda=grid,thresh=1e-12)
% regularized fit .
ridge .pred = predict(ridge.mod,s=4,newx=x[test,])
% predict on test set, with lambda = 4.
mean((ridge.pred-y.test)ˆ2)
% MSE (101037 in this case).

% ridge regression III (find best lambda)


set.seed(1)
cv.out = cv.glmnet(x[train ,], y[ train ], alpha=0)
% k=10 cross-validation by default
% can be changed using folds=..
bestlam = cv.out$lambda.min
% find best lambda (in terms of min cv error).
out = glmnet(x,y,alpha=0)
predict(out,type=’coefficients ’ , s=bestlam)[1:20,]
% fit model with all predictors
% then find the coefs with best lambda.

% lasso
lasso .mod = glmnet(x[train,],y[train ], alpha=1,lambda=grid)
% fit model on training.
set.seed(1)
cv.out = cv.glmnet(x[train ,], y[ train ], alpha=1)
% cross-validation .
bestlam = cv.out$lambda.min
% find best lambda.
lasso .pred = predict(lasso.mod,s=bestlam,newx=x[test,])
mean((lasso.pred-y.test)ˆ2)
% MSE with best lambda.
out = glmnet(x,y,alpha=1,lambda=grid)
lasso .coef=predit(out,type=’coefficients ’ , s=bestlam)[1:20,]
% coefficients with best lambda
% note that some of the coefs = 0
% this shows that lasso , unlike ridge regression

51
% does do predictor selection .

% PRC & PLS

% PRC

% utilities
library(pls)

% fit
set.seed(2)
pcr. fit = pcr(Salary˜.,data=Hitters,scale=TRUE,validation=’CV’)
% scale=TRUE standardizes each predictor.
% validation=’CV’ does 10-fold cross- validation .
% pcr() reports Root MSE, which = sqrt(MSE).

% evaluation
validationplot (pcr. fit , val .type=’MSEP’)
% plotting # of vars against MSE.

% training vs. test


set.seed(1)
pcr. fit = pcr(Salary˜.,data=Hitters,subset=train,scale=TRUE,validation=’CV’)
validationplot (pcr. fit , val .type=’MSEP’)
% evaluating on training set.
pcr.pred = predict(pcr.fit,x[ test ,], ncomp=7)
mean((pcr.pred-y.test)ˆ2)
% evaluating on test set.

% PLS

% fit
pls . fit = plsr(Salary˜.,data=Hitters,subset=train,scale=TRUE,validation=’CV’)

% evaluation
validationplot (pls . fit , val .type=’MSEP’)
% evaluating on training set.
pls .pred = predict(pls.fit ,x[ test ,], ncomp=2)
mean((pls.pred-y.test)ˆ2)
% evaluating on test set.

52
6 Preliminaries on Nonlinear Methods
When the actual relationship between the predictors and the response is not
one of linear nature, linear models are limited in their power to balance bias-
variance tradeoff while making decent predictions on new data. Although the
envelope can be pushed slightly further by using regularization and dimension
reduction, the performance of the models does not improve by much.

In this chapter, we consider a set of extensions of the linear models covered pre-
viously. The extensions are engineered such that they model nonlinear behaviors
and interactions among variables/predictors.

6.1 Polynomial Regression


Polynomial regression considers the range power-1 to power-d relationship be-
tween the predictors and the response. Specifically, it is formulated as follows:

yi = β0 + β1 xi + β2 xi 2 + ... + βd xi d + i (6.1)

Technically there is no upperbound for power d. In practice, however, it is rarely


greater than 3 or 4, because the increase in d makes a model more flexible. An
overly flexible polynomial curve can take on some very strange shapes which we
do not have any reasonable information in the data to back up.

Suppose we have an age (xi ) variable with which we would like to predictor the
ˆ )) of corresponding persons. The following is the power-4 polynomial
wage (f (x i
regression model for the task. The regression curve is the solid line in the left
pane of Fig. 6.1.
fˆ(xi ) = β̂i + βˆ1 xi + ... + βˆ4 xi 4 (6.2)
If we say that $250,000 is the threshold that divides the observations/persons
into a high income and a low income group, we may further build a polynomial
logistic regression model for the classification task:
ˆ ˆ ˆ 4
eβi +β1 xi +...+β4 xi
Pˆr(f (xi ) > 250k|xi ) = ˆ ˆ ˆ 4 (6.3)
1 + eβi +β1 xi +...+β4 xi

Now if we can compute variance of fˆ(xi ), we will be able to compute its standard
deviation31 , and thus plot a confidence interval around the regression curve as
described with dashed lines in Fig. 6.1. The dashed lines are generated by
plotting twice the standard deviation on either side of the curve (i.e. 95%
confidence interval, fˆ(xi ) ± 2σ).
31 T
The computation is a follows: fˆ(xi ) = li Ĉli , where Ĉ is theq5 × 5 covariance matrix of
t 2 3 4
the β̂j , and li = (1, xi , xi , xi , xi ). The standard deviation is fˆ(xi ).

53
Figure 6.1: Polynomial Regression (left); Polynomial Logistic Regression (right)

6.2 Step Functions


Polynomial functions of predictors impose a “global” structure on the non-linear
function of X. Specifically, a predictor xi , which is featurized as itself in sev-
eral power-levels (i.e. xi , ..., xi d ), in some sense has its “powerized” components
covary with each other, and this limits the flexibility of the model, in that the
change in the estimated value of the response (i.e. fˆ(xi )) can only change at a
limited rate32 .

To break from the restraining force of the global structure, we may instead use
a step function. Basically, we now partition the range of the predictors X 33 into
bins with cut points c1 , c2 , ..., cK , and fit a different constant in each bin. This
converts a continuous variable into an ordered categorical variable. With X’s
value located in the interval of each bin, we have K + 1 new variables34 :

C0 (X) = I(X < c1 ),


C1 (X) = I(c1 ≤ X < c2 ),
..
.
CK−1 (X) = I(cK−1 ≤ cK ),
CK = I(cK ≤ X)

Under the partition, the linear model uses C1 (X), ..., CK (X) as predictors in-
stead, and takes the following form for a particular predictor xi :

yi = β0 + β1 C1 (xi ) + ... + βK CK (xi ) + i (6.4)


32
Thinking of this graphically, the regression curve’s room for “swerving” is limited.
33
A matrix of the values of the predictors corresponding to the values of the response.
34
I(..) are indicator functions.

54
Figure 6.2: Linear Model with Step Function

The logistic-regression-version of the step function model will be as follows,


where ξ is a certain threshold which effectively serves as the decision boundary:

eβ0 +β1 C1 (xi )+...+βK CK (xi )


P r(yi > ξ|xi ) = (6.5)
1 + eβ0 +β1 C1 (xi )+...+βK CK (xi )
Graphically, the regression curve of a linear model with step function will looks
like Fig. 6.2.

6.3 Basis Functions: A Generalization


Basis Function is a family of functional approaches where, instead of fitting the
predictors X directly, we fit the function of them: b1 (X), ..., bK (X). The model
takes the following form:
yi = β0 + β1 b1 (xi ) + ... + βK bK (xi ) + i (6.6)
Polynomial regression and step function discussed in the two previous sections
are essentially special cases of basis functions. Specifically, for polynomial re-
gression, the basis functions are bj (xj ) = xi j . For step function, the basis
functions are bj (xi ) = I(cj ≤ xi < cj+1 ).

In the following, we consider two alternative basis functions: regression splines


and smoothing splines.

6.4 Regression Splines


6.4.1 Piecewise Polynomials & Basics of Splines
Essentially, instead of fitting a high-degree polynomial over the entire range of
X, piecewise polynomial regression involves fitting separate low-degree polyno-

55
Figure 6.3: Unconstrained Piecewise & Spline Constraints

mials over different regions of X. For instance, the piecewise-version of the


following cubic polynomial would have different sets of coefficient values: i.e.
β0 , β1 , β2 , and β3 .
yi = β0 + β1 xi + β2 x22 + β3 x33 + i (6.7)
There could be, for instance, a point c, for which β0 = β01 , β1 = β11 , β2 =
β21 , β3 = β31 if xi < c. Otherwise, i.e. if x1 ≥ c, then β0 = β02 , β1 = β12 , β2 =
β22 , β3 = β32 if xi < c. The points where the coefficients change are called knots.

The top-left pane in Fig 6.3 presents a simple demo of how the piecewise poly-
nomial regression works. Basically, the data is cut in two halves for two fittings
which share the same predictors.
Apparently the discontinuity at age = 50 is somewhat unnatural. To handle
this, we may impose a constraint on the model which requires the fitting curve
to be continuous, this results in the fitting in the top-right pane of Fig 6.3. The
continuity constraint can be further extended to continuity of the k-th deriva-
tive35 , such that the fitting curve becomes even smoother, as is shown in the
bottom-left pane. Finally, rather than fitting a polynomial curve in the interval
35
Note that k ≤ d, where d is the highest degree of polynomial in the model.

56
of each partition, we may fit a straight line too. With the continuity constraint,
we have the fitting in the bottom-right pane. This is called a linear spline.

Here is how the continuity constraint is imposed. We may use a basis model to
represent our regression spline. Eq. 6.7 is then written in the form of a basis
model:
yi = β0 + β1 b1 (xi ) + β2 b2 (xi ) + β3 b3 (xi ) + i (6.8)
Alternatively, we may add a term to the model in Eq. 6.7: a truncated power
basis function to the , where ξ is the knot:
(
3
3 (x − ξ) if x > ξ
h(x, ξ) = (x − ξ)+ = (6.9)
0 otherwise

Specifically, add the term of the form β4 h(x, ξ) to Eq. 6.7 for a cubic polynomial
will lead to a discontinuity in only the third derivative at ξ, i.e. the function
will otherwise continuous (first & second derivatives) at each of the knots.
Finally, as splines are unconstrained in border intervals, we may further add a
boundary constraint , which requires the function to be linear at the boundary,
for more stable estimates at the boundaries. This is called a natural spline.

6.4.2 Tuning Spline


How do we place knots to improve the performance of the model? One way is to
place more knots in places where we expect the function may vary most rapidly,
and fewer where it seems more stable. Essentially, adding knots increases the
flexibility of the fitting. In practice, however, it is common to place knots in a
uniform fashion.

While we cannot be sure at the set-off which approach will fare better (or how
may partitions, in the case of evenly-spaced knot placement), we may always
resort to cross-validation to find the best knot placement. With this method,
we fit a spline on the training set, and make predictions with it on the held-out
set. The performance of the spline is then evaluated by computing the overall
cross-validated RSS.

6.4.3 Spline vs. Polynomial Regression


Generally, with the same degree of freedom, splines produce more stable fittings,
whereas the polynomial regression tend to falter at boundaries of the range of
X. This is demonstrated in Fig 6.4, where 15 degrees of freedom is given to a
natural spline and a polynomial regression.

6.4.4 Smoothing Splines


Smoothing Spline is a particular type of spline where there is a knot at each
data point. We know that the optimization in splines is least-squares, which

57
Figure 6.4: Natural Splines vs. Polynomial Regression (df = 15)

Pn 2
essentially minimizes the RSS = i=1 (yi − g(xi )) for some function g(x)
which produces an estimate for the corresponding y. From the Ch 5.2, we
learned that when the number of predictors is high (and especially when the
observation to predictor ratio is low), regression models tend to overfit the
training set. Here we have the same problem with smoothing splines, because
each data point is effectively a “predictor”. Graphically, an overfitted regression
curve has a high overall slope variation, i.e. the first derivatives of the curve
changes frequently and to large degrees, or the second derivatives. To moderate
this, we add a penalty to our optimization goal (i.e. RSS) to minimize also the
slope-variation, which is done by minimizing the second derivatives.
n Z
(yi − g(xi ))2 + λ g 00 (t)2 dt
X
(6.10)
i=1

λ here is the tuning variable. R g becomes perfectly smooth (i.e. becomes a


straight line) while λ → ∞. g 00 (t)2 dt is the measure of the overall slope-
variation. Although it seems that we would still have an overwhelmingly large
df when the predictors are numerous, λ serves to reduce the effective df , which
is a measure of the flexibility of the smoothing spline. The effecitive df of a
smoothing spline, dfλ , is defined as follows:
ĝλ = Sλ y (6.11)
Xn
dfλ = {Sλ }ii (6.12)
i=1

ĝ here is an n-vector containing the fitted values of the smoothing spline at the
training points x1 , ..., xn . The vector can be written, according to Eq. 6.11, the

58
Figure 6.5: Local Regression

product of an n × n matrix Sλ and the response vector y. In other words, there


exists a transformation from y to its estimate ĝ. The transformation matrix S
is then used to compute dfλ .

To choose a λ, we may use LOOCV by computing the following RSScv (λ) and
select the λ for which the value is minimized. The computation makes use of
the transformation matrix {S}, and is incredibly effective.
n n  2
X (−i) 2 X yi − ĝλ (xi )
RSScv (λ) = (yi − ĝλ (xi )) = (6.13)
i=1 i=1
1 − {Sλ }ii

In general, we would like to have a model with less degree of freedom (i.e. less
free parameters, and hence simpler).

6.5 Local Regression


An alternative to regression splines is Local Regression. The idea of local re-
gression goes as follows: for each target point x0 , find a set of xi from a defined
vicinity of x0 . xi are weighted (for indicating their “relative importance”) by
their distance from x0 . The estimate ŷ0 will be from the regression fitted with
the set xi . This is illustrated graphically in Fig 6.5.

The procedure of local regression is described in the following algorithm:


• Gather the fraction s = k/n of training points whose xi are closest to x0 .
• Assign a weight Ki0 = K(xi , x0 ) to each point in this neighborhood, so
that the point furtherest from x0 has weight zero, and the closest has the
highest weight. All but these k nearest neighbors get weight zero.
• Fit a weighted least squares regression of the yi on the xi using the afore-
mentioned weights, by finding β̂0 and β̂1 that minimize
n
X 2
Ki0 (yi − β0 − β1 xi ) (6.14)
i=1

59
• The fitted value at x0 is given by fˆ(x0 ) = β̂0 + β̂1 x0
Finally, note that local regression suffers from the same “neighbor sparsity”
problem as K-nearest Neighbor approach at high dimensions. Recall that, at
high dimensions, it is very difficult to find a set of neighbors from the target
data point.

6.6 Generalized Additive Model


Generalized Additive Model (GAM) is a compromise between linear models and
non-parametric models. Specifically, its formulation allows individual predictors
to be associated with the response non-linearly, but at the same time impose a
global structure to the model. Instead of giving each predictor a coefficient, as
a linear model does (Eq. 6.15), a GAM replaces each term in the linear model
with a non-linear function: βj xij → fi (xij ) (Eq. 6.16):

yi = β0 + β1 xi1 + ... + βp xip + i (6.15)


yi = β0 + f1 (xi1 ) + ... + fp (xip ) + i (6.16)

The pros and cons of GAM are listed as follows:


• Pros of GAM
– Incorporate non-linear relationships between predictors and responses
which are amiss in linear models.
– Potentially more accurate fit.
– The relationships between individual predictors and the response can
be studied while holding other predictors fixed.
• Cons of GAM

– Additive restriction. Important interactions among variables can be


missed, if there are any.
The extension of GAM to qualitative settings is simple. This is demonstrated
with Eq. 6.17 & Eq. 6.18:
 
p(X)
log = β0 + β1 X1 + ... + βp Xp (6.17)
1 − p(X)
 
p(X)
log = β0 + f1 (X1 ) + ... + fp (Xp ) (6.18)
1 − p(X)

6.7 Lab Code

% Polynomial Regression I (linear)

60
library(ISLR)
attach(Wage)

fit = lm(wage˜poly(age,4), data=Wage)


% orthogonal polynomials∗
coef(summary(fit)) % print out

fit2 = lm(wage˜poly(age,4,raw=T), data=Wage)


fit2a = lm(wage˜age+I(ageˆ2)+I(ageˆ3)+I(ageˆ4), data=Wage)
fit2b = lm(wage˜cbind(age,ageˆ2,ageˆ3,ageˆ4), data=Wage)
% original/raw polynomial
% fitting the same, coefs change

agelims = range(age)
age.grid = seq(from=agelims[1], to=agelims[2])
% grid: 18,19,...,90
preds = predict(fit,newdata=list(age=age.grid),se=TRUE)
% make prediction
se.bands = cbind(preds$fit+2∗preds$se.fit, preds$fit-2∗preds$se.fit)
% show standard error band at 2se

par(mfrow=c(1,2), mar=c(4.5,4.5,1,1), oma=c(0,0,4,0))


% 1 row 2 col grid
% margin: (bottom,left,top,right)
% oma: outer margin
plot(age, wage, xlim=agelims, cex=.5, col=’darkgrey’)
title ( ’D-4 Poly’, outer=T)
lines(age.grid, preds$fit , lwd=2, col=’blue’)
% add fit curve
matlines(age.grid, se.bands, lwd=1, col=’blue’, lty=3)
% add standard error band

fit .1 = lm(wage˜age, data=Wage)


fit .2 = lm(wage˜poly(age,2), data=Wage)
fit .3 = lm(wage˜poly(age,3), data=Wage)
fit .4 = lm(wage˜poly(age,4), data=Wage)
fit .5 = lm(wage˜poly(age,5), data=Wage)
anova(fit.1, fit .2, fit .3, fit .4, fit .5)
% model comparison for choosing degree of polynomial
% cutting point is where value is insignificant

% Polynomial Regression II ( logistic )

fit = glm(I(wage>250)˜poly(age,4), data=Wage, family=binomial)


% create fit
preds = predict(fit, newdata=list(age=age.grid), se=T)

61
% make prediction
% alternative:
% preds = predict(fit, newdata=list(age=age.grid),
% type=’response’, se=T)
pfit = exp(preds$fit) / (1+exp(preds$fit))
% convert logit to estimate
se.bands.logit = cbind(preds$fit+2∗preds$se.fit, preds$fit-2∗preds$se.fit)
se.bands = exp(se.bands.logit) / (1+exp(se.bands.logit))
% show standard error band at 2se

plot(age, I(wage>250), xlim=agelims, type=’n’, ylim=c(0,.2))


points(jitter(age), I((wage>250)/5), cex=.5, pch=’|’, col=’darkgrey’)
% jitter(): ‘rug plot’ that makes values non-overlap
lines (age.grid , pfit , lwd=2, col=’blue’)
matlines(age.grid , se .bands, lwd=1, col=’blue’, lty=3)
% plot i) fit , ii ) 2se bands

table(cut(age,4))
% table representation of prediction
% 4 ‘age-buckets’
fit = lm(wage˜cut(age,4), data=Wage)
% partitioned fit
coef(summary(fit))

% Splines I ( regression splines )

library(splines)
fit = lm(wage˜bs(age,knots=c(25,40,60)), data=Wage)
% fit
% bs(): generate matrix of basis functions for specified knots
pred = predict(fit, newdata=list(age=age.grid), se=T)
% make prediction
plot(age, wage, col=’gray’)
lines(age.grid, pred$fit , lwd=2)
lines(age.grid, pred$fit+2∗pred$se, lty=’dashed’)
lines(age.grid, pred$fit -2∗pred$se, lty=’dashed’)

dim(bs(age, knots=c(25,40,60)))
% two ways to check df
attr(bs(age,df=6), ’knots’)
% show quantile percentages

% Splines II (natural splines )

fit2 = lm(wage˜ns(age,df=4), data=Wage)


pred2 = predict(fit2, newdata=list(age=age.grid), se=T)

62
lines(age.grid, pred2$fit , col=’red’, lwd=2)

% Splines III (smoothing splines)

fit = smooth.spline(age, wage, df=16)


fit2 = smooth.spline(age, wage, cv=T)
$ fit2 $df: 6.8
plot(age, wage, xlim=agelims, cex=.5, col=’darkgrey’)
lines( fit , col=’red’, lwd=2)
lines( fit2 , col=’blue’, lwd=2)

% Local Regression

fit = loess(wage˜age, span=.2, data=Wage)


% span=.2: neighborhood consists of 20% of the observations
fit2 = loess(wage˜age, span=.5, data=Wage)
plot(age, wage, xlim=agelims, cex=.5, col=’darkgrey’)
lines(age.grid, predict(fit , data.frame(age=age.grid)), col=’red’, lwd=2)
lines(age.grid, predict(fit2,data.frame(age=age.grid)), col=’blue’, lwd=2)

% GAM

gam1 = lm(wage˜ns(year,4)+ns(age,5)+education, data=Wage)


% ns(data, df, ...) for year & age
% regular qualitative for education

library(gam)
gam.m3 = gam(wage˜s(year,4)+s(age,5)+education, data=Wage)
par(mfrow=c(1,3))
plot(gam.m3, se=T, col=’blue’)
% 3 plots for 3 predictors
% each shows respective predictor’s fit to response

gam.m1 = gam(wage˜s(age,5)+education, data=Wage)


gam.m2 = gam(wage˜year+s(age,5)+education, data=Wage)
anova(gam.m1, gam.m2, gam.m3, test=’F’)
% model comparison

gam.lo = gam(wage˜s(year,df=4)+lo(age,span=.7)+education, data=Wage)


gam.lo.i = gam(wage˜lo(year,age,span=.5)+education, data=Wage)
% make use of local regression

gam.lr = gam(I(wage>250)˜year+s(age,df=5)+education, family=binomial, data=Wage)


par(mfrow=c(1,3))
plot(gam.lr, se=T, col=’green’)
% logistic GAM

63
table(education, I(wage>250))
% show predictions

64
Figure 7.1: Decision Tree Demo

7 Tree-Based Models
7.1 Decision Trees
7.1.1 Model of DT
In a typical Decision Tree task, we have n observations x1 , ..., xn and p predic-
tors/parameters, and we would like to compute estimate ŷi for each response
yi . Graphically, the following example illustrates how the predictors year and
hits are used to predict a baseball player’s salary 36 .

In the example, the two predictors are binarily factorified by an artificial dividing
point which minimizes RSS (coming up soon). The tree can also be represented
with a graph of decision regions, as Fig 7.2:

Having the basic setup of a decision tree task in mind, we now formulate the
prediction-making and optimization goal of a decision tree, given .

• Prediction
– Given a set of possible values of observations X1 , ..., Xp character-
ized by p predictors, partition the values into J distinct and non-
overlapping regions R1 , ..., RJ .
– For every observation xi in region Rj , the prediction/estimate for its
corresponding ŷi is the mean of the response values yi which are in
Rj .
36
Represented as log salary to obtain a more bell-shaped distribution.

65
Figure 7.2: Decision Tree as Decision Regions

• Optimization Goal
J X
X 2
(yi − ŷRj ) (7.1)
j=1 i∈Rj

Essentially, in constructing a decision tree, we make two decisions:


• The cutting points s1 , ..., sk for each predictor Xj , by which each predictor
gives two decision regions:
R1 (j, sj ) = {X|Xj < sj } and R2 (j, sj ) = {X|Xj ≥ sj } (7.2)

• The sequence of predictors X1 , ..., Xk , where k ≤ p, by which the parti-


tioning of the decision space is done. The sequence should minimize the
combined RSS of all decision regions.
X X 2
(yi − ŷRj ) (7.3)
j∈J i:xi ∈Rj (j,sj )

In practice, it is apparently inefficient to scan through all possible sequences (i.e.


all possible tree structures). Further, for the simplicity of a model, we would like
to have as less predictors (thus decision regions) involved at a reasonable cost
of RSS 37 . Therefore, the optimization goal in Eq. 7.1 is modified to include a
penalty term to minimize also the number of terminal nodes in a tree (Eq. 7.4,
where |T | is the number of terminal nodes, m is the index for decision regions).
|T |
X X 2
(yi − ŷRm ) + α|T | (7.4)
m=1 xi ∈Rm

37
It is clear that the more predictors we use, the lower the RSS will be on the training set.
This, however, risks overfitting for our model.

66
The sequence selection can be carried out with some variation of forward/back-
ward/hybrid selection procedure (cf. Ch 5.1), which will not be elaborated here.
To guard against overfitting, each tree is also subject to a cross-validation where
the M SE is computed to evaluate a particular tree’s performance.

Regression tree in a classification task differs in both the way in which prediction
is made and the optimization goal.
• Prediction
Each observation goes to the most commonly occurring class of training
observations in a decision region.

• Optimization Goals
– Classification Error Rate38

E = 1 − max(p̂mk ) (7.5)
k

– Gini Index 39
K
X
G= p̂mk (1 − p̂mk ) (7.6)
k=1

– Cross-Entropy 40
K
X
D=− p̂mk log p̂mk (7.7)
k=1

In building a classification tree, Gini or Cross-Entropy is used to evaluate the


quality of a particular split. While Gini and Cross-Entropy are effective with
tree pruning, Classification Error Rate is preferable if the objective is making
predictions. Finally, note that node purity is important because it reduces the
uncertainty in a decision when information is incomplete.

7.1.2 DT: Pros & Cons


Many tasks can be approached with either a DT or a linear model, therefore we
need to decide which one is more ideal for a particular data set and task. A gen-
eral rule of thumb is as follows: A linear model works better if the relationship
between the predictors and the response is close to linear. On the other hand,
if this relationship is highly non-linear and complex, then DT makes a better bet.

More generally, the pros and cons of DT are listed as follows:


38
p̂mk represents the proportion of training observations in the mth region that are from
the kth class.
39
Gini is a measure of node purity, in the sense that a small Gini indicates that a node
contains predominantly observations from a single class.
40
Cross-Entropy also measures node purity.

67
• Pros of DT
– Highly interpretable, even for non-statisticians.
– Cognitively (presumably) closer to the process of human decision-
making.
– Efficiently handle qualitative predictors w/o the need to create dummy
variables.
• Cons of DT
– Relatively low predictive accuracy, compared with other regression
and classification models.

7.2 Bagging
Bagging is a model-building method that aims at reducing variance. Basically,
we would like to build n models with n training sets, and obtain prediction as the
average of the prediction from the n models. If each model has a variance σi2 , the
2
variance of the prediction will be σ̄n ≤ σi2 . In practice, obtaining n training sets
is oftentimes unpractical, we therefore use bootstrapping to generate B different
training sets by sampling from the only training set we have. The prediction is
thus made as follows:
B
1 X ˆ∗b
fˆbag (x) = f (x) (7.8)
B
b=1

For a classification task, we let n models “vote” with a classification decision,


and take the majority vote as our final classification. One nice property of bag-
ging is that increasing its parameter B doesn’t lead to increase in variance. In
practice we use a value of B sufficiently large that the error has settled down.

The equivalent of LOOCV cross-validation in bagging is called Out-Of-Bag


(OOB), where each training set is left out in model-building and prediction-
making once as the test set. The accuracy of the prediction at the round is then
tested on it. The overall prediction accuracy is obtained by averaging over all
prediction accuracies recorded.

Performing bagging on a DT comes with the price of reduced interpretability.


Specifically, with bagging, it is not clear which predictor(s) are more important
in the process of decision-making. The following method remedy this problem:

• RSS-reduction: Record the decrease in RSS at each split over a given


predictor, and average over all B trees.
• Gini-reduction: Same as RSS-reduction, only this time the recorded is
Gini Indices at each split for a predictor.

68
7.3 Random Forest
A serious problem that prevents bagging from effectively reducing the variance
is that bagging trees are highly correlated when there are a few strong predic-
tors in the set of all predictors — the trees built from the n training sets then
end up very similar to each other. Random Forest overcomes this problem by
decorrelating the bagging trees with a clever splitting technique: At each split,

m predictors are sampled from all the p predictors (empirically m ' p gives
good performance), and the split is allowed to use only one of these m predictors.

Procedurally, RF is otherwise very similar to bagging, with a moderate increase


in variance reduction.

7.4 Boosting
There are two main differences between bagging and Boosting:
• Fitting Target: Boosting fits to the residual rather than the response per
se.
• Fitting Procedure: Boosting builds trees sequentially rather than by si-
multaneous sampling.
The following algorithm describes the boosting method:
• Set fˆ(x) = 0 and ri = yi for all i in the training set.
• For b = 1, 2, ..., B, repeat:
– Fit a tree fˆb with d split (d + 1 terminal node) to the training data
(X, r).
– Update fˆ by adding in a shrunken version of the new tree:
fˆ(x) ← fˆ(x) + λfˆb (x) (7.9)
– Update the residuals,
ri ← ri − λfˆb (xi ) (7.10)
• Output the boosted model,
B
X
fˆ(x) = λfˆb (x) (7.11)
b=1

The idea behind boosting is learning slowly and improve fˆ in areas where it
does not perform well, which is in a way highlighted by residuals. The shrinkage
parameter λ (typically 0.001 ≤ λ ≤ 0.01) tunes the process by showing it
down further, which allows thus fine-tuned attack on the residuals with different
shaped trees. d here is a step parameter which controls the depth (usually d = 1
works well, where each subtree is a stump) of the subtree added for each iteration
(at step 2 in the Boosting Algorithm).

69
7.5 Lab Code

% Decision Tree

library(tree) % R version 3.2.3


library(ISLR)
attach(Carseats)
High = ifelse(Sales<=8, ‘No’, ‘Yes’)
Carseats = data.frame(Carseats, High)
% select ‘ Sales ’ for response
% factorify ’Sale’ by 8 as threshold

tree . carseats = tree(High˜.-Sales, Carseats)


% build a DT for ‘High’ using all predictors (leave ‘Sale’ out)

summary(tree.carseats)
% see report of # of terminal nodes,
% residual mean deviance, misclassification error rate ,
% and predictors by importance

plot(tree . carseats )
text(tree . carseats , pretty=0)
% plot tree and print node labels

set .seed(2)
train = sample(1:nrow(Carseats), 200)
Carseats. test = Carseats[-train ,]
% half-half train - test split
High.test = High[-train]
tree . carseats = tree(High˜.-Sales, Carseats, subset=train)
tree .pred = predict(tree . carseats , Carseats. test , type=‘class’ )
table(tree.pred, High.test)
(86+57)/200 ... % = 0.715
% (No˜No + Yes˜Yes) / Total

set.seed(3)
cv. carseats = cv.tree(tree . carseats , FUN=prune.misclass)
% 10-block CV by default
% report 10 trees of different sizes , and
% corresponding misclass rate ($dev)
% 9-node tree has the best

par(mfrow=c(1,2))
plot(cv.carseats$size , cv. carseats$dev, type=‘b’)
plot(cv. carseats$k, cv. carseats$dev, type=‘b’)

70
% plot size of nodes against error rate
% plot # of CV-folds against error rate

prune.carseats = prune.misclass(tree. carseats , best=9)


plot(prune.carseats)
text(prune.carseats, pretty=0)
% plot the best tree

tree .pred = predict(prune.carseats, Carseats.test, type=‘class’)


table(tree .pred, High.test)
(94+60)/200 ... % = 0.77
% pruned tree performs better!

% Regression Tree

library (MASS)
set .seed(1)
train = sample(1:nrow(Boston), nrow(Boston)/2)
tree .boston = tree(medv˜.,Boston, subset=trian)
summary(tree.boston)
% outputs # of terminal nodes,
% residual mean deviance, distribution of residuals

plot(tree .boston)
text(tree .boston, pretty=0)
% plot tree

cv.boston = cv.tree(tree .boston)


plot(cv.boston$size, cv.boston$dev, type=‘b’)

prune.boston = prune.tree(tree.boston, best=5)


plot(prune.boston)
text(prune.boston, pretty=0)
% prune tree with the best # of nodes

yhat = predict(tree.boston, newdata=Boston[-train,])


boston.test = Boston[-train ,‘ medv’]
plot(yhat, boston.test)
abline (0,1)
mean((yhat-boston.test)ˆ2)
% evaluation

% Bagging, Random Forest

library (randomForest)
set .seed(1)

71
bag.boston = randomForest(medv˜., data=Boston, subset=train,
mtry=13, importance=TRUE)
% mtry=13: all 13 predictors should be considered
% i.e. bagging (m=p random forest)
% importance: assess importance of predictors

yhat.bag = predict(bag.boston, newdata=Boston[-train,])


plot(yhat.bag, boston.test)
abline (0,1)
mean((yhat.bag-boston.test)ˆ2)
% evaluation
% MSE greatly reduced comparing with DT (reg DT)
% ntree=500 by default

bag.boston = randomForest(medv˜., data=Boston, subset=train,


mtry=13, ntree=25)
mean((yhat.bag-boston.test)ˆ2)
% now MSE increase by a bit
% with much lower computation cost

set .seed(1)
rf .boston = randomForest(medv˜., data=Boston, subset=train,
mtry=6, importance=TRUE)
yhat. rf = predict(rf .boston, newdata=Boston[-train,])
mean((yhat.rf-boston.test)ˆ2)
% MSE lower than bagging
importance(rf.boston)
% outputs MSE impact & impurity reduction
% for each node/predictor
varImpPlot(rf.boston)
% plot importance of nodes/predictors

% Boosting

library (gbm)
set .seed(1)
boost.boston = gbm(medv˜., data=Boston[train,], distribution=‘gaussian’,
n. trees=5000, interaction.depth=4)
summary(boost.boston)
% outputs relative influence statistics

par(mfrow=c(1,2))
plot(boost.boston, i=‘rm’)
plot(boost.boston, i=‘lstat ’ )
% plot important predictors against response

72
yhat.hoost = predict(boost.boston, newdata=Boston[-train,], n.trees=5000)
mean((yhat.boost-boston.test)ˆ2)
% evaluation
% MSE similar to random forest

boost.boston = gbm(medv˜., data=Boston[train,], distribution=‘gaussian’,


n. trees=5000, interaction.depth=4,
shrinkage=0.2, verbose=F)
% boost with slower pace
% shrinkage: lambda tuning
yhat.boost = predict(boost.boston, newdata=Boston[-train,], n.trees=5000)
mean((yhat.boost-boston.test)ˆ2)
% MSE reduces slightly

73
8 Support Vector Machine
8.1 Maximal Margin Classifier
The essential idea of a Maximal Margin Classifier (MMC) can be explained
using a simplified example: Let X1 , ..., Xn be a set of p-dimensional data points
(i.e. they are characterized by p predictors/variables), and y1 , ..., yn are their
corresponding responses. Assuming the data points belong to either one of
two classes, if we were able to find a hyperplane, which is defined as a flat
affine subspace of dimension p − 1 (for a p-predictor case41 ), then we will be
able to classify a data point by which side of the hyperplane it is located.
More concretely, the p − 1 dimensional hyperplane of a set of p-dimensional
observations has the following general form:

β0 + β1 X1 + ... + βp Xp = 0 (8.1)

The classification will thus be as follows: Xi belong to one of the two classes if
β0 +β1 X1 +...+βp Xp > 0; it belongs to the other class if β0 +β1 X1 +...+βp Xp <
0. However, note that MMC requires the hyperplane (i.e. the decision bound-
ary) to be linear. We will consider the generalization of MMC that handles
non-linear classification tasks later on.

How, then, do we find such a hyperplane which is the best out of infinite number
of possible hyperplanes which can accomplish this division? For any observa-
tion, its margin to a hyperplane is defined as the minimal (i.e. perpendicular)
distance it is from the hyperplane. In MMC, the hyperplane of choice is natu-
rally defined as the one for which the observations have the largest margin42 .

Interestingly, instead of decided by all observations in a training set, the MMC


hyperplane depends directly on a set of observations which are equidistant from
it43 . These are called support vectors, since they live in p-dimensional space.
More specifically, only the movement of support vectors would change the hy-
perplane, while the movement of other observations would not, unless they are
moved across the “border”. This is demonstrated in Fig. 8.1, where the support
vectors are the two blue and one red observations.

41
For instance, a 2-dimensional space can be partitioned in two halves by a 1-dimensional
hyperplane, which in this case is a line.
42
Note that although MMC often works well, it can lead to overfitting when p is large.
43
Support vectors need not to be the closest to the hyperplane.

74
Figure 8.1: Support Vectors

The optimization goals for finding the MMC hyperplane are the following:

maximizeM (8.2)
β0 ,β1 ,...,βp
p
X
subject to βj2 = 1 (8.3)
j=1

yi (β0 + β1 xi1 + ... + βp xip ) ≥ M ∀i = 1, ..., n (8.4)

While the optimization procedure is out of the scope of our discussion, we will
explain the working of it informally here. Under the constraint Eq. 8.3, the
distance from an observation to a hyperplane is yi (β0 + β1 xi1 + ... + βp xip ).
Eq. 8.4 says the hyperplane that is at least M (distance) units away from an
observation. Technically, M = 0 will do the trick. However, we set it as M
here to leave some cushion room. With Eq. 8.3,4, the hyperplane found will
naturally follow, that is, it will be the one that maximizes the margins.

8.2 Support Vector Classifiers


Note that not all data sets are as nice as Fig. 8.1 where it is possible to find
an MMC hyperplane, which is linear. When data points of different classes
“mingle” together to some extent, MMC will not work. To solve the problem,
we extend MMC to allow a so-called soft margin to find a hyperplane that
almost separate classes.

75
Figure 8.2: MMC Sensitivity

Another problem with MMC is that it is sometimes very sensitive to changes in


particular observations, as is shown in Fig. 8.2.

Therefore, we need to relax our model to have


• Greater robustness to individual observations, and
• Better classification of most of the training observations.
In order to achieve this, we introduce so-called slack variables  to allow some
observations to be on the incorrect side of margin/hyperplane (hence soft mar-
gin) by modifying our optimization goals as follows:

maximize M (8.5)
β0 ,β1 ,...,βp ,1 ,...,n
p
X
subject to βj2 = 1 (8.6)
j=1

yi (β0 + β1 xi1 + ... + βp xip ) ≥ M (1 − i ) (8.7)


Xn
i ≥ 0, i ≤ C (8.8)
i=1

If the slack variable i of an observation xi is greater than 0, then the observa-


tion is on the wrong side of the margin. C is the slack tuning parameter which
determines the number and severity of the violations to the margin we would
tolerate (MMC is the case where C = 0). C controls the bias-variance trade-off:
small C means better fitting to the training and high variance. It is usually
decided through cross-validation.

Finally note that, as the hyperplane depends only on a handful of support


vectors, it is more robust to the behavior of observations — not changes in all
of them can affect the hyperplane — in contrast to LDA, where the classification
depends on the mean of all of the observations within each class, and the within-
class covariance matrix computed using all of the observations.

76
8.3 Support Vector Machine
As flexible as the support vector classifier, sometimes we would still like to have
a non-linear decision boundary rather than a linear one44 .

To allow non-linear decision boundaries, we may add quadratic or cubic (or


higher powered) terms in the predictor set. E.g. X1 , X12 , ..., Xp , Xp2 . The opti-
mization goals then becomes:

maximize M (8.9)
β0 ,β11 ,β12 ,...,βp1 ,βp2 ,1 ,...,n
 
p
X p
X
subject to yi β0 + βj1 xij + βj2 x2ij  ≥ M (1 − i ) (8.10)
j=1 j=1
n
X p X
X 2
2
i ≤ C, i ≥ 0, βjk =1 (8.11)
i=1 j=1 k=1

The problem with this technique at this point is that the predictor set may be-
come computationally unmanageably large. Without getting into the technical
details of computation, we give a somewhat informal but intuitive explanation
to the solution of the problem — Support Vector Machine (SVM), which is an
extension of the support vector classifier, where kernels are used. The “base”
form of a linear support vector classifier is represented as follows, where x is a
new observation:
X n
f (x) = β0 + αi hx, xi i (8.12)
i=1

To estimate β0 and αi , i = 1, ..., n, we only need to compute the inner product of


all pairs of training observations hxi , xi0 i (i.e. n(n−1)/2 computations involved).
This is significantly more efficiently than adding a large set of variables to the
predictor set. Conveniently, it is discovered that except for the support vectors,
the αi for all other training observations will be equal to 0. Therefore, if S
represents the collection of indices of the support vectors, Eq. 8.12 becomes the
following: X
f (x) = β0 + αi hx, xi i (8.13)
i∈S

The inner product here is intended to represent the similarity between two
vectors. However, instead of compute such a “raw” value for similarity, we
instead compute a generalized version of it: the kernel :

K(xi , xi0 ) (8.14)

Essentially, kernel allows us to conduct non-linear fitting without the burden of


overpopulating the predictor set with new variables. The following two are the
44
E.g. class A observations are surrounded by class B observations.

77
Figure 8.3: Left: Polynomial Kernel; Right: Radial Kernel

most popular form of kernels — polynomial and radial kernels45 . The fitting
with the kernels is illustrated in Fig. 8.3.
p
X
K(xi , xi0 ) = (1 + xij xi0 j )d (8.15)
j=1
p
X 2
K(xi , xi0 ) = exp(−γ (xij − xi0 j ) ) (8.16)
j=1

Basically, the polynomial kernel enables non-linear fitting. The radial kernel
adding a new feature where, the more distant/dissimilar a new observation is
from a training observation, the less effect the training has on the new observa-
tion.

In a K-class setting, we have the following two options to use SVM:


 
K
• One-vs-One: Run classification for an observation for all pairs of
2
classes, and finally take a tally and assign the observation to the class to
which it is most frequently assigned.
• One-vs-All : Run classification for an observation for all K classes. Each
time the chosen classes is set up against the rest of the classes as a single
class. Finally assign the observation to the class that gives the largest
f (x) (i.e. classification function).

8.4 Lab Code

% Support Vector Classifier I ( linearly inseparable)

set.seed(1)
45
γ is a positive number. The greater γ is, the stronger the effect of “long-distance muffling”.

78
x = matrix(rnorm(20∗2), ncol=2)
% create 20 2-dimensional data points
y = c(rep(-1,10), rep(1,10))
% label to 2 classes
x[y==1,] = x[y==1,] + 1
% add 1 to class[y==1,] for distinction
plot(x, col=(3-y))
% check linear separability
dat = data.frame(x=x, y=as.factor(y))
% recode y into factor

library(e1071)
svmfit = svm(y˜., data=dat, kernel=‘linear’,
cost=10, scale=FALSE)
% kernel=‘linear’ : support vector classifier
% cost: ‘‘ penalty’ ’ for margin violation
% scale=FALSE: do not do N(0,1) scaling
plot(svmfit, dat)
% ‘o’ regular observation
% ‘x’ support vector
% coloring for class distinction

svmfit = svm(y˜., data=dat, kernel=‘linear’,


cost=0.1, scale=FALSE)
% less penalty for margin violation this time
% margin windens, # of sv’s increases

set.seed(1)
tune.out = tune(svm, y˜., data=dat, kernel=‘linear’,
ranges=list(cost=c (.001,.01,.1,1,5,10,100)))
% 10-fold CV
summary(tune.out)
% output the best cost parameter
bestmod = tune.out$best.model
summary(bestmod)

xtest = matrix(rnorm(20∗2), ncol=2)


ytest = sample(c(-1,1), 20, rep=TRUE)
xtest [ ytest==1,] = xtest[ytest==1,] + 1
testdat = data.frame(x=xtest, y=as.factor(ytest))
% create a test set in the same fashion as before

ypred = predict(bestmod, testdat)


table(predict=ypred, truth=testdat$y)
% evaluate using bestmod
% only 1 misclassification

79
% Support Vector Classifier II ( linearly separable)

x[y==1,] = x[y==1,] + 0.5


plot(x, col=(y+5/2), pch=19)
dat = data.frame(x=x, y=as.factor(y))
% create data

svmfit = svm(y˜., data=dat, kernel=‘linear’, cost=1e5)


% ridiculously large cost to nihilate misclassification
% at the risk of overfitting
% can reduce cost to 1 for better model

% Support Vector Machine

set.seed(1)
x = matrix(rnorm(200∗2), ncol=2)
x [1:100,] = x [1:100,] + 2
x [101:150,] = x[101:150,] - 2
y = c(rep(1,150), rep(2,50))
dat = data.frame(x=x, y=as.factor(y))
% create nonlinear data set
plot(x, col=y)
% plot to check

train = sample(200, 100)


svmfit = svm(y˜., data=dat[train,], kernel=‘radial’ ,
gamma=1, cost=1)
plot(svmfit, dat[ train ,])
% gamma: long-distance muffling (positive)
% cost: penalty on misclassification

set .seed(1)
tune.out = tune(svm, y˜., data=dat[train,], kernel=‘radial ’ ,
ranges=list(cost=c(.1,1,10,100,1000)))
summary(tune.out)
% 10-fold CV for best cost & gamma

table(true=dat[-train,‘y’ ], pred=predict(tune.out$best.model,
newx=dat[-train,]))
% model testing on test set

% SVM with Multiple Classes

set .seed(1)
x = rbind(x, matrix(rnorm(50∗2), ncol=2))

80
y = c(y, rep(0,50))
x[y==0,2] = x[y==0,2] + 2
dat = data.frame(x=x, y=as.factor(y))
% create data
par(mfrow=c(1,1))
plot(x, col=(y+1))

svmfit = svm(y˜., data=dat, kernel=‘radial’,


cost=10, gamma=1)
plot(svmfit, dat)

81
9 Unsupervised Learning
We have been mainly focusing on supervised learning, where we have i) a set
of predictors X1 , ..., Xp on n observations x1 , ..., xn , and ii) corresponding re-
sponses y1 , ..., yn . The responses (or, “correct answers”) “guides” the construc-
tion of our model. However, in the cases where we only have the observations
x1 , ..., xn and the p predictors, the “guidance” is no longer available. unsuper-
vised learning is thus motivated.

9.1 Principal Component Analysis (PCA)


Given p dimensions/predictors of a data set X1 , ..., Xp , the Principal Compo-
nents (PC) are linear combinations of the predictors:
p
X
Z = φ1 X1 + ... + φp Xp, where φ2i = 1 (9.1)
i=1

Each φ is called the loading of its corresponding predictor/variable46 . The sum


of squares equal to 1 constraint prevents the coefficients from becoming arbi-
trarily large — this leads to arbitrarily large variance.

In Ch 5.3, we learned that the 1st PC is the dimension that maximizes the
variance explanation for a data set47 . The maximization goal with which we
find the 1st PC, with Eq. 9.1, is then Eq. 9.2. The optimization problem is
solved using eigen decomposition.
  2 

1 X n Xp 
 Xp
maximize  φj1 xij  , subject to φ2j1 = 1 (9.2)
 n i=1 j=1
φ11 ,...,φp1  
 j=1

As a linear combination of the original p dimensions, a PC may help the discov-


ery of hidden patterns in a data set. For instance, if a PC is mainly loaded by a
few of the dimensions, then the combination of these dimensions may have some
meaningful interpretation. Another way to look at PCs is a low-dimensional lin-
ear surface that are closest to the observations.

Further, note that scaling (e.g. N(0,1) scaling) affects the output of PCA. This
is desirable when predictors live in different units of measurement. Scaling is
unnecessary and potentially damaging when the predictors are of the same units
of measurement.

PCA is after all an approximation to actual data. To evaluate its performance,


it is therefore informative to know the proportion of variance explained (PVE)
46
The loading of a predictor can be thought of as its “contribution” to the value of an
observation on the dimension of a PC (i.e. Z).
47
The direction in which the data vary the most.

82
Figure 9.1: Scree Plot

for the data. PVE is defined as follows, where the numerator is the variance
explained by a PCA model, and the denominator is the total amount of variance
in a data set: Pn Pp 2
n=1 ( j=1 φjm xij )
PV E = Pn Pp 2 (9.3)
i=1 j=1 xij

In order to decide how many PCs to include in a PCA model, we plot the
number of PCs against the (cumulative) PVE for a scree plot (as in Fig. 9.1).
A rule of thumb here is the elbow method: Take the number of PCs where the
scree plot takes a major change of direction. In this example, PC = 2 would
be a good number to pick by the rule. However, keep in mind that the actual
decision should be based on many other factors (e.g. number of PCs preserved
in order to keep a highly faithful approximation while not paying too much
computational cost).

9.2 Clustering
Succinctly stated, clustering methods look to find homogeneous subgroups among
the observations.

9.2.1 K-Means
As is clear by its name, K-Means clustering seeks to partition a data set into
K subgroups/classes, where K is empirical and its choice depends on two main
factor:
• Clustering Pattern: How data cloud naturally partition and gather in
subgroups.
• Granularity Needed: How much granularity is required for a particular
task.
Fig. 9.2 presents the grouping results on a data set using K = 2, 3, 4.

A K-Means clustering satisfies two properties:

83
Figure 9.2: K-Means Demo

• C1 ∪ ... ∪ CK = {1, ..., n}. I.e. each observation belongs to at least one of
the K clusters.
• Ci ∩ Cj = ∅ for all i 6= j. I.e. each observation can only belong to 1
cluster.
An ideal K-Means clustering minimizes the total (i.e. all K classes) within-class
squared Euclidean distance:
 
K K p
X X 1 X X 
minimize{ W (Ck )} = minimize (xij − xi0 j )2 (9.4)
C1 ,...,CK C1 ,...,CK  |Ck | 0 j=1

k=1 k=1 i,i ∈Ck

The optimization problem in Eq. 9.4 is solved using the following algorithm:
• Randomly pick K observations to represent the K clusters, and each ob-
servation to cluster k by its Euclidean distance to the initializers (the
closest).
• Iterate the following steps until the assignment stop changing:

– Compute cluster centroid for each cluster, and represent the class
using the centroid.
– Reassign each observation to cluster k by its Euclidean distance to
the centroids (the closest).

Although the K-Means algorithm is guaranteed to reduce the optimization goal


in Eq. 9.4, it may be trapped in a local optimum, and the clustering results
depends heavily on the selection of the random initializers. Therefore, it is
recommended that the process be run at least a few times with different random
initializers, and decide on the final clustering by observing the clustering pattern.

84
Figure 9.3: Hierarchical Clustering Demo

9.2.2 Hierarchical Clustering


Hierarchical Clustering (HC) is an alternative to K-Means which is designed
to overcome the latter’s shortcoming that K has to be pre-specified, which is
somewhat subjective.

The basic idea of HC is that pairs of observations that are closest to each other
are fused into single representations (discussed later), until we obtain 1 single
representation. The process of HC is represented with a dendrogram, which is
illustrated in Fig. 9.3. The number of clusters is chosen by drawing a horizontal
cut at some vertical point on the dendrogram. The point is decided by the
same factors as in the case of K-Means clustering. HC is incredibly efficient
computationally comparing with K-Means, in that a single dendrogram can be
used to obtain any number of clusters.

Given n observations in a data set, the core algorithm for HC goes as follows:
 
n
• Compute pairwise dissimilarities (e.g. Euclidean, = n(n−1)
2 pairs),
2
and treat each observation as a cluster.
• Fuse clusters based on distance defined by dissimilarity.
• Compute new pairwise dissimilarities among new clusters using one of the
following methods:
– Complete Linkage: Between-cluster distance is the dissimilarity be-
tween the furthest apart representative members from each cluster
(Maximal intercluster dissimilarity).
– Single Linkage: Minimal intercluster dissimilarity.
– Average Linkage: Average intercluster dissimilarity.
– Centroid : Clusters are represented by their centroids when measuring
dissimilarity.

85
Complete and average linkage are preferred over single linkage, for they usually
produce more balanced dendrogram. Centroid linkage does produce balanced
dendrogram also, however it suffers from the inversion problem, where two
clusters are fused at a height below either of the individual clusters in the
dendrogram.

9.3 Lab Code

% PCA

library(ISLR)
states = row.names(USArrests)
% USArrests: arrest stats in the US across states
apply(USArrests, 2, mean)
% apply(data, row,col=1,2, operation)
% compute for column means

pr.out = prcomp(USArrests, scale=TRUE)


% pr.out$center, scale: mean & standard deviation of variables
% pr.out$rotation: rotation matrix R
% X %∗% R = coordinates of observations in the space dimensionalized by the PCs
biplot (pr.out, scale=0)
% show PC1,2’s ‘‘degree of alignment’’
% with all variables
pr.var = pr.out$sdevˆ2
% compute variance
pve = pr.var / sum(pr.var)
% fraction of variance explained by PCs
plot(pve, xlab=’PC’, ylab=’Prop of Var Explained’, ylim=c(0,1), type=’b’)
% PC against ‘‘impact of PC’’ with PVE
plot(cumsum(pve), xlab=’PC’, ylab=’Cumulative Prop of Var Explained’, ylim=c(0,1), type=’b’)
% cumulative version using cumsum(..)
% cumsum(c(1,2,3)) = c(1,3,6)

% K-means Clustering

set .seed(2)
x = matrix(rnorm(50∗2), ncol=2)
x [1:25,1] = x[1:25,1]+3
x [1:25,2] = x [1:25,2]-4
% create data with two obvious groups

km.out = kmeans(x, 2, nstart=20)


% k=2, 20 sets of random initializers

86
% km.out$tot.withinss: within-class sum of squares (Euclidea)
% nstart=1 -> 104.33; nstart=20 -> 97.98
km.out$cluster
% output class assignments over all observations

plot(x, col=(km.out$cluster+1), pch=20, cex=2)


% plot observations of different classes by color
% if k>2, plot by PC1,2

% Hierarchical Clustering

hc.complete = hclust(dist(x), method=‘complete’)


% x: fake data generated for k-means previously
% if data x needs to be scaled, do scale(x) first
% dist: Euclidean distance
% as.dist: correlation -based distance
% method: can be average/single...
hc.average = hclust(dist(x), method=‘average’)
hc. single = hclust(dist(x), method=‘single’)

par(mfrow=c(1,3))
plot(hc.complete, cex=.9)
% the rest are the same

cutree(hc.complete, 2)
% cut tree (horizontal) at a height such that
% observations fall into 2 classes

87

You might also like