Professional Documents
Culture Documents
Introductory Statistical Learning
Introductory Statistical Learning
Introductory Statistical Learning
Learning
su wang
Winter 2016
1
Contents
1 Fundamentals of Statistical Learning 4
1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Methods and Evaluation . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Special Topic: Bayes Classifier . . . . . . . . . . . . . . . . . . . 6
1.4 Lab Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Linear Regression 8
2.1 Simple LR: Univariate . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Multiple LR: Multivariate . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Non-linear Fit / Polynomial Regression . . . . . . . . . . . . . . 15
2.5 Issues in Linear Regression . . . . . . . . . . . . . . . . . . . . . 16
2.6 K-Nearest Neighbor Regression: a Nonparametric Model . . . . . 18
2.7 Lab Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 Classification 22
3.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . 24
3.3 Lab Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4 Resampling 31
4.1 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Lab Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2
7 Tree-Based Models 65
7.1 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
7.1.1 Model of DT . . . . . . . . . . . . . . . . . . . . . . . . . 65
7.1.2 DT: Pros & Cons . . . . . . . . . . . . . . . . . . . . . . . 67
7.2 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.3 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.4 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.5 Lab Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
9 Unsupervised Learning 82
9.1 Principal Component Analysis (PCA) . . . . . . . . . . . . . . . 82
9.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
9.2.1 K-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
9.2.2 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . 85
9.3 Lab Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3
1 Fundamentals of Statistical Learning
1.1 Basic Idea
Statistical Learning essentially handles the following type of task: Given a
set of independent variables/features X, make a prediction with regard to a
(set of) dependent variable(s)/response(s) Y , which is a function of f of X,
such that:
Y = f (X) + (1.1)
In Fig 1.1, X is “years of education”, Y is “income”, f is the curve in the right
pane.
2 2
E(Y − Ŷ ) = E[f (X) + − fˆ(X)] (1.2)
2
= [f (X) − fˆ(X)] + V ar()
4
• Parametric SL: Reduce the problem of estimating f down to one of esti-
mating a set of (prior) parameters (e.g. Y = β0 + β1 x1 + β2 x2 ).
• Nonparametric SL: Operationally similar to parametric method but does
not assume a set of prior parameters. Usually it takes a large amount of
data to come to accurate results. (TODO: example of model)
5
the decrease in bias is outpaced by that in variance, the MSE rises again. That
point is therefore the best balance point for the trade-off between bias and
variance.
The following is a graphical example of a Bayes Classifier task, where the dashed
curve is called a Bayes Decision Boundary, where the conditional probability
of a data point being in class j is 50%:
6
Figure 1.3: Bayes Classifier
% generate sequences
seq(-1,1,by=.1) % from -1 to 1 by .1 pace
seq(-1,1,length=10 % 10 evenly spaced points from -1 to 1
% contour
x = seq(-1,1,by=.1)
y=x
f=outer(x,y,function(x,y)cos(y)/(1+xˆ2))
contour(x,y,f)
image(x,y,f) % heatmap version of contour
persp(x,y,f,theta=30,phi=20) % theta, phi configure horizontal
% and vertical view angle
% load data
Auto=read.table (”Auto.data”, header =T,na.strings =”?”)
% header: data has header row
% na.strings: empty cell indicator
fix(Auto) % read in a spread sheet
7
2 Linear Regression
2.1 Simple LR: Univariate
Linear Regression is an approach for predicting a quantitative response Y on
the basis of a single predictor variable X. It assumes that there is approximately
a linear relationship between X and Y . Mathematically, the relationship is
represented as follows:
Y ≈ β0 + β1 X (2.1)
The corresponding model for which is:
βˆ0 = ȳ − βˆ1 x̄
n
X
(xi − x̄)(yi − ȳ) (2.4)
i=1
βˆ1 = n
X 2
(xi − x̄)
i=1
So far we have been assuming the access to all data in a population. In prac-
tice, however, usually we train our model using samples from a population. The
regression lines computed from the samples obviously do not necessarily equal
to the regression line computed from the entire population. On the right-pane
in the following graph, the population regression line is blue solid, and the rest
of the regression lines are generated from 10 samples from the population.
For a particular sample S and a population P from which S is drawn, the sam-
ple mean µ̂ is most likely different from the population mean µ. However, this
difference is systematic, in the sense that, in the long run (with more samples
drawn), the number of samples overestimating the population mean and those
8
Figure 2.1: Regression Lines
that underestimate the population mean will “even out”. The estimate µ̂ for µ,
therefore, is unbiased, and the average of a set of sample means will approach
µ when the cardinality of the set approaches infinity.
Not all sets of samples means are the same. Some may vary greatly, some may
vary to lesser extent. In estimating µ, it is apparently better for the variance
to be as low as possible. The variance, which is called Standard Error (SE),
in fact serves as an important indicator of the quality of the samples, and is
computed as follows, where σ is the standard deviation of each of the realization
yi of Y , and n is the cardinality of the sample set:
σ22
V ar(µ̂) = SE(µ̂) = (2.5)
n
Roughly speaking, the SE tells us the average amount that an estimate µ̂ differs
from the actual value of µ.
Besides the SE of Y , we may also compute the SE for the intercept β0 and the
independent variable β1 , where σ 2 = V ar():
2 1 x̄2
SE(βˆ0 ) = σ 2 [ +X 2]
n n(xi − x̄2 )
i=1
2
(2.6)
2 σ
SE(βˆ1 ) = X 2
n(xi − x̄2 )
i=1
9
p
RSE = RSS/(n − 2) (2.7)
Using the SE for a parameter (e.g. β1 ), we may compute a Confidence In-
terval such that the true value of the parameter has a chosen probability to
fall within the interval. The following exemplifies an approximation of the 95%
confidence interval for the parameter β1 . The same computation applies to all
parameters.
βˆ1 − 0
t= (2.10)
SE(βˆ1 )
The fraction of the corresponding t-distribution beyond the t-bar (i.e. the result
from the t-test) is the probability at which an estimate of the parameter (i.e. βˆ1
in this case) is observed to be associated with the true value of the parameter,
in the absence of a real association. This probability is called a p-value.
Assuming that we have rejected the null hypothesis and believe there is a rela-
tion between an independent variable and a dependent variable, the next thing
that comes to focus is to evaluate the extent to which our model fits the data.
The quality of fit is commonly evaluated using the following indicators:
• Lack of Fit (LOF)
v
q u n
X
RSS
u 1 2
– LOF = RSE = n−2 =t n−2 (yi − ŷi ) ,
i=1
• R2 Statistic
X 2
– R2 = T SS−RSS
T SS =1− RSS
T SS , where T SS = (yi − ȳ)
10
2.2 Multiple LR: Multivariate
If we have multiple independent variables in hand, fitting a regression line using
the simple LR for each of the variables is not entirely satisfactory. For one thing,
it is not clear how to make a single prediction of the dependent variable given
levels of the p independent variables, since each is associated with a separate
regression line. Further, by fitting separately for each variable, we are implicitly
assuming that the effects of the independent variables on the dependent variable
are additive, which is usually not the case, as there may be correlations among
the independent variables, which causes their effects to overlap. In light of the
discussion, we now extend our model in (1.1) and (1.2) to be:
Y = β0 + β1 X1 + β2 X2 + ... + βp Xp +
(2.11)
Ŷ = β̂0 + β̂1 X1 + β̂2 X2 + ... + β̂p Xp +
n
X 2
RSS = (yi − yˆi )
i=1 (2.12)
n
X 2
= (yi − β̂0 − β̂1 xi1 − ... − β̂p xip )
i=1
11
Figure 2.2: Regression with 2 Variables
Figure 2.3: Correlation Matrix: TV, radio and newspaper as indep. var.; sales
as dep. var.
(T SS − RSS)/p
F = (2.13)
RSS/(n − p − 1)
In the equation, the denominator indicates the average of the inherent variance
in the data set (size = n), the numerator measures the average amount of
variance (noninherent variance) explained by the predictors. If the predictors
have 0 influence on the response (i.e. β1 = β2 = ... = βp = 0), the two values
are expected to be equal:
12
Therefore, the greater the F -statistic is, the more significant the predictors are,
although this does not necessarily mean that each predictor is individually sig-
nificant.
Finally, if we are only interested in a subset of the p predictors, say, the last q
predictors (i.e. H0 : βp−q+1 = ... = βp = 0), the F -statistic is now computed
with:
(RSS0 − RSS)/q
F = (2.15)
RSS/(n − p − 1)
RSS0 here is the RSS in a model that includes all the predictors except for
the last q. The difference RSS0 − RSS, therefore, is the portion of variance
accounted for by the last q variables, which, when divided by q, indicates the
average amount of variance explained by the last q predictors3 .
Note that a significant F -statistic only tells us that at least one of the p predic-
tors involved is significant, not the q of the p predictors are significant. Testing
the predictors individually using t-test is not a viable approach to find the effec-
tive predictors, because some predictors are often tested siginificant in a t-test by
chance even when the F -statistic is not significant. For instance, if the selected
significance level for the F -test is 95%, there is 5% of chance that a predictor
is significant by chance in a t-test. Increasing the significance level may remedy
the problem to some extent, but it fails when p is large.
• Forward Selection:
– Start with a null-model (0 predictors) to set up a baseline for RSS.
– Add 1 predictor to the model, and try the p predictors one-by-one.
Select the predictor that produces the highest RSS reduction.
– Continue running the previous step until a stopping rule is satisfied
(e.g. RSS threshold).
• Backward Selection:
13
Finally note that while the forward selection applies if p > n, backward selected
does not. As the forward selection suffers from the problem of including redun-
dant predictors, the mixed selection is most recommended in practice.
As far as model evaluation, the same statistics are used multivariate case:
• RSE
q
1
– RSE = n−p−1 RSS
• R2
– R2 = 1 − RSS
T SS
2.3 Interaction
So far, we have been assuming that the relationship between the predictor and
the response is additive. Specifically, the effect of a predictor X1 on the response
Y is constant, such that one-unit change in Xj corresponds to a constant rate of
change in Y , regardless of changes in another predictor X2 . An additive linear
model as such can be formulated as follows in mathematical terms:
Y = β0 + β1 X1 + β2 X2 + (2.16)
In the equation, it is clear that any changes in the impact of X1 on Y is subject
only to the changes in the coefficient β1 , independent of changes associated with
X2 . The additive assumption, however, is most likely false. For instance, take
X1 and X2 to be two predictors TV and radio advertising expenditure, and Y
to be the response sales. If investing in radio advertising positively affects the
investment efficacy in TV on sales, i.e. two predictors are correlated, then the
additive assumption will be false. To reformulate 2.16 to include this correlation
between the two predictors, we may put down the following:
Y = β0 + β1 X1 + β2 X2 + β3 X1 X2 + (2.17)
How, then, is the interaction (i.e. the fact that the changes in X2 affects the
impact of X1 on Y , and vice versa) included in 2.17? This will be made clearer
in 2.18 with some algebraic manipulation:
14
In 2.18, β̃1 = β1 + β3 X2 . Now it is obvious that changes in the coefficient of
X1 is subject to changes in X2 . Specifically, β3 indicates the increase in the
effectiveness of X1 on Y for one unit increase in X2 . To evaluate the “worth”
2 2
R w/interaction −R w/ointeraction
of the inclusion of interaction, we may compute 2 ,
100−R w/ointeraction
which produces the percentage of residual variance “left behind” by the additive
model that is accounted for by the inclusion of the interaction term.
Note that we should always include the main effects (i.e. effect from the additive
model) if the interaction is included, even if the main effects are not significant.
This is referred to as the hierarchical principle. The rationale for the principle
is i) X1 × X2 is related to the response, whether or not X1 or X2 have 0
coefficient; ii) leaving out X1 and X2 changes the meaning of the interation,
since they are typically related to X1 × X2 .
Observing Fig 2.4, we see that horsepower does not effectively predict mpg in a
linear fit. If we were to include a power-2 term of horsepower, however, the pre-
diction becomes much more accurate. However, we should also be cautious in us-
ing nonlinear fit to avoid overfitting. For instance, adding a power-5 horsepower
15
term may lead to even better prediction than a power-2 term of horsepower
term. The regression model fitted using the current data set, however, may
perform poorly in generalization to new data sets, as its overly “customized” to
one training data set.
Importantly, note that this “tweaking” on our original model still gives us a
linear model, because the rate of change in the response effected by the predictors
is still a constant!
• Outliers
• High-levearge points.
• Collinearity
We now tackle the issues one by one. In Fig 2.5, the left pane is a plotting of
the residual against the one-predictor linear model prediction. It is immediately
noticed that the data points systematically deviates from the fitting to form a
convex curve. This means that the relationship between the predictor and the
response is nonlinear. Based on our observation, we adjust the model by in-
cluding a power-2 predictor term. With this addition, the data points are now
shown (right pane) to be randomly dispersed around the fitting, which indicates
that the error in the model is random, and we therefore know the polynomial
regression gives a better fitting.
The error term may be correlated with the response sometimes. In the left
pane of Fig 2.6, it is readily observed that the error terms are positively corre-
lated with the response. To eliminate the pattern in the error terms, we may
transform the response by some function. In the right pane of Fig 2.6, a log(Y )
16
Figure 2.5: Nonlinearity
conversion is deployed, and the error terms now scatter randomly around the
fitting curve, as desired.
Outliers may distort the fitting curve, when their number is relatively large,
which indicates the poor quality of the data. Nevertheless, in most cases the
number of outliers is small, and we may detect them by plotting the fitted val-
ues against the residuals. As it is difficult to the magnitude threshold by which
to decide whether to include a data point as an outlier, in practice we often
plot with studentized residuals instead and eliminate data points as outliers by
a threshold defined in terms of standard deviation. Fig 2.7 gives an example of
17
outlier detection.
Similar to the outliers, data points with high levearge also distorts the fitting
curve, although the distortion can be much more severe in the latter case. A
data point is said to have high leverage when it has an unusually large predictor
value x. The detection of high leverage points can be done using a leverage
statistic, although plotting X against the studentized residuals also applies.
2
1 (x − x̄)
hi = + n i (2.19)
n X 2
(xi0 − x̄)
0
i =1
A large leverage statistic indicates the presence of high leverage points. In mul-
tivaraite settings, the threshold for high leverage points is set as (p + 1)/n.
Finally, collinearity refers to the situation where two or more of the predictors
correlate. Collinearity undermines the power of hypothesis testing, because the
significance of a predictor masks the impact of predictors that are correlated
with it. To handle collinearity, the VIF -statistic is used, which is defined as
follows:
1
V IF (β̂j ) = (2.20)
1 − RXj |X−j 2
RXj |X−j 2 is the R2 from a regression Xj onto all the other predictors. The value
of V IF falls between 1 and infinity, where 1 indicates the absolutely absence of
collinearity. When collinearity is detected, we either retain only one of the set
of correlated predictors, or we may also combine these into one single predictor.
18
presents a nonparametric alterative which provides more flexible fitting.
The procedure of KNN regression roughly goes as follows: Given a value for K
and a prediction point x0 , KNN first identifies the K training observations that
are closest to x0 , represented by N0 . The following equation is then used to
compute the average of all the training responses in N0 as an estimate of the
response corresponding to x0 :
1 X
fˆ(x0 ) = yi (2.21)
K
xi ∈N0
As is discussed in chapter 1, the larger K is, the less flexible the model will be.
In general, the optimal value for K will depend on the bias-variance trade-off.
19
% facilities
library(MASS)
library(ISLR)
library(car)
% leverage statistic
plot(hatvalues(lm.fit ))
which.max(hatvalues(lm.fit)) % find most deviant
% collinearity evaluation
vif (lm. fit )
% polynomial regression
20
lm. fit = lm(y ˜ x1 + I(x1ˆ2), data=’<data>’) % power-2
% alternatively ...
lm. fit = lm(y ˜ poly(x, 2)), data=’<data>’)
% model comparison
% H0: model fits equally well
% H1: the model with more predictors significantly superior
lm.fit1 = lm(y ˜ x1, data=’<data>’)
lm.fit2 = lm(y ˜ x1 + I(x1ˆ2), data=’<data>’)
anova(lm.fit1, lm.fit2)
21
3 Classification
A simple example for the classification task is the following: Given medical
symptoms x1 , x2 , and x3 , asking if it is reasonable (or highly probable) to diag-
nose a patient as diabetic. The most commonly used methods in classification
are listed in the following:
• Logistic Regression
• Linear Discriminant Analysis
• K-Nearest Neighbors
There are very good reasons why we do not simply apply linear regression in
classification. For one thing, with a categorical predictor with 3 or more values,
linear regression falsely assumes an equal-distance order among the predictors.
For instance, if we coded three medical symptoms as integer values 1, 2, and
3, the order in which the symptoms are labeled may affect the outcome of the
linear regression model, which is intuitively unreasonable. Even if we were to
convert the three values into three binary predictors (i.e. whether a patient
has a symptom), the linear regression will still suffer from the interpretation
problem: When the fitted value comes beyond the [0,1] interval, it is difficult to
interpret the result as a probability (e.g. how likely is a patient diabetic).
For this reason, dedicated models for the classification task are necessary. Logis-
tic regression model, for instance, handles the out-of-[0,1]-bound issue gracefully,
as is illustrated in Fig 3.15 :
5
The task is predicting whether an individual will default based on his/her bank balance.
22
3.1 Logistic Regression
The simplest form of a logistic regression model (univariate) is formulated as
follows, where the fitted value for the response is interpreted as a probability:
eβ0 +β1 X
Y = p(X) = (3.1)
1 + eβ0 +β1 X
As β0 + β1 X can be greater than 1 or less than 0, we use a sigmoid function6
to map it into the [0,1] interval.
To further convince you that 3.1 is a reasonable formulation for the probability
interpretation, note that the equation can be algebraically manipulated to the
3.2:
p(X)
eβ0 +β1 X = (3.2)
1 − p(X)
The right hand side of the equation is the odds at which an event takes place.
Observing 3.1, we know that the probability prediction p(X) increases as eβ0 +β1 X
increases. The odds in 3.2 also positively covaries with eβ0 +β1 X , which implies
the positive correlation between the intuitive odds and the logistic model.
We can also show that 3.1 is indeed a linear model by 3.37 . Specifically, the
predictor is linearly related to the response in that its changes corresponds to
changes in log odds of the response.
p(X)
log( ) = β0 + β1 X (3.3)
1 − p(X)
In logistic regression, we seek to maximize the likelihood that the predictor(s)
gives correct prediction. Mathematically, we search for the predictor coefficients
(and the intercept) that maximize the value of the following likelihood function.
Y Y
L(β0 , β1 ) = p(xi ) (1 − p(xi0 )) (3.4)
i:yi =1 0 0
i :yi =0
6
It is also variously referred to as inverse logit function, or logistic function.
7 p(X)
log( 1−p(X) ) is called the logit.
23
Figure 3.2: Example: Logistic Regression Output
Generalizing 3.1 to the multivariate case, we have the model in 3.5. The other
aspects of the model is the same as the univariate case.
24
1 1 2
P r(X = x|Y = j) = √ exp(− 2 (x − µj ) ) (3.7)
2πσj 2σj
The equal-variability assumption states that σ1 2 = ... = σk 2 . The task of our
Bayes classifier, then, is to find the class j for which P r(Y = j|X = x) is largest.
Knowing that the denominator/normalizer in 3.6 is the same for all classes, and
under the assumption of equal-variability, we now put down the probability we
are trying to maximize as follows:
1 2
P r(Y = j|X = x) ∝ P r(Y = j) · exp(− (x − µj ) ) (3.8)
2σj 2
To further simplify the maximization task, we first take log of 3.8, and then
drop all the terms that are not related to µs, because in comparing a P r(Y =
i|X = x) and a P r(Y = j|X = x), the x value is the same for two probability
evaluations, and σs are assumed to be equal: x and σ are taken as constants.
If we also assume the same prior probabilities across classes, the task is even
further simplified9 . The computation proceeded as follows, where δ(x) is called
the Discriminant Function:
µi 2 − µj 2 µi + µj
x= = (3.10)
2(µi − µj ) 2
In concise terms, an LDA classifier evaluate from a data set i) the prior proba-
bilities of classes (i.e. π̂j = P r(Y = j)10 ); ii) the mean of xs in each class (i.e.
µ̂j ; iii) a weighted average of variance across classes and observations (i.e. σ̂ 2 ).
These are defined as follows:
9
Mind that the equal-prior assumption is put down here only to simplify the algebra for a
clearer intuition. It is rarely true in practice.
10
This is for the convenience of reference.
25
1 X
µ̂j = x (3.11)
nj i:y =j i
i
K
1 X X 2
σ̂ 2 = (x − µ̂j ) (3.12)
n − K j=1 i:y =j i
i
µ̂j µ̂2j
δj (x) = log(π̂j ) + x · − (3.14)
σ̂ 2 2σ̂ 2
The classifier finally compute the decision boundary using 3.10. The results are
illustrated with Fig 3.3, where the black dashed line is the Bayesian decision
boundary, whereas the solid line is the LDA decision boundary. The two only
differ in that LDA decision boundary includes the log(π̂j ) term.
Generalizing to the multivariate case, where there is more than one predictors
(i.e. p > 1), we are assuming that the predictors are drawn from a multivariate
normal distribution, which is defined as follows:
1 1
exp(− (x − µ) Σ−1 (x − µ))
T
f (x) = p/2 1/2
(3.15)
(2π) |Σ| 2
Not that the x and µ here are both p-dimensional vectors, and Σ is a p × p co-
variance matrix. In a multivariate normal distribution, each individual variable
is normally distributed, and their correlations are captured by the covariance
matrix.
With all aspects virtually the same to the univariate case, the discriminant
function of the multivariate LDA is defined as follows:
26
1
δ̂j (x) = log(π̂j ) + xT Σ−1 µ̂j − µ̂Tj Σ−1 µ̂j (3.16)
2
Fig 3.4 is an example of a multivariate data set with p = 2. The decision
boundaries are computed for each pair of classes.
Dropping the equal-variability assumption, each class will have a separate vari-
ablity measure (i.e. σj in the univariate case, Σ in the multivariate case). The
discriminant function, therefore, becomes the following:
1
σ̂j (x) = − (x − µ̂j ) Σj −1 (x − µ̂j ) + log(π̂j )
T
2
1 1
= − xT Σj −1 x + xT Σj −1 µ̂j − µ̂Tj Σj −1 µ̂j + log(π̂j ) (3.17)
2 2
This alternative is called the Quadratic Discriminant Analysis (QDA). It
is so named because in 3.17, x appears as a quadratic function.
Adding class-specific Σ, QDA has more parameters and thus imposes more con-
straints on the classification, which can potentially cause overfitting problem
(or high variance), although it is much more flexible than LDA in fitting, which
is on the opposite prone to underfitting (or high bias) problem. To help you put
your finger on the magnitude of QDA’s parameter proliferation, consider this:
In LDA, all classes share one variability measure (i.e. the covariance matrix)
the estimation of which involves p(p + 1)/2 parameters. In QDA, this number
is kp(p + 1)/2, because each of the k classes gets a separate covariance matrix.
A general rule of thumb in choosing between LDA and QDA is taking LDA
when the data-to-variable ratio is low, because it is crucial to contain variance
in these cases, and choosing QDA otherwise.
Comparing the classification methods we have covered so far (i.e. logistic re-
gression, LDA, QDA, and KNN), it is clear that logistic regression and LDA are
27
very similar, and only differ in i) LDA assumes that observations within classes
are normally distributed; ii) fitting procedures. In fact, in terms of log odds,
LDA has the following formulation:
p1 (x) p (x)
log( ) = log( 1 ) = c0 + c1 x (3.18)
1 − p1 (x) p2 (x)
where c0 and c1 are functions of µ1 , µ2 , σ 2 . This corresponds to the log odds
of logistic regression (cf. 3.3).
As for the comparison between logistic regression & LDA as parametric methods
and KNN as a nonparametric method, see the discussion at the end of section
2.6.
QDA serves as a compromise between the nonparametric KNN and the para-
metric logistic & LDA methods.
% logistic regression
% fitting
glm.fit = glm(y ˜ x1 + x2 + ... + xp, data=’<data>’, family=binomial)
% precision, recall , f
precision = 145/(145+457) % .241
28
recall = 145/(145+141) % .507
f = 2∗(precision∗recall / ( precision +recall)) % .327
% fitting
library(MASS)
lda. fit = lda(Direction˜Lag1+Lag2,data=Smarket,subset=train)
lda. fit
Call:
lda(Direction ˜ Lag1 + Lag2, data = Smarket, subset = train)
Group means:
Lag1 Lag2
Down 0.04279022 0.03389409
Up -0.03954635 -0.03132544
% evaluation
29
lda.pred = predict(lda.fit, Smarket.2005)
lda.class = lda.pred$class
table(lda.class, Direction.2005)
Down Up
999 0.4901792 0.5098208
1000 0.4792185 0.5207815
1001 0.4668185 0.5331815
...
% on the day with index 999, .49 chance Down, .51 chance Up
% fitting
qda. fit = qda(Direction˜Lag1+Lag2, data=Smarket, subset=train)
% other operations same as lda(..)
% output doesn’t include a ’coef’, because this is a quadratic function
% k-nearest neighbors
% fitting
library ( class )
set .seed(1) % for R to break tie in computation
knn.pred = knn(train,X, test.X, train .Direction, k=1)
% evaluation
table(knn.pred, Direction.2005)
% data management
% scaling/normalization
vector .norm = scale(vector)
30
4 Resampling
In statistical learning, we use sampling to maximize the value of a dataset. This
is particularly important when acquiring new data is difficult. For instance, for
a dataset D (assuming D to be the only data accessible under constraint on
resource), we have discovered in an initial modeling that including a quadratic
term for a predictor x significantly improves the fitting, and we would like to
make a reasonable estimation as to how likely the result comes not due to
chance. One way to do so is to sample m times from D and take each sample as
a training set, and the rest of the data a validation set. For each sample we
build the model using it and test the model on the corresponding validation set.
If we get similar results in most or all of the sampling/partitioning, we may be
assured that the initial result does have some validity, on which it is reasonable
for us to set up some working hypothesis.
4.1 Cross-Validation
The simplest validation resampling method is to partition, by random sam-
pling11 , a dataset in two halves: a training and a validation and build a model
on the training in each partitioning. We may then repeat the partitioning-
training-estimating process for m times, and then observe the results and val-
idating underlying patterns we suspected initially. However, the approach has
two crucial drawbacks: i) test errors may be highly variable across partitioning
pairs; ii) overestimating error rate (e.g. MSE) by training on less amount of
data.
31
Fig 4.1 and 4.2 compare the performance of the half-and-half cross-validation
and LOOCV.
Alternatively, we may also set the held-out set to be of 1/k size of the original
dataset. The error estimate is thus as follows.
32
k
1X
CV(k) = M SEi (4.2)
k i=1
This is called a k-fold cross-validation, which is illustrated in Fig 4.3:
Now, a question naturally follows: Given that the high computational cost of
LOOCV can be avoided by 4.1, why k-fold CV should even be considered? The
question is answered from a bias-variance trade-off point of view. In terms of
bias, LOOCV is obviously superior to the k-fold method, because the training
sets it uses are almost identical to the original dataset, and thus has a very low
bias (i.e. it is a better approximation to the model built from the original set
than k-fold). However, because these training sets have high overlapping with
each other, they are highly correlated, which induces high variance! That is, the
models generalize poorly on new datasets (cf. section 1.3, the definition of bias
and variance). k-fold CV, therefore, serves as a good compromise between the
half-and-half CV and LOOCV by keeping the bias and the variance reasonably
balanced.
n
1X
CV(n) = Erri
n i=1
n
1X
= I(yi 6= ŷi ) (4.3)
n i=1
4.2 Bootstrapping
Relating to the context of resampling, the Central Limit Theorem (CLT)
states that the mean of a sufficiently large number of samples will be approxi-
33
mately normally distributed, and the “grand mean” of the samples approaches
the true mean of the population from which the samples are drawn when the
number of samples approaches infinity. In real-world problems, most likely we
will not be able to sample repeatedly from a target population, due to practical
constraint (e.g. the cost in an experimental study of cancer patients). In these
cases, we therefore sample instead from the original dataset. This resampling
method is referred to as Bootstrapping.
σA 2 − σAB 13
α= (4.4)
σA 2 + σB 2 − 2σAB
In practice, it is impossible for us to invest, say 1000 times, into the two com-
panies and analyze the returns as the dataset. All we have is a dataset which is
the past investment allocations and returns. Using bootstrapping, however, we
may then sample 1000 times from the dataset we have to get a decent estimate
of the result from actually investing 1000 times (i.e. the target population). Our
estimate of the true α, then, is the mean of all αs computed from the samples:
1,000
1 X
ᾱ = α̂ (4.5)
1, 000 r=1 r
To evaluate the extent to which the estimate
v is accurate, we compute the stan-
u 1,000
X 2
u
1
dard deviation of the αs: SE1000 (α̂) = t 1,000−1 (α̂r − ᾱ) . According the
r=1
CLT, SE(α̂) approaches 0, ᾱ approaches the true α when the number of samples
approaches infinity.
34
% select the first half -set
library(ISLR)
set.seed(1) % so that results can be reproduced later
train = sample(392,196) % 392 observations in total
% LOOCV
% fit
library(boot)
glm.fit = glm(mpg˜horsepower,data=Auto) % same as lm(..) here
% compute MSE
cv. err = cv.glm(Auto,glm.fit)
cv. err$delta
% output two values
% value 1: MSE
% value 2: adjusted MSE (k-fold case: adjustment for not using LOOCV)
% k-fold CV
% 10-fold demo
set.seed(17)
cv. errs = rep(0,10) % can run more repetitions with lower comp. cost
35
for ( i in 1:10) {
glm.fit = glm(mpg˜poly(horsepower,i),data=Auto)
cv. errs [ i ] = cv.glm(Auto,glm.fit,K=10)$delta[1]
}
plot(cv.errs ,type=’b’)
% manual resampling
set.seed(1)
coefs(Auto,sample(392,392,replace=T))
% run this 1000 times, store values , compute mean & sd
% automatic resampling
boot(Auto,coefs,1000)
36
5 Linear Model Selection & Regularization
Up to now, we have been using least squares to find the fitting in linear regres-
sion, which, as we have shown in chapter 2, makes an effective and intuitively
simple approach. The prediction accuracy and the model interpretabil-
ity of linear models, however, can be further improved by supplment the least
squares with some additional procedures. For the least squares to function
properly, the following conditions have to more or less stand:
In this chapter, we consider three methods, which aim to solve the issues that
arise when the conditions do not hold, to least squares:
• Set M0 as the null model, which does not contain any predictors.
• For k = 1, 2, ..., p:
14
In fact, there will be an infinite number of linear solutions.
15
These will be discussed later on.
16
For logistic regression, use deviance, which is defined as −2L(β1 , ..., βp ) (cf. equation
3.3).
37
p
– Build all k models (k predictors).
– Select the model that has the lowest RSS or largest R2 , and set it as
Mk
• Compare M0 , ..., Mp , select the model with the lowest cross-validation
error, AIC, BIC, or adjusted R2 .
It should be noted that RSS and R-squared oftentimes trade-off with AIC and
BIC: the formers monotonically improve17 while the latters usually decrease as
the number of predictors increases. One way to handle the issue is by first
screening the models of different number of predictors to find the number that
comes with the highest “cost performance”18 . Consider Fig 5.1:
It is clear in the plots that the performance improvement to the model by in-
creasing the number of predictors comes to a near halt at p = 3. This can
be taken as an indicator that it is not worth the cost to increase p further. A
point like this is commonly called the elbow point 19 . After having found the
elbow point of p, we then drop the models with more predictors in the further
evaluation by AIC, BIC, etc.
38
all combinations of predictors, we find the combination with stepwise selec-
tion.
First consider forward stepwise selection. While being the same as the best
subsect selection method in the first and the third step, the forward stepwise
selection algorithm adds predictors to the model one-by-one, and always picks
the predictor that best improves the model (in terms of RSS and R-squared).
Specifically, in each iteration, we select the predictor such that adding it to the
current model induces the hightest reduction in RSS or increase in R-squared.
For instance, starting from the null model M0 , we iterate through all p predic-
tors, find the best one, and set the current model to M1 , which has this best
predictor as the only predictor. We then search through the rest p−1 predictors,
find the best one, set the current model with this predictor and the predictor
in M1 , call it M2 , then go to the loop again, until adding a predictor does not
improve the model further, by some subjectively set threshold 20 .
Doing the count, it is readily checked that, instead of having to inspect 2p mod-
els, we now only have 1 + p(p + 1)/2 models to process.
Alternatively, we may also start with a full model (i.e. one which includes all
predictors available), and remove predictors one-by-one: the predictor the re-
moval of which leads to the least reduction in R-squared or the least increase
in RSS, and stop when the cost of removal surpasses a threshold. This is called
the backward stepwise selection.
Although the stepwise methods are computationally much less costly than the
best subset selection method, they do not guarantee the finding of the best model
overall. However, in practice, they still select models that predict reasonably
well. As a compromise between the stepwise methods and the best subset selec-
tion method, one may run the forward and the backward stepwise selection in
alternation: Upon the forward selection reaches the “improvement threshold”,
switch to the backward selection, run it until its threshold is reached, and switch
back. Operating as such a few turns, we usually find a better model that the
ones found by using the forward or backward selection alone. The method is
called the hybrid stepwise selection.
As have been pointed out, RSS and R-squared cannot serve as the sole esti-
mate for a model, because the model that includes all predictors always has the
lowest RSS and the highest R-squared (cf. Fig 5.1). Further, the error esti-
mate they give is a training estimate, which does not guarantee generalization
to new datasets. To adjust the estimate of the training data, we therefore need
to somehow penalize the additional cost of adding more predictors and increase
in variance.
20
This could be a threshold value of RSS or R-squared.
39
The following adjustment methods are commonly used:
1
Cp = (RSS + 2dσ̂ 2 ) (5.1)
n
1
AIC = (RSS + 2dσ̂ 2 ) (5.2)
nσ̂ 2
1
BIC = (RSS + log(n)dσ̂ 2 ) (5.3)
n
Note that σ̂ 2 is the variance of the errors for all the responses (i.e. the variance
of yi − yˆi ). d is the number of predictors selected for the current model. The
smaller these values are, the better a model is evaluated. Without digressing
too much by getting into the details of these measurements, one readily notices
that they share the following properties:
In practice, the evaluation methods tend to select models with lower test error.
Fig 5.2 is an illustration of the methods in model selection:
The methods above are particularly necessary when “elbow points” cannot be
categorically decided. In addition, we can also add penalty to adding predictors
for R-squared, which is 1 − RSS
T SS , to calculated adjusted R-squared :
RSS/(n − d − 1)
AdjustedR2 = 1 − (5.4)
T SS/(n − 1)
Adjusted R-squared takes into account the “noise” generated by introducing
more predictors, which may be interpreted as additional variance.
40
Finally, to mitigate the problem of low-training-error-yet-high-test-error, we al-
ways have the cross-validation procedures introduced in chapter 4.
5.2 Shrinkage
As mentioned earlier, the shrinkage21 techniques aim to reduces the variance
of a model by shrinking the coefficients of its predictors. The most commonly
used shrinkage techniques are ridge regression and lasso.
The mechanism of ridge regression is intuitively clear: With the addition of the
penalty term, we in fact have a “duplidate” for each β. Therefore, the combined
value which is distributed among the coefficients is now allocated to more terms,
which leads to the shrinkage of each coefficient. The λ coefficient of the penalty
term is called a tuning parameter (or regularization parameter ). Its greater than
or equal to 0. It is clear that the larger the tunning parameter is, the greater
the extent to which the coefficients shrink.
An important note is that the scale of the predictors may affect the coefficients
in the fitting. Therefore it is advised to standardize the predictors before regu-
larization:
xij
xˆij = q P (5.6)
1 n 2
n i=1 xij − x̄j
The choice of the tuning parameter is crucial to the performance of the fitting.
Specifically, while regularization drops the variance of a model, it in the mean-
time increase its bias by reducing the flexibility of the fitting: The predictors
make less “contribution” to the prediction under regularization, thus their abil-
ity to “steer the fitting towards the observations” weakens, which increases bias.
The balance point of the bias-variance trade-off can be found by plotting the
21
Shrinkage is also popularly known as regularization in statistics and machine learn-
ing community.
22
Note that by convention we do not shrink the intercept β0 , which is simply the mean
value of the response when all the predictors are “disabled” (i.e. x1 , ..., xp = 0).
41
tunning parameter λ against the MSE (cf. section 1.2), as is demonstrated in
Fig 5.3:
Figure 5.3: Ridge Regression in Bias-Variance Trade-off (Green: vari ance; Black:
bias)
In practice, ridge regression works the best when the least squares estimate has
high variance (e.g. when p/n ratio is high, or when the predictors-response
relationship is close to linear), and it generally does a better job than subset
selection by preserving all predictors while minimizing the concomitant “harm”
of high variance.
Fig 5.4 illustrates why lasso, but not ridge regression, allows some coefficients
to be shrunk to 0. The red ellipses in the graphs are contours of RSS, and
the green areas are the constraint functions for lasso and ridge regression (left
and right graph respectively)23 . The first contact point of an RSS contour and
a regularization constraint function is where the optimization βs are decided.
23
Note that, when the magnitude of regularization increases (i.e. λ increases), the constraint
area shrinks.
42
Obviously, in the case of ridge regression, the first contact can never take place
when β1 = 0, whereas this is possible in the case of lasso.
Finally, as a comparison of ridge regression and lasso, the former shrinks coeffi-
cients by the same proportion, whereas the latter always do so at by a more or
less same amount.
43
Figure 5.5: Underlying Dimensions
p
X
Zm = φjm Xj (5.8)
j=1
M
X
yi = θ0 + θm zim + i , i = 1, ..., n (5.9)
m=1
The coefficients of the original predictors (i.e. βs) can be reconstructed using
the coefficients of the new predictors (i.e. θs) and the linear coefficients φs as
follows:
M
X M
X p
X p X
X M p
X
θm zim = θm φjm xij = θm φjm xij = βj xij (5.10)
m=1 m=1 j=1 j=1 m=1 j=1
M
X
βj = θm φjm (5.11)
m=1
We start with the principal component analysis. Continuing using Fig 5.5 for
example, PCA aims to find the new dimensions corresponding to the green and
the blue axes. The green axis is aligned in the direction in which the obser-
vations vary the most, and the blue axis accounts for the rest of the variance
which is not accounted for by the green axis25 . The green and the blue axes are
25
Here, we are assuming that the observations only vary in a two-dimensional space for
simplicity.
44
called the 1st principal component and the 2nd principal component.
The PCA algorithm first find the 1st PC, and then the 2nd PC, which is ge-
ometrically orthogonal to the 1st PC26 . Therefore, the 1st PC can be seen as
an “anchor” for finding the rest of the principal components. In addition to
lying along the most-varied dimension for the observations, the 1st PC has the
following property:
• The perpendicular distances between the 1st PC and the observations are
minimized.
• The 0 point on the 1st PC is the intersection of the average lines of the
observations along the original dimensions27 .
It should be pointed out that the number of principal components is best de-
cided based on results from cross-validation.
PCA suffers from a major drawback: It is entirely unsupervised, in that the “di-
rections” of PCs (i.e. linear combination of predictors vectors) are not guided
by the response. There is no guarantee that the directions that best explain the
predictors will also be the best directions to use for predicting the response. To
26
It is easy to see why this should make 2nd PC to account for the variance that is not
account for by 1st PC: The variation of the observations along the 2nd PC is not related to
their values on the 1st PC.
27
This implies if an observation has its 1st PC value greater than 0, then it has above-average
values on the original dimensions.
45
handle this issue when quality training data is available, we use instead its close
cousin partial least squares (PLS), which is in large part identical to PCA, and
only differs in that it guides the model building with the correlation between
the predictors and the response.
Similar to PCA, where the PCs are linear combinations of the original dimen-
sions (i.e. predictors, cf. equation 5.8), PLS also has these “PCs”, only the PCs
of PLS have different coefficients. Specifically, the φj1 (i.e. coefficient of the
j-th predictor in the 1st PC, 1 ) is equal to the coefficient from the simple linear
regression Y ∼ Xj . Therefore, the coefficients, although do not always guide the
directions of the PCs to explain the maximal amount of variance, do link them
up with the response in a semi-supervised way. Fig 5.7 gives a demonstration,
comparing PCA’s PC side-by-side to PLS’s.
To find the second PC, we first adjust each of the variables for Z1 , by regressing
each variable on Z1 (i.e. Z1 ∼ Xj ) and taking residuals28 . These residuals can
be interpreted as the remaining information that has not been explained by the
Z1 . We then compute Z2 using this orthogonalized data in exactly the same
fashion as Z1 was computed based on the original data29 .
46
Figure 5.8: Overfitting in High Dimensional Spaces
The major problem in using least squares fitting for such data is that the least
squares regression line is too flexible and hence will always overfit the data, even
when a sizable number of irrelevant predictors are involved. Fig 5.8 demonstrate
the changes in R2 , training MSE and test MSE.
It is particularly worth noting that many previously mentioned evaluation pa-
rameters will not be appropriate in a high-dimensional setting. These are
squared errors, p-values, Cp , AIC, BIC and (adjusted) R2 . With a high-
dimensional data, the evaluation based on these parameters invariably paints
a rosy picture when the actual predictors may actually be utterly useless in
making predictions for new data.
To combat overfitting, the less flexible least squares models can be adopted,
such as forward stepwise selection, ridge regression, the lass, and PCA. However,
note that the fine-tuning of the model still plays a significant role in striking the
balance in bias-variance tradeoff. That is, heavy regularization30 , for instance,
does not always necessarily lead to a better model.
30
In the case of predictor subsetting and dimension reduction, the tuning will be on the
number of predictors.
47
% Example: Hitters data set from ISLR
% use various predictors to predict the salary of players .
% subsetting
library(leaps)
regfit . full = regsubsets(Salary˜.,Hitters)
% summary on this outputs best 1 to 8 predictor models
% the selected predictors are marked with ‘∗’s.
regfit . full = regsubsets(Salary˜.,data=Hitters,nvmax=19)
% the number 8 is by default
% to tweak it, included argument nvmax.
% Model Selection
48
% build model with training
regfit .best = regsubsets(Salary˜.,data=Hitters[train ,], nvmax=19)
% cross-validation
k = 10
set.seed(1)
folds = sample(1:k,nrow(Hitters),replace=TRUE)
cv. errors = matrix(NA,k,19,dimnames=list(NULL,paste(1:19)))
% the above allocates 263 observations in the data set
% randomly into 10 folds.
for(j in 1:k){
best. fit =regsubsets( Salary ., data=Hitters [folds!=j ,], nvmax=19)
for(i in 1:19) {
pred=predict(best.fit,Hitters [ folds ==j,],id=i)
cv. errors [ j , i]=mean((Hitters$Salary[folds ==j]-pred)ˆ2)
}
}
% the above does the following:
% for each fold (as a test set), build model
% on the rest folds , and test on the left -out fold .
% in each testing round, try i-predictor model
49
% where i = 1,...,19.
% result: a 10∗19 matrix where
% rows = test folds
% columns = i-predictor models
mean.cv.errors = apply(cv.errors,2,mean)
% this computes column means
% each result is the mean MSE of an i-predictor model
% across all k CV-testings.
which.min(mean.cv.errors)
% this gives the best model’s # of predictors
% 11 in this case.
% Regularization
% load package
library (glmnet)
% glmnet()’s alpha argument decides type of regularization
% alpha = 0 for ridge regression
% alpha = 1 for lass
50
% ridge regression II (cross - validation )
set.seed(1)
train = sample(1:nrow(x),nrow(x)/2)
% split data into training and test
% half-half split :
% each row of the data matrix corresponds to 1 observation
% we randomly select 131 (i.e. nrow(x)/2) players for training.
test = (-train)
% the rest half is put in the test set.
y. test = y[test ]
% actual response values in the test set.
ridge .mod = glmnet(x[train,],y[train ,], alpha=0,lambda=grid,thresh=1e-12)
% regularized fit .
ridge .pred = predict(ridge.mod,s=4,newx=x[test,])
% predict on test set, with lambda = 4.
mean((ridge.pred-y.test)ˆ2)
% MSE (101037 in this case).
% lasso
lasso .mod = glmnet(x[train,],y[train ], alpha=1,lambda=grid)
% fit model on training.
set.seed(1)
cv.out = cv.glmnet(x[train ,], y[ train ], alpha=1)
% cross-validation .
bestlam = cv.out$lambda.min
% find best lambda.
lasso .pred = predict(lasso.mod,s=bestlam,newx=x[test,])
mean((lasso.pred-y.test)ˆ2)
% MSE with best lambda.
out = glmnet(x,y,alpha=1,lambda=grid)
lasso .coef=predit(out,type=’coefficients ’ , s=bestlam)[1:20,]
% coefficients with best lambda
% note that some of the coefs = 0
% this shows that lasso , unlike ridge regression
51
% does do predictor selection .
% PRC
% utilities
library(pls)
% fit
set.seed(2)
pcr. fit = pcr(Salary˜.,data=Hitters,scale=TRUE,validation=’CV’)
% scale=TRUE standardizes each predictor.
% validation=’CV’ does 10-fold cross- validation .
% pcr() reports Root MSE, which = sqrt(MSE).
% evaluation
validationplot (pcr. fit , val .type=’MSEP’)
% plotting # of vars against MSE.
% PLS
% fit
pls . fit = plsr(Salary˜.,data=Hitters,subset=train,scale=TRUE,validation=’CV’)
% evaluation
validationplot (pls . fit , val .type=’MSEP’)
% evaluating on training set.
pls .pred = predict(pls.fit ,x[ test ,], ncomp=2)
mean((pls.pred-y.test)ˆ2)
% evaluating on test set.
52
6 Preliminaries on Nonlinear Methods
When the actual relationship between the predictors and the response is not
one of linear nature, linear models are limited in their power to balance bias-
variance tradeoff while making decent predictions on new data. Although the
envelope can be pushed slightly further by using regularization and dimension
reduction, the performance of the models does not improve by much.
In this chapter, we consider a set of extensions of the linear models covered pre-
viously. The extensions are engineered such that they model nonlinear behaviors
and interactions among variables/predictors.
yi = β0 + β1 xi + β2 xi 2 + ... + βd xi d + i (6.1)
Suppose we have an age (xi ) variable with which we would like to predictor the
ˆ )) of corresponding persons. The following is the power-4 polynomial
wage (f (x i
regression model for the task. The regression curve is the solid line in the left
pane of Fig. 6.1.
fˆ(xi ) = β̂i + βˆ1 xi + ... + βˆ4 xi 4 (6.2)
If we say that $250,000 is the threshold that divides the observations/persons
into a high income and a low income group, we may further build a polynomial
logistic regression model for the classification task:
ˆ ˆ ˆ 4
eβi +β1 xi +...+β4 xi
Pˆr(f (xi ) > 250k|xi ) = ˆ ˆ ˆ 4 (6.3)
1 + eβi +β1 xi +...+β4 xi
Now if we can compute variance of fˆ(xi ), we will be able to compute its standard
deviation31 , and thus plot a confidence interval around the regression curve as
described with dashed lines in Fig. 6.1. The dashed lines are generated by
plotting twice the standard deviation on either side of the curve (i.e. 95%
confidence interval, fˆ(xi ) ± 2σ).
31 T
The computation is a follows: fˆ(xi ) = li Ĉli , where Ĉ is theq5 × 5 covariance matrix of
t 2 3 4
the β̂j , and li = (1, xi , xi , xi , xi ). The standard deviation is fˆ(xi ).
53
Figure 6.1: Polynomial Regression (left); Polynomial Logistic Regression (right)
To break from the restraining force of the global structure, we may instead use
a step function. Basically, we now partition the range of the predictors X 33 into
bins with cut points c1 , c2 , ..., cK , and fit a different constant in each bin. This
converts a continuous variable into an ordered categorical variable. With X’s
value located in the interval of each bin, we have K + 1 new variables34 :
Under the partition, the linear model uses C1 (X), ..., CK (X) as predictors in-
stead, and takes the following form for a particular predictor xi :
54
Figure 6.2: Linear Model with Step Function
55
Figure 6.3: Unconstrained Piecewise & Spline Constraints
The top-left pane in Fig 6.3 presents a simple demo of how the piecewise poly-
nomial regression works. Basically, the data is cut in two halves for two fittings
which share the same predictors.
Apparently the discontinuity at age = 50 is somewhat unnatural. To handle
this, we may impose a constraint on the model which requires the fitting curve
to be continuous, this results in the fitting in the top-right pane of Fig 6.3. The
continuity constraint can be further extended to continuity of the k-th deriva-
tive35 , such that the fitting curve becomes even smoother, as is shown in the
bottom-left pane. Finally, rather than fitting a polynomial curve in the interval
35
Note that k ≤ d, where d is the highest degree of polynomial in the model.
56
of each partition, we may fit a straight line too. With the continuity constraint,
we have the fitting in the bottom-right pane. This is called a linear spline.
Here is how the continuity constraint is imposed. We may use a basis model to
represent our regression spline. Eq. 6.7 is then written in the form of a basis
model:
yi = β0 + β1 b1 (xi ) + β2 b2 (xi ) + β3 b3 (xi ) + i (6.8)
Alternatively, we may add a term to the model in Eq. 6.7: a truncated power
basis function to the , where ξ is the knot:
(
3
3 (x − ξ) if x > ξ
h(x, ξ) = (x − ξ)+ = (6.9)
0 otherwise
Specifically, add the term of the form β4 h(x, ξ) to Eq. 6.7 for a cubic polynomial
will lead to a discontinuity in only the third derivative at ξ, i.e. the function
will otherwise continuous (first & second derivatives) at each of the knots.
Finally, as splines are unconstrained in border intervals, we may further add a
boundary constraint , which requires the function to be linear at the boundary,
for more stable estimates at the boundaries. This is called a natural spline.
While we cannot be sure at the set-off which approach will fare better (or how
may partitions, in the case of evenly-spaced knot placement), we may always
resort to cross-validation to find the best knot placement. With this method,
we fit a spline on the training set, and make predictions with it on the held-out
set. The performance of the spline is then evaluated by computing the overall
cross-validated RSS.
57
Figure 6.4: Natural Splines vs. Polynomial Regression (df = 15)
Pn 2
essentially minimizes the RSS = i=1 (yi − g(xi )) for some function g(x)
which produces an estimate for the corresponding y. From the Ch 5.2, we
learned that when the number of predictors is high (and especially when the
observation to predictor ratio is low), regression models tend to overfit the
training set. Here we have the same problem with smoothing splines, because
each data point is effectively a “predictor”. Graphically, an overfitted regression
curve has a high overall slope variation, i.e. the first derivatives of the curve
changes frequently and to large degrees, or the second derivatives. To moderate
this, we add a penalty to our optimization goal (i.e. RSS) to minimize also the
slope-variation, which is done by minimizing the second derivatives.
n Z
(yi − g(xi ))2 + λ g 00 (t)2 dt
X
(6.10)
i=1
ĝ here is an n-vector containing the fitted values of the smoothing spline at the
training points x1 , ..., xn . The vector can be written, according to Eq. 6.11, the
58
Figure 6.5: Local Regression
To choose a λ, we may use LOOCV by computing the following RSScv (λ) and
select the λ for which the value is minimized. The computation makes use of
the transformation matrix {S}, and is incredibly effective.
n n 2
X (−i) 2 X yi − ĝλ (xi )
RSScv (λ) = (yi − ĝλ (xi )) = (6.13)
i=1 i=1
1 − {Sλ }ii
In general, we would like to have a model with less degree of freedom (i.e. less
free parameters, and hence simpler).
59
• The fitted value at x0 is given by fˆ(x0 ) = β̂0 + β̂1 x0
Finally, note that local regression suffers from the same “neighbor sparsity”
problem as K-nearest Neighbor approach at high dimensions. Recall that, at
high dimensions, it is very difficult to find a set of neighbors from the target
data point.
60
library(ISLR)
attach(Wage)
agelims = range(age)
age.grid = seq(from=agelims[1], to=agelims[2])
% grid: 18,19,...,90
preds = predict(fit,newdata=list(age=age.grid),se=TRUE)
% make prediction
se.bands = cbind(preds$fit+2∗preds$se.fit, preds$fit-2∗preds$se.fit)
% show standard error band at 2se
61
% make prediction
% alternative:
% preds = predict(fit, newdata=list(age=age.grid),
% type=’response’, se=T)
pfit = exp(preds$fit) / (1+exp(preds$fit))
% convert logit to estimate
se.bands.logit = cbind(preds$fit+2∗preds$se.fit, preds$fit-2∗preds$se.fit)
se.bands = exp(se.bands.logit) / (1+exp(se.bands.logit))
% show standard error band at 2se
table(cut(age,4))
% table representation of prediction
% 4 ‘age-buckets’
fit = lm(wage˜cut(age,4), data=Wage)
% partitioned fit
coef(summary(fit))
library(splines)
fit = lm(wage˜bs(age,knots=c(25,40,60)), data=Wage)
% fit
% bs(): generate matrix of basis functions for specified knots
pred = predict(fit, newdata=list(age=age.grid), se=T)
% make prediction
plot(age, wage, col=’gray’)
lines(age.grid, pred$fit , lwd=2)
lines(age.grid, pred$fit+2∗pred$se, lty=’dashed’)
lines(age.grid, pred$fit -2∗pred$se, lty=’dashed’)
dim(bs(age, knots=c(25,40,60)))
% two ways to check df
attr(bs(age,df=6), ’knots’)
% show quantile percentages
62
lines(age.grid, pred2$fit , col=’red’, lwd=2)
% Local Regression
% GAM
library(gam)
gam.m3 = gam(wage˜s(year,4)+s(age,5)+education, data=Wage)
par(mfrow=c(1,3))
plot(gam.m3, se=T, col=’blue’)
% 3 plots for 3 predictors
% each shows respective predictor’s fit to response
63
table(education, I(wage>250))
% show predictions
64
Figure 7.1: Decision Tree Demo
7 Tree-Based Models
7.1 Decision Trees
7.1.1 Model of DT
In a typical Decision Tree task, we have n observations x1 , ..., xn and p predic-
tors/parameters, and we would like to compute estimate ŷi for each response
yi . Graphically, the following example illustrates how the predictors year and
hits are used to predict a baseball player’s salary 36 .
In the example, the two predictors are binarily factorified by an artificial dividing
point which minimizes RSS (coming up soon). The tree can also be represented
with a graph of decision regions, as Fig 7.2:
Having the basic setup of a decision tree task in mind, we now formulate the
prediction-making and optimization goal of a decision tree, given .
• Prediction
– Given a set of possible values of observations X1 , ..., Xp character-
ized by p predictors, partition the values into J distinct and non-
overlapping regions R1 , ..., RJ .
– For every observation xi in region Rj , the prediction/estimate for its
corresponding ŷi is the mean of the response values yi which are in
Rj .
36
Represented as log salary to obtain a more bell-shaped distribution.
65
Figure 7.2: Decision Tree as Decision Regions
• Optimization Goal
J X
X 2
(yi − ŷRj ) (7.1)
j=1 i∈Rj
37
It is clear that the more predictors we use, the lower the RSS will be on the training set.
This, however, risks overfitting for our model.
66
The sequence selection can be carried out with some variation of forward/back-
ward/hybrid selection procedure (cf. Ch 5.1), which will not be elaborated here.
To guard against overfitting, each tree is also subject to a cross-validation where
the M SE is computed to evaluate a particular tree’s performance.
Regression tree in a classification task differs in both the way in which prediction
is made and the optimization goal.
• Prediction
Each observation goes to the most commonly occurring class of training
observations in a decision region.
• Optimization Goals
– Classification Error Rate38
E = 1 − max(p̂mk ) (7.5)
k
– Gini Index 39
K
X
G= p̂mk (1 − p̂mk ) (7.6)
k=1
– Cross-Entropy 40
K
X
D=− p̂mk log p̂mk (7.7)
k=1
67
• Pros of DT
– Highly interpretable, even for non-statisticians.
– Cognitively (presumably) closer to the process of human decision-
making.
– Efficiently handle qualitative predictors w/o the need to create dummy
variables.
• Cons of DT
– Relatively low predictive accuracy, compared with other regression
and classification models.
7.2 Bagging
Bagging is a model-building method that aims at reducing variance. Basically,
we would like to build n models with n training sets, and obtain prediction as the
average of the prediction from the n models. If each model has a variance σi2 , the
2
variance of the prediction will be σ̄n ≤ σi2 . In practice, obtaining n training sets
is oftentimes unpractical, we therefore use bootstrapping to generate B different
training sets by sampling from the only training set we have. The prediction is
thus made as follows:
B
1 X ˆ∗b
fˆbag (x) = f (x) (7.8)
B
b=1
68
7.3 Random Forest
A serious problem that prevents bagging from effectively reducing the variance
is that bagging trees are highly correlated when there are a few strong predic-
tors in the set of all predictors — the trees built from the n training sets then
end up very similar to each other. Random Forest overcomes this problem by
decorrelating the bagging trees with a clever splitting technique: At each split,
√
m predictors are sampled from all the p predictors (empirically m ' p gives
good performance), and the split is allowed to use only one of these m predictors.
7.4 Boosting
There are two main differences between bagging and Boosting:
• Fitting Target: Boosting fits to the residual rather than the response per
se.
• Fitting Procedure: Boosting builds trees sequentially rather than by si-
multaneous sampling.
The following algorithm describes the boosting method:
• Set fˆ(x) = 0 and ri = yi for all i in the training set.
• For b = 1, 2, ..., B, repeat:
– Fit a tree fˆb with d split (d + 1 terminal node) to the training data
(X, r).
– Update fˆ by adding in a shrunken version of the new tree:
fˆ(x) ← fˆ(x) + λfˆb (x) (7.9)
– Update the residuals,
ri ← ri − λfˆb (xi ) (7.10)
• Output the boosted model,
B
X
fˆ(x) = λfˆb (x) (7.11)
b=1
The idea behind boosting is learning slowly and improve fˆ in areas where it
does not perform well, which is in a way highlighted by residuals. The shrinkage
parameter λ (typically 0.001 ≤ λ ≤ 0.01) tunes the process by showing it
down further, which allows thus fine-tuned attack on the residuals with different
shaped trees. d here is a step parameter which controls the depth (usually d = 1
works well, where each subtree is a stump) of the subtree added for each iteration
(at step 2 in the Boosting Algorithm).
69
7.5 Lab Code
% Decision Tree
summary(tree.carseats)
% see report of # of terminal nodes,
% residual mean deviance, misclassification error rate ,
% and predictors by importance
plot(tree . carseats )
text(tree . carseats , pretty=0)
% plot tree and print node labels
set .seed(2)
train = sample(1:nrow(Carseats), 200)
Carseats. test = Carseats[-train ,]
% half-half train - test split
High.test = High[-train]
tree . carseats = tree(High˜.-Sales, Carseats, subset=train)
tree .pred = predict(tree . carseats , Carseats. test , type=‘class’ )
table(tree.pred, High.test)
(86+57)/200 ... % = 0.715
% (No˜No + Yes˜Yes) / Total
set.seed(3)
cv. carseats = cv.tree(tree . carseats , FUN=prune.misclass)
% 10-block CV by default
% report 10 trees of different sizes , and
% corresponding misclass rate ($dev)
% 9-node tree has the best
par(mfrow=c(1,2))
plot(cv.carseats$size , cv. carseats$dev, type=‘b’)
plot(cv. carseats$k, cv. carseats$dev, type=‘b’)
70
% plot size of nodes against error rate
% plot # of CV-folds against error rate
% Regression Tree
library (MASS)
set .seed(1)
train = sample(1:nrow(Boston), nrow(Boston)/2)
tree .boston = tree(medv˜.,Boston, subset=trian)
summary(tree.boston)
% outputs # of terminal nodes,
% residual mean deviance, distribution of residuals
plot(tree .boston)
text(tree .boston, pretty=0)
% plot tree
library (randomForest)
set .seed(1)
71
bag.boston = randomForest(medv˜., data=Boston, subset=train,
mtry=13, importance=TRUE)
% mtry=13: all 13 predictors should be considered
% i.e. bagging (m=p random forest)
% importance: assess importance of predictors
set .seed(1)
rf .boston = randomForest(medv˜., data=Boston, subset=train,
mtry=6, importance=TRUE)
yhat. rf = predict(rf .boston, newdata=Boston[-train,])
mean((yhat.rf-boston.test)ˆ2)
% MSE lower than bagging
importance(rf.boston)
% outputs MSE impact & impurity reduction
% for each node/predictor
varImpPlot(rf.boston)
% plot importance of nodes/predictors
% Boosting
library (gbm)
set .seed(1)
boost.boston = gbm(medv˜., data=Boston[train,], distribution=‘gaussian’,
n. trees=5000, interaction.depth=4)
summary(boost.boston)
% outputs relative influence statistics
par(mfrow=c(1,2))
plot(boost.boston, i=‘rm’)
plot(boost.boston, i=‘lstat ’ )
% plot important predictors against response
72
yhat.hoost = predict(boost.boston, newdata=Boston[-train,], n.trees=5000)
mean((yhat.boost-boston.test)ˆ2)
% evaluation
% MSE similar to random forest
73
8 Support Vector Machine
8.1 Maximal Margin Classifier
The essential idea of a Maximal Margin Classifier (MMC) can be explained
using a simplified example: Let X1 , ..., Xn be a set of p-dimensional data points
(i.e. they are characterized by p predictors/variables), and y1 , ..., yn are their
corresponding responses. Assuming the data points belong to either one of
two classes, if we were able to find a hyperplane, which is defined as a flat
affine subspace of dimension p − 1 (for a p-predictor case41 ), then we will be
able to classify a data point by which side of the hyperplane it is located.
More concretely, the p − 1 dimensional hyperplane of a set of p-dimensional
observations has the following general form:
β0 + β1 X1 + ... + βp Xp = 0 (8.1)
The classification will thus be as follows: Xi belong to one of the two classes if
β0 +β1 X1 +...+βp Xp > 0; it belongs to the other class if β0 +β1 X1 +...+βp Xp <
0. However, note that MMC requires the hyperplane (i.e. the decision bound-
ary) to be linear. We will consider the generalization of MMC that handles
non-linear classification tasks later on.
How, then, do we find such a hyperplane which is the best out of infinite number
of possible hyperplanes which can accomplish this division? For any observa-
tion, its margin to a hyperplane is defined as the minimal (i.e. perpendicular)
distance it is from the hyperplane. In MMC, the hyperplane of choice is natu-
rally defined as the one for which the observations have the largest margin42 .
41
For instance, a 2-dimensional space can be partitioned in two halves by a 1-dimensional
hyperplane, which in this case is a line.
42
Note that although MMC often works well, it can lead to overfitting when p is large.
43
Support vectors need not to be the closest to the hyperplane.
74
Figure 8.1: Support Vectors
The optimization goals for finding the MMC hyperplane are the following:
maximizeM (8.2)
β0 ,β1 ,...,βp
p
X
subject to βj2 = 1 (8.3)
j=1
While the optimization procedure is out of the scope of our discussion, we will
explain the working of it informally here. Under the constraint Eq. 8.3, the
distance from an observation to a hyperplane is yi (β0 + β1 xi1 + ... + βp xip ).
Eq. 8.4 says the hyperplane that is at least M (distance) units away from an
observation. Technically, M = 0 will do the trick. However, we set it as M
here to leave some cushion room. With Eq. 8.3,4, the hyperplane found will
naturally follow, that is, it will be the one that maximizes the margins.
75
Figure 8.2: MMC Sensitivity
maximize M (8.5)
β0 ,β1 ,...,βp ,1 ,...,n
p
X
subject to βj2 = 1 (8.6)
j=1
76
8.3 Support Vector Machine
As flexible as the support vector classifier, sometimes we would still like to have
a non-linear decision boundary rather than a linear one44 .
maximize M (8.9)
β0 ,β11 ,β12 ,...,βp1 ,βp2 ,1 ,...,n
p
X p
X
subject to yi β0 + βj1 xij + βj2 x2ij ≥ M (1 − i ) (8.10)
j=1 j=1
n
X p X
X 2
2
i ≤ C, i ≥ 0, βjk =1 (8.11)
i=1 j=1 k=1
The problem with this technique at this point is that the predictor set may be-
come computationally unmanageably large. Without getting into the technical
details of computation, we give a somewhat informal but intuitive explanation
to the solution of the problem — Support Vector Machine (SVM), which is an
extension of the support vector classifier, where kernels are used. The “base”
form of a linear support vector classifier is represented as follows, where x is a
new observation:
X n
f (x) = β0 + αi hx, xi i (8.12)
i=1
The inner product here is intended to represent the similarity between two
vectors. However, instead of compute such a “raw” value for similarity, we
instead compute a generalized version of it: the kernel :
77
Figure 8.3: Left: Polynomial Kernel; Right: Radial Kernel
most popular form of kernels — polynomial and radial kernels45 . The fitting
with the kernels is illustrated in Fig. 8.3.
p
X
K(xi , xi0 ) = (1 + xij xi0 j )d (8.15)
j=1
p
X 2
K(xi , xi0 ) = exp(−γ (xij − xi0 j ) ) (8.16)
j=1
Basically, the polynomial kernel enables non-linear fitting. The radial kernel
adding a new feature where, the more distant/dissimilar a new observation is
from a training observation, the less effect the training has on the new observa-
tion.
set.seed(1)
45
γ is a positive number. The greater γ is, the stronger the effect of “long-distance muffling”.
78
x = matrix(rnorm(20∗2), ncol=2)
% create 20 2-dimensional data points
y = c(rep(-1,10), rep(1,10))
% label to 2 classes
x[y==1,] = x[y==1,] + 1
% add 1 to class[y==1,] for distinction
plot(x, col=(3-y))
% check linear separability
dat = data.frame(x=x, y=as.factor(y))
% recode y into factor
library(e1071)
svmfit = svm(y˜., data=dat, kernel=‘linear’,
cost=10, scale=FALSE)
% kernel=‘linear’ : support vector classifier
% cost: ‘‘ penalty’ ’ for margin violation
% scale=FALSE: do not do N(0,1) scaling
plot(svmfit, dat)
% ‘o’ regular observation
% ‘x’ support vector
% coloring for class distinction
set.seed(1)
tune.out = tune(svm, y˜., data=dat, kernel=‘linear’,
ranges=list(cost=c (.001,.01,.1,1,5,10,100)))
% 10-fold CV
summary(tune.out)
% output the best cost parameter
bestmod = tune.out$best.model
summary(bestmod)
79
% Support Vector Classifier II ( linearly separable)
set.seed(1)
x = matrix(rnorm(200∗2), ncol=2)
x [1:100,] = x [1:100,] + 2
x [101:150,] = x[101:150,] - 2
y = c(rep(1,150), rep(2,50))
dat = data.frame(x=x, y=as.factor(y))
% create nonlinear data set
plot(x, col=y)
% plot to check
set .seed(1)
tune.out = tune(svm, y˜., data=dat[train,], kernel=‘radial ’ ,
ranges=list(cost=c(.1,1,10,100,1000)))
summary(tune.out)
% 10-fold CV for best cost & gamma
table(true=dat[-train,‘y’ ], pred=predict(tune.out$best.model,
newx=dat[-train,]))
% model testing on test set
set .seed(1)
x = rbind(x, matrix(rnorm(50∗2), ncol=2))
80
y = c(y, rep(0,50))
x[y==0,2] = x[y==0,2] + 2
dat = data.frame(x=x, y=as.factor(y))
% create data
par(mfrow=c(1,1))
plot(x, col=(y+1))
81
9 Unsupervised Learning
We have been mainly focusing on supervised learning, where we have i) a set
of predictors X1 , ..., Xp on n observations x1 , ..., xn , and ii) corresponding re-
sponses y1 , ..., yn . The responses (or, “correct answers”) “guides” the construc-
tion of our model. However, in the cases where we only have the observations
x1 , ..., xn and the p predictors, the “guidance” is no longer available. unsuper-
vised learning is thus motivated.
In Ch 5.3, we learned that the 1st PC is the dimension that maximizes the
variance explanation for a data set47 . The maximization goal with which we
find the 1st PC, with Eq. 9.1, is then Eq. 9.2. The optimization problem is
solved using eigen decomposition.
2
1 X n Xp
Xp
maximize φj1 xij , subject to φ2j1 = 1 (9.2)
n i=1 j=1
φ11 ,...,φp1
j=1
Further, note that scaling (e.g. N(0,1) scaling) affects the output of PCA. This
is desirable when predictors live in different units of measurement. Scaling is
unnecessary and potentially damaging when the predictors are of the same units
of measurement.
82
Figure 9.1: Scree Plot
for the data. PVE is defined as follows, where the numerator is the variance
explained by a PCA model, and the denominator is the total amount of variance
in a data set: Pn Pp 2
n=1 ( j=1 φjm xij )
PV E = Pn Pp 2 (9.3)
i=1 j=1 xij
In order to decide how many PCs to include in a PCA model, we plot the
number of PCs against the (cumulative) PVE for a scree plot (as in Fig. 9.1).
A rule of thumb here is the elbow method: Take the number of PCs where the
scree plot takes a major change of direction. In this example, PC = 2 would
be a good number to pick by the rule. However, keep in mind that the actual
decision should be based on many other factors (e.g. number of PCs preserved
in order to keep a highly faithful approximation while not paying too much
computational cost).
9.2 Clustering
Succinctly stated, clustering methods look to find homogeneous subgroups among
the observations.
9.2.1 K-Means
As is clear by its name, K-Means clustering seeks to partition a data set into
K subgroups/classes, where K is empirical and its choice depends on two main
factor:
• Clustering Pattern: How data cloud naturally partition and gather in
subgroups.
• Granularity Needed: How much granularity is required for a particular
task.
Fig. 9.2 presents the grouping results on a data set using K = 2, 3, 4.
83
Figure 9.2: K-Means Demo
• C1 ∪ ... ∪ CK = {1, ..., n}. I.e. each observation belongs to at least one of
the K clusters.
• Ci ∩ Cj = ∅ for all i 6= j. I.e. each observation can only belong to 1
cluster.
An ideal K-Means clustering minimizes the total (i.e. all K classes) within-class
squared Euclidean distance:
K K p
X X 1 X X
minimize{ W (Ck )} = minimize (xij − xi0 j )2 (9.4)
C1 ,...,CK C1 ,...,CK |Ck | 0 j=1
k=1 k=1 i,i ∈Ck
The optimization problem in Eq. 9.4 is solved using the following algorithm:
• Randomly pick K observations to represent the K clusters, and each ob-
servation to cluster k by its Euclidean distance to the initializers (the
closest).
• Iterate the following steps until the assignment stop changing:
– Compute cluster centroid for each cluster, and represent the class
using the centroid.
– Reassign each observation to cluster k by its Euclidean distance to
the centroids (the closest).
84
Figure 9.3: Hierarchical Clustering Demo
The basic idea of HC is that pairs of observations that are closest to each other
are fused into single representations (discussed later), until we obtain 1 single
representation. The process of HC is represented with a dendrogram, which is
illustrated in Fig. 9.3. The number of clusters is chosen by drawing a horizontal
cut at some vertical point on the dendrogram. The point is decided by the
same factors as in the case of K-Means clustering. HC is incredibly efficient
computationally comparing with K-Means, in that a single dendrogram can be
used to obtain any number of clusters.
Given n observations in a data set, the core algorithm for HC goes as follows:
n
• Compute pairwise dissimilarities (e.g. Euclidean, = n(n−1)
2 pairs),
2
and treat each observation as a cluster.
• Fuse clusters based on distance defined by dissimilarity.
• Compute new pairwise dissimilarities among new clusters using one of the
following methods:
– Complete Linkage: Between-cluster distance is the dissimilarity be-
tween the furthest apart representative members from each cluster
(Maximal intercluster dissimilarity).
– Single Linkage: Minimal intercluster dissimilarity.
– Average Linkage: Average intercluster dissimilarity.
– Centroid : Clusters are represented by their centroids when measuring
dissimilarity.
85
Complete and average linkage are preferred over single linkage, for they usually
produce more balanced dendrogram. Centroid linkage does produce balanced
dendrogram also, however it suffers from the inversion problem, where two
clusters are fused at a height below either of the individual clusters in the
dendrogram.
% PCA
library(ISLR)
states = row.names(USArrests)
% USArrests: arrest stats in the US across states
apply(USArrests, 2, mean)
% apply(data, row,col=1,2, operation)
% compute for column means
% K-means Clustering
set .seed(2)
x = matrix(rnorm(50∗2), ncol=2)
x [1:25,1] = x[1:25,1]+3
x [1:25,2] = x [1:25,2]-4
% create data with two obvious groups
86
% km.out$tot.withinss: within-class sum of squares (Euclidea)
% nstart=1 -> 104.33; nstart=20 -> 97.98
km.out$cluster
% output class assignments over all observations
% Hierarchical Clustering
par(mfrow=c(1,3))
plot(hc.complete, cex=.9)
% the rest are the same
cutree(hc.complete, 2)
% cut tree (horizontal) at a height such that
% observations fall into 2 classes
87