CART Algorithm

An introduction to CART
Sudheesh K Kattumannil
Indian Statistical Institute, Chennai, India.
ISI Chennai, 30 January, 2024
Sudheesh K Kattumannil An introduction to CART

Examples
I Predicted whether a patient, hospitalized due to a heart

attack, will have a second heart attack
Data: demographic, diet, clinical measurements
I Amount of glucose in the blood of a diabetic
Data: Infrared absorption spectrum of blood sample

I Data are predicted on the basis of a set of features (e.g.
diet or clinical measurements)
I The prediction model is called a learner

Example
I Data: 4601 email messages:each labelled email (+) or

spam (-)
I The relative frequencies of the 57 most commonly
occurring words and punctuation marks in the message
I Prediction goal: label future messages email (+) or spam
(-)
I Supervised learning problem on categorical data:
classification problem

Tree versus Linear models

Decision trees
I Graphical representation of possible solutions to a decision

based on certain conditions.
I Decision tree defines a tree-structured hierarchy of rules.
I Root node, Decision/Internal node, Leaf/Terminal node.
I Root and internal nodes contain the rules.
I Leaf nodes define the predictions.
I Decision Tree learning is about learning such a tree from
the labeled training data.

An example: Decision trees

Classification and Regression Tree: CART
I Classification : Problem of identifying to which of a set of

categories a new observation belongs, on the basis of a
training set of data containing observations whose
category membership is known.
I Classification tree: Dependent variable is categorical.
I Regression Tree: Dependent variable is continuous.

Classification Tree: Binary classification
Assume training data with each input having 2 features (x1 ; x2 )


Is x1 greater than 3?

Given x1 > 3, is feature 2 (x2 ) greater than 3?

Given x1 < 3, is feature 2 (x2 ) greater than 1?


Decision Trees: Predicting Baseball Players’ Salaries
I We need to predict a baseball player’s Salary based on

Years and Hits
I Overall, the tree stratifies or segments the players into
three regions of predictor space:
I Players who have played for five or fewer years,
I Players who have played for five or more years and who
made at least 118 hits last year.
I The predicted salary for these players is given by the mean
response value for the players in the partitioned data set.
I R1 = {X |Years < 5}
I R2 = {X |Years ≥ 5, Hits < 117.5} and
R3 = {X |Years ≥ 5, Hits ≥ 117.5}.

Predicting Baseball Players’ Salaries:Players with
Years<5
I We use the Hitters data set to predict a baseball player’s

Salary based on Years and Hits
I Consider the player having less than 4.5 years experience.
I For such players, the mean log salary is 5.107,
I Hence the salary of these players is 165,174 dollars.
I The predicted salaries for other two groups are
I 1, 000 ∗ e5.999 = 402, 834, and 1, 000 ∗ e6.740 = 845, 346,
respectively.

Building a regression tree
I Mainly there are two steps.
I Step 1: We divide the predictor space; that is, the set of
possible values for X1 , X2 , ..., Xp -into J distinct and
non-overlapping regions, R1 , R2 , ..., RJ .
I Step 2: For every observation that falls into the region Rj ,
we make the same prediction, which is simply the mean of
the response values for the training observations in Rj .
I Suppose that in Step 1 we obtain two regions, R1 and R2 .
I Assume that the response mean of the training
observations in the first region is 10, while the response
mean of the training observations in the second region is
20.
I Then for a given observation X = x, if x ∈ R1 we will
predict a value of 10, and if x ∈ R2 we will predict a value
of 20.
Construction of the regions
I Divide the predictor space into high-dimensional

rectangles, or boxes
I Find boxes R1 , ..., RJ that minimize the RSS, given by
J X
X
(yi − ŷR j )2
j=1 i∈Rj
where ŷR j is the mean response for the training

observations within the jth box.

Construction of the regions
I In binary splitting, we first select the predictor Xj and the
cut point s such that splitting the predictor space into the
regions {X |Xj < s} and {X |Xj ≥ s} leads to the greatest
possible reduction in RSS.
I That is, we consider all predictors X1 , ..., Xp , and all
possible values of the cut point s for each of the predictors,
and then choose the predictor and cutpoint such that the
resulting tree has the lowest RSS
I That is, for any j and s, we define the pair of half-planes
R1 (j, s) = {X |Xj < s} and R2 (j, s) = {X |Xj ≥ s}
I We seek the value of j and s that minimize the equation
X X
(yi − ŷR i )2 and (yi − ŷR 2 )2 ,
i:xi ∈R1 (j,s) i:xi ∈R2 (j,s)
I where ŷR i is the mean response for the training

observations in R1 (j, s)
Tree with five region

Tree with five region

Overfitting versus underfitting

I Underfitting occurs when a model is not able to make

accurate predictions based on the training data.
Accordingly, may not have the capacity to generalize it on
new data.
I Machine learning models with underfitting tend to have
poor performance both in training and testing sets.
I Underfitting models usually have high bias and low
variance.
I A model is considered overfitting when it does extremely
well on training data but fails to perform on the same level
on the validation/test data.
I Models that are overfitting usually have low bias and high
variance


Tree Pruning
I The process described above (recursive binary splitting)

may produce good predictions on the training set, but is
likely to overfit the data, leading to poor test set
performance.
I This is because the resulting tree might be too complex.
I A smaller tree with fewer splits (that is, fewer regions
R1 , . . . , RJ ) might lead to lower variance and better
interpretation at the cost of a little bias.
I A strategy is to grow a very large tree T0 and then prune it
back in order to obtain a subtree.
I How do we determine the best prune way to prune the
tree?
I Our goal is to select a subtree that subtree leads to the
lowest test error rate.

Tree Pruning
I Cost complexity pruning—also known as weakest link
pruning—gives us a way to do just this.
I Rather than considering every possible subtree, we
consider a sequence of trees indexed by a nonnegative
tuning parameter α.

Tree Pruning: Algorithm
I Step 1: Use recursive binary splitting to grow a large tree

on the training data, stopping only when each terminal
node has fewer than some minimum number of
observations.
I Step 2: Apply cost complexity pruning to the large tree in
order to obtain a sequence of best subtrees, as a function
of α.
I Step 3: Use K-fold cross-validation to choose α (see next
slide).
I Step 4: Return the subtree from Step 2 that corresponds to
the chosen value of α.

How to choose Alpha
I Use K-fold cross-validation to choose α. That is, divide the

training observations into K folds. For each k = 1, ..., K :
I Use K -fold cross-validation to choose α. That is, divide the
training observations into K folds. For each k = 1, ..., K :
I Evaluate the mean squared prediction error on the data in
the left-out kth fold, as a function of α.
I Average the results for each value of α, and pick α to
minimize the average error
n
1X
CV = MSEk
n
k =1

Classification Trees

I A classification tree is very similar to a regression tree,
except that it is classification used to predict a qualitative
response rather than a quantitative one.

I We predict that each observation belongs to the most
commonly occurring class of training observations in the
region to which it belongs.

I A natural alternative to RSS is the classification error rate.

I A natural alternative to RSS is the classification error rate.
I The Gini index is defined by
K
X
G= p̂mk (1 − p̂mk ),
k =1
where p̂mk represents the proportion of training

observations in the mth region that are from the kth class,
a measure of total variance across the K classes.
Advantages and Disadvantages of Trees
I Trees are very easy to explain to people.


I Decision trees more closely mirror human decision-making
than do the regression and classification


I Trees can be displayed graphically, and are easily
interpreted even by a non-expert


I Trees can easily handle qualitative predictors without the
need to create dummy variables


I Trees generally do not have the same level of predictive
accuracy as GLM and classification


I Trees generally do not have the same level of predictive
accuracy as GLM and classification
I Trees can be very non-robust.

Bootstrap aggregation or bagging
I In a given set of n independent observations Z1 , ..., Zn

each with variance σ 2 , the variance of the mean Z̄ of the
observations is given by σ 2 /n

I In a given set of n independent observations Z1 , ..., Zn

each with variance σ 2 , the variance of the mean Z̄ of the
observations is given by σ 2 /n
I A natural way to increase the prediction accuracy of a
statistical learning method is; Take as many training sets
from the population, build a separate prediction model and
average them.

I We could predict f (x) by f̂ 1 (x), f̂ 2 (x), ..., f̂ B (x) using B

separate training sets and average them in order to obtain
statistical learning model
B
1X 1
f̂avg (x) = f̂ (x)
B
b=1

I We could predict f (x) by f̂ 1 (x), f̂ 2 (x), ..., f̂ B (x) using B

separate training sets and average them in order to obtain
statistical learning model
B
1X 1
f̂avg (x) = f̂ (x)
B
b=1
I We generate B different bootstrapped training data sets
B
1X b
f̂bag (x) = f̂ (x)
B
b=1

Bagging: In regression tree
I To apply bagging to regression trees, we simply construct

B regression trees using B bootstrapped training sets


I Average the resulting predictions.


I Average the resulting predictions.
I Averaging these B trees reduces the variance.

Bagging: Classification
I For a given test observation, we can record the class

predicted by each of the B trees, and take a majority vote:

Bagging: Classification
I For a given test observation, we can record the class

predicted by each of the B trees, and take a majority vote:
I The overall prediction is the most commonly occurring
majority class among the B predictions.

Random Forests
I While building decision trees, each time we consider a split
in a tree

Random Forests
in a tree
I Suppose that there is one very strong predictor in the data
set, along with a number of other moderately strong
predictors.

Random Forests
in a tree
predictors.
I In bagged tree, most of the trees will use this strong
predictor in the top split.

Random Forests
in a tree
predictors.
I Hence the predictions from the bagged trees will be highly
correlated. Averaging many highly correlated quantities
does not lead to large of a reduction in variance

Random Forests
in a tree
predictors.
I Random forests overcome this problem by forcing each
split to consider only a subset of the predictors.

Random Forests
in a tree
predictors.
I Random forests overcome this problem by forcing each
split to consider only a subset of the predictors.
I A random sample of m predictors is chosen as split
candidates from the full set of p predictors (In classification
√
m = p and in regression m = p3 .)
Number of parameters in Random Forest algorithm
I Number of trees used in the forest(ntree) and number of

predictor used in each tree(mtry).


I A tree with low error rate is a strong classifier.


I As mtry decreases, correlation and strength decrease.


I As mtry decreases, correlation and strength decrease.
I We need to find an optimal mtry.

I Set mtry to a default value.


I Try different ntree values.


I Record OOB error rate.


I See the ntree where OOB error rate stabilizes and reaches
minimum.


I See the ntree where OOB error rate stabilizes and reaches
minimum.
I See the link http://www.r2d3.us/
visual-intro-to-machine-learning-part-1/

I Hastie, T., Tibshirani, R. and Friedman, J. (2009), The
elements of Statistical learning, data mining, inference and
prediction, Springer.
I Kuhn, M. and Johnson, K. (2013), Applied Predictive
modelling, Springer.
I James,G., Witten, D., Hastie, T. and Tibshirani, R. (2013),
An introduction to Statistical learning with applications in R,
Springer.

CART Algorithm

Uploaded by

Copyright:

Available Formats

You might also like

CART Algorithm

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CART Algorithm

Uploaded by

Copyright:

Available Formats

An introduction to CART

Indian Statistical Institute, Chennai, India.

ISI Chennai, 30 January, 2024

Sudheesh K Kattumannil An introduction to CART

I Predicted whether a patient, hospitalized due to a heart

Sudheesh K Kattumannil An introduction to CART

Sudheesh K Kattumannil An introduction to CART

I Data: 4601 email messages:each labelled email (+) or

Sudheesh K Kattumannil An introduction to CART

Sudheesh K Kattumannil An introduction to CART

I Graphical representation of possible solutions to a decision

Sudheesh K Kattumannil An introduction to CART

Sudheesh K Kattumannil An introduction to CART

I Classification : Problem of identifying to which of a set of

Sudheesh K Kattumannil An introduction to CART

Sudheesh K Kattumannil An introduction to CART

Sudheesh K Kattumannil An introduction to CART

Sudheesh K Kattumannil An introduction to CART

Given x1 > 3, is feature 2 (x2 ) greater than 3?

Sudheesh K Kattumannil An introduction to CART

Given x1 < 3, is feature 2 (x2 ) greater than 1?

Sudheesh K Kattumannil An introduction to CART

Sudheesh K Kattumannil An introduction to CART

I We need to predict a baseball player’s Salary based on

Sudheesh K Kattumannil An introduction to CART

I We use the Hitters data set to predict a baseball player’s

Sudheesh K Kattumannil An introduction to CART

I Divide the predictor space into high-dimensional

where ŷR j is the mean response for the training

Sudheesh K Kattumannil An introduction to CART

I where ŷR i is the mean response for the training

Sudheesh K Kattumannil An introduction to CART

Sudheesh K Kattumannil An introduction to CART

Sudheesh K Kattumannil An introduction to CART

I Underfitting occurs when a model is not able to make

Sudheesh K Kattumannil An introduction to CART

Sudheesh K Kattumannil An introduction to CART

I The process described above (recursive binary splitting)

Sudheesh K Kattumannil An introduction to CART

Sudheesh K Kattumannil An introduction to CART

I Step 1: Use recursive binary splitting to grow a large tree

Sudheesh K Kattumannil An introduction to CART

I Use K-fold cross-validation to choose α. That is, divide the

Sudheesh K Kattumannil An introduction to CART

Sudheesh K Kattumannil An introduction to CART

Sudheesh K Kattumannil An introduction to CART

Sudheesh K Kattumannil An introduction to CART

Sudheesh K Kattumannil An introduction to CART

where p̂mk represents the proportion of training

I Trees are very easy to explain to people.

Sudheesh K Kattumannil An introduction to CART

I Trees are very easy to explain to people.

Sudheesh K Kattumannil An introduction to CART

I Trees are very easy to explain to people.

Sudheesh K Kattumannil An introduction to CART

I Trees are very easy to explain to people.

Sudheesh K Kattumannil An introduction to CART

I Trees are very easy to explain to people.

Sudheesh K Kattumannil An introduction to CART