CART Algorithm

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 65

An introduction to CART

Sudheesh K Kattumannil

Indian Statistical Institute, Chennai, India.

ISI Chennai, 30 January, 2024

Sudheesh K Kattumannil An introduction to CART


Examples

I Predicted whether a patient, hospitalized due to a heart


attack, will have a second heart attack
Data: demographic, diet, clinical measurements
I Amount of glucose in the blood of a diabetic
Data: Infrared absorption spectrum of blood sample

Sudheesh K Kattumannil An introduction to CART


I Data are predicted on the basis of a set of features (e.g.
diet or clinical measurements)
I The prediction model is called a learner

Sudheesh K Kattumannil An introduction to CART


Example

I Data: 4601 email messages:each labelled email (+) or


spam (-)
I The relative frequencies of the 57 most commonly
occurring words and punctuation marks in the message
I Prediction goal: label future messages email (+) or spam
(-)
I Supervised learning problem on categorical data:
classification problem

Sudheesh K Kattumannil An introduction to CART


Tree versus Linear models

Sudheesh K Kattumannil An introduction to CART


Decision trees

I Graphical representation of possible solutions to a decision


based on certain conditions.
I Decision tree defines a tree-structured hierarchy of rules.
I Root node, Decision/Internal node, Leaf/Terminal node.
I Root and internal nodes contain the rules.
I Leaf nodes define the predictions.
I Decision Tree learning is about learning such a tree from
the labeled training data.

Sudheesh K Kattumannil An introduction to CART


An example: Decision trees

Sudheesh K Kattumannil An introduction to CART


Classification and Regression Tree: CART

I Classification : Problem of identifying to which of a set of


categories a new observation belongs, on the basis of a
training set of data containing observations whose
category membership is known.
I Classification tree: Dependent variable is categorical.
I Regression Tree: Dependent variable is continuous.

Sudheesh K Kattumannil An introduction to CART


Classification Tree: Binary classification
Assume training data with each input having 2 features (x1 ; x2 )

Sudheesh K Kattumannil An introduction to CART


Classification Tree: Binary classification

Sudheesh K Kattumannil An introduction to CART


Classification Tree: Binary classification

Is x1 greater than 3?

Sudheesh K Kattumannil An introduction to CART


Classification Tree: Binary classification

Given x1 > 3, is feature 2 (x2 ) greater than 3?

Sudheesh K Kattumannil An introduction to CART


Classification Tree: Binary classification

Given x1 < 3, is feature 2 (x2 ) greater than 1?

Sudheesh K Kattumannil An introduction to CART


Classification Tree: Binary classification

Sudheesh K Kattumannil An introduction to CART


Decision Trees: Predicting Baseball Players’ Salaries

I We need to predict a baseball player’s Salary based on


Years and Hits
I Overall, the tree stratifies or segments the players into
three regions of predictor space:
I Players who have played for five or fewer years,
I Players who have played for five or more years and who
made at least 118 hits last year.
I The predicted salary for these players is given by the mean
response value for the players in the partitioned data set.
I R1 = {X |Years < 5}
I R2 = {X |Years ≥ 5, Hits < 117.5} and
R3 = {X |Years ≥ 5, Hits ≥ 117.5}.

Sudheesh K Kattumannil An introduction to CART


Predicting Baseball Players’ Salaries:Players with
Years<5

I We use the Hitters data set to predict a baseball player’s


Salary based on Years and Hits
I Consider the player having less than 4.5 years experience.
I For such players, the mean log salary is 5.107,
I Hence the salary of these players is 165,174 dollars.
I The predicted salaries for other two groups are
I 1, 000 ∗ e5.999 = 402, 834, and 1, 000 ∗ e6.740 = 845, 346,
respectively.

Sudheesh K Kattumannil An introduction to CART


Sudheesh K Kattumannil An introduction to CART
Building a regression tree
I Mainly there are two steps.
I Step 1: We divide the predictor space; that is, the set of
possible values for X1 , X2 , ..., Xp -into J distinct and
non-overlapping regions, R1 , R2 , ..., RJ .
I Step 2: For every observation that falls into the region Rj ,
we make the same prediction, which is simply the mean of
the response values for the training observations in Rj .
I Suppose that in Step 1 we obtain two regions, R1 and R2 .
I Assume that the response mean of the training
observations in the first region is 10, while the response
mean of the training observations in the second region is
20.
I Then for a given observation X = x, if x ∈ R1 we will
predict a value of 10, and if x ∈ R2 we will predict a value
of 20.
Sudheesh K Kattumannil An introduction to CART
Construction of the regions

I Divide the predictor space into high-dimensional


rectangles, or boxes
I Find boxes R1 , ..., RJ that minimize the RSS, given by

J X
X
(yi − ŷR j )2
j=1 i∈Rj

where ŷR j is the mean response for the training


observations within the jth box.

Sudheesh K Kattumannil An introduction to CART


Construction of the regions
I In binary splitting, we first select the predictor Xj and the
cut point s such that splitting the predictor space into the
regions {X |Xj < s} and {X |Xj ≥ s} leads to the greatest
possible reduction in RSS.
I That is, we consider all predictors X1 , ..., Xp , and all
possible values of the cut point s for each of the predictors,
and then choose the predictor and cutpoint such that the
resulting tree has the lowest RSS
I That is, for any j and s, we define the pair of half-planes
R1 (j, s) = {X |Xj < s} and R2 (j, s) = {X |Xj ≥ s}
I We seek the value of j and s that minimize the equation
X X
(yi − ŷR i )2 and (yi − ŷR 2 )2 ,
i:xi ∈R1 (j,s) i:xi ∈R2 (j,s)

I where ŷR i is the mean response for the training


observations in R1 (j, s)
Sudheesh K Kattumannil An introduction to CART
Tree with five region

Sudheesh K Kattumannil An introduction to CART


Tree with five region

Sudheesh K Kattumannil An introduction to CART


Overfitting versus underfitting

Sudheesh K Kattumannil An introduction to CART


Overfitting versus underfitting

I Underfitting occurs when a model is not able to make


accurate predictions based on the training data.
Accordingly, may not have the capacity to generalize it on
new data.
I Machine learning models with underfitting tend to have
poor performance both in training and testing sets.
I Underfitting models usually have high bias and low
variance.
I A model is considered overfitting when it does extremely
well on training data but fails to perform on the same level
on the validation/test data.
I Models that are overfitting usually have low bias and high
variance

Sudheesh K Kattumannil An introduction to CART


Overfitting versus underfitting

Sudheesh K Kattumannil An introduction to CART


Tree Pruning

I The process described above (recursive binary splitting)


may produce good predictions on the training set, but is
likely to overfit the data, leading to poor test set
performance.
I This is because the resulting tree might be too complex.
I A smaller tree with fewer splits (that is, fewer regions
R1 , . . . , RJ ) might lead to lower variance and better
interpretation at the cost of a little bias.
I A strategy is to grow a very large tree T0 and then prune it
back in order to obtain a subtree.
I How do we determine the best prune way to prune the
tree?
I Our goal is to select a subtree that subtree leads to the
lowest test error rate.

Sudheesh K Kattumannil An introduction to CART


Tree Pruning
I Cost complexity pruning—also known as weakest link
pruning—gives us a way to do just this.
I Rather than considering every possible subtree, we
consider a sequence of trees indexed by a nonnegative
tuning parameter α.

Sudheesh K Kattumannil An introduction to CART


Tree Pruning: Algorithm

I Step 1: Use recursive binary splitting to grow a large tree


on the training data, stopping only when each terminal
node has fewer than some minimum number of
observations.
I Step 2: Apply cost complexity pruning to the large tree in
order to obtain a sequence of best subtrees, as a function
of α.
I Step 3: Use K-fold cross-validation to choose α (see next
slide).
I Step 4: Return the subtree from Step 2 that corresponds to
the chosen value of α.

Sudheesh K Kattumannil An introduction to CART


How to choose Alpha

I Use K-fold cross-validation to choose α. That is, divide the


training observations into K folds. For each k = 1, ..., K :
I Use K -fold cross-validation to choose α. That is, divide the
training observations into K folds. For each k = 1, ..., K :
I Evaluate the mean squared prediction error on the data in
the left-out kth fold, as a function of α.
I Average the results for each value of α, and pick α to
minimize the average error
n
1X
CV = MSEk
n
k =1

Sudheesh K Kattumannil An introduction to CART


Classification Trees

Sudheesh K Kattumannil An introduction to CART


Classification Trees
I A classification tree is very similar to a regression tree,
except that it is classification used to predict a qualitative
response rather than a quantitative one.

Sudheesh K Kattumannil An introduction to CART


Classification Trees
I A classification tree is very similar to a regression tree,
except that it is classification used to predict a qualitative
response rather than a quantitative one.
I We predict that each observation belongs to the most
commonly occurring class of training observations in the
region to which it belongs.

Sudheesh K Kattumannil An introduction to CART


Classification Trees
I A classification tree is very similar to a regression tree,
except that it is classification used to predict a qualitative
response rather than a quantitative one.
I We predict that each observation belongs to the most
commonly occurring class of training observations in the
region to which it belongs.
I A natural alternative to RSS is the classification error rate.

Sudheesh K Kattumannil An introduction to CART


Classification Trees
I A classification tree is very similar to a regression tree,
except that it is classification used to predict a qualitative
response rather than a quantitative one.
I We predict that each observation belongs to the most
commonly occurring class of training observations in the
region to which it belongs.
I A natural alternative to RSS is the classification error rate.
I The Gini index is defined by

K
X
G= p̂mk (1 − p̂mk ),
k =1

where p̂mk represents the proportion of training


observations in the mth region that are from the kth class,
a measure of total variance across the K classes.
Sudheesh K Kattumannil An introduction to CART
Advantages and Disadvantages of Trees

I Trees are very easy to explain to people.

Sudheesh K Kattumannil An introduction to CART


Advantages and Disadvantages of Trees

I Trees are very easy to explain to people.


I Decision trees more closely mirror human decision-making
than do the regression and classification

Sudheesh K Kattumannil An introduction to CART


Advantages and Disadvantages of Trees

I Trees are very easy to explain to people.


I Decision trees more closely mirror human decision-making
than do the regression and classification
I Trees can be displayed graphically, and are easily
interpreted even by a non-expert

Sudheesh K Kattumannil An introduction to CART


Advantages and Disadvantages of Trees

I Trees are very easy to explain to people.


I Decision trees more closely mirror human decision-making
than do the regression and classification
I Trees can be displayed graphically, and are easily
interpreted even by a non-expert
I Trees can easily handle qualitative predictors without the
need to create dummy variables

Sudheesh K Kattumannil An introduction to CART


Advantages and Disadvantages of Trees

I Trees are very easy to explain to people.


I Decision trees more closely mirror human decision-making
than do the regression and classification
I Trees can be displayed graphically, and are easily
interpreted even by a non-expert
I Trees can easily handle qualitative predictors without the
need to create dummy variables
I Trees generally do not have the same level of predictive
accuracy as GLM and classification

Sudheesh K Kattumannil An introduction to CART


Advantages and Disadvantages of Trees

I Trees are very easy to explain to people.


I Decision trees more closely mirror human decision-making
than do the regression and classification
I Trees can be displayed graphically, and are easily
interpreted even by a non-expert
I Trees can easily handle qualitative predictors without the
need to create dummy variables
I Trees generally do not have the same level of predictive
accuracy as GLM and classification
I Trees can be very non-robust.

Sudheesh K Kattumannil An introduction to CART


Bootstrap aggregation or bagging

I In a given set of n independent observations Z1 , ..., Zn


each with variance σ 2 , the variance of the mean Z̄ of the
observations is given by σ 2 /n

Sudheesh K Kattumannil An introduction to CART


Bootstrap aggregation or bagging

I In a given set of n independent observations Z1 , ..., Zn


each with variance σ 2 , the variance of the mean Z̄ of the
observations is given by σ 2 /n
I A natural way to increase the prediction accuracy of a
statistical learning method is; Take as many training sets
from the population, build a separate prediction model and
average them.

Sudheesh K Kattumannil An introduction to CART


Bootstrap aggregation or bagging

I We could predict f (x) by f̂ 1 (x), f̂ 2 (x), ..., f̂ B (x) using B


separate training sets and average them in order to obtain
statistical learning model
B
1X 1
f̂avg (x) = f̂ (x)
B
b=1

Sudheesh K Kattumannil An introduction to CART


Bootstrap aggregation or bagging

I We could predict f (x) by f̂ 1 (x), f̂ 2 (x), ..., f̂ B (x) using B


separate training sets and average them in order to obtain
statistical learning model
B
1X 1
f̂avg (x) = f̂ (x)
B
b=1

I We generate B different bootstrapped training data sets

B
1X b
f̂bag (x) = f̂ (x)
B
b=1

Sudheesh K Kattumannil An introduction to CART


Bagging: In regression tree

I To apply bagging to regression trees, we simply construct


B regression trees using B bootstrapped training sets

Sudheesh K Kattumannil An introduction to CART


Bagging: In regression tree

I To apply bagging to regression trees, we simply construct


B regression trees using B bootstrapped training sets
I Average the resulting predictions.

Sudheesh K Kattumannil An introduction to CART


Bagging: In regression tree

I To apply bagging to regression trees, we simply construct


B regression trees using B bootstrapped training sets
I Average the resulting predictions.
I Averaging these B trees reduces the variance.

Sudheesh K Kattumannil An introduction to CART


Bagging: Classification

I For a given test observation, we can record the class


predicted by each of the B trees, and take a majority vote:

Sudheesh K Kattumannil An introduction to CART


Bagging: Classification

I For a given test observation, we can record the class


predicted by each of the B trees, and take a majority vote:
I The overall prediction is the most commonly occurring
majority class among the B predictions.

Sudheesh K Kattumannil An introduction to CART


Random Forests
I While building decision trees, each time we consider a split
in a tree

Sudheesh K Kattumannil An introduction to CART


Random Forests
I While building decision trees, each time we consider a split
in a tree
I Suppose that there is one very strong predictor in the data
set, along with a number of other moderately strong
predictors.

Sudheesh K Kattumannil An introduction to CART


Random Forests
I While building decision trees, each time we consider a split
in a tree
I Suppose that there is one very strong predictor in the data
set, along with a number of other moderately strong
predictors.
I In bagged tree, most of the trees will use this strong
predictor in the top split.

Sudheesh K Kattumannil An introduction to CART


Random Forests
I While building decision trees, each time we consider a split
in a tree
I Suppose that there is one very strong predictor in the data
set, along with a number of other moderately strong
predictors.
I In bagged tree, most of the trees will use this strong
predictor in the top split.
I Hence the predictions from the bagged trees will be highly
correlated. Averaging many highly correlated quantities
does not lead to large of a reduction in variance

Sudheesh K Kattumannil An introduction to CART


Random Forests
I While building decision trees, each time we consider a split
in a tree
I Suppose that there is one very strong predictor in the data
set, along with a number of other moderately strong
predictors.
I In bagged tree, most of the trees will use this strong
predictor in the top split.
I Hence the predictions from the bagged trees will be highly
correlated. Averaging many highly correlated quantities
does not lead to large of a reduction in variance
I Random forests overcome this problem by forcing each
split to consider only a subset of the predictors.

Sudheesh K Kattumannil An introduction to CART


Random Forests
I While building decision trees, each time we consider a split
in a tree
I Suppose that there is one very strong predictor in the data
set, along with a number of other moderately strong
predictors.
I In bagged tree, most of the trees will use this strong
predictor in the top split.
I Hence the predictions from the bagged trees will be highly
correlated. Averaging many highly correlated quantities
does not lead to large of a reduction in variance
I Random forests overcome this problem by forcing each
split to consider only a subset of the predictors.
I A random sample of m predictors is chosen as split
candidates from the full set of p predictors (In classification

m = p and in regression m = p3 .)
Sudheesh K Kattumannil An introduction to CART
Number of parameters in Random Forest algorithm

I Number of trees used in the forest(ntree) and number of


predictor used in each tree(mtry).

Sudheesh K Kattumannil An introduction to CART


Number of parameters in Random Forest algorithm

I Number of trees used in the forest(ntree) and number of


predictor used in each tree(mtry).
I A tree with low error rate is a strong classifier.

Sudheesh K Kattumannil An introduction to CART


Number of parameters in Random Forest algorithm

I Number of trees used in the forest(ntree) and number of


predictor used in each tree(mtry).
I A tree with low error rate is a strong classifier.
I As mtry decreases, correlation and strength decrease.

Sudheesh K Kattumannil An introduction to CART


Number of parameters in Random Forest algorithm

I Number of trees used in the forest(ntree) and number of


predictor used in each tree(mtry).
I A tree with low error rate is a strong classifier.
I As mtry decreases, correlation and strength decrease.
I We need to find an optimal mtry.

Sudheesh K Kattumannil An introduction to CART


Number of parameters in Random Forest algorithm

I Set mtry to a default value.

Sudheesh K Kattumannil An introduction to CART


Number of parameters in Random Forest algorithm

I Set mtry to a default value.


I Try different ntree values.

Sudheesh K Kattumannil An introduction to CART


Number of parameters in Random Forest algorithm

I Set mtry to a default value.


I Try different ntree values.
I Record OOB error rate.

Sudheesh K Kattumannil An introduction to CART


Number of parameters in Random Forest algorithm

I Set mtry to a default value.


I Try different ntree values.
I Record OOB error rate.
I See the ntree where OOB error rate stabilizes and reaches
minimum.

Sudheesh K Kattumannil An introduction to CART


Number of parameters in Random Forest algorithm

I Set mtry to a default value.


I Try different ntree values.
I Record OOB error rate.
I See the ntree where OOB error rate stabilizes and reaches
minimum.
I See the link http://www.r2d3.us/
visual-intro-to-machine-learning-part-1/

Sudheesh K Kattumannil An introduction to CART


I Hastie, T., Tibshirani, R. and Friedman, J. (2009), The
elements of Statistical learning, data mining, inference and
prediction, Springer.
I Kuhn, M. and Johnson, K. (2013), Applied Predictive
modelling, Springer.
I James,G., Witten, D., Hastie, T. and Tibshirani, R. (2013),
An introduction to Statistical learning with applications in R,
Springer.

Sudheesh K Kattumannil An introduction to CART

You might also like