Classification and Regression Trees

Classification and Regression
Trees
CLASSIFICATION TREES
Goal
• Classify an outcome based on a set of
predictors
• The output is a set of rules
Example
• Goal: classify a record as “will accept credit
card offer” or “will not accept”
• Rule might be “IF (Income > 92.5) AND
(Education < 1.5) AND (Family <= 2.5) THEN
Class = 0 (nonacceptor)
• Also called CART, Decision Trees, or just Trees
• Rules are represented by tree diagrams
Two key ideas
• Recursive partitioning:
Repeatedly split the records into two parts so
as to achieve maximum homogeneity within the
new parts
• Pruning:
Simplify the tree by pruning peripheral
branches to avoid overfitting
Recursive Partitioning
• Dependent (response) variable y
• The dependent variable is a categorical variable in
classification trees
• Predictor variables x1, x2, …, xp
• The predictor variables are continuous, or binary
or ordinal
• Recursive partitioning divides the p-dimensional
space of the predictor variables into non-
overlapping multidimensional rectangles
Recursive Partitioning Steps
• Select one of the predictor variables, say xi
• Select a value of xi, say si, that divides the
training data into two (not necessarily equal)
portions
• Then, one of these two parts is divided in a
similar manner by choosing a variable again
and a split value for the variable
• This results in three multi-dimensional
rectangular regions
• The process is continued so that smaller and
smaller rectangular regions are obtained
• The idea is to divide the entire predictor space
into rectangles such that each rectangle is as
homogeneous or “pure” as possible
• At each step, we measure how “pure” or
homogeneous each of the resulting portions
are
“Pure” = containing records of mostly one class
• In each split, the algorithm tries different

values of xi, and si to maximize purity
Example: Riding Mowers
• Goal: Classify 24 households as owning or not
owning riding mowers
• Dependent variable: Categorical (owner, non-
owner)
• Predictors: Income, Lot Size
Income Lot_Size Ownership
60.0 18.4 owner
85.5 16.8 owner
64.8 21.6 owner
61.5 20.8 owner
87.0 23.6 owner
110.1 19.2 owner
108.0 17.6 owner
82.8 22.4 owner
69.0 20.0 owner
93.0 20.8 owner
51.0 22.0 owner
81.0 20.0 owner
75.0 19.6 non-owner
52.8 20.8 non-owner
64.8 17.2 non-owner
43.2 20.4 non-owner
84.0 17.6 non-owner
49.2 17.6 non-owner
59.4 16.0 non-owner
66.0 18.4 non-owner
47.4 16.4 non-owner
33.0 18.8 non-owner
51.0 14.0 non-owner
63.0 14.8 non-owner
Algorithm: how to split
• Order records according to one variable, say lot
size
• R uses the predictor values as the split points
• XLMiner finds midpoints between successive
values, and uses them as split points
E.g. first midpoint is 14.4 (halfway between 14.0 and 14.8)
• Divide records into those with lot size > 14.4 and
those < 14.4
• After evaluating that split, try the next one, which
is 15.4 (halfway between 14.8 and 16.0)
The first split: by lot size
The second split: by income
After all splits
Note: Categorical predictors
• Examine all possible ways in which the categories can
be split.
• E.g., categories A, B, C can be split in 3 ways
{A} and {B, C}
{B} and {A, C}
{C} and {A, B}
• With many categories, number of splits becomes huge
• XLMiner supports only binary categorical variables
• R can handle any categorical variable
MEASURING IMPURITY
Measuring Impurity
• Gini impurity index
• Entropy
Gini Impurity Index
• Gini impurity index for rectangle A is given by
I(A) = 1 -
• m is the number of classes of the response

variable
• pk = proportion of observations in rectangle A
that belong to class k, k = 1, …, m
Gini Impurity Index
• It takes the value 0 when all the observations
belong to the same class; it is the minimum value
• It takes the value (m-1)/m when all m classes are
equally represented; it is the maximum value
• Clearly, for a two-class case, the Gini impurity
index is maximum when pk = 0.5 (that is, when
the rectangle contains 50% of each of the two
classes)
Entropy
• The entropy is given by
• pk = proportion of observations in rectangle A

that belong to class k, k = 1, …, m
• Entropy ranges between 0 (most pure) and
log2(m) (equal representation of classes)
Impurity and Recursive Partitioning
• Obtain overall impurity measure (weighted
avg. of individual rectangles)
• At each successive stage, compare this
measure across all possible splits in all
variables
• Choose the split that reduces impurity the
most
• Chosen split points become nodes on the tree
First Split – The Tree
Tree after three splits
Tree structure
• Split points become nodes on tree (circles with
split value in center)
• Terminal nodes represent “leaves” (terminal
points, no further splits, classification value
noted)
• Numbers over lines between nodes indicate
number of cases
• Read down tree to derive rule
E.g., If lot size < 19, and if income > 84.75, then class = “owner”
Determining Leaf Node Label
• Each leaf node label is determined by “voting”
of the records within it, and by the cutoff
value
• Records within each leaf node are from the
training data
• Default cutoff=0.5 means that the leaf node’s
label is the majority class
• Cutoff = 0.25: requires 25% or more “1”
records in the leaf to label it a “1” node
Classifying a new observation
• “Drop” the observation down the tree in such
a way that at each decision node an
appropriate branch is taken
• Continue until a node is reached that has no
successor (that is, a leaf node)
• Classify that observation according to the
label of that leaf node
• Build the tree using training data; assess its
accuracy using validation data
Tree after all splits
THE OVERFITTING PROBLEM
Stopping Tree Growth
• Natural end of process is 100% purity in each
leaf
• This overfits the data, which end up fitting
noise in the data
• Overfitting leads to low predictive accuracy of
new data
• Past a certain point, the error rate for the
validation data starts to increase
Full-grown Tree Error Rate
Some ways to stop tree growth
• One can control the following to stop tree growth
- Tree depth (i.e., number of splits)
- Minimum number of records in a terminal
node
- Minimum reduction in impurity
• The problem is that it is not clear which of the
above provides a good stopping rule
CHAID
• CHAID: Chi-squared automatic interaction
detection
• Uses chi-square statistical test to limit tree
growth
• Splitting stops when purity improvement is
not statistically significant
• Widely used in database marketing
CHAID process
• At each node, we split on the predictor that
has strongest association with the response
• Strength of association is measured by the p-
value of a chi-squared test of independence
• If for the best predictor, the test does not
show a significant improvement, the split is
not carried out, and the tree is terminated
• Suitable for categorical predictors; can be
adapted to continuous predictors by binning
Pruning
• Let the tree grow to full extent, then prune it
back
• Pruning consists of successively selecting a
decision node and redesignating it as a leaf
node (by lopping off branches extending
beyond that decision node and thereby
reducing the size of the tree)
Pruning
• Idea is to find that point at which the
validation error begins to rise
• Generate successively smaller trees by
pruning leaves
• At each pruning stage, multiple trees are
possible
• Use cost complexity to choose the best tree at
that stage
Cost Complexity
• The cost complexity is given by
CC(T) = Err(T) +  L(T)
CC(T) = cost complexity of a tree

Err(T) = proportion of misclassified records for the training data
 = penalty factor attached to tree size (set by user)
• When  is very small, we get a full-grown unpruned tree
• When  is very large (infinity), we get a tree with fewest
number of nodes
Algorithm with Cost Complexity
• Among trees of given size, choose the one
with lowest CC
• Do this for each size of tree
• Finally, choose that one as the best tree that
gives smallest misclassification error in the
validation data (minimum error tree)
Using Validation Error to Prune
Pruning process yields a set of trees of different
sizes and associated error rates
Two trees of interest:

• Minimum error tree
Has lowest error rate on validation data
• Best pruned tree
Smallest tree within one std. error of min. error
This adds a bonus for simplicity/parsimony
REGRESSION TREES
Regression Trees for Prediction
• Used with numerical (continuous) outcome
variable
• Procedure similar to classification tree
• Many splits are attempted, then the one that
minimizes impurity is chosen
Regression Trees for Prediction
• Prediction is computed as the average of
numerical target variable in the rectangle (in
CT it is majority vote)
• Impurity is measured by sum of squared

deviations from leaf mean
• Performance is measured by RMSE (root mean

squared error)
IMPROVING PREDICTIONS: BAGGING,
RANDOM FORESTS, AND BOOSTING
Bagging
• Decision trees suffer from high variance
• This means that if we split the training data
into two parts at random, and fit a tree to
both halves, the results could be quite
different
• But we know that a procedure with low
variance will give similar results if applied
repeatedly to distinct datasets
Bagging
• Bootstrap aggregation, or bagging, is a general
purpose tool for reducing variance of a statistical
learning method
• It follows from the principle that averaging a set
of observations reduces variance
• The idea is to get many training sets, building a
separate prediction model using each training
set, and the averaging the results
• Since we do not have “many” training sets, we
use a statistical procedure based on resampling,
called bootstrap
Decision Trees and Bagging: Algorithm
• The basic algorithm is the following:
- Draw multiple random samples (say, B), with
replacement, from the given data
(bootstrapping)
- To each sample, fit a tree
- Combine the predictions/classifications from
the individual trees to obtain improved
prediction
Choosing Number of Trees to Fit
• B is not a critical parameter with bagging
• Using very large value of B will not lead to
over-fitting
• B should be sufficiently large (like 150 or so) to
have satisfactory performance
Combining the Results
• For regression trees:
- Construct B regression trees using B
bootstrapped training sets
- Average resulting predictions
- These are deep, un-pruned trees, with high
variance but low bias
- Averaging these B trees reduces variance
Combining the Results
• For classification trees:
- For a given test observation, record the class
predicted by each of the B trees
- Then, take a majority vote
- That is, the final prediction is the most
commonly occurring class among the B
predictions
Random Forests
• Random Forests (RF) is an approach similar to
bagging, based on multiple trees
• Improves the predictive performance of trees
• Based on an idea of bootstrapping
• Not easily interpretable unlike trees (and like
bagging)
• Does not have a nice graphical representation
unlike trees (and like bagging)
Random Forests: Algorithm
• The algorithm, in a very basic form, is the
following:
- Draw multiple random samples, with
replacement, from the given data (bootstrapping)
- To each sample, fit a tree using a random subset
of predictors
- Combine the predictions/classifications from
the individual trees to obtain improved prediction
Random Forests and Bagging: Further
Details
• RF and Bagging cannot be displayed in tree
diagrams – lose of visual representation
• RF and Bagging can produce “variable
importance” (VI) scores
• VI measures relative contribution of each
predictor
• VI score for a predictor is obtained by adding the
decrease in the impurity index for that predictor
over all the trees
• Higher the score, more important a variable is
Boosted Trees for Classification
Problems
• Another type of multi-tree improvement
• A sequence of trees is fitted
• Each tree concentrates on misclassified
records from the previous tree
Problems
• Algorithm:
- Step 1: Fit a single tree
- Step 2: Draw a sample that gives higher
selection probabilities to misclassified records
- Step 3: Fit a tree to the new sample
- Step 4: Repeat steps 2 and 3 multiple times
- Step 5: Use weighted voting to classify
records, with heavier weights for later trees
Problems: Why Does This Work?
• Consider a binary response – 0 (unimportant class),
and 1 (important class)
• Typically, 0s are dominant in numbers
• Basic classifiers are tempted to classify cases as
belonging to the dominant class
• Naturally, 1s in this case constitute most of the
misclassifications with the single best-pruned tree
• Boosting concentrates on misclassifications (which are
mostly 1s)
• So, this naturally reduces misclassification of 1s!
Boosted Trees for Regression Problems
• The process is similar
• Trees are grown sequentially, learning from
the trees previously fitted
• In this case, the trees are fitted sequentially to
the residuals of the previous trees rather than
the outcome Y as the response
• The residuals are updated sequentially, and
final predictions are then adjusted with these
updated residuals
Advantages of Trees
• Easy to use, understand
• Produce rules that are easy to interpret &
implement
• Variable selection & reduction is automatic
• Do not require the assumptions of statistical
models
• Can work without extensive handling of
missing data
Disadvantages of Trees
• May not perform well where there is structure
in the data that is not well captured by
horizontal or vertical splits
• Since the process deals with one variable at a

time, no way to capture interactions between
variables
Summary
• Classification and Regression Trees are an easily understandable
and transparent method for predicting or classifying new records
• A tree is a graphical representation of a set of rules
• Trees must be pruned to avoid over-fitting of the training data
• As trees do not make any assumptions about the data structure,

they usually require large samples
• Bagging, Random Forests, and Boosting are tools that can improve
predictions/classifications, at the cost of interpretability and
representability

Classification and Regression Trees

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Classification and Regression Trees

Uploaded by

Copyright:

Available Formats

Classification and Regression

• In each split, the algorithm tries different

• m is the number of classes of the response

• pk = proportion of observations in rectangle A

CC(T) = cost complexity of a tree

Two trees of interest:

• Impurity is measured by sum of squared

• Performance is measured by RMSE (root mean

• Since the process deals with one variable at a

• A tree is a graphical representation of a set of rules

• Trees must be pruned to avoid over-fitting of the training data

• As trees do not make any assumptions about the data structure,

You might also like