Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Machine Learning — Statistical Methods for Machine Learning

Tree Predictors
Instructor: Nicolò Cesa-Bianchi version of March 13, 2023

As we already said, while certain types of data (like images and texts) have a natural representation
as vectors x ∈ Rd , others (like medical records and other structured data) do not. For example,
consider a problem of medical diagnosis, where data consist of medical records containing the
following fields
age ∈ {12, . . . , 90}
smoker ∈ {yes, no, ex}
weight ∈ [10, 120]
sex ∈ {M, F}
therapy ∈ {antibiotics, cortisone, none}
Even if we compute rescaled numerical representations for the features (including the categorical
fields smoker and sex), algorithms based on Euclidean distance like k-NN may not work well.

In order to learn data whose features vary in heterogeneous sets X1 , . . . , Xd (i.e., sets with incom-
parable ranges, including ranges corresponding to categorical variables), we introduce a new family
of predictors: the tree predictors.

A tree predictor has the structure of an ordered and rooted tree where each node is either a leaf (if
it has zero children) or an internal node (if it has at least two children). Recall that an ordered tree
is one where the children of any internal node are numbered consecutively. Hence, if the internal
node v has k ≥ 2 children, we can access the first child, the second child, and so on until the k-th
child.

outlook
overcast
y

ra
nn

in
su

humidity +1 windy
%

> 70

yes
≤ 70

no
%

+1 −1 −1 +1

Figure 1: A classical example of a tree classifier for a binary classification task. The features are:
outlook, humidity e windy.

1
Fix X = X1 ×· · ·×Xd , where Xi is the range of the i-th attribute xi . A tree predictor hT : X → Y
is a predictor defined by a tree T whose internal nodes are tagged with tests and whose leaves are
tagged with labels in Y. A test on attribute i for an internal node with k children is a function
f : Xi → {1, . . . , k}. The function f maps each element of Xi to the node’s children. For example,
if Xi ≡ {a, b, c, d} and k = 3, then f could be defined by

 1 if xi = c,
f (xi ) = 2 if xi = d,
3 if xi ∈ {a, b}.

An example with Xi = R and k = 3 is the following



 1 if xi ∈ (−∞, α],
f (xi ) = 2 if xi ∈ (β, +∞),
3 if xi ∈ (α, β]

where α < β are arbitrary values. See Figure 1 for an example.

The prediction hT (x) is computed as follows. Start by assigning v ← r, where r is the root of T ;

1. if v is a leaf ℓ, then stop and let hT (x) be the label y ∈ Y associated with ℓ;
2. otherwise, if f : Xi → {1, . . . , k} is the test associated with v, then assign v ← vj where
j = f (xi ) and vj denotes the j-th child of v;
3. go to step 1.

If the computation of hT (x) terminates in leaf ℓ, we say that the example x is routed to ℓ. Hence
hT (x) is always the label of the leaf to which x is routed.

How do we build a tree predictor given a training set S? For simplicity, we focus on the case of
binary classification Y = {−1, 1} and we only consider complete binary trees, i.e., all internal nodes
have exactly two children. The idea is to grow the tree classifier starting from a single-node tree
(which must be a leaf) that corresponds to the classifier assigning to any data point the label that
occurs most frequently in the training set. The tree is grown by picking a leaf (at the beginning
there is only a leaf to pick) and replacing it with an internal node and two new leaves.

Suppose we have grown a tree T up to a certain point, and the resulting classifier is hT . We start by
computing the contributions of each leaf to the training error ℓS (hT ) (recall that each x is classified
by some leaf, the leaf which x is routed to). For each leaf ℓ, define Sℓ ≡ {(xt , yt ) ∈ S : xt is routed to ℓ}.
That is, Sℓ is the subset of training examples that are routed to ℓ. Define further two subsets of
Sℓ , namely Sℓ+ ≡ {(xt , yt ) ∈ Sℓ : yt = +1} and Sℓ− ≡ {(xt , yt ) ∈ Sℓ : yt = −1}.

For each leaf ℓ, let Nℓ+ = Sℓ+ , Nℓ− = Sℓ− and Nℓ = Sℓ = Nℓ− + Nℓ+ . In order to minimize the
training error ℓS (hT ), the label associated with ℓ must be
+1 if Nℓ+ ≥ Nℓ− ,

yℓ =
−1 otherwise.
 −
Thus, ℓ errs on exactly min Nℓ , Nℓ+ training examples in Sℓ . Therefore, we can write the training
error as a sum of contributions due to all leaves
 −
Nℓ Nℓ+
  +
1 X 1 X Nℓ
ℓ(h) =
b min , Nℓ = ψ Nℓ
m Nℓ Nℓ m Nℓ
ℓ ℓ

2
+
 we introduced the function ψ(a) = min{a, 1 − a} defined on [0, 1] —recall that Nℓ +
where

Nℓ Nℓ = 1, so the argument of ψ is a number between zero and one.

r
r
=⇒ ℓ0 v
ℓ0 ℓ

ℓ′ ℓ′′

Figure 2: A step in the growth of a tree classifier: a leaf ℓ is replaced by an internal node v and
be two new leaves ℓ′ and ℓ′′ .

Suppose we replace a leaf ℓ in T with an internal node, and its associated test, and two new leaves ℓ′
and ℓ′′ —see Figure 2. Can the training error of the new tree be larger than the training error of T ?
To answer this question is sufficient to observe that ψ is a concave function
 (just like the logarithm).
We can then apply Jensen’s inequality, stating that ψ αa + (1 − α)b ≥ αψ(a) + (1 − α)ψ(b), for
all a, b ∈ R and all α ∈ [0, 1].

Hence, via Jensen’s inequality, we can study how the training error changes when ℓ is replaced by
two new leaves ℓ′ and ℓ′′ ,
 +
N +′′ Nℓ′′
 +   +  +
Nℓ Nℓ′ Nℓ′ Nℓ′ Nℓ′ Nℓ′′ Nℓ′′
ψ Nℓ = ψ + ℓ Nℓ ≥ ψ Nℓ + ψ Nℓ
Nℓ Nℓ′ Nℓ Nℓ′′ Nℓ Nℓ′ Nℓ Nℓ′′ Nℓ
| {z }
contribution of ℓ
Nℓ+′
   +
Nℓ′′
=ψ Nℓ′ + ψ Nℓ′′
Nℓ′ Nℓ′′
| {z } | {z }
contribution of ℓ′ contribution of ℓ′′

meaning that a split never increases the training error.

A leaf ℓ such that Nℓ+ ∈ 0, Nℓ is called pure because it does not contribute to the training error.


Note that ℓS (hT ) > 0 unless all leaves are pure.

We now describe a generic method to construct a binary tree given a training set S.

1. Initialization: Create T with only the root ℓ and let Sℓ = S. Let the label associated with
the root be the most frequent label in Sℓ .
2. Main loop: pick a leaf ℓ and replace it with an internal node v creating two children ℓ′ (first
child) and ℓ′′ (second child). Pick an attribute i and a test f : Xi → {1, 2}. Associate the
test f with v and partition Sℓ in the two subsets

Sℓ′ = {(xt , yt ) ∈ Sℓ : f (xt,i ) = 1} and Sℓ′′ = {(xt , yt ) ∈ Sℓ : f (xt,i ) = 2} .

Let the labels associated with ℓ′ and ℓ′′ be, respectively, the most frequent labels in Sℓ′ and
Sℓ′′ .

3
Just like the classifiers generated by the k-NN algorithm, also tree predictors may suffer from
overfitting. In this case the relevant parameter is the number of tree nodes. If the number of tree
nodes grows too much compared to the cardinality of the training set, then the tree may overfit
the training data. For this reason, the choice of the leaf to expand should at least approximately
guarantee the largest decrease in the training error.

In practice, functions different from ψ(p) = min{p, 1 − p} are used to measure this decrease. This
happens because the min function might be problematic in certain circumstances. For example,
N+ N+ N+ Nℓ′
consider splitting a leaf where p = Nℓℓ = 0.8, q = Nℓ′′ = 0.6, r = Nℓ′′′′ = 1 and α = Nℓ = 0.5. In
ℓ ℓ
this case, when ψ(p) = min{p, 1 − p} we have that
  
ψ(p) − αψ(q) + (1 − α)ψ(r) = 0.2 − 0.5 × 0.4 + 0.5 × 0 = 0 .

As this split leaves the training error unchanged, it would be not be considered when growing the
tree, and the algorithm might even get stuck if no split can be found to decrease the training error.
On the other hand, the test in the new internal node is correctly classifying half of the examples
in Sℓ , and all these correctly classified examples are routed to leaf ℓ′′ which is pure. Hence, half of
the data in Sℓ is “explained” by the split.

In order to fix this problem, different functions ψ are used in practice. These functions are similar to
min because they are symmetric around 12 and satisfy ψ(0) = ψ(1) = 0. However, unlike min, they
have a nonzero curvature (i.e., strictly negative second derivative) —see Figure 3. The curvature
helps in cases like the one described in the example above, that is when p, q, r are all on thesame
side with respect to 21 and p = α q + (1 − α)r. In this case, ψ(p) − αψ(q) + (1 − α)ψ(r) = 0
because between 0 and 12 the function ψ(a) = min{a, 1 − a} is a straight line.

Some examples of functions ψ used in practice are

• Gini function: ψ2 (p) = 2p(1 − p).


p 1−p
• Scaled entropy: ψ3 (p) = − log2 (p) − log2 (1 − p).
2 2
• ψ4 (p) = p(1 − p).
p

The following inequalities hold: min{p, 1 − p} ≤ ψ2 (p) ≤ ψ3 (p) ≤ ψ4 (p).

Note that tree predictors can be naturally used also to solve multiclass classification or regression
tasks. In the first case, the label associated with a leaf is, once more, the most frequent label among
all training examples routed to that leaf. In the regression case, where Y = R, the label associated
to a leaf is the mean of the labels of all training examples that are routed to that leaf.

An interesting feature of tree predictors for binary classification is that they can be represented with
a formula of propositional logic in disjunctive normal form (DNF). This representation is obtained
by considering the clauses (conjunctions of predicates) that result from the tests on each path that
leads from the root to a leaf associated with label +1. For example, the classifier corresponding to
the tree of Figure 1 is represented by the formula

(outlook = sunny) ∧ (humidity ≤ 70%) ∨ (outlook = overcast)


∨ (outlook = rainy) ∧ (windy = false) .

4
Figure 3: Plots of the curves min{p, 1 − p} (green line) and ψ2 , ψ3 , ψ4 .

This “rule-based” representation of the tree classifier is very intuitive, and lends itself to being
manipulated using the tools of propositional logic; for example, to obtain more compact repre-
sentations of the same classifier. More importantly, this representation provides an interpretable
description of the knowledge the learning algorithm extracted from the training set.

You might also like