Download as pdf or txt
Download as pdf or txt
You are on page 1of 80

Advanced Analytics in Business

Big Data Platforms & Technologies


Supervised Learning: Basic Techniques
Overview
Regression
Logistic regression
K-NN
Decision and regression trees

2
The analytics process

3
Recall
Supervised learning
You have a labelled data set at your disposal
Correlate features to target
Common case: predict the future based on patterns observed now (predictive)
Classification (categorical) versus regression (continuous)

Unsupervised learning
Describe patterns in data
Clustering, association rules, sequence rules
No labelling required
Common case: descriptive, explanatory

For supervised learning, our data set will contain a label

4
Recall
Regression: continuous label Most classification use cases use a binary
categorical variable
Classification: categorical label
Churn prediction: churn yes/no
Credit scoring: default yes/no
For classification:
Fraud detection: suspicous yes/no
Binary classification (positive/negative outcome)
Response modeling: customer buys yes/no
Multiclass classification (more than two possible
outcomes) Predictive maintenance: needs check yes/no

Ordinal classification (target is ordinal)


Multilabel classification (multiple outcomes are
possible)

For regression:
Absolute values
Delta values
Quantiles regression

Single versus multi-output models

5
Defining your target
Recommender system: a form of multi-class? Multi-label?
Survival analysis: instead of yes/no predict the “time until event occurs”

Oftentimes, different approaches are possible

Regression, quantile regression, mean residuals regression?


Or: predicting the absolute value or the change?
Or: convert manually to a number of bins and perform classification?
Or: reduce the groups to two outcomes?
Or: sequential binary classification (“classifier chaining”)?
Or: perform segmentation first and build a model per segment?

6
Regression

7
Regression

https://xkcd.com/605/

8
Linear regression
A simple linear parametric model:

y = β0 + β1 x1 + β2 x2 +. . . +ϵ with ϵ ∼ N (0, σ)

β0 : →
mean response when x = 0 (y-intercept)
β1 : change in mean response when x1 increases by one unit
Weights are the only parameters we can optimize

How to determine the parameters?

Set them so that we minimize a loss function


n
argmin → ∑
β i=1
(yi − y
^i )
2
: minimize sum of squared errors (SSE) (or the MSE)

OLS: “Ordinary Least Squares”

OLS: “Ordinary Least Squares”. Why SSE though?

9
Logistic Regression

10
Logistic regression
Classification is solved as well?

Customer Age Income Gender … Response


John 30 1200 M No → 0
Sophie 24 2200 F No → 0
Sarah 53 1400 F Yes → 1
David 48 1900 M No → 0

Seppe 35 800 M Yes → 1

^ = β0 + β1 age + β2 income + β3 gender


y

But no guarantee that output is 0 or 1


Okay fine, a probability then (this is actually better), but no guarantee that outcome is between 0 and 1 either
Target and errors also not normally distributed (assumption of OLS violated)

11
Logistic regression
We use a bounding function to limit the outcome between
0 and 1:

(logistic, sigmoid function)


1
f (z) = −z
1+e

Same basic formula, but now with the goal of binary


classification

Two possible outcomes: either 0 or 1, no or yes – a categorical, binary label, not continuous
Logistic regression is thus a technique for classification rather than regression (target is a binary categorical)
Though the predictions are still continuous: between ]0,1[ – so a probability

12
Logistic regression
Linear regression with a transformation such that the output is always between 0 and 1,
and can thus be interpreted as a probability (e.g. response or churn probability)

P (response = yes|age, income, gender) =

1 − P (response = no|age, income, gender) =

−(β0 +β1 age+β2 income+β3 gender)


1 + e

Or (“logit” – natural logarithm of odds):

P (response = yes|age, income, gender)


ln( ) = β0 + β1 age + β2 income + β3 gender
P (response = no|age, income, gender)

13
Logistic regression
Customer Age Income Gender … Response
John 30 1200 M No → 0
The predictive model: a formula
Sophie 24 2200 F No → 0

Sarah 53 1400 F Yes → 1


David 48 1900 M No → 0 Not very spectacular, but note:
Seppe 35 800 M Yes → 1

Easy to understand

↓ Easy to construct
Easy to implement
1
Well-calibrated
−(0.10+0.22age+0.05income−0.80gender)
1 + e
In some settings, the end result will be a
↓ logistic model “extracted” from more
complex approaches
Customer Age Income Gender … Response Score

Will 44 1500 M 0.76


Emma 28 1000 F 0.44

14
Logistic regression
If Xi increases by 1:
1

logit| = logit| + βi −(β0 +β1 age+β2 income+β3 gender)


Xi +1 Xi 1 + e
βi
odds| = odds| + e
Xi +1 Xi

e
βi
: “odds-ratio”: multiplicative increase in odds when Xi increases by 1 (other variables
constant)

βi > 0 → eβ i
> 1 → odds/probability increase with Xi

βi < 0 → eβ i
< 1 → odds/probability decrease with Xi

Doubling amount:

Amount of change required for doubling primary outcome odds


Doubling amount for Xi = log(2)/βi

15
Logistic regression
Easy to interpret and understand
Statistical rigor, a “well-calibrated” classifier
Linear decision boundary, though interaction effects can be taken into the model (and explicitely – allows for
easy investigation)
Sensitive to outliers, correlated features
Categorical variables need to be converted (e.g. using dummy encoding as the most common approach,
though recall ways to reduce large amount of dummies)
Sensitive to the curse of dimensionality…

16
Regularization

17
Stepwise approaches
“Parsimonious” models:

If a “smaller” model works just as well as a “larger” one, prefer the smaller one
Also: “curse of dimensionality”

Makes sense: most statistical techniques don’t like dumping in your whole feature set all
at once

Selection based approaches (build up final model step-by-step):

Forward selection
Backward selection
Hybrid (stepwise) selection

See MASS::stepAIC , leaps::regsubsets , caret , or simply step in R


Only recently implemented in scikit-learn
18
Stepwise approaches
Trying to get the best, smallest model given some information about a large number of
variables is a reasonable idea

Many sources cover stepwise selection methods


However, this is not really a legitimate situation

Frank Harrell (1996):

It yields R-squared values that are badly biased to be high, the F and chi-squared tests quoted next to each variable on the
“ printout do not have the claimed distribution. The method yields confidence intervals for effects and predicted values that
are falsely narrow (Altman and Andersen, 1989). It yields p-values that do not have the proper meaning, and the proper
correction for them is a difficult problem. It gives biased regression coefficients that need shrinkage (the coefficients for
remaining variables are too large (Tibshirani, 1996). It has severe problems in the presence of collinearity. It is based on
methods (e.g., F tests for nested models) that were intended to be used to test prespecified hypotheses. Increasing the
sample size does not help very much (Derksen and Keselman, 1992). It uses a lot of paper. “
(https://www.stata.com/support/faqs/statistics/stepwise-regression-problems/)

19
Stepwise approaches
Some of these issues have / can be fixed (e.g. using
proper tests), but…

Developing and confirming a model based on the same dataset is called


“ data dredging. Although there is some underlying relationship amongst
the variables, and stronger relationships are expected to yield stronger
scores, these are random variables and the realized values contain
error. Thus, when you select variables based on having better realized
values, they may be such because of their underlying true value, error,
or both. True, using the AIC is better than using p-values, because it
penalizes the model for complexity, but the AIC is itself a random
variable (if you run a study several times and fit the same model, the “
AIC will bounce around just like everything else).

This actually already reveals something we’ll visit again


when talking about evaluation!

Take-away: use a proper train-validation-test setup!


20
Regularization

2 2 2 2
SSEM odel1 = (1 − 1) + (2 − 2) + (3 − 3) + (8 − 4) = 16
2 2 2 2
SSEM odel2 = (1 − −1) + (2 − 2) + (3 − 5) + (8 − 8) = 8

21
Regularization
Key insight: introduce a penalty on the size of the weights

Constrained, rather than fewer parameters!


Makes the model less sensitive to outliers, improves generalization

Lasso and ridge regression:


n
Standard: y = β0 + β1 x1 +. . . +βp xp + ϵ with argminβ→ ∑i=1 (yi − y
^i )
2

n p
Lasso regression (L1 regularization): argminβ→ ∑i=1 (yi ^i )
− y
2
+ λ∑
j=1
|βj |

n p
Ridge regression (L2 regularization): argmin → ∑i=1 (yi − y
^i )
2
+ λ∑
j=1
β
2
j
β

No penalization on the intercept!

Obviously: standardization/normalization is now required!

22
Lasso and ridge regression

https://newonlinecourses.science.psu.edu/stat508/lesson/5/5.1

23
Lasso and ridge regression
Lasso will force coefficients to become zero, ridge only keeps them within bounds
Variable selection “for free” with Lasso
Why ridge, then? Easier to implement (slightly) and faster to compute (slightly), or when you have a limited number of variables to
begin with
Lasso will also not consider grouping effects (i.e. picks a variable at random when variables are correlated), will not work when
number of instances is less than number of features (but then again nothing really will)

In practice lasso tends to work well even with small sample sizes
How to pick a good value for lambda: cross-validation! (See later)
Works both for linear and logistic regression, concept of L1 and L2 regularization also pops up with other
model types (e.g. SVM’s, Neural Networks) and fields:
Tikhonov regularization (Andrey Tikhonov), ridge regression (statistics), weight decay (machine learning), the Tikhonov–Miller
method, the Phillips–Twomey method, the constrained linear inversion method, and the method of linear regularization

Need to normalize variables befhorehand to ensure that the regularisation term λ


regularises/affects the variables involved in a similar manner!

MATLAB always uses the centred and scaled variables for the computations within ridge. It just back-transforms them
“ before returning them.

24
Lasso and ridge regression

25
Elastic net
Every time you have two similar approaches, there’s an easy extension opportunity by
proposing to combine them (and giving the result a new name)…

n p p

2 2
^i )
argmin → ∑(yi − y + λ1 ∑ |βj | + λ2 ∑ β
β j

i=1 j=1 j=1

Combines L1 and L2 penalties


Retains benefit of introducing sparsity
Good at getting grouping effects
Implemented in R and Python (check the documentation: everybody disagrees on how to call λ1 and λ2
Grid search on two parameters necessary
Lasso parameter will be the most pronounced in most practical settings

26
Some Other Forms of Regression

27
Non-parametric regression
“Non-parametric” meaning “no underlying
distribution assumed, purely data-driven”

“Smoothers” such as LOESS (locally weighted scatterplot


smoothing)
Does not require specification of a function to fit a model,
only a “smoothing” parameter
Very flexible, but requires large data samples (because
LOESS relies on local data structure to provide local
fitting), does not produce a regression function
Take care when using this as a “model”, more an
exploratory means!

28
Generalized additive models
(GAMs), similar concept as normal regression, but uses splines and other given smoothing function in a
linear combination
Benefit: capture non-linearities by smooth functions
Functions can be parametric, non-parametric, polynomial, local weighted mean, …
Flexible, best of both worlds approach
Danger of overfitting: stringent validation required
Theoretical relation to boosting (which we’ll discuss later)

Very nice technique but not that well known

aerosolve - Machine learning for humans


“ A machine learning library designed from the ground up to be human friendly.
A general additive linear piecewise spline model. The training is done at a higher resolution specified by num_buckets between the min
and max of a feature’s range. At the end of each iteration we attempt to project the linear piecewise spline into a lower dimensional “
function such as a polynomial spline with Dirac delta endpoints.

29
Generalized additive models
Henckaerts, Antonio, et al., 2017:

log(E(nclaims)) = log(exp) + β0 + β1 coverageP O + β2 coverageF O + β3 f ueldiesel +

f1 (ageph) + f2 (power) + f3 (bm) + f4 (ageph, power) + f5 (long, lat)

30
Multinomial logistic regression
P (yi = k|Xi ) = β0,k + β1,k x1,i + ⋯ + βM ,k xM ,i

K possible outcomes, M features


For K outcomes, construct K − 1 binary logistic regression models

P (yi =1|Xi )
ln( ) = β∙,1 ⋅ Xi
P (yi =K|Xi )

P (yi =2|Xi )
ln( ) = β∙,2 ⋅ Xi
P (yi =K|Xi )


P (yi =K−1|Xi )
ln( ) = β∙,K−1 ⋅ Xi
P (yi =K|Xi )

K−1 K−1 β∙,k Xi


P (yi = K|Xi ) = 1 − ∑ P (yi = k|Xi ) = 1 − ∑ P (yi = K|Xi )e
k=1 k=1

and thus P (yi


1
= K|Xi ) = K−1 β X
∙,k i
1+∑ e
k=1

31
Ordinal logistic regression
P (yi = D|Xi ) = P (yi ≤ D|Xi )

P (yi = C|Xi ) = P (yi ≤ C|Xi ) − P (yi ≤ D|Xi )

P (yi = B|Xi ) = P (yi ≤ B|Xi ) − P (yi ≤ C|Xi )

P (yi = A|Xi ) = P (yi ≤ A|Xi ) − P (yi ≤ B|Xi )

P (yi = AA|Xi ) = P (yi ≤ AA|Xi ) − P (yi ≤ A|Xi )

P (yi = AAA|Xi ) = 1 − P (yi ≤ AA|Xi )

P (yi ≤R|Xi )
ln( ) = −θR + β1 x1 + ⋯ + βn xn
1−P (yi ≤R|Xi )

but since P (yi ≤ AAA|Xi ) = 1, θAAA = ∞

The logit functions for all ratings are parallel since they only differ in the intercept
(“proportional odds model”)

32
PCR and PLS
Principal Component Regression (PCR):

Key idea: perform PCA first on the features and then perform normal regression
Number of components to be tuned using cross-validation (see later)
Standardization required as PCA is scaling-sensitive

Partial Least Squares (PLS) regression:

PCR does not take response into account


PLS performs PCA but now include the target as well
Variance aspect often dominates so PLS will behave closely to PCR in many settings

33
Decision Trees

34
Decision trees
Both for classification and regression

We’ll discuss classification first

Based on recursively partioning the data

Splitting decision: How to split a node?


Age < 30, income < 1000, status = married?

Stopping decision: When to stop splitting?


When to stop growing the tree?

Assignment decision: How to assign a label outcome in the leaf nodes?


Which class to assign to the leaf node?

35
Terminology

36
ID3
ID3 (Iterative Dichotomiser 3)

Most basic decision tree algorithm, by Ross Quinlan (1979,


1986)
Begin with the original set S as the root node
On each iteration of the algorithm, iterate through every
unused attribute of the set S and calculate a measure for that
attribute, e.g. Entropy H (S) and Information Gain I G(A, S)
Select the best attribute, split S on the selected attribute to
produce subsets
Continue to recurse on each subset, considering only attributes
not selected before (for this particular branch of the tree)
Recursion stops when every element in a subset belongs to the same class label, or there are no more
attributes to be selected, or there are no instances left in the subset

37
ID3

38
ID3

39
ID3

40
Impurity measures
Which measure? Based on impurity

••••••
⤿ Minimal impurity

••••••
⤿ Also minimal impurity

••••••
⤿ Maximal impurity

41
Impurity measures

Intuitively, it’s easy to see that:

•••••• → ••• + •••


Is better than:

•••••• → ••• + •••


But what about:

•••••• → •••• + ••

We need a measure…

42
Entropy
Entropy is a measure of the amount of uncertainty in a data set (information theory)

H (S) = − ∑
x∈X
p(x)log2 (p(x)) with S the data (sub)set, X the classes (e,g, {yes, no}),
p(x) the proportion of elements with class x over |S|

When H (S) , the set is completely pure (all elements belong to the same class)
= 0

43
Entropy
Expected amount of “information” from a distribution:

E[inf o] = ∑
x∈X
p(x)× “Measure of Information”

In information theory, standard unit of information is “a bit” or “the Shannon (Sh)”

I = −log2 (p)

E.g. an observation, rule that cuts space of possibilities in half provides one bit of information as
−log2 (1/2) = 1

In a quarter: two bits as −log2 (1/4) = 2

I
1
I
I = −log2 (p) ⟺ I = log2 (1/p) ⟺ 2 = 1/p ⟺ ( ) = p
2

E[inf o] = ∑
x∈X
p(x) × −log2 (p(x)) : expected value of information (expected number of bits)
Only bounded between 0-1 in case of two possible outcomes!

Information theory worked on by Claude Shannon in 1940s, and talking about these ideas with John von Neumann who suggested “entropy” as
a name for the uncertainty value as it “will give you an advantage in a debate”. (https://www.youtube.com/watch?v=v68zYyaEmEA)

44
Entropy

For the original data set, we get:

#yes #no x = yes x = no Entropy


9 5 -0.41 -0.53 0.94

45
Information gain

•••••• → •••• + ••
We can calculate the entropy of the original set and all the subsets, but how to measure the
improvement?

Information gain: measure of the difference in impurity before and after the split

How much uncertainty was reduced by a particular split?

I G(A, S) = H (S) − ∑
t∈T
p(t)H (t) with T the set of subsets obtained by splitting the
original set S on attribute A and p(t) = |t|/|S|

46
Information gain
Original set S :

#yes #no x = yes x = no Entropy


9 5 -0.41 -0.53 0.94

Calculate the entropy for all subsets created by all candidate splitting features:

Attribute Subset #yes #no x = yes x = no Entropy

Outlook Sunny 2 3 -0.53 -0.44 0.97


  Overcast 4 0 0.00 0
  Rain 3 2 -0.44 -0.53 0.97
Temperature Hot 2 2 -0.50 -0.50 1

  Mild 4 2 -0.39 -0.53 0.92


  Cool 3 1 -0.31 -0.5 0.81
Humidity High 3 4 -0.52 -0.46 0.99

  Normal 6 1 -0.19 -0.4 0.59


Wind Strong 3 3 -0.50 -0.50 1
  Weak 6 2 -0.31 -0.5 0.81

47
Information gain
I G(A, S) = H (S) − ∑ p(t)H (t)
t∈T

IG(outlook, S) = 0.94 - ( 14 0.97 +


5 4
0.00 +
5

14 14
0.97) = 0.94 - 0.69 = 0.25 ← highest IG

IG(temperature, S) = 0.03
IG(humidity, S) = 0.15
IG(wind, S) = 0.05

#yes #no x = yes x = no Entropy Attribute Subset #yes #no x = yes x = no Entropy

9 5 -0.41 -0.53 0.94 Outlook Sunny 2 3 -0.53 -0.44 0.97


  Overcast 4 0 0.00 0
  Rain 3 2 -0.44 -0.53 0.97

Temperature Hot 2 2 -0.50 -0.50 1


  Mild 4 2 -0.39 -0.53 0.92
  Cool 3 1 -0.31 -0.5 0.81

Humidity High 3 4 -0.52 -0.46 0.99


  Normal 6 1 -0.19 -0.4 0.59
Wind Strong 3 3 -0.50 -0.50 1

  Weak 6 2 -0.31 -0.5 0.81

48
ID3: after one split

49
ID3: continued

Recursion stops when every element in a subset belongs to the same class label, or there
are no more attributes to be selected, or there are no instances left in the subset

50
ID3: final tree

Assign labels to the leaf nodes: easy – just pick the most common class

51
Impurity measures
Entropy (Shannon index/uncertainty) is not the only
measure of impurity that can be used:

Entropy: H (S) = −∑
x∈X
p(x)log2 (p(x))

Gini diversity index: Gini(S) = 1 − ∑


x∈X
p(x)
2

Not very different: Gini works a bit better for continuous variables (see after)
and is a little faster
Most implementations default to this approach

Classification error: ClassErr(S) = 1 − maxx∈X p(x)

Something to think about: why not use accuracy directly? Or another metric of interest
such as AUC or precision, F1, …?

52
Summary so far
Using the tree to predict new labels is easy: just follow the questions in the tree and look at the outcome
Easy to understand, easy (for a computer) to construct
Can be easily expressed as simple IF…THEN rules as well, can hence be easily implemented in existing
programs (even as a SQL procedure)

Side note: algorithms exist which directly try to introduce prediction models in the form of
a rule base, with RIPPER (Repeated Incremental Pruning to Produce Error Reduction)
being the most well known. It’s just as old and leads to very similar models, but has some
interesting differences and is (nowadays) not widely implemented or known about
anymore

53
https://christophm.github.io/interpretable-ml-book/rules.html

See also RuleFit (https://github.com/christophM/rulefit) and Skope-Rules


(https://github.com/scikit-learn-contrib/skope-rules) for interesting, newer approaches

54
Problems still to solve
ID3 is greedy: it never backtracks (i.e. retraces on previous
steps) during the construction of the tree, it only moves
forward

This means that global optimality of the tree is not guaranteed


Algorithms exist which overcome this (see e.g. evtree package for R),
though they’re often slow and do not give much better results (so
greedy is good enough)

A bigger problem however is the fact that we do not have a


way to tackle continuous variables, such as temperature = 21, 24, 26, 27, 30, …

Another problem is that the “grow for as long as you can” strategy leads to trees which
will be horribly overfitting!

55
Spotting overfitting

(Note that this is a good motivating case to illustrate the difference between “supervised
methods for predictive analytics” or “for descriptive analytics”)

56
C4.5
Also by Ross Quinlan
Extension of ID3: still uses information gain
Main contribution: dealing with continuous variables
Can also deal with missing variables
The original paper describes that you just ignore them when calculating the impurity measure and information gain
Though most implementations do not implement this!

Also allows to set importance weights on attributes (biasing the information gain, basically)
Describes methods to prune trees

57
C4.5: continuous variables
Say we want to split on temperature = 21, 24, 26, 27, 30, …

Obviously, using the values just as we do with categorical features would not be a good idea
Would lead to a lot of subsets, many of which potentially having only a few instances
And when applying the tree on new data: encountering a value which was unseen during training is much
higher than for categoricals, e.g. what if temperature is 22?

Instead, enforce binary splits by:

Splitting on temperature <= 21 → two subsets (yes, no) – and calculate the information gain
Splitting on temperature <= 24 → two subsets (yes, no) – and calculate the information gain
Splitting on temperature <= 26 → two subsets (yes, no) – and calculate the information gain
Splitting on temperature <= 27 → two subsets (yes, no) – and calculate the information gain
And so on…
Important: only the distinct set of values seen in the training set are considered (other values wouldn’t
change the information gain), though some papers propose changes to make this a little more stable
58
C4.5: continuous variables
Note that each of our “temperature <= …” splits lead to a yes/no outcome

Couldn’t we do the same for categorical features as well?

Humity = “high” → two subsets (yes, no) – and calculate the information gain
Humity = “medium” → two subsets (yes, no) – and calculate the information gain
Humity = “low” → two subsets (yes, no) – and calculate the information gain

Yes, and it turns out that constructing such a binary tree is better

Information gain measure is biased towards preferring splits that lead to more subsets (avoided now as we
always get two subsets)
Obviously: we can now re-use attributes throughout the tree as each attribute leads to multiple binary
subsets
Tree can be deeper, but less wide
Weka is a weird exception

59
C4.5: preventing overfitting
Early stopping: stop based on a stop criterion

E.g. when number of instances in subset goes below a threshold


Or when depth gets too high
Or: set aside a validation set during training and stop when performance on this set starts to decrease (easy
strategy if you have enough training data to begin with)

60
C4.5: preventing overfitting
Pruning: grow the full tree, and then reduce it

Merge back leaf nodes if they add little power to the classification accuracy
“Inside every big tree is a small, perfect tree waiting to come out.” – Dan Steinberg
Many forms exist: weakest-link pruning, cost-complexity pruning (common), etc…
Oftentimes governed by a “complexity” parameter in most implementations
Only recently in sklearn : https://scikit-learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html#sphx-glr-
auto-examples-tree-plot-cost-complexity-pruning-py

Scoring of new instances can now return a “probability” 61


C4.5: probabilities
Note that fully-grown decision trees will most likely not be able to provide a probability

The leaf node contains either only positive or negative instances

If you prune or stop early:

Can be used to rank predictions across different instances

But: not well-calibrated!


Also doesn’t take leaf node frequency into account by default
62
C5 (See5)
Also by Ross Quinlan
Commercial offering
Faster, memory efficient, smaller trees, weighting of cases and supplying misclassification costs
Not widely adopted… only recently made open source, better open source implementations available
Support for boosting (see next course)

Outdated

63
A few final remarks…
Conditional inference trees:

Another method which instead of using information gain uses a significance test procedure in order to select
variables

Preprocessing:

Decision trees are robust to outliers


Only missing value treatment needed
Some implementations have proposed three-way splits (yes / no / NA)

Multiclass:

Concept of decision trees easily extends to multiclass setting

64
A few final remarks…
Categorization:

Recall preprocessing: possible to run a decision tree on one continuous variable only to suggest a good
binning based on the leaf nodes

Interaction effects and nonlinearities:

Considered by default by decision trees


CHAID (Chi-square automatic interaction detection): chi-square based test to split trees

Variable selection:

Based on features that pop up earlier in the tree

Non well-calibrated and an unstable classifier:

Sensitive to changes in training data: a small change can cause your tree to look different

65
A few final remarks
Remember to prevent overfitting trees

But a deep tree does not necessarily mean that you have a problem
And a short tree does not necessarily mean that it’s not overfitting

It is very likely that your leaf nodes will not be completely pure (i.e. not containing 100%
yes cases or no cases)

66
A few final remarks
Decision trees are simple to understand and present, require very little data preparation,
and can learn non-linear relationships

In fact, many ways exist to take your favorite implementation and extending it

E.g. when domain experts have their favorite set of features, you can easily constrain the tree to only consider
a subset of the features in the first n levels
You might even consider playing with more candidates to generate binary split rules, e.g. “feature X between
A and B?”, see also oblique decision trees

67
Regression trees
As made popular by CART: Classification And Regression Trees
Instead of calculating the #yes’s vs. #no’s to get the predicted
class label, take the mean of the continuous label and use that as
the outcome

But how to select the splitting criterion?

Squared residuals minimization algorithm which implies that


expected sum of variances for two resulting nodes should be
minimized
Find nodes with minimal within variance… and therefore
greatest between variance (a little bit like k-means, a clustering
technique we’ll visit later)
Based on sum of squared errors: find the split that produces the greatest separation in SSE in each subset
Or MSE × #samples in group

Important: pruning/early stopping still applies, though a regression tree will typically need to be deeper than
a classification one

68
Regression trees

69
Regression trees

70
Regression trees

71
Regression trees

72
A few final remarks
Visualizing:

R: use fancyRpartPlot() from the rattle package for nicer visualizations


Python: use dtreeviz (https://github.com/parrt/dtreeviz) for nicer visualizations or pybaobabdt
(https://pypi.org/project/pybaobabdt/)

73
Decision trees versus (logistic) regression

Decision boundary of decision trees: squares orthogonal to a dimension

74
Decision trees versus (logistic) regression
Decision trees can struggle to capture
linear relationships

The best it can do is a step function


approximation of a linear relationship
This is strictly related to how decision trees
work: it splits the input features in several
“orthogonal” regions and assigns a prediction
value to each region
Here, a deeper tree would be necessary to
approximate the linear relationship
Or: apply a transformation first (e.g. PCA)

75
K-nearest Neighbors

76
K-nearest neighbors (K-NN)
A non-parametric method used for classification and regression
“The data is the model”
“There’s a model that perfectly fits the training data: it’s called a database”
Trivially easy
Based on the concept of distances (so normalization/standardization required)

77
K-nearest neighbors (K-NN)
Has some appealing properties: easy to understand
Fun to tweak with custom distances, different values for k, dynamic k-values, custom distance measures… –
even a recommender system can be built using this approach
Provides surprisingly good results, given enough data
Regression: e.g. use a (weighted) average of the k nearest neighbors, weighted by the inverse of their distance
The main disadvantage of k-NN is that it is a “lazy learner”: it does not learn anything from the training data
and simply uses the training data itself as a model
This means you don’t get a formula, or a tree as a summarizing model
And that you need to keep training data around
Relatively slow
Not as stable for noisy data, might be hard to generalize, unstable with regards to k setting

78
K-nearest neighbors (K-NN)

https://towardsdatascience.com/scanned-digits-recognition-using-k-nearest-neighbor-k-nn-d1a1528f0dea

79
Wrap up
Linear regression, logistic regression and decision trees still widely used techniques for tabular data
White box, statistical rigor, easy to construct and interpret

K-NN less can provide a quick baseline and is very extensible


Decision between regression and (which) classification not always a clear cut choice!
Neither is the decision between unsupervised and supervised
Iteration, domain expert involvement required

80

You might also like