Professional Documents
Culture Documents
3 - SupervisedIntro
3 - SupervisedIntro
2
The analytics process
3
Recall
Supervised learning
You have a labelled data set at your disposal
Correlate features to target
Common case: predict the future based on patterns observed now (predictive)
Classification (categorical) versus regression (continuous)
Unsupervised learning
Describe patterns in data
Clustering, association rules, sequence rules
No labelling required
Common case: descriptive, explanatory
4
Recall
Regression: continuous label Most classification use cases use a binary
categorical variable
Classification: categorical label
Churn prediction: churn yes/no
Credit scoring: default yes/no
For classification:
Fraud detection: suspicous yes/no
Binary classification (positive/negative outcome)
Response modeling: customer buys yes/no
Multiclass classification (more than two possible
outcomes) Predictive maintenance: needs check yes/no
For regression:
Absolute values
Delta values
Quantiles regression
5
Defining your target
Recommender system: a form of multi-class? Multi-label?
Survival analysis: instead of yes/no predict the “time until event occurs”
6
Regression
7
Regression
https://xkcd.com/605/
8
Linear regression
A simple linear parametric model:
y = β0 + β1 x1 + β2 x2 +. . . +ϵ with ϵ ∼ N (0, σ)
β0 : →
mean response when x = 0 (y-intercept)
β1 : change in mean response when x1 increases by one unit
Weights are the only parameters we can optimize
9
Logistic Regression
10
Logistic regression
Classification is solved as well?
11
Logistic regression
We use a bounding function to limit the outcome between
0 and 1:
Two possible outcomes: either 0 or 1, no or yes – a categorical, binary label, not continuous
Logistic regression is thus a technique for classification rather than regression (target is a binary categorical)
Though the predictions are still continuous: between ]0,1[ – so a probability
12
Logistic regression
Linear regression with a transformation such that the output is always between 0 and 1,
and can thus be interpreted as a probability (e.g. response or churn probability)
13
Logistic regression
Customer Age Income Gender … Response
John 30 1200 M No → 0
The predictive model: a formula
Sophie 24 2200 F No → 0
Easy to understand
↓ Easy to construct
Easy to implement
1
Well-calibrated
−(0.10+0.22age+0.05income−0.80gender)
1 + e
In some settings, the end result will be a
↓ logistic model “extracted” from more
complex approaches
Customer Age Income Gender … Response Score
14
Logistic regression
If Xi increases by 1:
1
e
βi
: “odds-ratio”: multiplicative increase in odds when Xi increases by 1 (other variables
constant)
βi > 0 → eβ i
> 1 → odds/probability increase with Xi
βi < 0 → eβ i
< 1 → odds/probability decrease with Xi
Doubling amount:
15
Logistic regression
Easy to interpret and understand
Statistical rigor, a “well-calibrated” classifier
Linear decision boundary, though interaction effects can be taken into the model (and explicitely – allows for
easy investigation)
Sensitive to outliers, correlated features
Categorical variables need to be converted (e.g. using dummy encoding as the most common approach,
though recall ways to reduce large amount of dummies)
Sensitive to the curse of dimensionality…
16
Regularization
17
Stepwise approaches
“Parsimonious” models:
If a “smaller” model works just as well as a “larger” one, prefer the smaller one
Also: “curse of dimensionality”
Makes sense: most statistical techniques don’t like dumping in your whole feature set all
at once
Forward selection
Backward selection
Hybrid (stepwise) selection
It yields R-squared values that are badly biased to be high, the F and chi-squared tests quoted next to each variable on the
“ printout do not have the claimed distribution. The method yields confidence intervals for effects and predicted values that
are falsely narrow (Altman and Andersen, 1989). It yields p-values that do not have the proper meaning, and the proper
correction for them is a difficult problem. It gives biased regression coefficients that need shrinkage (the coefficients for
remaining variables are too large (Tibshirani, 1996). It has severe problems in the presence of collinearity. It is based on
methods (e.g., F tests for nested models) that were intended to be used to test prespecified hypotheses. Increasing the
sample size does not help very much (Derksen and Keselman, 1992). It uses a lot of paper. “
(https://www.stata.com/support/faqs/statistics/stepwise-regression-problems/)
19
Stepwise approaches
Some of these issues have / can be fixed (e.g. using
proper tests), but…
2 2 2 2
SSEM odel1 = (1 − 1) + (2 − 2) + (3 − 3) + (8 − 4) = 16
2 2 2 2
SSEM odel2 = (1 − −1) + (2 − 2) + (3 − 5) + (8 − 8) = 8
21
Regularization
Key insight: introduce a penalty on the size of the weights
n p
Lasso regression (L1 regularization): argminβ→ ∑i=1 (yi ^i )
− y
2
+ λ∑
j=1
|βj |
n p
Ridge regression (L2 regularization): argmin → ∑i=1 (yi − y
^i )
2
+ λ∑
j=1
β
2
j
β
22
Lasso and ridge regression
https://newonlinecourses.science.psu.edu/stat508/lesson/5/5.1
23
Lasso and ridge regression
Lasso will force coefficients to become zero, ridge only keeps them within bounds
Variable selection “for free” with Lasso
Why ridge, then? Easier to implement (slightly) and faster to compute (slightly), or when you have a limited number of variables to
begin with
Lasso will also not consider grouping effects (i.e. picks a variable at random when variables are correlated), will not work when
number of instances is less than number of features (but then again nothing really will)
In practice lasso tends to work well even with small sample sizes
How to pick a good value for lambda: cross-validation! (See later)
Works both for linear and logistic regression, concept of L1 and L2 regularization also pops up with other
model types (e.g. SVM’s, Neural Networks) and fields:
Tikhonov regularization (Andrey Tikhonov), ridge regression (statistics), weight decay (machine learning), the Tikhonov–Miller
method, the Phillips–Twomey method, the constrained linear inversion method, and the method of linear regularization
MATLAB always uses the centred and scaled variables for the computations within ridge. It just back-transforms them
“ before returning them.
“
24
Lasso and ridge regression
25
Elastic net
Every time you have two similar approaches, there’s an easy extension opportunity by
proposing to combine them (and giving the result a new name)…
n p p
2 2
^i )
argmin → ∑(yi − y + λ1 ∑ |βj | + λ2 ∑ β
β j
26
Some Other Forms of Regression
27
Non-parametric regression
“Non-parametric” meaning “no underlying
distribution assumed, purely data-driven”
28
Generalized additive models
(GAMs), similar concept as normal regression, but uses splines and other given smoothing function in a
linear combination
Benefit: capture non-linearities by smooth functions
Functions can be parametric, non-parametric, polynomial, local weighted mean, …
Flexible, best of both worlds approach
Danger of overfitting: stringent validation required
Theoretical relation to boosting (which we’ll discuss later)
29
Generalized additive models
Henckaerts, Antonio, et al., 2017:
30
Multinomial logistic regression
P (yi = k|Xi ) = β0,k + β1,k x1,i + ⋯ + βM ,k xM ,i
P (yi =1|Xi )
ln( ) = β∙,1 ⋅ Xi
P (yi =K|Xi )
P (yi =2|Xi )
ln( ) = β∙,2 ⋅ Xi
P (yi =K|Xi )
…
P (yi =K−1|Xi )
ln( ) = β∙,K−1 ⋅ Xi
P (yi =K|Xi )
31
Ordinal logistic regression
P (yi = D|Xi ) = P (yi ≤ D|Xi )
P (yi ≤R|Xi )
ln( ) = −θR + β1 x1 + ⋯ + βn xn
1−P (yi ≤R|Xi )
The logit functions for all ratings are parallel since they only differ in the intercept
(“proportional odds model”)
32
PCR and PLS
Principal Component Regression (PCR):
Key idea: perform PCA first on the features and then perform normal regression
Number of components to be tuned using cross-validation (see later)
Standardization required as PCA is scaling-sensitive
33
Decision Trees
34
Decision trees
Both for classification and regression
35
Terminology
36
ID3
ID3 (Iterative Dichotomiser 3)
37
ID3
38
ID3
39
ID3
40
Impurity measures
Which measure? Based on impurity
••••••
⤿ Minimal impurity
••••••
⤿ Also minimal impurity
••••••
⤿ Maximal impurity
41
Impurity measures
•••••• → •••• + ••
We need a measure…
42
Entropy
Entropy is a measure of the amount of uncertainty in a data set (information theory)
H (S) = − ∑
x∈X
p(x)log2 (p(x)) with S the data (sub)set, X the classes (e,g, {yes, no}),
p(x) the proportion of elements with class x over |S|
When H (S) , the set is completely pure (all elements belong to the same class)
= 0
43
Entropy
Expected amount of “information” from a distribution:
E[inf o] = ∑
x∈X
p(x)× “Measure of Information”
I = −log2 (p)
E.g. an observation, rule that cuts space of possibilities in half provides one bit of information as
−log2 (1/2) = 1
I
1
I
I = −log2 (p) ⟺ I = log2 (1/p) ⟺ 2 = 1/p ⟺ ( ) = p
2
E[inf o] = ∑
x∈X
p(x) × −log2 (p(x)) : expected value of information (expected number of bits)
Only bounded between 0-1 in case of two possible outcomes!
Information theory worked on by Claude Shannon in 1940s, and talking about these ideas with John von Neumann who suggested “entropy” as
a name for the uncertainty value as it “will give you an advantage in a debate”. (https://www.youtube.com/watch?v=v68zYyaEmEA)
44
Entropy
45
Information gain
•••••• → •••• + ••
We can calculate the entropy of the original set and all the subsets, but how to measure the
improvement?
Information gain: measure of the difference in impurity before and after the split
I G(A, S) = H (S) − ∑
t∈T
p(t)H (t) with T the set of subsets obtained by splitting the
original set S on attribute A and p(t) = |t|/|S|
46
Information gain
Original set S :
Calculate the entropy for all subsets created by all candidate splitting features:
47
Information gain
I G(A, S) = H (S) − ∑ p(t)H (t)
t∈T
14 14
0.97) = 0.94 - 0.69 = 0.25 ← highest IG
IG(temperature, S) = 0.03
IG(humidity, S) = 0.15
IG(wind, S) = 0.05
#yes #no x = yes x = no Entropy Attribute Subset #yes #no x = yes x = no Entropy
48
ID3: after one split
49
ID3: continued
Recursion stops when every element in a subset belongs to the same class label, or there
are no more attributes to be selected, or there are no instances left in the subset
50
ID3: final tree
Assign labels to the leaf nodes: easy – just pick the most common class
51
Impurity measures
Entropy (Shannon index/uncertainty) is not the only
measure of impurity that can be used:
Entropy: H (S) = −∑
x∈X
p(x)log2 (p(x))
Not very different: Gini works a bit better for continuous variables (see after)
and is a little faster
Most implementations default to this approach
Something to think about: why not use accuracy directly? Or another metric of interest
such as AUC or precision, F1, …?
52
Summary so far
Using the tree to predict new labels is easy: just follow the questions in the tree and look at the outcome
Easy to understand, easy (for a computer) to construct
Can be easily expressed as simple IF…THEN rules as well, can hence be easily implemented in existing
programs (even as a SQL procedure)
Side note: algorithms exist which directly try to introduce prediction models in the form of
a rule base, with RIPPER (Repeated Incremental Pruning to Produce Error Reduction)
being the most well known. It’s just as old and leads to very similar models, but has some
interesting differences and is (nowadays) not widely implemented or known about
anymore
53
https://christophm.github.io/interpretable-ml-book/rules.html
54
Problems still to solve
ID3 is greedy: it never backtracks (i.e. retraces on previous
steps) during the construction of the tree, it only moves
forward
Another problem is that the “grow for as long as you can” strategy leads to trees which
will be horribly overfitting!
55
Spotting overfitting
(Note that this is a good motivating case to illustrate the difference between “supervised
methods for predictive analytics” or “for descriptive analytics”)
56
C4.5
Also by Ross Quinlan
Extension of ID3: still uses information gain
Main contribution: dealing with continuous variables
Can also deal with missing variables
The original paper describes that you just ignore them when calculating the impurity measure and information gain
Though most implementations do not implement this!
Also allows to set importance weights on attributes (biasing the information gain, basically)
Describes methods to prune trees
57
C4.5: continuous variables
Say we want to split on temperature = 21, 24, 26, 27, 30, …
Obviously, using the values just as we do with categorical features would not be a good idea
Would lead to a lot of subsets, many of which potentially having only a few instances
And when applying the tree on new data: encountering a value which was unseen during training is much
higher than for categoricals, e.g. what if temperature is 22?
Splitting on temperature <= 21 → two subsets (yes, no) – and calculate the information gain
Splitting on temperature <= 24 → two subsets (yes, no) – and calculate the information gain
Splitting on temperature <= 26 → two subsets (yes, no) – and calculate the information gain
Splitting on temperature <= 27 → two subsets (yes, no) – and calculate the information gain
And so on…
Important: only the distinct set of values seen in the training set are considered (other values wouldn’t
change the information gain), though some papers propose changes to make this a little more stable
58
C4.5: continuous variables
Note that each of our “temperature <= …” splits lead to a yes/no outcome
Humity = “high” → two subsets (yes, no) – and calculate the information gain
Humity = “medium” → two subsets (yes, no) – and calculate the information gain
Humity = “low” → two subsets (yes, no) – and calculate the information gain
Yes, and it turns out that constructing such a binary tree is better
Information gain measure is biased towards preferring splits that lead to more subsets (avoided now as we
always get two subsets)
Obviously: we can now re-use attributes throughout the tree as each attribute leads to multiple binary
subsets
Tree can be deeper, but less wide
Weka is a weird exception
59
C4.5: preventing overfitting
Early stopping: stop based on a stop criterion
60
C4.5: preventing overfitting
Pruning: grow the full tree, and then reduce it
Merge back leaf nodes if they add little power to the classification accuracy
“Inside every big tree is a small, perfect tree waiting to come out.” – Dan Steinberg
Many forms exist: weakest-link pruning, cost-complexity pruning (common), etc…
Oftentimes governed by a “complexity” parameter in most implementations
Only recently in sklearn : https://scikit-learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html#sphx-glr-
auto-examples-tree-plot-cost-complexity-pruning-py
Outdated
63
A few final remarks…
Conditional inference trees:
Another method which instead of using information gain uses a significance test procedure in order to select
variables
Preprocessing:
Multiclass:
64
A few final remarks…
Categorization:
Recall preprocessing: possible to run a decision tree on one continuous variable only to suggest a good
binning based on the leaf nodes
Variable selection:
Sensitive to changes in training data: a small change can cause your tree to look different
65
A few final remarks
Remember to prevent overfitting trees
But a deep tree does not necessarily mean that you have a problem
And a short tree does not necessarily mean that it’s not overfitting
It is very likely that your leaf nodes will not be completely pure (i.e. not containing 100%
yes cases or no cases)
66
A few final remarks
Decision trees are simple to understand and present, require very little data preparation,
and can learn non-linear relationships
In fact, many ways exist to take your favorite implementation and extending it
E.g. when domain experts have their favorite set of features, you can easily constrain the tree to only consider
a subset of the features in the first n levels
You might even consider playing with more candidates to generate binary split rules, e.g. “feature X between
A and B?”, see also oblique decision trees
67
Regression trees
As made popular by CART: Classification And Regression Trees
Instead of calculating the #yes’s vs. #no’s to get the predicted
class label, take the mean of the continuous label and use that as
the outcome
Important: pruning/early stopping still applies, though a regression tree will typically need to be deeper than
a classification one
68
Regression trees
69
Regression trees
70
Regression trees
71
Regression trees
72
A few final remarks
Visualizing:
73
Decision trees versus (logistic) regression
74
Decision trees versus (logistic) regression
Decision trees can struggle to capture
linear relationships
75
K-nearest Neighbors
76
K-nearest neighbors (K-NN)
A non-parametric method used for classification and regression
“The data is the model”
“There’s a model that perfectly fits the training data: it’s called a database”
Trivially easy
Based on the concept of distances (so normalization/standardization required)
77
K-nearest neighbors (K-NN)
Has some appealing properties: easy to understand
Fun to tweak with custom distances, different values for k, dynamic k-values, custom distance measures… –
even a recommender system can be built using this approach
Provides surprisingly good results, given enough data
Regression: e.g. use a (weighted) average of the k nearest neighbors, weighted by the inverse of their distance
The main disadvantage of k-NN is that it is a “lazy learner”: it does not learn anything from the training data
and simply uses the training data itself as a model
This means you don’t get a formula, or a tree as a summarizing model
And that you need to keep training data around
Relatively slow
Not as stable for noisy data, might be hard to generalize, unstable with regards to k setting
78
K-nearest neighbors (K-NN)
https://towardsdatascience.com/scanned-digits-recognition-using-k-nearest-neighbor-k-nn-d1a1528f0dea
79
Wrap up
Linear regression, logistic regression and decision trees still widely used techniques for tabular data
White box, statistical rigor, easy to construct and interpret
80