To Machine Learning: Isabelle Guyon

Introduction
to
Machine Learning
Isabelle Guyon
isabelle@clopinet.com
What is Machine Learning?
Learning Trained
algorithm machine
TRAINING
DATA Answer
Query
What for?
• Classification
• Time series prediction
• Regression
• Clustering
Some Learning Machines
• Linear models
• Kernel methods
• Neural networks
• Decision trees
Applications
training
Market
examples5
10 Ecology Analysis Machine
Vision
104 Text
Categorization
103 OCR
System diagnosis
HWR
102
Bioinformatics
10
inputs
10 102 103 104 105
Banking / Telecom / Retail
• Identify:
– Prospective customers
– Dissatisfied customers
– Good customers
– Bad payers
• Obtain:
– More effective advertising
– Less credit risk
– Fewer fraud
– Decreased churn rate
Biomedical / Biometrics
• Medicine:
– Screening
– Diagnosis and prognosis
– Drug discovery
• Security:
– Face recognition
– Signature / fingerprint / iris
verification
– DNA fingerprinting 6
Computer / Internet
• Computer interfaces:
– Troubleshooting wizards
– Handwriting and speech
– Brain waves
• Internet
– Hit ranking
– Spam filtering
– Text categorization
– Text translation
– Recommendation
7
Challenges
training NIPS 2003 &

examples5 Ada WCCI 2006
10 Sylva
104
Dexter, Nova
103 Gisette
Gina
Madelon
102 Arcene,
Dorothea, Hiva
10
inputs
10 102 103 104 105
Ten Classification Tasks
40 150
ARCENE 100 ADA
20 50
0
0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
0 5 10 15 20 25 30 35 40 45 50
40 150
DEXTER 100 GINA
20 50
0
0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
0 5 10 15 20 25 30 35 40 45 50
40
150
DOROTHEA 100 HIVA
20
50
0
0
0 5 10 15 20 25 30 35 40 45 50 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
40
GISETTE 150
20 100 NOVA
50
0 0
0 5 10 15 20 25 30 35 40 45 50
40 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
MADELON
20 150
100 SYLVA
0 50
0 5 10 15 20 25 30 35 40 45 50 0
Test BER (%) 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
Challenge Winning Methods
1.8
1.6 Gisette (HWR)
1.4 Gina (HWR)
1.2 Dexter (Text)
BER/<BER>
1 Nova (Text)
0.8 Madelon (Artificial)
0.6 Arcene (Spectral)
0.4 Dorothea (Pharma)
0.2 Hiva (Pharma)
0 Ada (Marketing)
Linear Neural Trees Naïve Sylva (Ecology)
/Kernel Nets /RF Bayes
Conventions
X={xij} m y ={yj}
xi

w
Learning problem
Data matrix: X
m lines = patterns (data points,
examples): samples, patients,
documents, images, …
n columns = features: (attributes,
input variables): genes, proteins,
words, pixels, …
Unsupervised learning
Is there structure in data?
Supervised learning
Predict an outcome y.
Colon cancer, Alon et al 1999
Linear Models
• f(x) = w  x +b = j=1:n wj xj +b
Linearity in the parameters, NOT in the input
components.
• f(x) = w  (x) +b = j wj j(x) +b (Perceptron)
• f(x) = i=1:m i k(xi,x) +b (Kernel method)

Artificial Neurons
Cell potential
x1
w1
x2 w2
 f(x)
Axon
wn
Activation xn b Activation function
of other
neurons 1 Dendrites
Synapses
McCulloch and Pitts, 1943 f(x) = w  x + b

Linear Decision Boundary
hyperplane
0.5
x2 0.4
0.3
0.2
0.1
X3
x3
-0.1
-0.2
-0.3
-0.4
-0.5
-0.5
0 -0.
0
x1 x
0.5 0.5
X2 2 xX11
Perceptron
Rosenblatt, 1957
x1 1(x)
w1
x2 2(x)
w2

f(x)
xn wN
N(x) b
f(x) = w  (x) + b
1
NL Decision Boundary
x2
0.5
Hs.7780
x3 0
-0.5
0.5
0.5
0
0
x
Hs.234680
2
-0.5 -0.5
Hs.128749
x1 x1
Kernel Method
Potential functions, Aizerman et al 1964
x1 k(x1,x)
1
x2 k(x2,x) 2

xn m
k(xm,x) b
f(x) = i i k(xi,x) + b
1 k(. ,. ) is a similarity measure or “kernel”.
Hebb’s Rule
wj  wj + yi xij
Activation
of another xj wj y
neuron  Axon
Dendrite
Synapse
Link to “Naïve Bayes”

Kernel “Trick” (for Hebb’s rule)
• Hebb’s rule for the Perceptron:

w = i yi (xi)
f(x) = w  (x) = i yi (xi)  (x)
• Define a dot product:

k(xi,x) = (xi)  (x)
f(x) = i yi k(xi,x)
Kernel “Trick” (general)
• f(x) = i i k(xi, x)
• k(xi, x) = (xi)  (x)
Dual forms
• f(x) = w  (x)
• w = i i (xi)
What is a Kernel?
A kernel is:
• a similarity measure
• a dot product in some feature space: k(s, t) = (s)  (t)
But we do not need to know the  representation.
Examples:
• k(s, t) = exp(-||s-t||2/2) Gaussian kernel
• k(s, t) = (s  t)q Polynomial kernel

Multi-Layer Perceptron
Back-propagation, Rumelhart et al, 1986

xj


internal “latent” variables
“hidden units”
Chessboard Problem
Tree Classifiers
CART (Breiman, 1984) or C4.5 (Quinlan, 1993)
f2
All the
data
f1
Choose f2 At each step,
choose the
feature that
“reduces
Choose f1 entropy” most.
Work towards
“node purity”.
Iris Data (Fisher, 1936)
Figure from Norbert Jankowski and Krzysztof Grabczewski
Linear discriminant Tree classifier
setosa versicolor
virginica
Gaussian mixture Kernel method (SVM)

Fit / Robustness Tradeoff
x2 x2
x1 x1
15
Performance evaluation
f(x) < 0 f(x) < 0

f( x
)=
x2 x2
f(x)
0
=0
f(x) > 0 f(x) > 0

x1 x1
f(x) < -1 f(x) < -1

f( x
x2 f(x) =
)=
x2
-1
-1
f(x) > -1 f(x) > -1

x1 x1
f(x
f(x) < 1 f(x) < 1

)=
x2 x2
1
f(x)
=1
f(x) > 1 f(x) > 1

x1 x1
ROC Curve
For a given
threshold Ideal ROC curve
100%
on f(x),
you get a
point on the ROC
t ual
ROC curve. Ac
Positive class
success rate
C
(hit rate, RO
dom
sensitivity) n
Ra
0 100%
1 - negative class success rate
(false alarm rate, 1-specificity)
ROC Curve
For a given
threshold Ideal ROC curve (AUC=1)
100%
on f(x),
you get a
point on the ROC
t ual
ROC curve. Ac
0.5 )
Positive class U C=
success rate (A
O C
R
(hit rate,
dom
n
sensitivity) Ra
0  AUC  1
0 100%
1 - negative class success rate
(false alarm rate, 1-specificity)
Lift Curve
Customers Ideal Lift

100%
ranked
Hit rate = Frac. good customers select.
according i ft
lL
to f(x); c tua
A M
selection O
of the top
ranking
customers. lift
dom
n
Ra
Gini  M
O
Gini=2 AUC-1
0 Fraction of customers selected 100%
0  Gini  1
Performance Assessment
Predictions: F(x)
Cost matrix
Class -1 Class +1 Total Class +1 / Total
Truth: Class -1 tn fp neg=tn+fp False alarm = fp/neg

y Class +1 tp pos=fn+tp Hit rate = tp/pos
fn
m=tn+fp
Total rej=tn+fn sel=fp+tp Frac. selected = sel/m
+fn+tp
Class+1 Precision False alarm rate = type I errate = 1-specificity
/Total = tp/sel Hit rate = 1-type II errate = sensitivity =
recall = test power
Compare F(x) = sign(f(x)) to the target y, and report:
• Error rate = (fn + fp)/m
• {Hit rate , False alarm rate} or {Hit rate , Precision} or {Hit rate , Frac.selected}
• Balanced error rate (BER) = (fn/pos + fp/neg)/2 = 1 – (sensitivity+specificity)/2
• F measure = 2 precision.recall/(precision+recall)
Vary the decision threshold  in F(x) = sign(f(x)+), and plot:
• ROC curve: Hit rate vs. False alarm rate
• Lift curve: Hit rate vs. Fraction selected
• Precision/recall curve: Hit rate vs. Precision
What is a Risk Functional?
A function of the parameters of the

learning machine, assessing how much it
is expected to fail on a given task.
Examples:
• Classification:
– Error rate: (1/m) i=1:m 1(F(xi)yi)
– 1- AUC (Gini Index = 2 AUC-1)
• Regression:
– Mean square error: (1/m) i=1:m(f(xi)-yi)2
How to train?
• Define a risk functional R[f(x,w)]

• Optimize it w.r.t. w (gradient descent,
mathematical programming, simulated
annealing, genetic algorithms, etc.)
R[f(x,w)]
Parameter space (w)

w*
(… to be continued in the next lecture)
How to Train?
• Define a risk functional R[f(x,w)]

• Find a method to optimize it, typically
“gradient descent”
wj  wj -  R/wj
or any optimization method (mathematical
programming, simulated annealing,
genetic algorithms, etc.)
(… to be continued in the next lecture)
Summary
• With linear threshold units (“neurons”) we can build:
– Linear discriminant (including Naïve Bayes)
– Kernel methods
– Neural networks
– Decision trees
• The architectural hyper-parameters may include:
– The choice of basis functions  (features)
– The kernel
– The number of units
• Learning means fitting:
– Parameters (weights)
– Hyper-parameters
– Be aware of the fit vs. robustness tradeoff
Want to Learn More?
• Pattern Classification, R. Duda, P. Hart, and D. Stork. Standard pattern

recognition textbook. Limited to classification problems. Matlab code. http://rii.
ricoh.com/~stork/DHS.html
• The Elements of statistical Learning: Data Mining, Inference, and
Prediction. T. Hastie, R. Tibshirani, J. Friedman, Standard statistics textbook.
Includes all the standard machine learning methods for classification,
regression, clustering. R code.
http://www-stat-class.stanford.edu/~tibs/ElemStatLearn/
• Linear Discriminants and Support Vector Machines, I. Guyon and D. Stork,
In Smola et al Eds. Advances in Large Margin Classiers. Pages 147--169, MIT
Press, 2000. http://clopinet.com/isabelle/Papers/guyon_stork_nips98.ps.gz
• Feature Extraction: Foundations and Applications. I. Guyon et al, Eds. Book
for practitioners with datasets of NIPS 2003 challenge, tutorials, best
performing methods, Matlab code, teaching material. http://clopinet.com/fextract
-book

To Machine Learning: Isabelle Guyon

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

To Machine Learning: Isabelle Guyon

Uploaded by

Copyright:

Available Formats

Introduction

training NIPS 2003 &

• f(x) = i=1:m i k(xi,x) +b (Kernel method)

McCulloch and Pitts, 1943 f(x) = w  x + b

Link to “Naïve Bayes”

• Hebb’s rule for the Perceptron:

• Define a dot product:

• k(xi, x) = (xi)  (x)

But we do not need to know the  representation.

• k(s, t) = (s  t)q Polynomial kernel

Gaussian mixture Kernel method (SVM)

f(x) < 0 f(x) < 0

f(x) > 0 f(x) > 0

f(x) < -1 f(x) < -1

f(x) > -1 f(x) > -1

f(x) < 1 f(x) < 1

f(x) > 1 f(x) > 1

Customers Ideal Lift

Truth: Class -1 tn fp neg=tn+fp False alarm = fp/neg

A function of the parameters of the

• Define a risk functional R[f(x,w)]

Parameter space (w)

• Define a risk functional R[f(x,w)]

• Pattern Classification, R. Duda, P. Hart, and D. Stork. Standard pattern

You might also like