Professional Documents
Culture Documents
Best Regress 11
Best Regress 11
Machine Learning
algorithms
Note to other teachers and users of these
slides. Andrew would be delighted if you
found this source material useful in giving
your own lectures. Feel free to use these
slides verbatim, or to modify them to fit
your own needs. PowerPoint originals are
available. If you make use of a significant Andrew W. Moore
portion of these slides in your own lecture,
please include this message, or the
following link to the source repository of Associate Professor
Andrew’s tutorials:
http://www.cs.cmu.edu/~awm/tutorials .
Comments and corrections gratefully
School of Computer Science
received.
Carnegie Mellon University
www.cs.cmu.edu/~awm
awm@cs.cmu.edu
412-268-7599
8: Polynomial Regression
So far we’ve mainly been dealing with linear regression
X1 X2 Y X= 3 2 y= 7
3 2 7 1 1 3
1 1 3 : : :
: : : x1=(3,2).. y1=7..
Z= 1 3 2 y=
7
1 1 1 3
: : : β=(ZTZ)-1(ZTy)
z 1=(1,3,2).. y1=7..
yest = β0+ β1 x1+ β 2 x2
z k=(1,xk1,xk2)
Copyright © 2001, Andrew W. Moore Machine Learning Favorites: Slide 2
1
Quadratic Regression
It’s trivial to do linear fits of fixed nonlinear basis functions
X1 X2 Y X= 3 2 y= 7
3 2 7 1 1 3
1 1 3 : : :
: : : x1=(3,2).. y1=7..
1 3 2 9 6 4 y=
Z=
7
1 1 1 1 1 1
3 β=(ZTZ)-1(ZTy)
: :
:
yest = β0+ β1 x1+ β 2 x2+
z=(1 , x1, x2 , x12, x1x2,x22, )
β 3 x12 + β 4 x1x2 + β5 x22
Quadratic Regression
It’s trivial to do
Each linear fitsofofafixed
component nonlinear
z vector basis
is called functions
a term.
X1 X2 Each Y column of X= 2 isy=called
the 3Z matrix 7 a term column
3 2 How
7 many terms in 1a quadratic
1 3
regression with m
1 inputs?
1 3 : : :
: : •1: constant termx =(3,2).. y1=7..
1
1 3 2 9 6 4
•m linear terms y=
Z=
7
1 •(m+1)-choose-2
1 1 1 1 =1m(m+1)/2 quadratic terms
3 β=(Z TZ)-1(ZTy)
: (m+2)-choose-2 terms: in total = O(m )
2
:
y = β0+ β1 x1+ β 2 x2+
est
z=(1 , x1, x2 , x12, x1x2,x22, )T -1 T
Note that solving β=(Z Z) (Z y) β 3isx1thus
2 + βO(m 6)
4 x1x2 + β5 x2
2
2
Qth-degree polynomial Regression
X1 X2 Y X= 3 2 y= 7
3 2 7 1 1 3
1 1 3 : : :
: : : x1=(3,2).. y1=7..
1 3 2 9 6 … y= 7
Z=
1 1 1 1 1 … 3
β=(ZTZ)-1(ZTy)
: … :
Q=11, m=4
q0=2 q1=2 q2=0 q3=4 q4=3
Copyright © 2001, Andrew W. Moore Machine Learning Favorites: Slide 6
3
7: Radial Basis Functions (RBFs)
X1 X2 Y X= 3 2 y= 7
3 2 7 1 1 3
1 1 3 : : :
: : : x1=(3,2).. y1=7..
… … … … … … y= 7
Z=
… … … … … … 3
β=(ZTZ)-1(ZTy)
… … … … … … :
1-d RBFs
y
c1 c1 c1
x
4
Example
y
c1 c1 c1
x
5
RBFs with Linear Regression
KW also held constant
(initialized to be large
Allyci ’s are held constant enough that there’s decent
(initialized randomly or overlap between basis
on a grid
c 1 in m - c1 functions*
c1
x
dimensional input space) *Usually much better than the crappy
overlap on my diagram
yest = 2φ 1(x) + 0.05φ 2(x) + 0.5φ 3(x)
where
φ i(x) = KernelFunction( | x - ci | / KW)
then given Q basis functions, define the matrix Z such that Zkj =
KernelFunction( | xk - ci | / KW) where xk is the kth vector of inputs
And as before, β=(ZT Z)-1(ZT y)
6
RBFs with NonLinear Regression
7
Radial Basis Functions in 2-d
Two inputs.
Outputs (heights
sticking out of page) Center
not shown.
Sphere of
x2 significant
influence of
center
x1
Copyright © 2001, Andrew W. Moore Machine Learning Favorites: Slide 15
Sphere of
x2 significant
influence of
center
x1
Copyright © 2001, Andrew W. Moore Machine Learning Favorites: Slide 16
8
Crabby RBFs in 2-d What’s the
problem in this
Blue dots denote example?
coordinates of
input vectors
Center
Sphere of
x2 significant
influence of
center
x1
Copyright © 2001, Andrew W. Moore Machine Learning Favorites: Slide 17
Sphere of
x2 significant
influence of
center
x1
Copyright © 2001, Andrew W. Moore Machine Learning Favorites: Slide 18
9
Hopeless! Even before seeing the data, you should
understand that this is a disaster!
Center
Sphere of
significant
x2 influence of
center
x1
Copyright © 2001, Andrew W. Moore Machine Learning Favorites: Slide 19
Center
Sphere of
significant
x2 influence of
center
x1
Copyright © 2001, Andrew W. Moore Machine Learning Favorites: Slide 20
10
6: Robust Regression
x
Copyright © 2001, Andrew W. Moore Machine Learning Favorites: Slide 21
Robust Regression
This is the best fit that
Quadratic Regression can
manage
x
Copyright © 2001, Andrew W. Moore Machine Learning Favorites: Slide 22
11
Robust Regression
…but this is what we’d
probably prefer
x
Copyright © 2001, Andrew W. Moore Machine Learning Favorites: Slide 23
x
Copyright © 2001, Andrew W. Moore Machine Learning Favorites: Slide 24
12
LOESS-based Robust Regression
After the initial fit, score
each datapoint according to
how well it’s fitted…
x
Copyright © 2001, Andrew W. Moore Machine Learning Favorites: Slide 25
x
Copyright © 2001, Andrew W. Moore Machine Learning Favorites: Slide 26
13
Robust Regression
For k = 1 to R…
•Let (xk,yk ) be the kth datapoint
y •Let yestk be predicted value of
yk
x
•Let wk be a weight for
datapoint k that is large if the
datapoint fits well and small if it
fits badly:
wk = KernelFn([y k- yestk]2)
Robust Regression
For k = 1 to R…
•Let (xk,yk ) be the kth datapoint
y •Let yestk be predicted value of
yk
x
•Let wk be a weight for
Then redo the regression datapoint k that is large if the
using weighted datapoints. datapoint fits well and small if it
I taught you how to do this in the “Instance- fits badly:
based” lecture (only then the weights
depended on distance in input-space) wk = KernelFn([y k- yestk]2)
Guess what happens next?
14
Robust Regression
For k = 1 to R…
•Let (xk,yk ) be the kth datapoint
y •Let yestk be predicted value of
yk
x
•Let wk be a weight for
Then redo the regression datapoint k that is large if the
using weighted datapoints. datapoint fits well and small if it
I taught you how to do this in the “Instance- fits badly:
based” lecture (only then the weights
depended on distance in input-space) wk = KernelFn([y k- yestk]2)
Repeat whole thing until
converged!
15
Robust Regression---what we’re
doing
What LOESS robust regression does:
Assume yk was originally generated using the
following recipe:
With probability p:
yk = β 0+ β 1 xk+ β 2 xk2 +N(0,σ2)
But otherwise
yk ~ N(µ,σhuge2)
Computational task is to find the Maximum
Likelihood β0 , β 1 , β 2 , p, µ and σhuge
16
5: Regression Trees
• “Decision trees for regression”
Predict age = 47
17
A one-split regression tree
Gender?
Female Male
• We can’t use
information gain.
• What should we use?
18
Choosing the attribute to split on
Gender Rich? Num. Num. Beany Age
Children Babies
Female No 2 1 38
Male No 0 0 24
Male Yes 0 5+ 72
: : : : :
MSE(Y|X) = The expected squared error if we must predict a record’s Y
value given only knowledge of the record’s X value
If we’re told x=j, the smallest expected error comes from predicting the
mean of the Y-values among those records in which x=j. Call this mean
quantity µy x=j
Then…
NX
1
MSE (Y | X )=
R
∑ ∑( y k
j =1 ( k such thatxk = j )
x= j 2
y −µ )
19
Pruning Decision
…property-owner = Yes
Do I deserve
to live?
Gender?
Female Male
20
Linear Regression Trees
…property-owner = Yes
Also known as
“Model Trees”
Gender?
d
Female nyMale teste
e a en
ignor as be he
Predict age = lly th
Predict t =
gage s
t ypica te tha durin d ribute
ou rib u e
re + ste a tt
ta il: Y al att - the t 24
26 + 6 * NumChildren
ll unte7 *luNumChildren
a
ed -
2 * YearsEducation c
De gori up in se a2.5l * -YearsEducation v ove
e u b
cat igher . But se rea sted a
h n u n te
on essio and Split
Leaves contain linear attribute chosen to minimize
r e g r
te s, e bee
functions (trained using ibu v
y’ MSE of regressed children.
linear regression on attrall if the
n
records matching that eve leaf) Pruning with a different Chi-
squared
Copyright © 2001, Andrew W. Moore Machine Learning Favorites: Slide 41
x
Copyright © 2001, Andrew W. Moore Machine Learning Favorites: Slide 42
21
Test your understanding
Assuming linear regression trees, can you sketch a graph
of the fitted function yest(x) over this diagram?
x
Copyright © 2001, Andrew W. Moore Machine Learning Favorites: Slide 43
22
GMDH (c.f. BACON, AIM)
• Group Method Data Handling
• A very simple but very good idea:
1. DoTypical
linear learned function:
regression
2. Useagecross-validation to discover whether any
est = height - 3.1 sqrt(weight) +
3: Cascade Correlation
• A super-fast form of Neural Network learning
• Begins with 0 hidden units
• Incrementally adds hidden units, one by one,
doing ingeniously little recomputation between
each unit addition
23
Cascade beginning
Begin with a regular dataset
Nonstandard notation:
•X (i) is the i’th attribute
•x(i)k is the value of the i’th
attribute in the k’th record
24
Consider our errors…
Begin with a regular dataset
Find weights w(0)i to best fit Y.
I.E. to minimize R m
∑( y
k =1
k − y ) where y
( 0) 2
k
(0 )
k = ∑ w(j0 ) xk( j )
j =1
25
Cascade next step
Find weights w(1)i p(1)0 to better fit Y.
I.E. to minimize
R m
∑(y
k =1
k − y ) where y
(1) 2
k
(1)
k = ∑ w (j0) xk( j ) + p(j0 ) hk( 0)
j=1
26
Create next hidden unit…
Find weights u(1)i v(1)0 to define a new
basis function H(1)(x) of the inputs.
Make it specialize in predicting the
errors in our original fit:
Find {u(1)i , v(1)0} to maximize correlation
m
(1) ( 0 )
∑
between H(1)(x) and E(1) where H ( 1) (x) = g u (1) ( j )
x + v
j 0 h
j=1
X (0) X (1) … X (m) Y Y(0) E(0) H(0) Y(1) E(1) H(1)
R m n −1
X(0) X(1) … X(m) Y Y(0) E(0) H(0) Y(1) E(1) H(1) … Y(n)
x (0)1 x (1)1 … x (m)1 y 1 y (0)1 e(0)1 h(0)1 y (1)1 e(1)1 h(1)1 … y (n)1
x (0)2 x (1)2 … x (m)2 y 2 y (0)2 e(0)2 h(0)2 y (1)2 e(1)2 h(1)2 … y (n)2
: : : : : : : : : : : : :
27
Now look at new errors
Find weights w(n)i p(n)j to better fit Y.
I.E. to minimize
28
Visualizing first iteration
29
Visualizing third iteration
30
Training two spirals: Steps 1-6
31
If you like Cascade Correlation…
See Also
• Projection Pursuit
In which you add together many non-linear non-
parametric scalar functions of carefully chosen directions
Each direction is chosen to maximize error-reduction from
the best scalar function
• ADA-Boost
An additive combination of regression trees in which the
n+1’th tree learns the error of the n’th tree
2: Multilinear Interpolation
Consider this dataset. Suppose we wanted to create a
continuous and piecewise linear fit to the data
x
Copyright © 2001, Andrew W. Moore Machine Learning Favorites: Slide 64
32
Multilinear Interpolation
Create a set of knot points: selected X-coordinates
(usually equally spaced) that cover the data
q1 q2 q3 q4 q5
x
Copyright © 2001, Andrew W. Moore Machine Learning Favorites: Slide 65
Multilinear Interpolation
We are going to assume the data was generated by a
noisy version of a function that can only bend at the
knots. Here are 3 examples (none fits the data well)
q1 q2 q3 q4 q5
x
Copyright © 2001, Andrew W. Moore Machine Learning Favorites: Slide 66
33
How to find the best fit?
Idea 1: Simply perform a separate regression in each
segment for each part of the curve
What’s the problem with this idea?
q1 q2 q3 q4 q5
x
Copyright © 2001, Andrew W. Moore Machine Learning Favorites: Slide 67
h2
y
h3
q1 q2 q3 q4 q5
x
Copyright © 2001, Andrew W. Moore Machine Learning Favorites: Slide 68
34
How to find the best fit?
In the red segment… y est ( x) = h2f 2 ( x) + h3f 3 ( x)
x − q2 q −x
where f 2 ( x) = 1 − , f 3 ( x) = 1− 3
w w
h2
y
h3 φ 2(x)
q1 q2 q3 q4 q5
x
Copyright © 2001, Andrew W. Moore Machine Learning Favorites: Slide 69
h2
y
h3 φ 2(x)
q1 q2 q3 q4 q5
φ 3(x)
x
Copyright © 2001, Andrew W. Moore Machine Learning Favorites: Slide 70
35
How to find the best fit?
In the red segment… y est ( x) = h2f 2 ( x) + h3f 3 ( x)
| x − q2 | | x − q3 |
where f 2 (x) = 1 − , f 3 ( x) = 1 −
w w
h2
y
h3 φ 2(x)
q1 q2 q3 q4 q5
φ 3(x)
x
Copyright © 2001, Andrew W. Moore Machine Learning Favorites: Slide 71
h2
y
h3 φ 2(x)
q1 q2 q3 q4 q5
φ 3(x)
x
Copyright © 2001, Andrew W. Moore Machine Learning Favorites: Slide 72
36
How to find the best fit? NK
In general y ( x ) = ∑ hi f i ( x )
est
i =1
| x − qi |
1− if | x − qi |<w
where f i (x) = w
0 otherwise
h2
y
h3 φ 2(x)
q1 q2 q3 q4 q5
φ 3(x)
x
Copyright © 2001, Andrew W. Moore Machine Learning Favorites: Slide 73
| x − qi |
1− if | x − qi |<w
where f i (x) = w
0 otherwise
q1 q2 q3 q4 q5
φ 3(x)
x
Copyright © 2001, Andrew W. Moore Machine Learning Favorites: Slide 74
37
In two dimensions…
Blue dots show
locations of input
vectors (outputs
not depicted)
x2
x1
Copyright © 2001, Andrew W. Moore Machine Learning Favorites: Slide 75
x2
x1
Copyright © 2001, Andrew W. Moore Machine Learning Favorites: Slide 76
38
In two dimensions… Each purple dot
is a knot point.
Blue dots show It will contain
locations of input the height of
vectors (outputs the estimated
not depicted) 9 3 surface
But how do we
do the
interpolation to
ensure that the
surface is
7 8
continuous?
x2
x1
Copyright © 2001, Andrew W. Moore Machine Learning Favorites: Slide 77
x2
x1
Copyright © 2001, Andrew W. Moore Machine Learning Favorites: Slide 78
39
In two dimensions… Each purple dot
is a knot point.
Blue dots show It will contain
locations of input the height of
vectors (outputs the estimated
not depicted) 9 7 3 surface
But how do we
do the
To predict the
interpolation to
value here… ensure that the
First interpolate 7 8 surface is
its value on two continuous?
opposite edges… 7.33
x2
x1
Copyright © 2001, Andrew W. Moore Machine Learning Favorites: Slide 79
x1
Copyright © 2001, Andrew W. Moore Machine Learning Favorites: Slide 80
40
In two dimensions… Each purple dot
is a knot point.
Blue dots show It will contain
locations of input the height of
vectors (outputs the estimated
not depicted) 9 7 3 surface
To predict the But how do we
value here… do the
7.05 interpolation to
First interpolate
its value on two ensure that the
opposite edges… 7 Notes: 8 surface is
continuous?
Then interpolate This can easily be generalized
between those 7.33
two valuesx2
to m dimensions.
It should be easy to see that it
ensures continuity
x1 The patches are not linear
Copyright © 2001, Andrew W. Moore Machine Learning Favorites: Slide 81
9 Happily, it’s
3 simply a two-
dimensional
basis function
problem.
(Working out
7 8 the basis
functions is
tedious,
x2 unilluminating,
and easy)
What’s the
problem in
higher
x1 dimensions?
Copyright © 2001, Andrew W. Moore Machine Learning Favorites: Slide 82
41
1: MARS
• Multivariate Adaptive Regression Splines
• Invented by Jerry Friedman (one of
Andrew’s heroes)
• Simplest version:
k =1
Instead of a linear combination of the inputs, it’s a linear
combination of non-linear functions of individual inputs
MARS m
y (x ) = ∑ g k ( xk )
est
k =1
Instead of a linear combination of the inputs, it’s a linear
combination of non-linear functions of individual inputs
Idea: Each
gk is one of
these
q1 q2 q3 q4 q5
x
Copyright © 2001, Andrew W. Moore Machine Learning Favorites: Slide 84
42
MARS y (x ) = ∑ g k ( xk )
est
m
k =1
Instead of a linear combination of the inputs, it’s a linear
combination of non-linear functions of individual inputs
m NK
y (x ) = ∑∑ h kj f kj ( xk )
est qk j : The location of
the j’th knot in the
k =1 j =1 k’th dimension
hk j : The regressed
| xk − q kj | height of the j’th
1 − if | xk − q kj |<wk knot in the k’th
where
y f k
( x ) = wk
j
dimension
0 otherwise wk : The spacing
between knots in
q1 q2 q3 q4 q5 the kth dimension
x
Copyright © 2001, Andrew W. Moore Machine Learning Favorites: Slide 85
43
If you like MARS…
…See also CMAC (Cerebellar Model Articulated
Controller) by James Albus (another of
Andrew’s heroes)
• Many of the same gut-level intuitions
• But entirely in a neural-network, biologically
plausible way
• (All the low dimensional functions are by
means of lookup tables, trained with a delta-
rule and using a clever blurred update and
hash-tables)
Inference
P(E1|E2) Joint DE, Bayes Net Structure Learning
Engine Learn
Density
DE, Bayes Net Structure Learning
Estimator ability
44
What You Should Know
• For each of the eight methods you should be able
to summarize briefly what they do and outline
how they work.
• You should understand them well enough that
given access to the notes you can quickly re-
understand them at a moments notice
• But you don’t have to memorize all the details
• In the right context any one of these eight might
end up being really useful to you one day! You
should be able to recognize this when it occurs.
Citations
Radial Basis Functions Regression Trees etc
T. Poggio and F. Girosi, Regularization Algorithms L. Breiman and J. H. Friedman and R. A. Olshen
for Learning That Are Equivalent to Multilayer and C. J. Stone, Classification and Regression
Networks, Science, 247, 978--982, 1989 Trees, Wadsworth, 1984
LOESS J. R. Quinlan, Combining Instance-Based and
W. S. Cleveland, Robust Locally Weighted Model-Based Learning, Machine Learning:
Regression and Smoothing Scatterplots, Proceedings of the Tenth International
Journal of the American Statistical Association, Conference, 1993
74, 368, 829-836, December, 1979 Cascade Correlation etc
GMDH etc S. E. Fahlman and C. Lebiere. The cascade-
http://www.inf.kiev.ua/GMDH-home/ correlation learning architecture. Technical
Report CMU -CS-90-100, School of Computer
P. Langley and G. L. Bradshaw and H. A. Simon,
Rediscovering Chemistry with the BACON Science, Carnegie Mellon University,
System, Machine Learning: An Artificial Pittsburgh, PA, 1990.
Intelligence Approach, R. S. Michalski and J. http://citeseer.nj.nec.com/fahlman91cascadec
orrelation.html
G. Carbonnell and T. M. Mitchell, Morgan
Kaufmann, 1983 J. H. Friedman and W. Stuetzle, Projection Pursuit
Regression, Journal of the American Statistical
Association, 76, 376, December, 1981
MARS
J. H. Friedman, Multivariate Adaptive Regression
Splines, Department for Statistics, Stanford
University, 1988, Technical Report No. 102
45