Professional Documents
Culture Documents
Introduction To Statistical Machine Learning: Marcus Hutter
Introduction To Statistical Machine Learning: Marcus Hutter
Introduction To Statistical Machine Learning: Marcus Hutter
-1-
Marcus Hutter
Introduction to
Statistical Machine Learning
Marcus Hutter
Canberra, ACT, 0200, Australia
ANU
RSISE
NICTA
-2-
Marcus Hutter
Abstract
This course provides a broad introduction to the methods and practice
of statistical machine learning, which is concerned with the development
of algorithms and techniques that learn from observed data by
constructing stochastic models that can be used for making predictions
and decisions. Topics covered include Bayesian inference and maximum
likelihood modeling; regression, classification, density estimation,
clustering, principal component analysis; parametric, semi-parametric,
and non-parametric models; basis functions, neural networks, kernel
methods, and graphical models; deterministic and stochastic
optimization; overfitting, regularization, and validation.
-3-
Marcus Hutter
Table of Contents
1. Introduction / Overview / Preliminaries
2. Linear Methods for Regression
3. Nonlinear Methods for Regression
4. Model Assessment & Selection
5. Large Problems
6. Unsupervised Learning
7. Sequential & (Re)Active Settings
8. Summary
Intro/Overview/Preliminaries
-4-
Marcus Hutter
INTRO/OVERVIEW/PRELIMINARIES
What is Machine Learning? Why Learn?
Related Fields
Applications of Machine Learning
SupervisedUnsupervisedReinforcement Learning
Dichotomies in Machine Learning
Mini-Introduction to Probabilities
Intro/Overview/Preliminaries
-5-
Marcus Hutter
Related fields
Intro/Overview/Preliminaries
-6-
Marcus Hutter
Why Learn ?
There is no need to learn to calculate payroll
Learning is used when:
Human expertise does not exist (navigating on Mars),
Humans are unable to explain their expertise (speech recognition)
Solution changes in time (routing on a computer network)
Solution needs to be adapted to particular cases (user biometrics)
Example: It is easier to write a program that learns to play checkers or
backgammon well by self-play rather than converting the expertise of a
master player to a program.
Intro/Overview/Preliminaries
-7-
Marcus Hutter
Task: Learn general mapping from pixel images to digits from examples
Intro/Overview/Preliminaries
-8-
Marcus Hutter
Intro/Overview/Preliminaries
-9-
Marcus Hutter
Reinforcement Learning
Agents
Unsupervised Learning
Association
Clustering
Density Estimation
Others
SemiSupervised Learning
Active Learning
Intro/Overview/Preliminaries
- 10 -
Supervised Learning
Prediction of future cases:
Use the rule to predict the output for future inputs
Knowledge extraction:
The rule is easy to understand
Compression:
The rule is simpler than the data it explains
Outlier detection:
Exceptions that are not covered by the rule, e.g., fraud
Marcus Hutter
Intro/Overview/Preliminaries
- 11 -
Classification
Example:
Credit scoring
Differentiating between
low-risk and high-risk
customers from their
Income and Savings
Marcus Hutter
Intro/Overview/Preliminaries
- 12 -
Marcus Hutter
Regression
Example: Price y = f (x)+noise of a used car as function of age x
Intro/Overview/Preliminaries
- 13 -
Unsupervised Learning
Learning what normally happens
No output
Clustering: Grouping similar instances
Example applications:
Customer segmentation in CRM
Image compression: Color quantization
Bioinformatics: Learning motifs
Marcus Hutter
Intro/Overview/Preliminaries
- 14 -
Reinforcement Learning
Learning a policy: A sequence of outputs
No supervised output but delayed reward
Credit assignment problem
Game playing
Robot in a maze
Multiple agents, partial observability, ...
Marcus Hutter
Intro/Overview/Preliminaries
- 15 -
Marcus Hutter
Intro/Overview/Preliminaries
- 16 -
Marcus Hutter
Probability Basics
Probability is used to describe uncertain events;
the chance or belief that something is or will be true.
Example: Fair Six-Sided Die:
Sample space: = {1, 2, 3, 4, 5, 6}
Events: Even= {2, 4, 6}, Odd= {1, 3, 5}
1
2
P (6and Even)
1/6
1
Conditional probability: P (6|Even) =
=
=
P (Even)
1/2
3
General Axioms:
P({}) = 0 P(A) 1 = P(),
P(A B) + P(A B) = P(A) + P(B),
P(A B) = P(A|B)P(B).
Intro/Overview/Preliminaries
- 17 -
Marcus Hutter
Probability Jargon
Example: (Un)fair coin: = {Tail,Head} ' {0, 1}. P(1) = [0, 1]:
Likelihood: P(1101|) = (1 )
Maximum Likelihood (ML) estimate: = arg max P(1101|) = 3
4
P(1101|)P()
P(1101)
3 (1 ) (BAYES RULE!).
2
63
2
3
3
4
- 18 -
Marcus Hutter
- 19 -
Marcus Hutter
Linear Regression
fitting a linear function to the data
Input feature vector x := (1 x(0) , x(1) , ..., x(d) ) IRd+1
Real-valued noisy response y IR.
Linear regression model:
y = fw (x) = w0 x(0) + ... + wd x(d)
Data: D = (x1 , y1 ), ..., (xn , yn )
Error or loss function:
Example: Residual sum of squares:
Pn
Loss(w) = i=1 (yi fw (xi ))2
Least squares (LSQ) regression:
w
= arg minw Loss(w)
Example: Persons weight y as a function of age x1 , height x2 .
- 20 -
Marcus Hutter
- 21 -
Marcus Hutter
Coefficient Shrinkage
Solution 2: Shrinkage methods:
Shrink the least squares w
by penalizing the Loss:
Ridge regression: Add ||w||22 .
Lasso: Add ||w||1 .
Bayesian linear regression:
Comp. MAP arg maxw P(w|D)
from prior P (w) and
sampling model P (D|w).
Weights of low variance components shrink most.
- 22 -
Marcus Hutter
- 23 -
Marcus Hutter
- 24 -
Marcus Hutter
- 25 -
Marcus Hutter
- 26 -
Marcus Hutter
Pn
K(x,xi )yi
Pi=1
n
i=1
K(x,xi )
Nearest-Neighbor Kernel:
i |<a
K(x, xi ) = 1 if { |xx
and 0 else
Generalization to other K,
e.g. quadratic (Epanechnikov) Kernel:
K(x, xi ) = max{0, a2 (x xi )2 }
- 27 -
Marcus Hutter
Nonlinear Regression
- 28 -
Marcus Hutter
NONLINEAR REGRESSION
Artificial Neural Networks
Kernel Trick
Maximum Margin Classifier
Sparse Kernel Methods / SVMs
Nonlinear Regression
- 29 -
Marcus Hutter
Nonlinear Regression
- 30 -
Marcus Hutter
Nonlinear Regression
- 31 -
Marcus Hutter
Kernel Trick
The Kernel-Trick allows to reduce the functional
minimization to finite-dimensional optimization.
Let L(y, f (x)) be any loss function
and J(f ) be a penalty quadratic in f .
Pn
then minimum of penalized loss i=1 L(yi , f (xi )) + J(f )
Pn
has form f (x) = i=1 i K(x, xi )
Pn
with minimizing i=1 L(yi , (K)i ) + >K.
and Kernel K ij = K(xi , xj ) following from J.
Nonlinear Regression
- 32 -
Marcus Hutter
max
w:||w||=1
with
y {1, 1}
Nonlinear Regression
- 33 -
Marcus Hutter
f (x) = w
(x) = i=1 ai K(xi , x)
depends only on via Kernel
Pd
K(xi , x) = b=1 b (xi )b (x).
Huge time savings if d n
Example K(x, x0 ):
polynomial (1 + hx, x0 i)d ,
Gaussian exp(||x x0 ||22 ),
neural network tanh(hx, x0 i).
- 34 -
Marcus Hutter
- 35 -
Marcus Hutter
- 36 -
Marcus Hutter
- 37 -
Marcus Hutter
- 38 -
Marcus Hutter
- 39 -
Marcus Hutter
- 40 -
Marcus Hutter
- 41 -
Marcus Hutter
- 42 -
Marcus Hutter
- 43 -
Marcus Hutter
Regression Trees
f (x) = cb for x Rb , and cb = Average[yi |xi Rb ]
- 44 -
Marcus Hutter
Non-Parametric Learning
= prototype methods = instance-based learning = model-free
Examples:
K-means:
Data clusters around K centers with cluster means 1 , ...K .
Assign xi to closest cluster center.
K Nearest neighbors regression (kNN):
Estimate f (x) by averaging the yi
for the k xi closest to x:
Kernel regression:
Pn
K(x, xi )yi
i=1
- 45 -
Marcus Hutter
- 46 -
Marcus Hutter
Examples: Gaussians
- 47 -
Marcus Hutter
- 48 -
Marcus Hutter
- 49 -
Marcus Hutter
Combining Models
Performance can often be improved
by combining multiple models in some way,
instead of just using a single model in isolation
Committees: Average the predictions of a set of individual models.
P
Bayesian model averaging: P(x) = Models P(x|Model)P(Model)
P
Mixture models: P(x|, ) = k k Pk (x|k )
Decision tree: Each model is responsible
for a different part of the input space.
Boosting: Train many weak classifiers in sequence
and combine output to produce a powerful committee.
- 50 -
Marcus Hutter
Boosting
Idea: Train many weak classifiers Gm in sequence and
combine output to produce a powerful committee G.
AdaBoost.M1: [Freund & Schapire (1997) received famous Godel-Prize]
Initialize observation weights wi uniformly.
For m = 1 to M do:
(a) Gm classifies x as Gm (x) {1, 1}.
Train Gm weighing data i with wi .
(b) Give Gm high/low weight i if it performed well/bad.
(c) Increase attention=weight wi for obs. xi misclassified by Gm .
PM
Output weighted majority vote: G(x) = sign( m=1 m Gm (x))
Unsupervised Learning
- 51 -
Marcus Hutter
UNSUPERVISED LEARNING
K-Means Clustering
Mixture Models
Expectation Maximization Algorithm
Unsupervised Learning
- 52 -
Unsupervised Learning
Supervised learning: Find functional relationship f : X Y from I/O data {(xi , yi )}
Unsupervised learning:
Find pattern in data (x1 , ..., xn )
without an explicit teacher (no y values).
Example: Clustering e.g. by K-means
Implicit goal: Find simple explanation,
i.e. compress data (MDL, Occams razor).
Density estimation: From which probability
distribution P are the xi drawn?
Marcus Hutter
Unsupervised Learning
- 53 -
Marcus Hutter
Unsupervised Learning
- 54 -
Marcus Hutter
Unsupervised Learning
- 55 -
Marcus Hutter
- 56 -
Marcus Hutter
- 57 -
Marcus Hutter
Sequential Data
General stochastic time-series: P(x1 ...xn ) =
Qn
i=1
- 58 -
Marcus Hutter
sunny
rainy
umbrella
0.1
0.3
sunglasses
0.0
1.0
Loss
y
=
=
function
of P(xt | )
- 59 -
Marcus Hutter
Learning Agents 1
sensors
percepts
?
environment
actions
agent
actuators
Additional complication:
Learner can influence environment,
and hence what he observes next.
farsightedness, planning, exploration necessary.
Exploration Exploitation problem
- 60 -
Marcus Hutter
Learning Agents 2
Performance standard
Sensors
Critic
changes
Learning
element
knowledge
Performance
element
learning
goals
Problem
generator
Agent
Actuators
Environment
feedback
- 61 -
Marcus Hutter
H
Y
HH
H
Agent
p
Environment q
PP
P
y1
y2
PP
P
q
y3
y4
y5
y6
...
- 62 -
Learning in Games
Learning though self-play.
Backgammon (TD-Gammon).
Samuels checker program.
Marcus Hutter
Summary
- 63 -
Marcus Hutter
SUMMARY
Summary
- 64 -
Marcus Hutter
Summary
- 65 -
Marcus Hutter
Summary
- 66 -
More Learning
Concept Learning
Bayesian Learning
Computational Learning Theory (PAC learning)
Genetic Algorithms
Learning Sets of Rules
Analytical Learning
Marcus Hutter
Summary
- 67 -
Data Sets
UCI Repository:
http://www.ics.uci.edu/ mlearn/MLRepository.html
UCI KDD Archive:
http://kdd.ics.uci.edu/summary.data.application.html
Statlib:
http://lib.stat.cmu.edu/
Delve:
http://www.cs.utoronto.ca/ delve/
Marcus Hutter
Summary
- 68 -
Marcus Hutter
Journals
Journal of Machine Learning Research
Machine Learning
IEEE Transactions on Pattern Analysis and Machine Intelligence
Neural Computation
Neural Networks
IEEE Transactions on Neural Networks
Annals of Statistics
Journal of the American Statistical Association
...
Summary
- 69 -
Marcus Hutter
Annual Conferences
Algorithmic Learning Theory (ALT)
Computational Learning Theory (COLT)
Uncertainty in Artificial Intelligence (UAI)
Neural Information Processing Systems (NIPS)
European Conference on Machine Learning (ECML)
International Conference on Machine Learning (ICML)
International Joint Conference on Artificial Intelligence (IJCAI)
International Conference on Artificial Neural Networks (ICANN)
Summary
- 70 -
Marcus Hutter
Recommended Literature
[Bis06]