Introduction To Statistical Machine Learning: Marcus Hutter

Introduction to Statistical Machine Learning
-1-
Marcus Hutter
Introduction to
Statistical Machine Learning
Marcus Hutter
Canberra, ACT, 0200, Australia
Machine Learning Summer School

MLSS-2008, 2 15 March, Kioloa
ANU
RSISE
NICTA
-2-
Marcus Hutter
Abstract
This course provides a broad introduction to the methods and practice
of statistical machine learning, which is concerned with the development
of algorithms and techniques that learn from observed data by
constructing stochastic models that can be used for making predictions
and decisions. Topics covered include Bayesian inference and maximum
likelihood modeling; regression, classification, density estimation,
clustering, principal component analysis; parametric, semi-parametric,
and non-parametric models; basis functions, neural networks, kernel
methods, and graphical models; deterministic and stochastic
optimization; overfitting, regularization, and validation.
-3-
Marcus Hutter
Table of Contents
1. Introduction / Overview / Preliminaries
2. Linear Methods for Regression
3. Nonlinear Methods for Regression
4. Model Assessment & Selection
5. Large Problems
6. Unsupervised Learning
7. Sequential & (Re)Active Settings
8. Summary
Intro/Overview/Preliminaries
-4-
Marcus Hutter
INTRO/OVERVIEW/PRELIMINARIES
What is Machine Learning? Why Learn?
Related Fields
Applications of Machine Learning
SupervisedUnsupervisedReinforcement Learning
Dichotomies in Machine Learning
Mini-Introduction to Probabilities
-5-
Marcus Hutter
What is Machine Learning?

Machine Learning is concerned with the development of
algorithms and techniques that allow computers to learn
Learning in this context is the process of gaining understanding by
constructing models of observed data with the intention to use them for
prediction.
Related fields
Artificial Intelligence: smart algorithms

Statistics: inference from a sample
Data Mining: searching through large volumes of data
Computer Science: efficient algorithms and complex models
-6-
Marcus Hutter
Why Learn ?
There is no need to learn to calculate payroll
Learning is used when:
Human expertise does not exist (navigating on Mars),
Humans are unable to explain their expertise (speech recognition)
Solution changes in time (routing on a computer network)
Solution needs to be adapted to particular cases (user biometrics)
Example: It is easier to write a program that learns to play checkers or
backgammon well by self-play rather than converting the expertise of a
master player to a program.
-7-
Marcus Hutter
Handwritten Character Recognition

an example of a difficult machine learning problem
Task: Learn general mapping from pixel images to digits from examples
-8-
Marcus Hutter
Applications of Machine Learning

machine learning has a wide spectrum of applications including:
natural language processing,
search engines,
medical diagnosis,
detecting credit card fraud,
stock market analysis,
bio-informatics, e.g. classifying DNA sequences,
speech and handwriting recognition,
object recognition in computer vision,
playing games learning by self-play: Checkers, Backgammon.
robot locomotion.
-9-
Marcus Hutter
Some Fundamental Types of Learning

Supervised Learning
Classification
Regression
Reinforcement Learning
Agents
Unsupervised Learning
Association
Clustering
Density Estimation
Others
SemiSupervised Learning
Active Learning
- 10 -
Supervised Learning
Prediction of future cases:
Use the rule to predict the output for future inputs
Knowledge extraction:
The rule is easy to understand
Compression:
The rule is simpler than the data it explains
Outlier detection:
Exceptions that are not covered by the rule, e.g., fraud
Marcus Hutter
- 11 -
Classification
Example:
Credit scoring
Differentiating between
low-risk and high-risk
customers from their
Income and Savings
Discriminant: IF income > 1 AND savings > 2

THEN low-risk ELSE high-risk
Marcus Hutter
- 12 -
Marcus Hutter
Regression
Example: Price y = f (x)+noise of a used car as function of age x
- 13 -
Learning what normally happens
No output
Clustering: Grouping similar instances
Example applications:
Customer segmentation in CRM
Image compression: Color quantization
Bioinformatics: Learning motifs
Marcus Hutter
- 14 -
Reinforcement Learning
Learning a policy: A sequence of outputs
No supervised output but delayed reward
Credit assignment problem
Game playing
Robot in a maze
Multiple agents, partial observability, ...
Marcus Hutter
- 15 -
Marcus Hutter
Dichotomies in Machine Learning

scope of my lecture
(machine) learning / statistical
induction prediction
regression
independent identically distributed
online learning
passive prediction
parametric
conceptual/mathematical
exact/principled
supervised learning
scope of other lectures

logic/knowledge-based (GOFAI)
decision action
classification
sequential / non-iid
offline/batch learning
active learning
non-parametric
computational issues
heuristic
unsupervised RL learning
- 16 -
Marcus Hutter
Probability Basics
Probability is used to describe uncertain events;
the chance or belief that something is or will be true.
Example: Fair Six-Sided Die:
Sample space: = {1, 2, 3, 4, 5, 6}
Events: Even= {2, 4, 6}, Odd= {1, 3, 5}
Probability: P(6) = 16 , P(Even) = P(Odd) =

Outcome: 6 E.
1
2
P (6and Even)
1/6
1
Conditional probability: P (6|Even) =
=
=
P (Even)
1/2
3
General Axioms:
P({}) = 0 P(A) 1 = P(),
P(A B) + P(A B) = P(A) + P(B),
P(A B) = P(A|B)P(B).
- 17 -
Marcus Hutter
Probability Jargon
Example: (Un)fair coin: = {Tail,Head} ' {0, 1}. P(1) = [0, 1]:
Likelihood: P(1101|) = (1 )
Maximum Likelihood (ML) estimate: = arg max P(1101|) = 3
4
Prior: If we are indifferent, then P() =const.

R
P
1
Evidence: P(1101) = P(1101|)P() = 20 (actually )
Posterior: P(|1101) =
P(1101|)P()
P(1101)
3 (1 ) (BAYES RULE!).
Maximum a Posterior (MAP) estimate: = arg max P(|1101) =

2
Predictive distribution: P(1|1101) = P(11011)
=
P(1101)
3
P
Expectation: E[f |...] = f ()P(|...), e.g. E[|1101] =
Variance: Var() = E[( E)2 |1101] =
2
63
Probability density: P() = 1 P([, + ]) for 0
2
3
3
4
Linear Methods for Regression
- 18 -
Marcus Hutter
LINEAR METHODS FOR REGRESSION

Linear Regression
Coefficient Subset Selection
Coefficient Shrinkage
Linear Methods for Classifiction
Linear Basis Function Regression (LBFR)
Piecewise linear, Splines, Wavelets
Local Smoothing & Kernel Regression
Regularization & 1D Smoothing Splines
- 19 -
Marcus Hutter
Linear Regression
fitting a linear function to the data
Input feature vector x := (1 x(0) , x(1) , ..., x(d) ) IRd+1
Real-valued noisy response y IR.
Linear regression model:
y = fw (x) = w0 x(0) + ... + wd x(d)
Data: D = (x1 , y1 ), ..., (xn , yn )
Error or loss function:
Example: Residual sum of squares:
Pn
Loss(w) = i=1 (yi fw (xi ))2
Least squares (LSQ) regression:
w
= arg minw Loss(w)
Example: Persons weight y as a function of age x1 , height x2 .
- 20 -
Marcus Hutter
Coefficient Subset Selection

Problems with least squares regression if d is large:
Overfitting: The plane fits the data well (perfect for d n),
but predicts (generalizes) badly.
Interpretation: We want to identify a small subset of
features important/relevant for predicting y.
Solution 1: Subset selection:
Take those k out of d features that minimize the LSQ error.
- 21 -
Marcus Hutter
Coefficient Shrinkage
Solution 2: Shrinkage methods:
Shrink the least squares w
by penalizing the Loss:
Ridge regression: Add ||w||22 .
Lasso: Add ||w||1 .
Bayesian linear regression:
Comp. MAP arg maxw P(w|D)
from prior P (w) and
sampling model P (D|w).
Weights of low variance components shrink most.
- 22 -
Marcus Hutter
Linear Methods for Classification

Example: Y = {spam,non-spam} ' {1, 1}
(or {0, 1})
Reduction to regression: Regard y IR w

from linear regression.
Binary classification: If fw (x) > 0 then y = 1 else y = 1.
Probabilistic classification: Predict probability that new x is in class y.
P(y=1|x,D)
:= fw
Log-odds log P(y=0|x,D)
(x)
Improvements:
Linear Discriminant Analysis (LDA)
Logistic Regression
Perceptron
Maximum Margin Hyperplane
Support Vector Machine
Generalization to non-binary Y possible.
- 23 -
Marcus Hutter
Linear Basis Function Regression (LBFR)

= powerful generalization of and reduction to linear regression!
Problem: Response y IR is often not linear in x IRd .
Solution: Transform x ; (x) with : IRd IRp .
Assume/hope response y IR is now linear in .
Examples:
Linear regression: p = d and i (x) = xi .
Polynomial regression: d = 1 and i (x) = xi .
Piecewise constant regression:
E.g. d = 1 with i (x) = 1 for i x < i + 1 and 0 else.
Piecewise polynomials ...
Splines ...
- 24 -
Marcus Hutter
- 25 -
Marcus Hutter
2D Spline LBFR and 1D Symmlet-8 Wavelets
- 26 -
Marcus Hutter
Local Smoothing & Kernel Regression

Estimate f (x) by averaging the yi for all xi a-close to x:
f(x) =
Pn
K(x,xi )yi
Pi=1
n
i=1
K(x,xi )
Nearest-Neighbor Kernel:
i |<a
K(x, xi ) = 1 if { |xx
and 0 else
Generalization to other K,
e.g. quadratic (Epanechnikov) Kernel:
K(x, xi ) = max{0, a2 (x xi )2 }
- 27 -
Marcus Hutter
Regularization & 1D Smoothing Splines

to avoid overfitting if function class is large
R 00
Pn
2
f = arg minf { i=1 (yi f (xi )) + (f (x))2 dx}

=0
f =any function
through data
=
f=least
squares line fit
0<<
f=piecwise cubic
with continuous
derivative
Nonlinear Regression
- 28 -
Marcus Hutter
NONLINEAR REGRESSION
Artificial Neural Networks
Kernel Trick
Maximum Margin Classifier
Sparse Kernel Methods / SVMs
- 29 -
Artificial Neural Networks 1

as non-linear function approximator
Single hidden layer feedforward neural network
Pd
(1)
Hidden layer: zj = h( i=0 wji xi )
PM
(2)
Output: fw,k (x) = ( j=0 wkj zj )
Sigmoidal activation functions:
h() and () f non-linear
Goal: Find network weights
best modeling the data:
Back-propagation algorithm:
Pn
Minimize i=1 ||y i f w (xi )||22
w.r.t. w by gradient decent.
Marcus Hutter
- 30 -
Artificial Neural Networks 2

Avoid overfitting by early stopping or small M .
Avoid local minima by random initial weights
or stochastic gradient descent.
Example: Image Processing
Marcus Hutter
- 31 -
Marcus Hutter
Kernel Trick
The Kernel-Trick allows to reduce the functional
minimization to finite-dimensional optimization.
Let L(y, f (x)) be any loss function
and J(f ) be a penalty quadratic in f .
Pn
then minimum of penalized loss i=1 L(yi , f (xi )) + J(f )
Pn
has form f (x) = i=1 i K(x, xi )
Pn
with minimizing i=1 L(yi , (K)i ) + >K.
and Kernel K ij = K(xi , xj ) following from J.
- 32 -
Marcus Hutter
Maximum Margin Classifier

= arg
w
max
w:||w||=1
min{yi (w>(xi ))}

i
Linear boundary for

b (x) = x(b) .
Boundary is determined by
Support Vectors
(circled data points)
Margin negative if
classes not separable.
with
y {1, 1}
- 33 -
Marcus Hutter
Sparse Kernel Methods / SVMs

Non-linear boundary for general b (x)
Pn
w
= i=1 ai (xi ) for some a.
Pn
>
f (x) = w
(x) = i=1 ai K(xi , x)
depends only on via Kernel
Pd
K(xi , x) = b=1 b (xi )b (x).
Huge time savings if d n
Example K(x, x0 ):
polynomial (1 + hx, x0 i)d ,
Gaussian exp(||x x0 ||22 ),
neural network tanh(hx, x0 i).
Model Assessment & Selection
- 34 -
Marcus Hutter
MODEL ASSESSMENT & SELECTION

Example: Polynomial Regression
Training=Empirical versus Test=True Error
Empirical Model Selection
Theoretical Model Selection
The Loss Rank Principle for Model Selection
- 35 -
Marcus Hutter
Example: Polynomial Regression

Straight line does not fit data well (large training error)
high bias poor predictive performance
High order polynomial fits data perfectly (zero training error)
high variance (overfitting)
poor prediction too!
Reasonable polynomial
degree d performs well.
How to select d?
minimizing training error
obviously does not work.
- 36 -
Marcus Hutter
Training=Empirical versus Test=True Error

Learn functional relation f for data D = {(x1 , y1 ), ..., (xn , yn )}.
We can compute the empirical error on past data:
Pn
1
ErrD (f ) = n i=1 (yi f (xi ))2 .
Assume data D is sample from some distribution P.
We want to know the expected true error on future examples:
ErrP (f ) = EP [(y f (x))].
How good an estimate of ErrP (f ) is ErrD (f ) ?
Problem: ErrD (f ) decreases with increasing model complexity,
but not ErrP (f ).
- 37 -
Empirical Model Selection

How to select complexity parameter
Kernel width a,
penalization constant ,
number k of nearest neighbors,
the polynomial degree d?
Empirical test-set-based methods:
Regress on training set and minimize empirical error w.r.t. complexity parameter (a, , k, d) on
a separate test-set.
Sophistication: cross-validation, bootstrap, ...
Marcus Hutter
- 38 -
Marcus Hutter
Theoretical Model Selection

How to select complexity or flexibility or smoothness parameter:
Kernel width a, penalization constant , number k of nearest neighbors,
the polynomial degree d?
For
parametric regression with d parameters:

Bayesian model selection,
Akaike Information Criterion (AIC),
Bayesian Information Criterion (BIC),
Minimum Description Length (MDL),
They all add a penalty proportional to d to the loss.

For non-parametric linear regression:
Add trace of on-data regressor = effective # of parameters to loss.
Loss Rank Principle (LoRP).
- 39 -
Marcus Hutter
The Loss Rank Principle for Model Selection

c
: X Y be the (best) regressor of complexity c on data D.
Let fD
c
The loss Rank of fD
is defined as the number of other (fictitious) data
c
c
D0 that are fitted better by fD
0 than D is fitted by fD .
c
c is small fD
fits D badly many other D0 can be fitted better
Rank is large.
c is large many D0 can be fitted well Rank is large.

c
c is appropriate fD
fits D well and not too many other D0
can be fitted well Rank is small.
LoRP: Select model complexity c that has minimal loss Rank
Unlike most penalized maximum likelihood variants (AIC,BIC,MDL),
LoRP only depends on the regression and the loss function.
It works without a stochastic noise model, and
is directly applicable to any non-parametric regressor, like kNN.
How to Attack Large Problems
- 40 -
Marcus Hutter
HOW TO ATTACK LARGE PROBLEMS

Probabilistic Graphical Models (PGM)
Trees Models
Non-Parametric Learning
Approximate (Variational) Inference
Sampling Methods
Combining Models
Boosting
- 41 -
Marcus Hutter
Probabilistic Graphical Models (PGM)

Visualize structure of model and (in)dependence
faster and more comprehensible algorithms
Nodes = random variables.
Edges = stochastic dependencies.
Bayesian network = directed PGM
Markov random field = undirected PGM
Example:
P(x1 )P(x2 )P(x3 )P(x4 |x1 x2 x3 )P(x5 |x1 x3 )P (x6 |x4 )P(x7 |x4 x5 )
- 42 -
Marcus Hutter
Additive Models & Trees & Related Methods

Generalized additive model: f (x) = + f1 (x1 ) + ... + fd (xd )
Reduces determining f : IRd IR to d 1d functions fb : IR IR
Classification/decision trees:
Outlook:
PRIM,
bump hunting,
How to learn
tree structures.
- 43 -
Marcus Hutter
Regression Trees
f (x) = cb for x Rb , and cb = Average[yi |xi Rb ]
- 44 -
Marcus Hutter
Non-Parametric Learning
= prototype methods = instance-based learning = model-free
Examples:
K-means:
Data clusters around K centers with cluster means 1 , ...K .
Assign xi to closest cluster center.
K Nearest neighbors regression (kNN):
Estimate f (x) by averaging the yi
for the k xi closest to x:
Kernel regression:
Pn
K(x, xi )yi
i=1
Take a weighted average f (x) = Pn

.
i=1 K(x, xi )
- 45 -
Marcus Hutter
- 46 -
Marcus Hutter
Approximate (Variational) Inference

Approximate full distribution P(z) by q(z)
Popular: Factorized distribution:
q(z) = q1 (z1 ) ... qM (zM ).
Measure of fit: Relative entropy
R
KL(p||q) = p(z) log p(z)
q(z) dz
Red curves: Left minimizes KL(P||q), Middle and Right are the
two local minima of
KL(q||P).
Examples: Gaussians
- 47 -
Marcus Hutter
Elementary Sampling Methods

How to sample from P : Z [0, 1]?
Special sampling algorithms for standard distributions P.
Rejection sampling: Sample z uniformly from domain Z,
but accept sample only with probability P(z).
Importance sampling:
R
PL
1
E[f ] = f (z)p(z)dz ' L l=1 f (z l )p(z l )/q(z l ),
where z l are sampled from q.
Choose q easy to sample and large where f (z l )p(z l ) is large.
Others: Slice sampling
- 48 -
Marcus Hutter
Markov Chain Monte Carlo (MCMC) Sampling

Metropolis: Choose some convenient q with q(z|z 0 ) = q(z 0 |z).
Sample z l+1 from q(|z l ) but
accept only with probability
min{1, p(zl+1 )/p(z l )}.
Gibbs Sampling: Metropolis with
q leaving z unchanged from l ;
l + 1,except resample coordinate
i from P(zi |z \i ).
Green lines are accepted and red lines are rejected Metropolis steps.
- 49 -
Marcus Hutter
Combining Models
Performance can often be improved
by combining multiple models in some way,
instead of just using a single model in isolation
Committees: Average the predictions of a set of individual models.
P
Bayesian model averaging: P(x) = Models P(x|Model)P(Model)
P
Mixture models: P(x|, ) = k k Pk (x|k )
Decision tree: Each model is responsible
for a different part of the input space.
Boosting: Train many weak classifiers in sequence
and combine output to produce a powerful committee.
- 50 -
Marcus Hutter
Boosting
Idea: Train many weak classifiers Gm in sequence and
combine output to produce a powerful committee G.
AdaBoost.M1: [Freund & Schapire (1997) received famous Godel-Prize]
Initialize observation weights wi uniformly.
For m = 1 to M do:
(a) Gm classifies x as Gm (x) {1, 1}.
Train Gm weighing data i with wi .
(b) Give Gm high/low weight i if it performed well/bad.
(c) Increase attention=weight wi for obs. xi misclassified by Gm .
PM
Output weighted majority vote: G(x) = sign( m=1 m Gm (x))
- 51 -
Marcus Hutter
UNSUPERVISED LEARNING
K-Means Clustering
Mixture Models
Expectation Maximization Algorithm
- 52 -
Supervised learning: Find functional relationship f : X Y from I/O data {(xi , yi )}
Unsupervised learning:
Find pattern in data (x1 , ..., xn )
without an explicit teacher (no y values).
Example: Clustering e.g. by K-means
Implicit goal: Find simple explanation,
i.e. compress data (MDL, Occams razor).
Density estimation: From which probability
distribution P are the xi drawn?
Marcus Hutter
- 53 -
Marcus Hutter
K-means Clustering and EM

Data points seem to cluster around two centers.
Assign each data point i to a cluster ki .
Let k = center of cluster k.
Distortion measure: Total distance2
of data points from cluster centers:
Pn
J(k, ) := i=1 ||xi ki ||2
Choose centers k initially at random.
M-step: Minimize J w.r.t. k:
Assign each point to closest center
E-step: Minimize J w.r.t. :
Let k be the mean of points belonging to cluster k
Iteration of M-step and E-step converges to local minimum of J.
- 54 -
Marcus Hutter
Iterations of K-means EM Algorithm
- 55 -
Mixture Models and EM

Mixture of Gaussians:
PK
P(x|) = i=1 Gauss(x|k , k )k
Maximize likelihood P(x|...) w.r.t. , , .
E-Step: Compute probability ik that data point i
belongs to cluster k, based
on estimates , , .
M-Step: Re-estimate ,
, (take empirical
mean/variance) given ik .
Marcus Hutter
Non-IID: Sequential & (Re)Active Settings
- 56 -
Marcus Hutter
NON-IID: SEQUENTIAL & (RE)ACTIVE

SETTINGS
Sequential Data
Sequential Decision Theory
Learning Agents
Reinforcement Learning (RL)
Learning in Games
- 57 -
Marcus Hutter
Sequential Data
General stochastic time-series: P(x1 ...xn ) =
Qn
i=1
P(xi |x1 ...xi1 )
Independent identically distributed (roulette,dice,classification):

P(x1 ...xn ) = P(x1 )P(x2 )...P(xn )
First order Markov chain (Backgammon):
P(x1 ...xn ) = P(x1 )P(x2 |x1 )P(x3 |x2 )...P(xn |xn1 )
Second order Markov chain (mechanical systems):
P(x1 ...xn ) = P(x1 )P(x2 |x1 )P(x3 |x1 x2 )
... P(xn |xn1 xn2 )
Hidden Markov model (speech recognition):
R
P(x1 ...xn ) = P(z1 )P(z2 |z1 )...P(zn |zn1 )
Qn
i=1 P(xi |zi )dz
- 58 -
Marcus Hutter
Sequential Decision Theory

Setup:
For t = 1, 2, 3, 4, ...
Given sequence x1 , x2 , ..., xt1
(1) predict/make decision yt ,
(2) observe xt ,
(3) suffer loss Loss(xt , yt ),
(4) t t + 1, goto (1)
Example: Weather Forecasting

xt X = {sunny, rainy}
yt Y = {umbrella, sunglasses}
Loss
sunny
rainy
umbrella
0.1
0.3
sunglasses
0.0
1.0
Goal: Minimize expected Loss: yt = arg minyt E[Loss(xt , yt )|x1 ...xt1 ]

Greedy minimization of expected loss is optimal if:
Important: Decision yt does not influence env. (future observations).
Examples:
Loss
y
=
=
square / absolute / 0-1 error

mean / median / mode
function
of P(xt | )
- 59 -
Marcus Hutter
Learning Agents 1
sensors
percepts
?
environment
actions
agent
actuators
Additional complication:
Learner can influence environment,
and hence what he observes next.
farsightedness, planning, exploration necessary.
Exploration Exploitation problem
- 60 -
Marcus Hutter
Learning Agents 2
Performance standard
Sensors
Critic
changes
Learning
element
knowledge
Performance
element
learning
goals
Problem
generator
Agent
Actuators
Environment
feedback
- 61 -
Marcus Hutter
Reinforcement Learning (RL)

r1 | o1 r2 | o2 r3 | o3 r4 | o4 r5 | o5 r6 | o6 ...
H
Y
HH
H
Agent
p
Environment q
PP
P
y1
y2
PP
P
q
y3
y4
y5
y6
...
RL is concerned with how an agent ought to take actions

in an environment so as to maximize its long-term reward.
Find policy that maps states of the world
to the actions the agent ought to take in those states.
The environment is typically formulated
as a finite-state Markov decision process (MDP).
- 62 -
Learning in Games
Learning though self-play.
Backgammon (TD-Gammon).
Samuels checker program.
Marcus Hutter
Summary
- 63 -
Marcus Hutter
SUMMARY
Important Loss Functions

Learning Algorithm Characteristics Comparison
More Learning
Data Sets
Journals
Annual Conferences
Recommended Literature
Summary
- 64 -
Important Loss Functions
Marcus Hutter
Summary
- 65 -
Marcus Hutter
Learning Algorithm Characteristics Comparison
Summary
- 66 -
More Learning
Concept Learning
Bayesian Learning
Computational Learning Theory (PAC learning)
Genetic Algorithms
Learning Sets of Rules
Analytical Learning
Marcus Hutter
Summary
- 67 -
Data Sets
UCI Repository:
http://www.ics.uci.edu/ mlearn/MLRepository.html
UCI KDD Archive:
http://kdd.ics.uci.edu/summary.data.application.html
Statlib:
http://lib.stat.cmu.edu/
Delve:
http://www.cs.utoronto.ca/ delve/
Marcus Hutter
Summary
- 68 -
Marcus Hutter
Journals
Journal of Machine Learning Research
Machine Learning
IEEE Transactions on Pattern Analysis and Machine Intelligence
Neural Computation
Neural Networks
IEEE Transactions on Neural Networks
Annals of Statistics
Journal of the American Statistical Association
...
Summary
- 69 -
Marcus Hutter
Annual Conferences
Algorithmic Learning Theory (ALT)
Computational Learning Theory (COLT)
Uncertainty in Artificial Intelligence (UAI)
Neural Information Processing Systems (NIPS)
European Conference on Machine Learning (ECML)
International Conference on Machine Learning (ICML)
International Joint Conference on Artificial Intelligence (IJCAI)
International Conference on Artificial Neural Networks (ICANN)
Summary
- 70 -
Marcus Hutter
Recommended Literature
[Bis06]
C. M. Bishop. Pattern Recognition and Machine Learning.

Springer, 2006.
[HTF01] T. Hastie, R. Tibshirani, and J. H. Friedman.

The Elements of Statistical Learning. Springer, 2001.
[Alp04]
E. Alpaydin. Introduction to Machine Learning.

MIT Press, 2004.
[Hut05] M. Hutter. Universal Artificial Intelligence:

Sequential Decisions based on Algorithmic Probability.
Springer, Berlin, 2005. http://www.hutter1.net/ai/uaibook.htm.

Introduction To Statistical Machine Learning: Marcus Hutter

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Statistical Machine Learning: Marcus Hutter

Uploaded by

Copyright:

Available Formats

Introduction to Statistical Machine Learning

Machine Learning Summer School

Introduction to Statistical Machine Learning

Introduction to Statistical Machine Learning

What is Machine Learning?

Artificial Intelligence: smart algorithms

Handwritten Character Recognition

Applications of Machine Learning

Some Fundamental Types of Learning

Discriminant: IF income > 1 AND savings > 2

Dichotomies in Machine Learning

scope of other lectures

Probability: P(6) = 16 , P(Even) = P(Odd) =

Prior: If we are indifferent, then P() =const.

Maximum a Posterior (MAP) estimate: = arg max P(|1101) =

Variance: Var() = E[( E)2 |1101] =

Probability density: P() = 1 P([, + ]) for 0

Linear Methods for Regression

LINEAR METHODS FOR REGRESSION

Linear Methods for Regression

Linear Methods for Regression

Coefficient Subset Selection

Linear Methods for Regression

Linear Methods for Regression

Linear Methods for Classification

(or {0, 1})

Reduction to regression: Regard y IR w

Linear Methods for Regression

Linear Basis Function Regression (LBFR)

Linear Methods for Regression

Linear Methods for Regression

2D Spline LBFR and 1D Symmlet-8 Wavelets

Linear Methods for Regression

Local Smoothing & Kernel Regression

Linear Methods for Regression

Regularization & 1D Smoothing Splines

f = arg minf { i=1 (yi f (xi )) + (f (x))2 dx}

Artificial Neural Networks 1

Artificial Neural Networks 2

Maximum Margin Classifier

min{yi (w>(xi ))}

Linear boundary for

Sparse Kernel Methods / SVMs

Model Assessment & Selection

MODEL ASSESSMENT & SELECTION

Model Assessment & Selection

Example: Polynomial Regression

Model Assessment & Selection

Training=Empirical versus Test=True Error

Model Assessment & Selection

Empirical Model Selection

Model Assessment & Selection

Theoretical Model Selection

parametric regression with d parameters:

They all add a penalty proportional to d to the loss.

Model Assessment & Selection

The Loss Rank Principle for Model Selection

c is large many D0 can be fitted well Rank is large.

How to Attack Large Problems

HOW TO ATTACK LARGE PROBLEMS

How to Attack Large Problems

Probabilistic Graphical Models (PGM)

How to Attack Large Problems