CH 1

AMCS 215
Mathematical Foundations of Machine Learning
Introduction
What is Machine Learning?
Input spaces can be large
Data visualization
Supervised Learning
• mapping from inputs x ∈ X to outputs y ∈ Y

• inputs x are also called features. Often a fixed-dimensional
vector of numbers, such as the height and weight of a person,
or the pixels in an image. X = RD
• outputs y also called labels. In classification problems, the
output space is a set of C unordered and mutually exclusive
labels known as classes, Y = {1, 2, . . . , C}
• we are given a set of N input-output pairs, known as a
training set. D = {(xn , yn )}N
n=1 . N is called the sample size
Learning a classifier
{
Setosa if petal length < 2.45
f (x; θ) =
Versicolor or Virginica, otherwise
Empirical risk
• misclassification rate on the training set
1 ∑
N
L(θ) ≜ I(yn ̸= f (x; θ))
N
n=1
where I(·) is the binary indicator function

• we may also define a loss function ℓ(y, ŷ) when some errors
are more costly than others, to get the empirical risk
1 ∑
N
L(θ) ≜ ℓ(y, f (xn ; θ))
N
n=1
Training
• empirical risk minimization
1 ∑
N
θ∗ = argmin L(θ) = argmin ℓ(yn , f (xn ; θ))
θ θ N
n=1
• However, our true goal is to minimize the expected loss on

future data that we have not yet seen. That is, we want to
generalize, rather than just do well on the training set.
Uncertainty
• In many cases, we will not be able to perfectly predict the

exact output given the input, due to model uncertainty or
data uncertainty
• we can capture our uncertainty using a conditional probability
distribution
p(y = c|x; θ) = fc (x; θ) = Sc (f (x; θ))
where f : X → [0, 1]C

• S is the softmax function
[ ]
ea1 eaC
S(a) ≜ ∑C , · · · , ∑C
aj aj
j=1 e j=1 e
Logistic regression
• common special case when f is an affine function of the form
f (x; θ) = b + wT x = b + w1 x1 + w2 x2 + · · · + wD xD
θ = (b, w) are model parameters, known as bias and weights
• to reduce notational clutter, we often absorb the bias term
into the weights by defining w̃ = [b, w1 , · · · , wD ] and
x̃ = [1, x1 , · · · , xD ] so that
w̃T x̃ = b + wT x
• converts the affine function into a linear function. Assume

this is done so that
f (x; w) = wT x
Maximum likelihood estimation
• common to use
ℓ(y, f (x; θ)) = − log p(y|f (x; θ))
assigns a high probability to the true output y for each
corresponding input x
• average negative log probability of the training set is given by
1 ∑
N
NLL(θ) = − log p(yn |f (xn ; θ))
N
n=1
known as negative log likelihood
• compute maximum likelihood estimate
∗
θmle = argminNNL(θ)
θ
Regression
• suppose we want to predict a real-value quantity y ∈ R

instead of a class label y ∈ {1, . . . , C}
• we need a different loss function
• quadratic loss, also called l2 loss, is common
ℓ(y, ŷ) = (y − ŷ)2
• empirical risk is the mean square error
1 ∑
N
MSE(θ) = (yn − f (xn ; θ))2
N
n=1
Uncertainty in regression tasks
• common to assume Gaussian or normal distribution

1
e− 2σ2 (y−µ)
1 2
N (y|µ, σ 2 ) ≜ √
2πσ 2
with mean µ and variance σ 2
• we can make the mean depend on the inputs by defining by

defining µ = f (xn ; θ) to get the probability distribution
p(y|x; θ) = N (y|f (x; θ), σ 2 )
• if we assume the variance σ 2 is fixed, the negative log

likelihood becomes
[( ) ( )]
1 ∑
N
1 1
NLL(θ) = − log exp − (yn − f (xn ; θ)) 2
N n=1 2πσ 2 2πσ 2
1
= MSE(θ) + constant
2σ 2
Linear regression
• f (x; θ) = b + wx
• least squares solution θ∗ = argminMSE(θ)
θ
• if we have multiple input features, we can write
Multiple linear regression
• consider the task of predicting temperature as a function of

2d location in a room
• f (x; θ) = b + w1 x1 + w2 x2
Polynomial regression
• can improve the fit by using a polynomial regression model of

degree d
• has the form f (x; θ) = wT ϕ(x) where ϕ(x) is a feature vector
derived from the input
• for example ϕ(x) = [1, x, x2 , · · · , xd ]
• a simple example of feature preprocessing, also called feature
engineering
• process that can be done manually
Automatic feature extraction
• can create much more powerful models by learning to do such

nonlinear feature extraction automatically
• deep neural networks decompose the feature extractor ϕ(x, V )
into a composition of simpler functions; stack of L nested
functions
f (x; θ) = fL (fL−1 (· · · (f1 (x)) · · · ))
where fl (x) = f (x; θl ) is the function at layer l

• and have been wildly successful
ImageNet
Overfitting and generalization
Overfitting and generalization
• the empirical risk is not really what we want to minimize
1 ∑
L(θ; Dtrain ) = ℓ(y, f (x; θ))
|Dtrain |
(x,y)∈Dtrain
• what we really want is to minimize is the theoretical expected

loss, aka population risk
L(θ; p∗ ) ≜ Ep∗ (x,y) [ℓ(y, f (x; θ))]
• difference is the generalization gap

• training, test, and validation data
Unsupervised learning — clustering
• we only have observed data xn , without labels yn

• number of clusters need to consider the tradeoff between
model complexity and fit to the data
Unsupervised learning — dimensionality reduction
• lower dimensional space that captures the “essence” of the

data
• hidden or latent factors zn ∈ RK → xn ∈ RD
Reinforcement Learning
• system has to learn how to interact wit its environment

• this is encoded by means of a policy a = π(x), action to take
in response to input from environment
• unlike SL or UL, system receives a reward in response to
actions
Data
• image data sets

• text data sets
• preprocessing discrete input data
• preprocessing text data
• missing data

CH 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CH 1

Uploaded by

Copyright:

Available Formats

AMCS 215

Mathematical Foundations of Machine Learning

• mapping from inputs x ∈ X to outputs y ∈ Y

• misclassification rate on the training set

where I(·) is the binary indicator function

• empirical risk minimization

• However, our true goal is to minimize the expected loss on

• In many cases, we will not be able to perfectly predict the

p(y = c|x; θ) = fc (x; θ) = Sc (f (x; θ))

where f : X → [0, 1]C

• common special case when f is an affine function of the form

• converts the affine function into a linear function. Assume

• suppose we want to predict a real-value quantity y ∈ R

ℓ(y, ŷ) = (y − ŷ)2

• empirical risk is the mean square error

• common to assume Gaussian or normal distribution

with mean µ and variance σ 2

• we can make the mean depend on the inputs by defining by

• if we assume the variance σ 2 is fixed, the negative log

• consider the task of predicting temperature as a function of

• can improve the fit by using a polynomial regression model of

• can create much more powerful models by learning to do such

f (x; θ) = fL (fL−1 (· · · (f1 (x)) · · · ))

where fl (x) = f (x; θl ) is the function at layer l

• the empirical risk is not really what we want to minimize

• what we really want is to minimize is the theoretical expected

L(θ; p∗ ) ≜ Ep∗ (x,y) [ℓ(y, f (x; θ))]

• difference is the generalization gap

• we only have observed data xn , without labels yn

• lower dimensional space that captures the “essence” of the

• system has to learn how to interact wit its environment

• image data sets

You might also like