Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

AMCS 215

Mathematical Foundations of Machine Learning

Introduction
What is Machine Learning?
Input spaces can be large
Data visualization
Supervised Learning

• mapping from inputs x ∈ X to outputs y ∈ Y


• inputs x are also called features. Often a fixed-dimensional
vector of numbers, such as the height and weight of a person,
or the pixels in an image. X = RD
• outputs y also called labels. In classification problems, the
output space is a set of C unordered and mutually exclusive
labels known as classes, Y = {1, 2, . . . , C}
• we are given a set of N input-output pairs, known as a
training set. D = {(xn , yn )}N
n=1 . N is called the sample size
Learning a classifier

{
Setosa if petal length < 2.45
f (x; θ) =
Versicolor or Virginica, otherwise
Empirical risk

• misclassification rate on the training set

1 ∑
N
L(θ) ≜ I(yn ̸= f (x; θ))
N
n=1

where I(·) is the binary indicator function


• we may also define a loss function ℓ(y, ŷ) when some errors
are more costly than others, to get the empirical risk

1 ∑
N
L(θ) ≜ ℓ(y, f (xn ; θ))
N
n=1
Training

• empirical risk minimization

1 ∑
N
θ∗ = argmin L(θ) = argmin ℓ(yn , f (xn ; θ))
θ θ N
n=1

• However, our true goal is to minimize the expected loss on


future data that we have not yet seen. That is, we want to
generalize, rather than just do well on the training set.
Uncertainty

• In many cases, we will not be able to perfectly predict the


exact output given the input, due to model uncertainty or
data uncertainty
• we can capture our uncertainty using a conditional probability
distribution

p(y = c|x; θ) = fc (x; θ) = Sc (f (x; θ))

where f : X → [0, 1]C


• S is the softmax function
[ ]
ea1 eaC
S(a) ≜ ∑C , · · · , ∑C
aj aj
j=1 e j=1 e
Logistic regression

• common special case when f is an affine function of the form

f (x; θ) = b + wT x = b + w1 x1 + w2 x2 + · · · + wD xD
θ = (b, w) are model parameters, known as bias and weights
• to reduce notational clutter, we often absorb the bias term
into the weights by defining w̃ = [b, w1 , · · · , wD ] and
x̃ = [1, x1 , · · · , xD ] so that
w̃T x̃ = b + wT x

• converts the affine function into a linear function. Assume


this is done so that
f (x; w) = wT x
Maximum likelihood estimation

• common to use
ℓ(y, f (x; θ)) = − log p(y|f (x; θ))
assigns a high probability to the true output y for each
corresponding input x
• average negative log probability of the training set is given by

1 ∑
N
NLL(θ) = − log p(yn |f (xn ; θ))
N
n=1
known as negative log likelihood
• compute maximum likelihood estimate

θmle = argminNNL(θ)
θ
Regression

• suppose we want to predict a real-value quantity y ∈ R


instead of a class label y ∈ {1, . . . , C}
• we need a different loss function
• quadratic loss, also called l2 loss, is common

ℓ(y, ŷ) = (y − ŷ)2

• empirical risk is the mean square error

1 ∑
N
MSE(θ) = (yn − f (xn ; θ))2
N
n=1
Uncertainty in regression tasks

• common to assume Gaussian or normal distribution


1
e− 2σ2 (y−µ)
1 2
N (y|µ, σ 2 ) ≜ √
2πσ 2

with mean µ and variance σ 2

• we can make the mean depend on the inputs by defining by


defining µ = f (xn ; θ) to get the probability distribution
p(y|x; θ) = N (y|f (x; θ), σ 2 )

• if we assume the variance σ 2 is fixed, the negative log


likelihood becomes
[( ) ( )]
1 ∑
N
1 1
NLL(θ) = − log exp − (yn − f (xn ; θ)) 2
N n=1 2πσ 2 2πσ 2
1
= MSE(θ) + constant
2σ 2
Linear regression

• f (x; θ) = b + wx
• least squares solution θ∗ = argminMSE(θ)
θ
• if we have multiple input features, we can write
Multiple linear regression

• consider the task of predicting temperature as a function of


2d location in a room
• f (x; θ) = b + w1 x1 + w2 x2
Polynomial regression

• can improve the fit by using a polynomial regression model of


degree d
• has the form f (x; θ) = wT ϕ(x) where ϕ(x) is a feature vector
derived from the input
• for example ϕ(x) = [1, x, x2 , · · · , xd ]
• a simple example of feature preprocessing, also called feature
engineering
• process that can be done manually
Automatic feature extraction

• can create much more powerful models by learning to do such


nonlinear feature extraction automatically
• deep neural networks decompose the feature extractor ϕ(x, V )
into a composition of simpler functions; stack of L nested
functions

f (x; θ) = fL (fL−1 (· · · (f1 (x)) · · · ))

where fl (x) = f (x; θl ) is the function at layer l


• and have been wildly successful
ImageNet
Overfitting and generalization
Overfitting and generalization

• the empirical risk is not really what we want to minimize

1 ∑
L(θ; Dtrain ) = ℓ(y, f (x; θ))
|Dtrain |
(x,y)∈Dtrain

• what we really want is to minimize is the theoretical expected


loss, aka population risk

L(θ; p∗ ) ≜ Ep∗ (x,y) [ℓ(y, f (x; θ))]

• difference is the generalization gap


• training, test, and validation data
Unsupervised learning — clustering

• we only have observed data xn , without labels yn


• number of clusters need to consider the tradeoff between
model complexity and fit to the data
Unsupervised learning — dimensionality reduction

• lower dimensional space that captures the “essence” of the


data
• hidden or latent factors zn ∈ RK → xn ∈ RD
Reinforcement Learning

• system has to learn how to interact wit its environment


• this is encoded by means of a policy a = π(x), action to take
in response to input from environment
• unlike SL or UL, system receives a reward in response to
actions
Data

• image data sets


• text data sets
• preprocessing discrete input data
• preprocessing text data
• missing data

You might also like