2 Linear

Machine Learning
Basics
I2DL: Prof. Niessner 1

AI vs ML vs DL
Artificial Intelligence
Machine Learning
Deep
Learning

A Simple Task:
Image Classification

Task


Occlusions

Background clutter

Pose Illumination Appearance
Representation

A Simple Classifier

Nearest Neighbor
?
Nearest Neighbor
NN classifier = dog
I2DL: Prof. Niessner

distance 12
Nearest Neighbor
k-NN classifier = cat

distance 13
Nearest Neighbor
The Data NN Classifier 5NN Classifier
How does the NN classifier perform on training data?

What classifier is more likely to perform best on test data?
What are we actually learning?
I2DL: Prof. Niessner Source: https://commons.wikimedia.org/wiki/File:Data3classes.png
14
Nearest Neighbor
L1 distance : |𝑥 − 𝑐|
• Hyperparameters L2 distance : ||𝑥 − 𝑐||2
No. of Neighbors: 𝑘
• These parameters are problem dependent.
• How do we choose these hyperparameters?

Machine Learning for
Classification

Machine Learning
• How can we learn to perform image classification?
Task Experience
Image
Data
classification

Machine Learning
• 𝑀𝜃 𝐼 = {DOG, CAT}
CAT
Model Image Class Label
DOG DOG
Model Params
CAT
CAT DOG

Machine Learning
• 𝑀𝜃 𝐼 = {DOG, CAT} Given 𝒊 images with train labels
CAT
Model Image Class Label
DOG DOG
Model Params
CAT
𝜃∗ = argmin σ𝑖 D(𝑀𝜃 𝐼𝑖 − 𝑌𝑖 )
𝜃
CAT DOG
“Distance” function {DOG, CAT}
Basic Recipe for Machine Learning
• Split your data
60% 20% 20%

train validation test
Find model params 𝜃
Other splits are also possible (e.g., 80%/10%/10%)

• Split your data
60% 20% 20%

Find your hyperparameters
Other splits are also possible (e.g., 80%/10%/10%)

• Split your data
60% 20% 20%

Find your hyperparameters

Test set is only used once!

Machine Learning
• How can we learn to perform image classification?
Task Experience
Performance
Image
measure Data
classification
Accuracy

Machine Learning
Unsupervised learning Supervised learning
• Labels or target classes

Machine Learning
CAT
DOG DOG
CAT
CAT DOG

Machine Learning
CAT
• No label or target class
• Find out properties of
the structure of the DOG DOG
data
CAT
• Clustering (k-means,
PCA, etc.) CAT DOG

Machine Learning
CAT
DOG DOG
CAT
CAT DOG

Machine Learning
CAT
DOG DOG
CAT
CAT DOG

Machine Learning
Reinforcement learning
interaction
Agents Environment

Machine Learning
reward
Agents Environment

Machine Learning
reward
Agents Environment

Linear Decision Boundaries
Let’s start with a simple linear Model!
What are the pros

and cons for using
linear decision
boundaries?

Linear Regression

Linear Regression
• Supervised learning
• Find a linear model that explains a target 𝒚 given
inputs 𝒙
𝒚
𝒙
Linear Regression
Training
{𝒙1:𝑛 , 𝒚1:𝑛 } Learner 𝜽

Data points Model parameters
Input (e.g., image,

measurement) Labels
(e.g., cat/dog)

Linear Regression
can be parameters of a
Training Neural Network
{𝒙1:𝑛 , 𝒚1:𝑛 } Learner 𝜽

Data points Model parameters
Testing
𝑥𝑛+1 , 𝜽 Predictor 𝑦ො𝑛+1

Estimation
Linear Prediction
• A linear model is expressed in the form
𝑑 input dimension
𝑦ො𝑖 = ෍ 𝑥𝑖𝑗 𝜃𝑗
𝑗=1
weights (i.e., model parameters)
Input data, features

Linear Prediction
• A linear model is expressed in the form
𝑑
𝑦ො𝑖 = 𝜃0 + ෍ 𝑥𝑖𝑗 𝜃𝑗 = 𝜃0 + 𝑥𝑖1 𝜃1 + 𝑥𝑖2 𝜃2 + ⋯ + 𝑥𝑖𝑑 𝜃𝑑

𝑗=1
𝒚
bias
𝜃0
𝒙
Linear Prediction
Outside Number of 𝑥3
𝑥1 𝜃1 𝜃3
temperature people
Temperature
𝜃2 of a building 𝜃4
Level of Sun
𝑥2 𝑥4
humidity exposure

Linear Prediction
𝑦ො1 𝑥11 ⋯ 𝑥1𝑑 𝜃1
𝑦ො2 𝑥21 ⋯ 𝑥2𝑑 𝜃
= 𝜃0 + ⋮ ⋱ ⋮ ∙ 2
⋮ ⋮
𝑦ො𝑛 𝑥𝑛1 ⋯ 𝑥𝑛𝑑 𝜃𝑑
𝑦ො1 1 𝑥11 ⋯ 𝑥1𝑑 𝜃0

𝑦ො2 𝑥21 ⋯ 𝑥2𝑑 𝜃1
= 1 ⋮ ⋱ ⋮
֜ 𝐲ො = 𝐗𝜽
⋮ ⋮ ⋮
𝑦ො𝑛 1 𝑥𝑛1 ⋯ 𝑥𝑛𝑑 𝜃𝑑

Linear Prediction
𝐲ො = 𝐗𝜽 Input features
Prediction (one sample has 𝑑
features)
𝑦ො1 1 𝑥11 ⋯ 𝑥1𝑑 𝜃0

𝑦ො2 𝑥21 ⋯ 𝑥2𝑑 𝜃1
= 1 ⋮ ⋱ ⋮ Model
⋮ ⋮ ⋮
𝑦ො𝑛 1 𝑥𝑛1 ⋯ 𝑥𝑛𝑑 𝜃𝑑 parameters
(𝑑 weights and 1 bias)

Linear Prediction
Temperature MODEL
of the building 0.2
𝑦ො1 0.64
1 25 50 2 50
= ⋅ 0
𝑦ො2 1 − 10 50 0 10
1
0.14
Linear Prediction
How do we
obtain the
model?
Temperature MODEL
of the building 0.2
𝑦ො1 0.64
1 25 50 2 50
= ⋅ 0
𝑦ො2 1 − 10 50 0 10
1
0.14
How to Obtain the Model?
Labels (ground truth)
Data points
𝑦
𝐗
Optimization
Loss
function
Model parameters Estimation

𝜃 𝑦ො
How to Obtain the Model?
• Loss function: measures how good my estimation is

(how good my model is) and tells the optimization
method how to make it better.
• Optimization: changes the model in order to improve

the loss function (i.e., to improve my estimation).

Linear Regression: Loss Function
Prediction:
Temperature
of the building

Prediction:
Temperature
of the building

𝑛 Objective function
Minimizing 1
𝐽 𝜽 = ෍ 𝑦ො𝑖 − 𝑦𝑖 2 Energy
𝑛
𝑖=1 Cost function 55
Optimization: Linear Least Squares
• Linear least squares: an approach to fit a linear model
to the data
𝑛
1 2
min 𝐽 𝜽 = ෍ 𝑦ො𝑖 − 𝑦𝑖
𝜃 𝑛
𝑖=1
• Convex problem, there exists a closed-form solution

that is unique.

𝑛 𝑛
1 2
1 2
min 𝐽 𝜽 = ෍ 𝑦ො𝑖 − 𝑦𝑖 = ෍ 𝐱 𝑖 𝜽 − 𝑦𝑖
𝜽 𝑛 𝑛
𝑖=1 𝑖=1
𝑛 training samples The estimation comes

from the linear model

𝑛 𝑛
1 2
1 2
𝜽 𝑛 𝑛
𝑖=1 𝑖=1
min 𝐽 𝜽 = 𝐗𝜽 − 𝒚 𝑇 (𝐗𝜽 − 𝒚) Matrix notation

𝜽
𝑛 training samples, 𝑛 labels

each input vector has
size 𝑑

𝑛 𝑛
1 2
1 2
𝜽 𝑛 𝑛
𝑖=1 𝑖=1
min 𝐽 𝜽 = 𝐗𝜽 − 𝒚 𝑇 (𝐗𝜽 − 𝒚) Matrix notation

𝜽
More on matrix notation in the next exercise session

𝑛 𝑛
1 2
1 2
𝜽 𝑛 𝑛
𝑖=1 𝑖=1
min 𝐽 𝜽 = 𝐗𝜽 − 𝒚 𝑇 (𝐗𝜽 − 𝒚)
𝜽
Convex
𝜕𝐽(𝜽)
=0
𝜕𝜽
Optimum
Optimization Details in the
exercise
𝜕𝐽(𝜃) session!
= 2𝐗 𝑇 𝐗𝜽 − 2𝐗 𝑇 𝐲 = 0
𝜕𝜃
𝜃 = 𝐗𝑇 𝐗 −1 𝐗 𝑇 𝐲
We have found True output:

an analytical Inputs: Outside
Temperature of
solution to a temperature,
the building
convex problem number of people,
…
Is this the best Estimate?
• Least squares estimate
𝑛
1 2
𝐽 𝜽 = ෍ 𝑦ො𝑖 − 𝑦𝑖
𝑛
𝑖=1

Maximum Likelihood

Maximum Likelihood Estimate
𝑝𝑑𝑎𝑡𝑎 (𝐲|𝐗) True underlying distribution
𝑝𝑚𝑜𝑑𝑒𝑙 (𝐲|𝐗, 𝜽) Parametric family of distributions
Controlled by parameter(s)

• A method of estimating the parameters of a statistical
model given observations,
𝑝𝑚𝑜𝑑𝑒𝑙 (𝐲|𝐗, 𝜽)
Observations from 𝑝𝑑𝑎𝑡𝑎 (𝐲|𝐗)

• A method of estimating the parameters of a statistical
model given observations, by finding the parameter
values that maximize the likelihood of making the
observations given the parameters.
𝜽𝑴𝑳 = arg max 𝑝𝑚𝑜𝑑𝑒𝑙 (𝐲|𝐗, 𝜽)

𝜽

• MLE assumes that the training samples are
independent and generated by the same probability
distribution
𝑛
𝑝𝑚𝑜𝑑𝑒𝑙 𝐲 𝐗, 𝜽 = ෑ 𝑝𝑚𝑜𝑑𝑒𝑙 (𝑦𝑖 |𝐱 𝑖 , 𝜽)

𝑖=1
“i.i.d.” assumption

𝑛
𝜽𝑴𝑳 = arg max ෑ 𝑝𝑚𝑜𝑑𝑒𝑙 (𝑦𝑖 |𝐱 𝑖 , 𝜽)

𝜽
𝑖=1
𝜽𝑴𝑳 = arg max ෍ log 𝑝𝑚𝑜𝑑𝑒𝑙 (𝑦𝑖 |𝐱 𝑖 , 𝜽)

𝜽
𝑖=1
Logarithmic property log 𝑎𝑏 = log 𝑎 + log 𝑏

Back to Linear Regression
𝑛
𝜽𝑴𝑳 = arg max ෍ log 𝑝𝑚𝑜𝑑𝑒𝑙 (𝑦𝑖 |𝐱 𝑖 , 𝜽)

𝜽
𝑖=1
What shape does our

probability distribution
have?

𝑝(𝑦𝑖 |𝐱 𝑖 , 𝜽) What shape does our probability
distribution have?

Gaussian Noise
𝑝(𝑦𝑖 |𝐱 𝑖 , 𝜽) Mean
Assuming 𝑦𝑖 = 𝐱 𝑖 𝜽 + 𝜎𝑖 with 𝜎𝑖 ~ 𝒩(0, 𝜎 2 )
Gaussian:
1 1
− 2 𝑦𝑖 −𝜇 2
𝑝 𝑦𝑖 = 𝑒 2𝜎 𝑦𝑖 ~ 𝒩(𝜇, 𝜎 2 )
2𝜋𝜎 2 73
Gaussian Noise
𝑝(𝑦𝑖 |𝐱 𝑖 , 𝜽) Mean

𝑦𝑖 ~ 𝒩 𝐱 𝑖 𝜽, 𝜎 2
Gaussian:
1 1
− 2 𝑦𝑖 −𝜇 2
𝑝 𝑦𝑖 = 𝑒 2𝜎 𝑦𝑖 ~ 𝒩(𝜇, 𝜎 2 )
2𝜋𝜎 2 74
Gaussian Noise
𝑝 𝑦𝑖 𝐱 𝑖 , 𝜽 = ? Mean

Gaussian:
1 1
− 2 𝑦𝑖 −𝜇 2
𝑝 𝑦𝑖 = 𝑒 2𝜎 𝑦𝑖 ~ 𝒩(𝜇, 𝜎 2 )
2𝜋𝜎 2 75
1
2 −1/2 − 𝑦𝑖 −𝐱 𝒊 𝜽 2
𝑝(𝑦𝑖 |𝐱 𝑖 , 𝜽) = 2𝜋𝜎 𝑒 2𝜎 2

Gaussian:
1 1
− 2 𝑦𝑖 −𝜇 2
𝑝 𝑦𝑖 = 𝑒 2𝜎 𝑦𝑖 ~ 𝒩(𝜇, 𝜎 2 )
2𝜋𝜎 2 76
1 2
2 −1/2 − 𝑦𝑖 −𝐱 𝒊 𝜽
𝑝 𝑦𝑖 𝐱 𝑖 , 𝜽 = 2𝜋𝜎 𝑒 2𝜎2
𝑛
Original
optimization 𝜽𝑴𝑳 = arg max ෍ log 𝑝𝑚𝑜𝑑𝑒𝑙 (𝑦𝑖 |𝐱 𝑖 , 𝜽)
𝜽
problem 𝑖=1

Back 𝑛to Linear Regression
1 1
2 −2 − 2 𝑦𝑖 −𝒙𝒊 𝜽 2
෍ log 2𝜋𝜎 𝑒 2𝜎
𝑖=1 Canceling log and 𝑒
𝑛 𝑛
1 2
1 2
෍ − log 2𝜋𝜎 + ෍ − 2 𝑦𝑖 − 𝒙𝒊 𝜽
2 2𝜎
𝑖=1 𝑖=1
Matrix notation
𝑛 2
1 𝑇
− log 2𝜋𝜎 − 2 𝒚 − 𝑿𝜽 𝒚 − 𝑿𝜽
2 2𝜎
𝑛
𝜃𝑀𝐿 = arg max ෍ log 𝑝𝑚𝑜𝑑𝑒𝑙 (𝑦𝑖 |𝐱 𝑖 , 𝜽)

𝜃
𝑖=1
𝑛 2
1 𝑇
− log 2𝜋𝜎 − 2 𝐲 − 𝐗𝜽 𝐲 − 𝐗𝜽
2 2𝜎
Details in the How can we find
𝜕𝐽(𝜽)
=0 the estimate of
exercise session! 𝜕𝜽 theta?
𝜽 = 𝑿𝑇 𝑿 −1 𝑿𝑇 𝐲

Linear Regression
• Maximum Likelihood Estimate (MLE) corresponds to
the Least Squares Estimate (given the assumptions)
• Introduced the concepts of loss function and

optimization to obtain the best model for regression


Regression vs Classification
• Regression: predict a continuous output value (e.g.,

temperature of a room)
• Classification: predict a discrete value

– Binary classification: output is either 0 or 1
– Multi-class classification: set of N classes

Logistic Regression
CAT classifier

Sigmoid for Binary Predictions
1
𝑥0 𝜎 𝑥 =
𝜃0 1 + 𝑒 −𝑥
1
𝜃1 Can be interpreted
𝑥1 Σ as a probability
𝜃2
0
𝑥2 𝑝(𝑦𝑖 = 1|𝐱 𝑖 , 𝜽)

Spoiler Alert: 1-Layer Neural Network
1
𝑥0 𝜎 𝑥 =
𝜃0 1 + 𝑒 −𝑥
1
𝜃1 Can be interpreted
𝑥1 Σ as a probability
𝜃2
0
𝑥2 𝑝(𝑦𝑖 = 1|𝐱 𝑖 , 𝜽)

Logistic Regression
• Probability of a binary output
𝑛
𝐲ො = 𝑝 𝐲 = 1 𝐗, 𝜽 = ෑ 𝑝(𝑦𝑖 = 1|𝐱 𝑖 , 𝜽)
𝑖=1
The prediction of 𝑦ො𝑖 = 𝜎(𝐱 𝑖 𝜽)

our sigmoid

Logistic Regression
𝑛
𝐲ො = 𝑝 𝐲 = 1 𝐗, 𝜽 = ෑ 𝑝(𝑦𝑖 = 1|𝐱 𝑖 , 𝜽)
𝑖=1
Bernoulli trial
Model for 𝜙 , if 𝑧 = 1
𝑝 𝑧𝜙 = 𝜙𝑧 1−𝜙 1−𝑧
=ቊ
coins 1 − 𝜙, if 𝑧 = 0
The prediction of
our sigmoid
Logistic Regression
𝑛
𝐲ො = 𝑝 𝐲 = 1 𝐗, 𝜽 = ෑ 𝑝(𝑦𝑖 = 1|𝐱 𝑖 , 𝜽)
𝑖=1
𝑛
Model for
𝑦𝑖 (1−𝑦𝑖 ) coins
𝐲ො = ෑ 𝑦ො𝑖 1 − 𝑦ො𝑖
𝑖=1
Prediction of the True labels: 0 or 1

I2DL: Prof. Niessner Sigmoid: continuous 88
Logistic Regression: Loss Function
𝑛
𝑦𝑖 (1−𝑦𝑖 )
𝑝 y 𝐗, 𝜽 = 𝐲ො = ෑ 𝑦ො𝑖 1 − 𝑦ො𝑖
𝑖=1
• Maximum Likelihood Estimate
𝜽𝑴𝑳 = arg max log 𝑝 y 𝐗, 𝜽

𝜽

𝑛
𝑝 y 𝐗, 𝜽 = 𝐲ො = ෑ 𝑦ො𝑖 1 − 𝑦ො𝑖
𝑖=1
𝑛
෍ log 𝑦ො𝑖 1 − 𝑦ො𝑖
𝑖=1
෍ 𝑦𝑖 log 𝑦ො𝑖 + (1 − 𝑦𝑖 ) log(1 − 𝑦ො𝑖 )

𝑖=1
ℒ 𝑦ො𝑖 , 𝑦𝑖 = −[𝑦𝑖 log 𝑦ො𝑖 + 1 − 𝑦𝑖 log(1 − 𝑦ො𝑖 )]
Maximize likelihood
by minimizing the
loss function 𝑦𝑖 = 1 ℒ 𝑦ො𝑖 , 1 = −log 𝑦ො𝑖
Maximize! 𝜽𝑴𝑳 = arg max log 𝑝 y 𝐗, 𝜽

𝜽

𝑦𝑖 = 1 ℒ 𝑦ො𝑖 , 1 = −log 𝑦ො𝑖
To minimize ℒ 𝑦ො𝑖 , 𝑦𝑖 ,we want log 𝑦ො𝑖 large; since

logarithm is a monotonically increasing function, we
want a large 𝑦ො𝑖 .
(1 is the largest value our model’s estimate can take!)
𝑦𝑖 = 1 ℒ 𝑦ො𝑖 , 1 = −log 𝑦ො𝑖

𝑦𝑖 = 0 ℒ 𝑦ො𝑖 , 0 = −log 1 − 𝑦ො𝑖
We want log 1 − 𝑦ො𝑖 large; so we want 𝑦ො𝑖 to be small
(0 is the smallest value our model’s estimate can take!)

Referred to as binary cross-entropy loss (BCE)
• Related to the multi-class loss you will see in this

course (also called softmax loss)

Logistic Regression: Optimization
• Loss function
ℒ 𝑦ො𝑖 , 𝑦𝑖 = −[𝑦𝑖 log 𝑦ො𝑖 + (1 − 𝑦𝑖 ) log(1 − 𝑦ො𝑖 )]
• Cost function 𝑛
1
𝐶 𝜃 = − ෍ ℒ 𝑦ො𝑖 , 𝑦𝑖
𝑛 𝑦ො𝑖 = 𝜎(𝐱 𝑖 𝜽)
𝑖=1
Minimization 1
𝑛
= − ෍ 𝑦𝑖 log 𝑦ො𝑖 + (1 − 𝑦𝑖 ) log(1 − 𝑦ො𝑖 )

𝑛
I2DL: Prof. Niessner 𝑖=1 95
Logistic Regression: Optimization
• No closed-form solution
• Make use of an iterative method → gradient descent
Gradient descent –
later on!

Why Machine Learning so Cool
• We can learn from experience
-> Intelligence, certain ability to infer the future!
• Even linear models are often pretty good for

complex phenomena: e.g., weather:
– Linear combination of day-time, day-year etc. is often
pretty good

Next Lectures
• Next exercise session: Math Recap II
• Next Lecture: Lecture 3:

– Jumping towards our first Neural Networks and
Computational Graphs

References for further Reading
• Cross validation:
– https://medium.com/@zstern/k-fold-cross-validation-
explained-5aeba90ebb3
– https://towardsdatascience.com/train-test-split-and-
cross-validation-in-python-80b61beca4b6
• General Machine Learning book:

– Pattern Recognition and Machine Learning. C. Bishop.

See you next week ☺

2 Linear

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2 Linear

Uploaded by

Copyright:

Available Formats

Machine Learning

I2DL: Prof. Niessner 1

I2DL: Prof. Niessner 2

I2DL: Prof. Niessner 3

I2DL: Prof. Niessner 4

I2DL: Prof. Niessner 5

I2DL: Prof. Niessner 6

I2DL: Prof. Niessner 7

I2DL: Prof. Niessner 9

I2DL: Prof. Niessner 10

I2DL: Prof. Niessner

k-NN classifier = cat

I2DL: Prof. Niessner

How does the NN classifier perform on training data?

• These parameters are problem dependent.

• How do we choose these hyperparameters?

I2DL: Prof. Niessner 15

I2DL: Prof. Niessner 16

I2DL: Prof. Niessner 17

Model Image Class Label

I2DL: Prof. Niessner 18

Model Image Class Label

60% 20% 20%

Find model params 𝜃

Other splits are also possible (e.g., 80%/10%/10%)

60% 20% 20%

Find your hyperparameters

Other splits are also possible (e.g., 80%/10%/10%)

60% 20% 20%

Find your hyperparameters

I2DL: Prof. Niessner 22

I2DL: Prof. Niessner 23

• Labels or target classes

I2DL: Prof. Niessner 27

I2DL: Prof. Niessner 28

I2DL: Prof. Niessner 29

I2DL: Prof. Niessner 30

I2DL: Prof. Niessner 31

I2DL: Prof. Niessner 33

I2DL: Prof. Niessner 34

I2DL: Prof. Niessner 35

What are the pros

I2DL: Prof. Niessner 39

I2DL: Prof. Niessner 40

{𝒙1:𝑛 , 𝒚1:𝑛 } Learner 𝜽

Input (e.g., image,

I2DL: Prof. Niessner 42

{𝒙1:𝑛 , 𝒚1:𝑛 } Learner 𝜽

𝑥𝑛+1 , 𝜽 Predictor 𝑦ො𝑛+1

Input data, features

I2DL: Prof. Niessner 44

𝑦ො𝑖 = 𝜃0 + ෍ 𝑥𝑖𝑗 𝜃𝑗 = 𝜃0 + 𝑥𝑖1 𝜃1 + 𝑥𝑖2 𝜃2 + ⋯ + 𝑥𝑖𝑑 𝜃𝑑

I2DL: Prof. Niessner 46

𝑦ො1 1 𝑥11 ⋯ 𝑥1𝑑 𝜃0

I2DL: Prof. Niessner 47

𝑦ො1 1 𝑥11 ⋯ 𝑥1𝑑 𝜃0

I2DL: Prof. Niessner 48

Model parameters Estimation

• Loss function: measures how good my estimation is

• Optimization: changes the model in order to improve

I2DL: Prof. Niessner 52

I2DL: Prof. Niessner 53