Machine Learning Lecture 1 (UW)

CSE 546: Machine Learning
Simon Du and Jamie Morgenstern
Traditional Algorithms
Social media mentions of Cats vs. Dogs

Reddit Google Twitter?
Graphics courtesy of https://theoutline.com/post/3128/dogs-cats-internet-popularity?zd=1

Social media mentions of Cats vs. Dogs

Reddit Google Twitter?
Write a program that sorts

tweets into those containing
“cat”, “dog”, or other

Social media mentions of Cats vs. Dogs Twitter?

Reddit Google cats = [] Twitter?
dogs = []
other = []
for tweet in tweets:
if “cat” in tweet:
cats.append(tweet)
elseif “dog” in tweet:
dogs.append(tweet)
else:
other.append(tweet)
return cats, dogs, other
Write a program that sorts
tweets into those containing
“cat”, “dog”, or other

Machine Learning algorithms
Write a program that sorts images

into those containing “birds”,
“airplanes”, or other.
airplane
other
bird

“airplanes”, or other. birds = []
planes = []
airplane other = []
other for image in images:
bird if bird in image:
birds.append(image)
elseif plane in image:
planes.append(image)
else:
other.append(tweet)
return birds, planes, other

planes = []
airplane other = []
other for image in images:
bird if bird in image:
birds.append(image)
else:
feature 2
other.append(tweet)
feature 1

airplane planes = []
other other = []
bird for image in images:
if bird in image:
birds.append(image)
feature 2
else:
other.append(tweet)
feature 1

“airplanes”, or other.
airplane
other
bird
The decision rule of

if “cat” in tweet:
feature 2
is hard coded by expert.
The decision rule of

if bird in image:
feature 1 is LEARNED using DATA

Machine Learning Ingredients

• Data: past observations
• Hypotheses/Models: devised to capture the patterns in data
• Prediction: apply model to forecast future observations
You may also like…
ML uses past data to make personalized predictions

Flavors of ML
Regression Unsupervised
Predict continuous value: Learning
ex: stock market, credit score,
temperature, Netflix rating Predict structure:
tree of life from DNA, find
Classi cation similar images, community
detection
Predict categorical value:
loan or not? spam or not? what
disease is this?
Mix of statistics (theory) and algorithms (programming)

fi

CSE446/546: Machine Learning

Instructor: Simon Du and Jamie Morgenstern
Contact: cse446-staff@cs.washington.edu
Course Website: https://courses.cs.washington.edu/courses/cse446/21sp/
What this class is:

• Fundamentals of ML: bias/variance tradeoff, overfitting,
optimization and computational tradeoffs, supervised learning
(e.g., linear, boosting, deep learning), unsupervised models (e.g.
k-means, EM, PCA)
• Preparation for further learning: the field is fast-moving, you will
be able to apply the basics and teach yourself the latest
What this class is not:

• Survey course: laundry list of algorithms, how to win Kaggle
• An easy course: familiarity with intro linear algebra and probability
are assumed, homework will be time-consuming

Prerequisites
■ Formally:
■ MATH 308, CSE 312, STAT 390 or equivalent
■ Familiarity with:
■ Linear algebra
■ linear dependence, rank, linear equations, SVD
■ Multivariate calculus
■ Probability and statistics
■ Distributions, marginalization, moments, conditional expectation
■ Algorithms
■ Basic data structures, complexity
■ “Can I learn these topics concurrently?”
■ Use HW0 to judge skills
■ See website for review materials!

Grading
■ 5 homework (100%=10%+20%+20%+20%+30%)
Each contains both theoretical questions and will have
programming
Collaboration okay but must write who you collaborated
with. You must write, submit, and understand your
answers and code (which run on autograder)
Do not Google for answers.
■ NO exams
■ We will assign random subgroups as PODs (when dust
clears)

Homework
HW 0 is out (Due next Wednesday 10/6 Midnight)

Short review
Work individually, treat as barometer for readiness
HW 1,2,3,4
They are not easy or short. Start early.
Submit to Gradescope
Regrade requests on Gradescope
There is no credit for late work, 5 late days

Homework
HW 0 is out (Due next Wednesday 10/6 Midnight)

Short review
Work individually, treat as barometer for readiness
HW 1,2,3,4
They are not easy or short. Start early.
Submit to Gradescope
Regrade requests on Gradescope
There is no credit for late work, 5 late days
1. All code must be written in Python

2. All written work must be typeset (e.g., LaTeX)
See course website for tutorials and references.

 
Homework & Grading
CSE 446:
Just do A problems
Doing B problems will not get higher grades
Grade is on 4.0 scale (relative to students in 446)
CSE 546:
If just do A problems, grade is up to 3.8
B problems are for 0.2
Final grade = A grade + B grade (relative students in
546)

Communication Chanels
■ Announcements, questions about class, homework help

EdStem (invitation sent, contact TAs if you need
access)
Weekly Section
Office hours
■ Regrade requests
Directly to Gradescope
■ Personal concerns
Email: cse446-staff@cs.washington.edu
■ Anonymous feedback
See website for link

Textbooks
■ Required Textbook:
Machine Learning: a Probabilistic Perspective;
Kevin Murphy
■ Optional Books (free PDF):

The Elements of Statistical Learning: Data Mining,
Inference, and Prediction; Trevor Hastie, Robert
Tibshirani, Jerome Friedman

Addcodes
■ Email: Elle Brown (ellean@cs.washington.edu)

for addcodes
Enjoy!
■ ML is becoming ubiquitous in science, engineering and

beyond
■ It’s one of the hottest topics in industry today
■ This class should give you the basic foundation for applying
ML and developing new methods
■ The fun begins…

Maximum Likelihood
Estimation
Jamie Morgenstern

Your first consulting job
Billionaire: I have a special coin, if I flip it, what’s the

probability it will be heads?
You: Please flip it a few times:
You: The probability is:
Billionaire: Why?
©2019 Kevin Jamieson

Coin – Binomial Distribution
■ Data: sequence D= (HHTHT…), k heads out of n flips

■ Hypothesis: P(Heads) = θ, P(Tails) = 1-
Flips are i.i.d.
Independent event
Identically distributed according to Binomial
distributio
■ P  (D|✓) =
 

n
Maximum Likelihood Estimation
■ Data: sequence D= (HHTHT…), k heads out of n flips

■ Hypothesis: P(Heads) = θ, P(Tails) = 1-θ 
 
P (D|✓) = ✓k (1 ✓)n k
■ Maximum likelihood estimation (MLE): Choose θ that

maximizes the probability of observed data:
✓bM LE = arg max P (D|✓)

✓ P (D|✓)
= arg max log P (D|✓) ✓bM LE
✓
✓

Your first learning algorithm
✓bM LE = arg max log P (D|✓)

✓
= arg max log ✓k (1 ✓)n k
✓
■ Set derivative to zero: d
log P (D|✓) = 0
d✓

How many flips do I need?
b k
✓M LE =
n
■ You: flip the coin 5 times. Billionaire: I got 3 heads.
✓bM LE =
■ You: flip the coin 50 times. Billionaire: I got 20 heads.
✓bM LE =
■ Billionaire: Which one is right? Why?

©2019 Kevin Jamieson 28

Simple bound  
(based on Hoeffding’s inequality)
■ For n flips and k heads the MLE is unbiased for true θ*:
k
b
✓M LE =
n
E[✓bM LE ] = ✓⇤
■ Hoeffding’s inequality says that for any ε>0:
2n✏2
P (|✓bM LE ⇤
✓ | ✏)  2e
PAC Learning
■ PAC: Probably Approximate Correct

■ Billionaire: I want to know the parameter θ*, within ε = 0.1, with
probability at least 1-δ = 0.95. How many flips?
2n✏2
P (|✓bM LE ⇤
✓ | ✏)  2e

What about continuous variables?

■ Billionaire: What if I am measuring a continuous variable?
■ You: Let me tell you about Gaussians…
Some properties of Gaussians
■ affine transformation (multiplying by scalar and adding a

constant)
X ~ N(µ,σ2)
Y = aX + b ➔ Y ~ N(aµ+b,a2σ2)
■ Sum of Gaussians
X ~ N(µX,σ2X)
Y ~ N(µY,σ2Y)
Z = X+Y ➔ Z ~ N(µX+µY, σ2X+σ2Y)

MLE for Gaussian
■ Prob. of i.i.d. samples D={x1,…,xN} (e.g., exam scores):
P (D|µ, ) = P (x1 , . . . , xn |µ, )

✓ ◆n Y n
1 (xi µ)2
= p e 2 2
2⇡ i=1
■ Log-likelihood of data:
n
X
p (xi µ)2
log P (D|µ, ) = n log( 2⇡) 2
i=1
2
Your second learning algorithm: 

MLE for mean of a Gaussian
• What’s MLE for mean?

" #
n
X
d d p (xi µ)2
log P (D|µ, ) = n log( 2⇡)
dµ dµ i=1
2 2

MLE for variance
• Again, set derivative" to zero: #

n
X
d d p (xi µ)2
log P (D|µ, ) = n log( 2⇡)
d d i=1
2 2

Learning Gaussian parameters
Xn
■ MLE: 1
bM LE
µ = xi
n i=1
Xn
c2 M LE 1
= (xi bM LE )2
µ
n i=1
■ MLE for the variance of a Gaussian is biased
E[c2 M LE ] 6= 2
Unbiased variance estimator:

n
X
c2 unbiased = 1
(xi bM LE )2
µ
n 1 i=1

Observe X1 , X2 , . . . , Xn drawn IID from f (x; ✓) for some “true” ✓ = ✓⇤

n
Y
Likelihood function Ln (✓) = f (Xi ; ✓)
i=1
n
X
Log-Likelihood function ln (✓) = log(Ln (✓)) = log(f (Xi ; ✓))
i=1
Maximum Likelihood Estimator (MLE) ✓bM LE = arg max Ln (✓)

✓

Observe X1 , X2 , . . . , Xn drawn IID from f (x; ✓) for some “true” ✓ = ✓⇤

n
Y
Likelihood function Ln (✓) = f (Xi ; ✓)
i=1
n
X
Log-Likelihood function ln (✓) = log(Ln (✓)) = log(f (Xi ; ✓))
i=1
Maximum Likelihood Estimator (MLE) ✓bM LE = arg max Ln (✓)

✓
Properties (under benign regularity conditions—smoothness, identifiability, etc.):

✓bM LE ✓⇤
Asymptotically consistent and normal: b
se ⇠ N (0, 1)
Asymptotic Optimality, minimum variance (see Cramer-Rao lower bound)

Recap
■ Learning is…
Collect some data
■ E.g., coin flips
Choose a hypothesis class or model
■ E.g., binomial
Choose a loss function
■ E.g., data likelihood
Choose an optimization procedure
■ E.g., set derivative to zero to obtain MLE
Justifying the accuracy of the estimate
■ E.g., Hoeffding’s inequality

Maximum
Maximum Likelihood
Estimation, cont.
Likelihood
Estimation, cont.
Machine Learning – CSE446
Jamie Morgenstern
University of Washington

The 1d regression problem
Given past sales data on zillow.com, predict:

y = House sale price from
x = # sq. ft. Training Data:
n
{(xi , yi )}i=1
Model:
yi = x i w + b + ✏ i
Sale Price
2
✏i ⇠ N (0, )
Linear model Noise model
# square feet


n
{(xi , yi )}i=1
Model:
yi = x i w + b + ✏ i
Sale Price
2
✏i ⇠ N (0, )
Loss function:
n
X
log(p(yi |xi , w, b))
# square feet
i=1

Loss function: ✏i ⇠ N (0, 2

)
n Model:
X
log(p(yi |xi , w, b)) yi = x i w + b + ✏ i
i=1
n
X ⇢
1 (yi (wxi + b))2
= log( p exp 2
)
i=1 2⇡ 2 2

Loss function: ✏i ⇠ N (0, 2

)
n Model:
X
log(p(yi |xi , w, b)) yi = x i w + b + ✏ i
i=1
n
X ⇢
1 (yi (wxi + b))2
= log( p exp 2
)
i=1 2⇡ 2 2
n
X n
X
arg min log(p(yi |xi , w, b)) ⌘ arg min (yi (wxi + b))2
w,b w,b
i=1 i=1


n
{(xi , yi )}i=1
Model:
yi = x i w + b + ✏ i
Sale Price
2
✏i ⇠ N (0, )
Loss function:
n
X
(yi (wxi + b))2
# square feet i=1

Machine Learning Lecture 1 (UW)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning Lecture 1 (UW)

Uploaded by

Copyright:

Available Formats

CSE 546: Machine Learning

Simon Du and Jamie Morgenstern

Social media mentions of Cats vs. Dogs

Graphics courtesy of https://theoutline.com/post/3128/dogs-cats-internet-popularity?zd=1

Social media mentions of Cats vs. Dogs

Write a program that sorts

Graphics courtesy of https://theoutline.com/post/3128/dogs-cats-internet-popularity?zd=1

Social media mentions of Cats vs. Dogs Twitter?

Graphics courtesy of https://theoutline.com/post/3128/dogs-cats-internet-popularity?zd=1

Machine Learning algorithms

Write a program that sorts images

Write a program that sorts images

Write a program that sorts images

Write a program that sorts images

Write a program that sorts images

The decision rule of

is hard coded by expert.

The decision rule of

Machine Learning Ingredients

You may also like…

ML uses past data to make personalized predictions

Mix of statistics (theory) and algorithms (programming)

CSE446/546: Machine Learning

What this class is:

What this class is not:

HW 0 is out (Due next Wednesday 10/6 Midnight)

HW 0 is out (Due next Wednesday 10/6 Midnight)

1. All code must be written in Python

See course website for tutorials and references.

■ Announcements, questions about class, homework help

■ Optional Books (free PDF):

■ Email: Elle Brown (ellean@cs.washington.edu)

■ ML is becoming ubiquitous in science, engineering and

Your first consulting job

Billionaire: I have a special coin, if I flip it, what’s the

You: The probability is:

©2019 Kevin Jamieson

Coin – Binomial Distribution

■ Data: sequence D= (HHTHT…), k heads out of n flips

©2019 Kevin Jamieson

Maximum Likelihood Estimation

■ Data: sequence D= (HHTHT…), k heads out of n flips

■ Maximum likelihood estimation (MLE): Choose θ that

✓bM LE = arg max P (D|✓)

©2019 Kevin Jamieson

✓bM LE = arg max log P (D|✓)

©2019 Kevin Jamieson

■ You: flip the coin 50 times. Billionaire: I got 20 heads.

■ Billionaire: Which one is right? Why?

■ Hoeffding’s inequality says that for any ε>0:

©2019 Kevin Jamieson

■ PAC: Probably Approximate Correct

©2019 Kevin Jamieson

What about continuous variables?

©2019 Kevin Jamieson

Some properties of Gaussians

■ affine transformation (multiplying by scalar and adding a

©2019 Kevin Jamieson

MLE for Gaussian

■ Prob. of i.i.d. samples D={x1,…,xN} (e.g., exam scores):

P (D|µ, ) = P (x1 , . . . , xn |µ, )

©2019 Kevin Jamieson

Your second learning algorithm:

• What’s MLE for mean?

Your second learning algorithm: