Download as pdf or txt
Download as pdf or txt
You are on page 1of 45

CSE 546: Machine Learning

Simon Du and Jamie Morgenstern

Traditional Algorithms

Social media mentions of Cats vs. Dogs


Reddit Google Twitter?

Graphics courtesy of https://theoutline.com/post/3128/dogs-cats-internet-popularity?zd=1


Traditional Algorithms

Social media mentions of Cats vs. Dogs


Reddit Google Twitter?

Write a program that sorts


tweets into those containing
“cat”, “dog”, or other

Graphics courtesy of https://theoutline.com/post/3128/dogs-cats-internet-popularity?zd=1


Traditional Algorithms

Social media mentions of Cats vs. Dogs Twitter?


Reddit Google cats = [] Twitter?
dogs = []
other = []
for tweet in tweets:
if “cat” in tweet:
cats.append(tweet)
elseif “dog” in tweet:
dogs.append(tweet)
else:
other.append(tweet)
return cats, dogs, other
Write a program that sorts
tweets into those containing
“cat”, “dog”, or other

Graphics courtesy of https://theoutline.com/post/3128/dogs-cats-internet-popularity?zd=1


Machine Learning algorithms

Write a program that sorts images


into those containing “birds”,
“airplanes”, or other.

airplane
other
bird
Machine Learning algorithms

Write a program that sorts images


into those containing “birds”,
“airplanes”, or other. birds = []
planes = []
airplane other = []
other for image in images:
bird if bird in image:
birds.append(image)
elseif plane in image:
planes.append(image)
else:
other.append(tweet)
return birds, planes, other
Machine Learning algorithms

Write a program that sorts images


into those containing “birds”,
“airplanes”, or other. birds = []
planes = []
airplane other = []
other for image in images:
bird if bird in image:
birds.append(image)
elseif plane in image:
planes.append(image)
else:
feature 2

other.append(tweet)
return birds, planes, other

feature 1
Machine Learning algorithms

Write a program that sorts images


into those containing “birds”,
“airplanes”, or other. birds = []
airplane planes = []
other other = []
bird for image in images:
if bird in image:
birds.append(image)
elseif plane in image:
planes.append(image)
feature 2

else:
other.append(tweet)
return birds, planes, other

feature 1
Machine Learning algorithms

Write a program that sorts images


into those containing “birds”,
“airplanes”, or other.
airplane
other
bird

The decision rule of


if “cat” in tweet:
feature 2

is hard coded by expert.

The decision rule of


if bird in image:
feature 1 is LEARNED using DATA

Machine Learning Ingredients


• Data: past observations
• Hypotheses/Models: devised to capture the patterns in data
• Prediction: apply model to forecast future observations

You may also like…

ML uses past data to make personalized predictions


Flavors of ML

Regression Unsupervised
Predict continuous value: Learning
ex: stock market, credit score,
temperature, Netflix rating Predict structure:
tree of life from DNA, find
Classi cation similar images, community
detection
Predict categorical value:
loan or not? spam or not? what
disease is this?

Mix of statistics (theory) and algorithms (programming)


fi

CSE446/546: Machine Learning


Instructor: Simon Du and Jamie Morgenstern
Contact: cse446-staff@cs.washington.edu
Course Website: https://courses.cs.washington.edu/courses/cse446/21sp/

What this class is:


• Fundamentals of ML: bias/variance tradeoff, overfitting,
optimization and computational tradeoffs, supervised learning
(e.g., linear, boosting, deep learning), unsupervised models (e.g.
k-means, EM, PCA)
• Preparation for further learning: the field is fast-moving, you will
be able to apply the basics and teach yourself the latest

What this class is not:


• Survey course: laundry list of algorithms, how to win Kaggle
• An easy course: familiarity with intro linear algebra and probability
are assumed, homework will be time-consuming

Prerequisites

■ Formally:
■ MATH 308, CSE 312, STAT 390 or equivalent
■ Familiarity with:
■ Linear algebra
■ linear dependence, rank, linear equations, SVD
■ Multivariate calculus
■ Probability and statistics
■ Distributions, marginalization, moments, conditional expectation
■ Algorithms
■ Basic data structures, complexity
■ “Can I learn these topics concurrently?”
■ Use HW0 to judge skills
■ See website for review materials!

Grading

■ 5 homework (100%=10%+20%+20%+20%+30%)
Each contains both theoretical questions and will have
programming
Collaboration okay but must write who you collaborated
with. You must write, submit, and understand your
answers and code (which run on autograder)
Do not Google for answers.
■ NO exams
■ We will assign random subgroups as PODs (when dust
clears)

Homework

HW 0 is out (Due next Wednesday 10/6 Midnight)


Short review
Work individually, treat as barometer for readiness
HW 1,2,3,4
They are not easy or short. Start early.
Submit to Gradescope
Regrade requests on Gradescope
There is no credit for late work, 5 late days

Homework

HW 0 is out (Due next Wednesday 10/6 Midnight)


Short review
Work individually, treat as barometer for readiness
HW 1,2,3,4
They are not easy or short. Start early.
Submit to Gradescope
Regrade requests on Gradescope
There is no credit for late work, 5 late days

1. All code must be written in Python


2. All written work must be typeset (e.g., LaTeX)

See course website for tutorials and references.



Homework & Grading

CSE 446:
Just do A problems
Doing B problems will not get higher grades
Grade is on 4.0 scale (relative to students in 446)
CSE 546:
If just do A problems, grade is up to 3.8
B problems are for 0.2
Final grade = A grade + B grade (relative students in
546)

Communication Chanels

■ Announcements, questions about class, homework help


EdStem (invitation sent, contact TAs if you need
access)
Weekly Section
Office hours
■ Regrade requests
Directly to Gradescope
■ Personal concerns
Email: cse446-staff@cs.washington.edu
■ Anonymous feedback
See website for link

Textbooks

■ Required Textbook:
Machine Learning: a Probabilistic Perspective;
Kevin Murphy

■ Optional Books (free PDF):


The Elements of Statistical Learning: Data Mining,
Inference, and Prediction; Trevor Hastie, Robert
Tibshirani, Jerome Friedman

Addcodes

■ Email: Elle Brown (ellean@cs.washington.edu)


for addcodes
Enjoy!

■ ML is becoming ubiquitous in science, engineering and


beyond
■ It’s one of the hottest topics in industry today
■ This class should give you the basic foundation for applying
ML and developing new methods
■ The fun begins…

Maximum Likelihood
Estimation
Jamie Morgenstern

Your first consulting job

Billionaire: I have a special coin, if I flip it, what’s the


probability it will be heads?
You: Please flip it a few times:

You: The probability is:

Billionaire: Why?

©2019 Kevin Jamieson


Coin – Binomial Distribution

■ Data: sequence D= (HHTHT…), k heads out of n flips


■ Hypothesis: P(Heads) = θ, P(Tails) = 1-
Flips are i.i.d.
Independent event
Identically distributed according to Binomial
distributio

■ P
 (D|✓) =

©2019 Kevin Jamieson


n

Maximum Likelihood Estimation

■ Data: sequence D= (HHTHT…), k heads out of n flips


■ Hypothesis: P(Heads) = θ, P(Tails) = 1-θ


P (D|✓) = ✓k (1 ✓)n k

■ Maximum likelihood estimation (MLE): Choose θ that


maximizes the probability of observed data:

✓bM LE = arg max P (D|✓)


✓ P (D|✓)
= arg max log P (D|✓) ✓bM LE

©2019 Kevin Jamieson


Your first learning algorithm

✓bM LE = arg max log P (D|✓)



= arg max log ✓k (1 ✓)n k

■ Set derivative to zero: d
log P (D|✓) = 0
d✓

©2019 Kevin Jamieson


How many flips do I need?

b k
✓M LE =
n
■ You: flip the coin 5 times. Billionaire: I got 3 heads.

✓bM LE =

■ You: flip the coin 50 times. Billionaire: I got 20 heads.

✓bM LE =

■ Billionaire: Which one is right? Why?


©2019 Kevin Jamieson 28

Simple bound 

(based on Hoeffding’s inequality)

■ For n flips and k heads the MLE is unbiased for true θ*:

k
b
✓M LE =
n
E[✓bM LE ] = ✓⇤

■ Hoeffding’s inequality says that for any ε>0:

2n✏2
P (|✓bM LE ⇤
✓ | ✏)  2e

©2019 Kevin Jamieson

PAC Learning

■ PAC: Probably Approximate Correct


■ Billionaire: I want to know the parameter θ*, within ε = 0.1, with
probability at least 1-δ = 0.95. How many flips?
2n✏2
P (|✓bM LE ⇤
✓ | ✏)  2e

©2019 Kevin Jamieson


What about continuous variables?


■ Billionaire: What if I am measuring a continuous variable?
■ You: Let me tell you about Gaussians…

©2019 Kevin Jamieson

Some properties of Gaussians

■ affine transformation (multiplying by scalar and adding a


constant)
X ~ N(µ,σ2)
Y = aX + b ➔ Y ~ N(aµ+b,a2σ2)

■ Sum of Gaussians
X ~ N(µX,σ2X)
Y ~ N(µY,σ2Y)
Z = X+Y ➔ Z ~ N(µX+µY, σ2X+σ2Y)

©2019 Kevin Jamieson


MLE for Gaussian

■ Prob. of i.i.d. samples D={x1,…,xN} (e.g., exam scores):

P (D|µ, ) = P (x1 , . . . , xn |µ, )


✓ ◆n Y n
1 (xi µ)2
= p e 2 2

2⇡ i=1

■ Log-likelihood of data:

n
X
p (xi µ)2
log P (D|µ, ) = n log( 2⇡) 2
i=1
2

©2019 Kevin Jamieson

Your second learning algorithm:



MLE for mean of a Gaussian

• What’s MLE for mean?


" #
n
X
d d p (xi µ)2
log P (D|µ, ) = n log( 2⇡)
dµ dµ i=1
2 2

©2019 Kevin Jamieson


MLE for variance

• Again, set derivative" to zero: #


n
X
d d p (xi µ)2
log P (D|µ, ) = n log( 2⇡)
d d i=1
2 2

©2019 Kevin Jamieson


Learning Gaussian parameters

Xn
■ MLE: 1
bM LE
µ = xi
n i=1
Xn
c2 M LE 1
= (xi bM LE )2
µ
n i=1
■ MLE for the variance of a Gaussian is biased

E[c2 M LE ] 6= 2

Unbiased variance estimator:


n
X
c2 unbiased = 1
(xi bM LE )2
µ
n 1 i=1
©2019 Kevin Jamieson

Maximum Likelihood Estimation

Observe X1 , X2 , . . . , Xn drawn IID from f (x; ✓) for some “true” ✓ = ✓⇤


n
Y
Likelihood function Ln (✓) = f (Xi ; ✓)
i=1
n
X
Log-Likelihood function ln (✓) = log(Ln (✓)) = log(f (Xi ; ✓))
i=1

Maximum Likelihood Estimator (MLE) ✓bM LE = arg max Ln (✓)


©2019 Kevin Jamieson


Maximum Likelihood Estimation

Observe X1 , X2 , . . . , Xn drawn IID from f (x; ✓) for some “true” ✓ = ✓⇤


n
Y
Likelihood function Ln (✓) = f (Xi ; ✓)
i=1
n
X
Log-Likelihood function ln (✓) = log(Ln (✓)) = log(f (Xi ; ✓))
i=1

Maximum Likelihood Estimator (MLE) ✓bM LE = arg max Ln (✓)


Properties (under benign regularity conditions—smoothness, identifiability, etc.):


✓bM LE ✓⇤
Asymptotically consistent and normal: b
se ⇠ N (0, 1)
Asymptotic Optimality, minimum variance (see Cramer-Rao lower bound)

©2019 Kevin Jamieson


Recap

■ Learning is…
Collect some data
■ E.g., coin flips
Choose a hypothesis class or model
■ E.g., binomial
Choose a loss function
■ E.g., data likelihood
Choose an optimization procedure
■ E.g., set derivative to zero to obtain MLE
Justifying the accuracy of the estimate
■ E.g., Hoeffding’s inequality

©2019 Kevin Jamieson


Maximum
Maximum Likelihood
Estimation, cont.
Likelihood
Estimation, cont.
Machine Learning – CSE446
Jamie Morgenstern
University of Washington

©2019 Kevin Jamieson


The 1d regression problem

Given past sales data on zillow.com, predict:


y = House sale price from
x = # sq. ft. Training Data:
n
{(xi , yi )}i=1
Model:
yi = x i w + b + ✏ i
Sale Price

2
✏i ⇠ N (0, )
Linear model Noise model

# square feet

©2018 Kevin Jamieson


The 1d regression problem

Given past sales data on zillow.com, predict:


y = House sale price from
x = # sq. ft. Training Data:
n
{(xi , yi )}i=1
Model:
yi = x i w + b + ✏ i
Sale Price

2
✏i ⇠ N (0, )
Loss function:
n
X
log(p(yi |xi , w, b))
# square feet
i=1
©2018 Kevin Jamieson 42

The 1d regression problem

Loss function: ✏i ⇠ N (0, 2


)
n Model:
X
log(p(yi |xi , w, b)) yi = x i w + b + ✏ i
i=1
n
X ⇢
1 (yi (wxi + b))2
= log( p exp 2
)
i=1 2⇡ 2 2

©2018 Kevin Jamieson


The 1d regression problem

Loss function: ✏i ⇠ N (0, 2


)
n Model:
X
log(p(yi |xi , w, b)) yi = x i w + b + ✏ i
i=1
n
X ⇢
1 (yi (wxi + b))2
= log( p exp 2
)
i=1 2⇡ 2 2

n
X n
X
arg min log(p(yi |xi , w, b)) ⌘ arg min (yi (wxi + b))2
w,b w,b
i=1 i=1

©2018 Kevin Jamieson


The 1d regression problem

Given past sales data on zillow.com, predict:


y = House sale price from
x = # sq. ft. Training Data:
n
{(xi , yi )}i=1
Model:
yi = x i w + b + ✏ i
Sale Price

2
✏i ⇠ N (0, )
Loss function:
n
X
(yi (wxi + b))2
# square feet i=1
©2018 Kevin Jamieson 45

You might also like