Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

CS 229 Lecture Three

Supervised Learning: Classification

Chris Ré

April 2, 2023
Supervised Learning and Classification

I Linear Regression via a Probabilistic Interpretation


I Logistic Regression
I Optimization Method: Newton’s Method

We’ll learn the maximum likelihood method (a probabilistic


interpretation) to generalize from linear regression to more
sophisticated models.
A Justification for Least Squares?

I Given a training set {(x (i) , y (i) ) for i = 1, . . . , n} in which


x (i) ∈ Rd+1 and y (i) ∈ R.
I Do find θ ∈ Rd+1 s.t. θ = argminθ ni=1 (hθ (x (i) ) − y (i) )2 in
P
which hθ (x) = θT x.
A Justification for Least Squares?

I Given a training set {(x (i) , y (i) ) for i = 1, . . . , n} in which


x (i) ∈ Rd+1 and y (i) ∈ R.
I Do find θ ∈ Rd+1 s.t. θ = argminθ ni=1 (hθ (x (i) ) − y (i) )2 in
P
which hθ (x) = θT x.

Where did this model come from?

One way to view is via a probabilistic interpretation (helpful


throughout the course).
A Justification for Least Squares?

We make an assumption (common in statistics) that the data are


generated according to some model (that may contain random
choices). That is,
y (i) = θT x (i) + ε(i) .
Here, ε(i) is a random variable that captures “noise” that is,
unmodeled effects, measurement errors, etc.
A Justification for Least Squares?

We make an assumption (common in statistics) that the data are


generated according to some model (that may contain random
choices). That is,
y (i) = θT x (i) + ε(i) .
Here, ε(i) is a random variable that captures “noise” that is,
unmodeled effects, measurement errors, etc.

Please keep in mind: this is just a model! As they say, all


models are wrong but some models are useful. This model
has been shockingly useful.
What do we expect of the noise?
What properties should we expect from ε(i)

y (i) = θT x (i) + ε(i) .

Again, it’s a model and ε(i) is a random variable:


I E[ε(i) ] = 0 – the noise is unbiased.
What do we expect of the noise?
What properties should we expect from ε(i)

y (i) = θT x (i) + ε(i) .

Again, it’s a model and ε(i) is a random variable:


I E[ε(i) ] = 0 – the noise is unbiased.
I The errors for different points are independent and identically
distributed (called, iid)

E[ε(i) ε(j) ] = E[ε(i) ]E[ε(j) ] for i 6= j.

and  2 
(i)
E ε = σ2

Here σ 2 is some measure of how noisy the data are.


What do we expect of the noise?
What properties should we expect from ε(i)

y (i) = θT x (i) + ε(i) .

Again, it’s a model and ε(i) is a random variable:


I E[ε(i) ] = 0 – the noise is unbiased.
I The errors for different points are independent and identically
distributed (called, iid)

E[ε(i) ε(j) ] = E[ε(i) ]E[ε(j) ] for i 6= j.

and  2 
(i)
E ε = σ2

Here σ 2 is some measure of how noisy the data are. Turns out,
this effectively defines the Gaussian or Normal distribution.
Notation for the Gaussian
We write z ∼ N (µ, σ 2 ) and read these symbols as
z is distributed as a normal with mean µ and standard
deviation σ 2 .
or equivalently
(z − µ)2
 
1
P(z) = √ exp − .
σ 2π 2σ 2
Notation for Guassians in our Problem
Recall in our model,

y (i) = θT x (i) + ε(i) in which ε(i) ∼ N (0, σ 2 ).

or more compactly notation:

y (i) | x (i) ; θ ∼ N (θT x, σ 2 ).

equivalently,
( )
  1 (y (i) − x T θ)2
P y (i) | x (i) ; θ = √ exp −
σ 2π 2σ 2

I We condition on x (i) .
I In contrast, θ parameterizes or “picks” a distribution.
We use bar (|) versus semicolon (;) notation above.
(Log) Likelihoods!
Intuition: among many distributions, pick the one that agrees with
the data the most (is most “likely”).

n
Y
L(θ) =p(y |X ; θ) = p(y (i) | x (i) ; θ) iid assumption
i=1
(Log) Likelihoods!
Intuition: among many distributions, pick the one that agrees with
the data the most (is most “likely”).

n
Y
L(θ) =p(y |X ; θ) = p(y (i) | x (i) ; θ) iid assumption
i=1
n
( )
Y 1 (x (i) θ − y (i) )2
= √ exp −
σ 2π 2σ 2
i=1
(Log) Likelihoods!
Intuition: among many distributions, pick the one that agrees with
the data the most (is most “likely”).

n
Y
L(θ) =p(y |X ; θ) = p(y (i) | x (i) ; θ) iid assumption
i=1
n
( )
Y 1 (x (i) θ − y (i) )2
= √ exp −
σ 2π 2σ 2
i=1

For convenience, we use the Log Likelihood `(θ) = log L(θ).


n
X 1 (x (i) θ − y (i) )2
`(θ) = log √ −
σ 2π 2σ 2
i=1
(Log) Likelihoods!
Intuition: among many distributions, pick the one that agrees with
the data the most (is most “likely”).

n
Y
L(θ) =p(y |X ; θ) = p(y (i) | x (i) ; θ) iid assumption
i=1
n
( )
Y 1 (x (i) θ − y (i) )2
= √ exp −
σ 2π 2σ 2
i=1

For convenience, we use the Log Likelihood `(θ) = log L(θ).


n
X 1 (x (i) θ − y (i) )2
`(θ) = log √ −
σ 2π 2σ 2
i=1
n
1 1 X (i) 1
=n log √ − 2 (x θ − y (i) )2 = C (σ, n) − 2 J(θ)
σ 2π 2σ σ
i=1

where C (σ, n) = n log σ√12π .


(Log) Likelihoods!

So we’ve shown that finding a θ to maximize L(θ) is the same as


maximizing
1
`(θ) = C (σ, n) − 2 J(θ)
σ
Or minimizing, J(θ) directly (why?)

Takeaway: “Under the hood,” solving least squares is


solving a maximum likelihood problem for a particular
probabilistic model.

This view shows a path to generalize to new situations!


Summary of Least Squares

I We introduced the Maximum Likelihood framework–super


powerful (next lectures)
I We showed that least squares was actually a version of
maximum likelihoods.
I We learned some notation that will help us later in the
course. . .
Classification
Given a training set {(x (i) , y (i) ) for i = 1, . . . , n} let y (i) ∈ {0, 1}.
Why not use regression, say least squares? A picture . . .
Logistic Regression: Link Functions
Given a training set {(x (i) , y (i) ) for i = 1, . . . , n} let y (i) ∈ {0, 1}.
Want hθ (x) ∈ [0, 1]. Let’s pick a smooth function:

hθ (x) = g (θT x)

Here, g is a link function. There are many. . .


Logistic Regression: Link Functions
Given a training set {(x (i) , y (i) ) for i = 1, . . . , n} let y (i) ∈ {0, 1}.
Want hθ (x) ∈ [0, 1]. Let’s pick a smooth function:

hθ (x) = g (θT x)

Here, g is a link function. There are many. . . but we’ll pick one!
1
g (z) = .
1 + e −z
Logistic Regression: Link Functions
Given a training set {(x (i) , y (i) ) for i = 1, . . . , n} let y (i) ∈ {0, 1}.
Want hθ (x) ∈ [0, 1]. Let’s pick a smooth function:

hθ (x) = g (θT x)

Here, g is a link function. There are many. . . but we’ll pick one!
1
g (z) = .
1 + e −z

How do we interpret hθ (x)?

P(y = 1 | x; θ) = hθ (x)
P(y = 0 | x; θ) = 1 − hθ (x)
Logistic Regression: Link Functions
Let’s write the Likelihood function. Recall:
P(y = 1 | x; θ) =hθ (x)
P(y = 0 | x; θ) =1 − hθ (x)

Then,
n
Y
L(θ) =P(y | X ; θ) = p(y (i) | x (i) ; θ)
i=1
Logistic Regression: Link Functions
Let’s write the Likelihood function. Recall:
P(y = 1 | x; θ) =hθ (x)
P(y = 0 | x; θ) =1 − hθ (x)

Then,
n
Y
L(θ) =P(y | X ; θ) = p(y (i) | x (i) ; θ)
i=1
n
(i) (i)
Y
= hθ (x (i) )y (1 − hθ (x (i) ))1−y exponents encode “if-then”
i=1
Logistic Regression: Link Functions
Let’s write the Likelihood function. Recall:
P(y = 1 | x; θ) =hθ (x)
P(y = 0 | x; θ) =1 − hθ (x)

Then,
n
Y
L(θ) =P(y | X ; θ) = p(y (i) | x (i) ; θ)
i=1
n
(i) (i)
Y
= hθ (x (i) )y (1 − hθ (x (i) ))1−y exponents encode “if-then”
i=1

Taking logs to compute the log likelihood `(θ) we have:


n
X
`(θ) = log L(θ) = y (i) log hθ (x (i) ) + (1 − y (i) ) log(1 − hθ (x (i) ))
i=1
Now to solve it. . .

n
X
`(θ) = log L(θ) = y (i) log hθ (x (i) ) + (1 − y (i) ) log(1 − hθ (x (i) ))
i=1

We maximize for θ but we already saw how to do this! Just


compute derivative, run (S)GD and you’re done with it!

Takeaway: This is another example of the max likelihood


method: we setup the likelihood, take logs, and compute
derivatives.
Time Permitting: There is magic in the derivative. . .
Even more, the batch update can be written in a remarkably
familiar form:
X
θ(t+1) = θ(t) + (y (j) − hθ (x (j) ))x (j) .
j∈B

We sketch why (you can check!) We drop superscripts to simplify


notation and examine a single data point:
y log hθ (x) + (1 − y ) log(1 − hθ (x))
Tx Tx
= − y log(1 + e −θ ) + (1 − y )(−θT x) − (1 − y ) log(1 + e −θ )
−θT x
= − log(1 + e ) − (1 − y )(θT x)
−θ T x
We used 1 − hθ (x) = e −θT x . We now compute the derivative of
1−e
this expression wrt θ and get:
T
e −θ x
x − (1 − y )x = (y − hθ (x))x
1 + e −θT x
Summary of Introduction to Classification

I We used the principle of maximum likelihood (and a


probabilistic model) to extend to classification.
Summary of Introduction to Classification

I We used the principle of maximum likelihood (and a


probabilistic model) to extend to classification.
I We developed logistic regression from this principle.
I Logistic regression is widely used today.
Summary of Introduction to Classification

I We used the principle of maximum likelihood (and a


probabilistic model) to extend to classification.
I We developed logistic regression from this principle.
I Logistic regression is widely used today.
I We noticed a familiar pattern: take derivatives of the
likelihood, and the derivatives had this (hopefully) intuitive
“misprediction form”
Newton’s Method
Given f : Rd → R find x s.t. f (x) = 0.
Newton’s Method
Given f : Rd → R find x s.t. f (x) = 0.

We apply this with f (θ) = ∇θ `(θ), the likelihood function


Newton’s Method (Drawn in Class)
Given f : Rd → R find x s.t. f (x) = 0.
Newton’s Method Summary
Given f : Rd → R find x s.t. f (x) = 0.
I This is the update rule in 1d

f (x (t) )
x (t+1) = x (t) −
f 0 (x (t) )
I It may converge very fast (quadratic local convergence!)
I For the likelihood, i.e., f (θ) = ∇θ `(θ) we need to generalize
to a vector-valued function which has:
 −1
θ(t+1) = θ(t) − H(θ(t) ) ∇θ `(θ(t) ).


in which Hi,j (θ) = ∂θi ∂θj `(θ).
Optimization Method Summary

Method Compute per Step Number of Steps


SGD
Minibatch SGD
GD
Newton
I In classical stats, d is small (< 100), n is often small, and
exact parameters matter
I In modern ML, d is huge (billions, trillions), n is huge
(trillions), and parameters used only for prediction
I As a result, (minibatch) SGD is the workhorse of ML.
Classification Lecture Summary

I We saw the differences between classification and regression.


I We learned about a principle for probabilistic interpretation for
linear regression and classification: Maximum Likelihood.
I We used this to derive logistic regression.
I The Maximum Likelihood principle will be used again next
lecture (and in the future)
I We saw Newton’s method, which is classically used models
(more statistics than ML–it’s not used in most modern ML)

You might also like