CS229 Lecture 3 PDF

CS 229 Lecture Three
Supervised Learning: Classification
Chris Ré
April 2, 2023
Supervised Learning and Classification
I Linear Regression via a Probabilistic Interpretation

I Logistic Regression
I Optimization Method: Newton’s Method
We’ll learn the maximum likelihood method (a probabilistic

interpretation) to generalize from linear regression to more
sophisticated models.
A Justification for Least Squares?
I Given a training set {(x (i) , y (i) ) for i = 1, . . . , n} in which

x (i) ∈ Rd+1 and y (i) ∈ R.
I Do find θ ∈ Rd+1 s.t. θ = argminθ ni=1 (hθ (x (i) ) − y (i) )2 in
P
which hθ (x) = θT x.
I Given a training set {(x (i) , y (i) ) for i = 1, . . . , n} in which

x (i) ∈ Rd+1 and y (i) ∈ R.
I Do find θ ∈ Rd+1 s.t. θ = argminθ ni=1 (hθ (x (i) ) − y (i) )2 in
P
which hθ (x) = θT x.
Where did this model come from?
One way to view is via a probabilistic interpretation (helpful

throughout the course).
We make an assumption (common in statistics) that the data are

generated according to some model (that may contain random
choices). That is,
y (i) = θT x (i) + ε(i) .
Here, ε(i) is a random variable that captures “noise” that is,
unmodeled effects, measurement errors, etc.
We make an assumption (common in statistics) that the data are

generated according to some model (that may contain random
choices). That is,
y (i) = θT x (i) + ε(i) .
Here, ε(i) is a random variable that captures “noise” that is,
unmodeled effects, measurement errors, etc.
Please keep in mind: this is just a model! As they say, all

models are wrong but some models are useful. This model
has been shockingly useful.
What do we expect of the noise?
What properties should we expect from ε(i)
y (i) = θT x (i) + ε(i) .
Again, it’s a model and ε(i) is a random variable:

I E[ε(i) ] = 0 – the noise is unbiased.
y (i) = θT x (i) + ε(i) .

I The errors for different points are independent and identically
distributed (called, iid)
E[ε(i) ε(j) ] = E[ε(i) ]E[ε(j) ] for i 6= j.
and 2
(i)
E ε = σ2
Here σ 2 is some measure of how noisy the data are.

y (i) = θT x (i) + ε(i) .

I The errors for different points are independent and identically
distributed (called, iid)
E[ε(i) ε(j) ] = E[ε(i) ]E[ε(j) ] for i 6= j.
and 2
(i)
E ε = σ2
Here σ 2 is some measure of how noisy the data are. Turns out,
this effectively defines the Gaussian or Normal distribution.
Notation for the Gaussian
We write z ∼ N (µ, σ 2 ) and read these symbols as
z is distributed as a normal with mean µ and standard
deviation σ 2 .
or equivalently
(z − µ)2

1
P(z) = √ exp − .
σ 2π 2σ 2
Notation for Guassians in our Problem
Recall in our model,
y (i) = θT x (i) + ε(i) in which ε(i) ∼ N (0, σ 2 ).
or more compactly notation:
y (i) | x (i) ; θ ∼ N (θT x, σ 2 ).
equivalently,
( )
1 (y (i) − x T θ)2
P y (i) | x (i) ; θ = √ exp −
σ 2π 2σ 2
I We condition on x (i) .
I In contrast, θ parameterizes or “picks” a distribution.
We use bar (|) versus semicolon (;) notation above.
(Log) Likelihoods!
Intuition: among many distributions, pick the one that agrees with
the data the most (is most “likely”).
n
Y
L(θ) =p(y |X ; θ) = p(y (i) | x (i) ; θ) iid assumption
i=1
(Log) Likelihoods!
n
Y
i=1
n
( )
Y 1 (x (i) θ − y (i) )2
= √ exp −
σ 2π 2σ 2
i=1
(Log) Likelihoods!
n
Y
i=1
n
( )
Y 1 (x (i) θ − y (i) )2
= √ exp −
σ 2π 2σ 2
i=1
For convenience, we use the Log Likelihood `(θ) = log L(θ).

n
X 1 (x (i) θ − y (i) )2
`(θ) = log √ −
σ 2π 2σ 2
i=1
(Log) Likelihoods!
n
Y
i=1
n
( )
Y 1 (x (i) θ − y (i) )2
= √ exp −
σ 2π 2σ 2
i=1
For convenience, we use the Log Likelihood `(θ) = log L(θ).

n
X 1 (x (i) θ − y (i) )2
`(θ) = log √ −
σ 2π 2σ 2
i=1
n
1 1 X (i) 1
=n log √ − 2 (x θ − y (i) )2 = C (σ, n) − 2 J(θ)
σ 2π 2σ σ
i=1
where C (σ, n) = n log σ√12π .

(Log) Likelihoods!
So we’ve shown that finding a θ to maximize L(θ) is the same as

maximizing
1
`(θ) = C (σ, n) − 2 J(θ)
σ
Or minimizing, J(θ) directly (why?)
Takeaway: “Under the hood,” solving least squares is

solving a maximum likelihood problem for a particular
probabilistic model.
This view shows a path to generalize to new situations!

Summary of Least Squares
I We introduced the Maximum Likelihood framework–super

powerful (next lectures)
I We showed that least squares was actually a version of
maximum likelihoods.
I We learned some notation that will help us later in the
course. . .
Classification
Given a training set {(x (i) , y (i) ) for i = 1, . . . , n} let y (i) ∈ {0, 1}.
Why not use regression, say least squares? A picture . . .
Logistic Regression: Link Functions
Want hθ (x) ∈ [0, 1]. Let’s pick a smooth function:
hθ (x) = g (θT x)
Here, g is a link function. There are many. . .

hθ (x) = g (θT x)
Here, g is a link function. There are many. . . but we’ll pick one!
1
g (z) = .
1 + e −z
hθ (x) = g (θT x)
Here, g is a link function. There are many. . . but we’ll pick one!
1
g (z) = .
1 + e −z
How do we interpret hθ (x)?
P(y = 1 | x; θ) = hθ (x)
P(y = 0 | x; θ) = 1 − hθ (x)
Let’s write the Likelihood function. Recall:
P(y = 1 | x; θ) =hθ (x)
P(y = 0 | x; θ) =1 − hθ (x)
Then,
n
Y
L(θ) =P(y | X ; θ) = p(y (i) | x (i) ; θ)
i=1
P(y = 1 | x; θ) =hθ (x)
P(y = 0 | x; θ) =1 − hθ (x)
Then,
n
Y
L(θ) =P(y | X ; θ) = p(y (i) | x (i) ; θ)
i=1
n
(i) (i)
Y
= hθ (x (i) )y (1 − hθ (x (i) ))1−y exponents encode “if-then”
i=1
P(y = 1 | x; θ) =hθ (x)
P(y = 0 | x; θ) =1 − hθ (x)
Then,
n
Y
L(θ) =P(y | X ; θ) = p(y (i) | x (i) ; θ)
i=1
n
(i) (i)
Y
= hθ (x (i) )y (1 − hθ (x (i) ))1−y exponents encode “if-then”
i=1
Taking logs to compute the log likelihood `(θ) we have:

n
X
`(θ) = log L(θ) = y (i) log hθ (x (i) ) + (1 − y (i) ) log(1 − hθ (x (i) ))
i=1
Now to solve it. . .
n
X
`(θ) = log L(θ) = y (i) log hθ (x (i) ) + (1 − y (i) ) log(1 − hθ (x (i) ))
i=1
We maximize for θ but we already saw how to do this! Just

compute derivative, run (S)GD and you’re done with it!
Takeaway: This is another example of the max likelihood

method: we setup the likelihood, take logs, and compute
derivatives.
Time Permitting: There is magic in the derivative. . .
Even more, the batch update can be written in a remarkably
familiar form:
X
θ(t+1) = θ(t) + (y (j) − hθ (x (j) ))x (j) .
j∈B
We sketch why (you can check!) We drop superscripts to simplify

notation and examine a single data point:
y log hθ (x) + (1 − y ) log(1 − hθ (x))
Tx Tx
= − y log(1 + e −θ ) + (1 − y )(−θT x) − (1 − y ) log(1 + e −θ )
−θT x
= − log(1 + e ) − (1 − y )(θT x)
−θ T x
We used 1 − hθ (x) = e −θT x . We now compute the derivative of
1−e
this expression wrt θ and get:
T
e −θ x
x − (1 − y )x = (y − hθ (x))x
1 + e −θT x
Summary of Introduction to Classification
I We used the principle of maximum likelihood (and a

probabilistic model) to extend to classification.

I We developed logistic regression from this principle.
I Logistic regression is widely used today.

I We developed logistic regression from this principle.
I Logistic regression is widely used today.
I We noticed a familiar pattern: take derivatives of the
likelihood, and the derivatives had this (hopefully) intuitive
“misprediction form”
Newton’s Method
Given f : Rd → R find x s.t. f (x) = 0.
Newton’s Method
We apply this with f (θ) = ∇θ `(θ), the likelihood function

Newton’s Method (Drawn in Class)
Newton’s Method Summary
I This is the update rule in 1d
f (x (t) )
x (t+1) = x (t) −
f 0 (x (t) )
I It may converge very fast (quadratic local convergence!)
I For the likelihood, i.e., f (θ) = ∇θ `(θ) we need to generalize
to a vector-valued function which has:
−1
θ(t+1) = θ(t) − H(θ(t) ) ∇θ `(θ(t) ).
∂
in which Hi,j (θ) = ∂θi ∂θj `(θ).
Optimization Method Summary
Method Compute per Step Number of Steps

SGD
Minibatch SGD
GD
Newton
I In classical stats, d is small (< 100), n is often small, and
exact parameters matter
I In modern ML, d is huge (billions, trillions), n is huge
(trillions), and parameters used only for prediction
I As a result, (minibatch) SGD is the workhorse of ML.
Classification Lecture Summary
I We saw the differences between classification and regression.

I We learned about a principle for probabilistic interpretation for
linear regression and classification: Maximum Likelihood.
I We used this to derive logistic regression.
I The Maximum Likelihood principle will be used again next
lecture (and in the future)
I We saw Newton’s method, which is classically used models
(more statistics than ML–it’s not used in most modern ML)

CS229 Lecture 3 PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CS229 Lecture 3 PDF

Uploaded by

Copyright:

Available Formats

CS 229 Lecture Three

Supervised Learning: Classification

I Linear Regression via a Probabilistic Interpretation

We’ll learn the maximum likelihood method (a probabilistic

I Given a training set {(x (i) , y (i) ) for i = 1, . . . , n} in which

I Given a training set {(x (i) , y (i) ) for i = 1, . . . , n} in which

Where did this model come from?

One way to view is via a probabilistic interpretation (helpful

We make an assumption (common in statistics) that the data are

We make an assumption (common in statistics) that the data are

Please keep in mind: this is just a model! As they say, all

y (i) = θT x (i) + ε(i) .

Again, it’s a model and ε(i) is a random variable:

y (i) = θT x (i) + ε(i) .

Again, it’s a model and ε(i) is a random variable:

E[ε(i) ε(j) ] = E[ε(i) ]E[ε(j) ] for i 6= j.

Here σ 2 is some measure of how noisy the data are.

y (i) = θT x (i) + ε(i) .

Again, it’s a model and ε(i) is a random variable:

E[ε(i) ε(j) ] = E[ε(i) ]E[ε(j) ] for i 6= j.

y (i) = θT x (i) + ε(i) in which ε(i) ∼ N (0, σ 2 ).

or more compactly notation:

y (i) | x (i) ; θ ∼ N (θT x, σ 2 ).

For convenience, we use the Log Likelihood `(θ) = log L(θ).

For convenience, we use the Log Likelihood `(θ) = log L(θ).

where C (σ, n) = n log σ√12π .

So we’ve shown that finding a θ to maximize L(θ) is the same as

Takeaway: “Under the hood,” solving least squares is

This view shows a path to generalize to new situations!

I We introduced the Maximum Likelihood framework–super

Here, g is a link function. There are many. . .

How do we interpret hθ (x)?

Taking logs to compute the log likelihood `(θ) we have:

We maximize for θ but we already saw how to do this! Just

Takeaway: This is another example of the max likelihood

We sketch why (you can check!) We drop superscripts to simplify

I We used the principle of maximum likelihood (and a

I We used the principle of maximum likelihood (and a

I We used the principle of maximum likelihood (and a

We apply this with f (θ) = ∇θ `(θ), the likelihood function

Method Compute per Step Number of Steps

I We saw the differences between classification and regression.

You might also like