EX02 Logistic Regression Solution

Pattern Recognition — Theory Exercises
Logistic Regression
Siming Bayer, Paul Stoewer
Pattern Recognition Lab (CS 5)
Winter Semester 2022/23
Last Exercise. . .
Bayes Classifier
• Take class with highest posterior probability p (y |x )
+ Optimum decision when using 0/1-loss function
– We need to know all posterior probabilities
• In practice: Real posteriors unknown. Models needed to estimate posteriors
Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 2
Last Exercise. . .
Bayes Classifier
Last Exercise. . .
Bayes Classifier
Last Exercise. . .
Bayes Classifier
Last Exercise. . .
Bayes Classifier
Generative vs Discriminative Models
• Generative models: model for priors and class-conditional probabilities

• Discriminative models: model posteriors directly
Logistic Regression
• Discriminative model using logistic function
Classification vs Regression
Logistic Regression
a) How can posterior probabilities be modeled?
Logistic Regression
Logistic Regression
Logistic Regression
How can posterior probabilities be modeled?
For two classes y ∈ {0, 1} we get:
p(y = 0) · p(x |y = 0)
p (y = 0|x ) =
p (x )
p(y = 0) · p(x |y = 0)
=
p(y = 0)p(x |y = 0) + p(y = 1)p(x |y = 1)
1
= p(y =1)p(x |y =1)
1+ p(y =0)p(x |y =0)
p(y = 0) · p(x |y = 0)
p (y = 0|x ) =
p (x )
p(y = 0) · p(x |y = 0)
=
p(y = 0)p(x |y = 0) + p(y = 1)p(x |y = 1)
1
= p(y =1)p(x |y =1)
1+ p(y =0)p(x |y =0)
p(y = 0) · p(x |y = 0)
p (y = 0|x ) =
p (x )
p(y = 0) · p(x |y = 0)
=
p(y = 0)p(x |y = 0) + p(y = 1)p(x |y = 1)
1
= p(y =1)p(x |y =1)
1+ p(y =0)p(x |y =0)
p(y = 0) · p(x |y = 0)
p (y = 0|x ) =
p (x )
p(y = 0) · p(x |y = 0)
=
p(y = 0)p(x |y = 0) + p(y = 1)p(x |y = 1)
1
= p(y =1)p(x |y =1)
1+ p(y =0)p(x |y =0)
Posteriors and the Logistic Function (cont.)
1
p (y = 0|x ) = p(y =1)p(x |y =1)
1+ p(y =0)p(x |y =0)
(Trick: extend with exponential and logarithm)
1
= p(y =1)p(x |y =1)
log
1+e p(y =0)p(x |y =0)
1
= =0|x )
− log pp((yy =
1+e 1|x )
1
p (y = 0|x ) = p(y =1)p(x |y =1)
1+ p(y =0)p(x |y =0)
1
= p(y =1)p(x |y =1)
log
1+e p(y =0)p(x |y =0)
1
= =0|x )
− log pp((yy =
1+e 1|x )
1
p (y = 0|x ) = p(y =1)p(x |y =1)
1+ p(y =0)p(x |y =0)
1
= p(y =1)p(x |y =1)
log
1+e p(y =0)p(x |y =0)
1
= =0|x )
− log pp((yy =
1+e 1|x )
1
p (y = 0|x ) = p(y =1)p(x |y =1)
1+ p(y =0)p(x |y =0)
1
= p(y =1)p(x |y =1)
log
1+e p(y =0)p(x |y =0)
1
= =0|x )
− log pp((yy =
1+e 1|x )
We see that the posterior for class y = 0 can be written in terms of
a logistic function:
1
p(y = 0|x ) =
1 + e−F (x )
And thus the posterior for the other class y = 1:
p(y = 1|x ) = 1 − p(y = 0|x )
e−F (x )
=
1 + e−F (x )
1
=
1 + e F (x )
1
p(y = 0|x ) =
1 + e−F (x )
p(y = 1|x ) = 1 − p(y = 0|x )
e−F (x )
=
1 + e−F (x )
1
=
1 + e F (x )
1
p(y = 0|x ) =
1 + e−F (x )
p(y = 1|x ) = 1 − p(y = 0|x )
e−F (x )
=
1 + e−F (x )
1
=
1 + e F (x )
1
p(y = 0|x ) =
1 + e−F (x )
p(y = 1|x ) = 1 − p(y = 0|x )
e−F (x )
=
1 + e−F (x )
1
=
1 + e F (x )
1
p(y = 0|x ) =
1 + e−F (x )
p(y = 1|x ) = 1 − p(y = 0|x )
e−F (x )
=
1 + e−F (x )
1
=
1 + e F (x )
Logistic Regression
Answer:
• p (y = 0|x ) = g (F (x )) can be modeled in terms of a function F (x ) and the logistic function
• F (x ) > 0 ⇒ y →0
• F (x ) ≤ 0 ⇒ y →1
Logistic Regression
b) What is the shape of the decision boundary for a Gaussian distributed class-conditional densities?
Does it change their covariance matrices are identical?
Logistic Regression
Logistic Regression
Logistic Regression
What is the shape of the decision boundary for a Gaussian distributed
class-conditional densities?
Example
Let us assume both classes have normally distributed d-dimensional feature vectors:
1 1 T −1
p(x |y ) = p e− 2 (x −µy ) Σy (x −µy )
det (2πΣy )
Then we can write the posterior of y = 0 in terms of a logistic function:
1 1
p (y = 0|x ) = =
1 + e−F (x ) 1 + e−(x
T Ax +αT x +α0 )
p (y = 0|x ) p(y = 0)p(x |y = 0)

F (x ) = log = log
p (y = 1|x ) p(y = 1)p(x |y = 1)
Example
1 1 T −1
p(x |y ) = p e− 2 (x −µy ) Σy (x −µy )
det (2πΣy )
1 1
p (y = 0|x ) = =
1 + e−F (x ) 1 + e−(x
T Ax +αT x +α0 )
p (y = 0|x ) p(y = 0)p(x |y = 0)

F (x ) = log = log
p (y = 1|x ) p(y = 1)p(x |y = 1)
Example
1 1 T −1
p(x |y ) = p e− 2 (x −µy ) Σy (x −µy )
det (2πΣy )
1 1
p (y = 0|x ) = =
1 + e−F (x ) 1 + e−(x
T Ax +αT x +α0 )
p (y = 0|x ) p(y = 0)p(x |y = 0)

F (x ) = log = log
p (y = 1|x ) p(y = 1)p(x |y = 1)
Example cont.
1 T −1
√ 1 e− 2 (x −µ0 ) Σ0 (x −µ0 )
p (y = 0) det (2πΣ0 )
F (x ) = log + log 1 −1
e− 2 (x −µ1 ) Σ1 (x −µ1 )
T
p (y = 1 ) √ 1
det (2πΣ1 )
This function has the constant component:
p (y = 0) 1 det (2πΣ1 )
c = log + log
p (y = 1) 2 det (2πΣ0 )
Example cont.
1 T −1
√ 1 e− 2 (x −µ0 ) Σ0 (x −µ0 )
p (y = 0) det (2πΣ0 )
F (x ) = log + log 1 −1
e− 2 (x −µ1 ) Σ1 (x −µ1 )
T
p (y = 1 ) √ 1
det (2πΣ1 )
p (y = 0) 1 det (2πΣ1 )
c = log + log
p (y = 1) 2 det (2πΣ0 )
Example cont.
Furthermore we have:
1 T −1
e− 2 (x −µ0 ) Σ0 (x −µ0 )
log 1 T −1 =
e− 2 (x −µ1 ) Σ1 (x −µ1 )
1
(x − µ1 )T Σ−1 T −1

= 1 (x − µ1 ) − (x − µ0 ) Σ0 (x − µ0 )
2
1
= x T (Σ−1 −1 T −1 T −1
1 − Σ0 )x − 2(µ1 Σ1 − µ0 Σ0 )x +
2
+µT1 Σ−1 T −1

1 µ1 − µ0 Σ0 µ0
Example cont.
1 T −1
e− 2 (x −µ0 ) Σ0 (x −µ0 )
log 1 T −1 =
e− 2 (x −µ1 ) Σ1 (x −µ1 )
1
(x − µ1 )T Σ−1 T −1

= 1 (x − µ1 ) − (x − µ0 ) Σ0 (x − µ0 )
2
1
= x T (Σ−1 −1 T −1 T −1
1 − Σ0 )x − 2(µ1 Σ1 − µ0 Σ0 )x +
2
+µT1 Σ−1 T −1

1 µ1 − µ0 Σ0 µ0
Example cont.
1 T −1
e− 2 (x −µ0 ) Σ0 (x −µ0 )
log 1 T −1 =
e− 2 (x −µ1 ) Σ1 (x −µ1 )
1
(x − µ1 )T Σ−1 T −1

= 1 (x − µ1 ) − (x − µ0 ) Σ0 (x − µ0 )
2
1
= x T (Σ−1 −1 T −1 T −1
1 − Σ0 )x − 2(µ1 Σ1 − µ0 Σ0 )x +
2
+µT1 Σ−1 T −1

1 µ1 − µ0 Σ0 µ0
Example cont.
Now we have:
1
A = (Σ−1 −1
1 − Σ0 )
2
αT = µT0 Σ−1 T −1
0 − µ1 Σ1

p (y
= 0) 1 det (2πΣ1 ) T −1 T −1
α0 = log + log + µ1 Σ1 µ1 − µ0 Σ0 µ0
p(y = 1) 2 det (2πΣ0 )
5
p (y = 0) = 0.5
3
p (y = 1) = 0.5
2
-1
-2
-3
-4
-5
-5 -4 -3 -2 -1 0 1 2 3 4 5
Two Gaussian sample sets and the decision boundary

5
p (y = 0) = 0.5
3
p (y = 1) = 0.5
2
-1
-2
-3
-4
-5
-5 -4 -3 -2 -1 0 1 2 3 4 5

5
p (y = 0) = 0.8
3
p (y = 1) = 0.2
2
-1
-2
-3
-4
-5
-5 -4 -3 -2 -1 0 1 2 3 4 5

100
90
80 p (y = 0) = 0.5
70 p (y = 1) = 0.5
60
50
40
30
20
10
-10
-20
-30
-40
-50
-50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100

Quadratic polynomials in the 2 variables x1 and x2
F (x ) = x T Ax + αT x + α0
!
= ax12 + bx1 x2 + cx22 + dx1 + ex2 + f = 0
(a) Circles and Ovals (b) Parabolas (c) Hyperbolas
Pbroks13, CC BY 3.0, via Wikimedia Commons
Logistic Regression
Logistic Regression
Logistic Regression
Example cont.
If both classes share the same covariances i. e. Σ = Σ0 = Σ1 , then the argument of the sigmoid function is linear
in the components of x.
A = 0
αT = (µ0 − µ1 )T Σ−1
p(y = 0) 1
α0 = log + (µ1 + µ0 )T Σ−1 (µ1 − µ0 )
p(y = 1) 2
5
p (y = 0) = 0.5
3
p (y = 1) = 0.5
2
-1
-2
-3
-4
-5
-5 -4 -3 -2 -1 0 1 2 3 4 5
Identical covariances lead to linear decision boundary
5
p (y = 0) = 0.5
3
p (y = 1) = 0.5
2
-1
-2
-3
-4
-5
-5 -4 -3 -2 -1 0 1 2 3 4 5
Identical covariances lead to linear decision boundary
Logistic Regression
c) How does a change of prior affect the decision boundary?
How does a change of prior affect the decision boundary?
Example cont.
1 T −1
√ 1 e− 2 (x −µ0 ) Σ0 (x −µ0 )
p (y = 0) det (2πΣ0 )
F (x ) = log + log 1 −1
e− 2 (x −µ1 ) Σ1 (x −µ1 )
T
p (y = 1 ) √ 1
det (2πΣ1 )
p (y = 0) 1 det (2πΣ1 )
c = log + log
p (y = 1) 2 det (2πΣ0 )
We observe:
• Priors imply a constant offset of the decision boundary.
• If priors and covariance matrices of both classes are identical, this offset is c = 0.
Logistic Regression
d) For Gaussian distributed class-conditional densities, is it easier to calculate the decision boundary
directly or to estimate the parameters of the distributions?
Logistic Regression
d) For Gaussian distributed class-conditional densities, is it easier to calculate the decision boundary
directly or to estimate the parameters of the distributions?
Answer:
• Estimating the Gaussians: p(y = 0), µ0 , µ1 , Σ0 , Σ1
• Estimating the decision boundary: A, α, α0
⇒ Estimating the decision boundary directly easier!
Logistic Regression
e) How can we obtain a decision boundary that is non-linear in the components of the feature vector?
Logistic Regression
e) How can we obtain a decision boundary that is non-linear in the components of the feature vector?
Answer:
• We get a non-linear decision boundary when the covariance matrices of the involved Gaussians are
not identical
Logistic Regression
f) What is the underlying assumption of optimizing the log-likelihood?
Log-Likelihood Function
• Let us assume the posteriors are given by
p(y = 0|x ) = 1 − g (θ T x )
p(y = 1|x ) = g (θ T x )
where g (θ T x ) is the sigmoid function parameterized in θ .
• The parameter vector θ has to be estimated from a set S of m training samples:
S = {(x1 , y1 ), (x2 , y2 ), (x3 , y3 ), . . . , (xm , ym )} .
• Method of choice: Maximum Likelihood Estimation
Log-Likelihood Function
• Let us assume the posteriors are given by
p(y = 0|x ) = 1 − g (θ T x )
p(y = 1|x ) = g (θ T x )
where g (θ T x ) is the sigmoid function parameterized in θ .
• The parameter vector θ has to be estimated from a set S of m training samples:
S = {(x1 , y1 ), (x2 , y2 ), (x3 , y3 ), . . . , (xm , ym )} .
• Method of choice: Maximum Likelihood Estimation
Log-Likelihood Function (cont.)
Before we work on the formulas of the ML-estimator, we rewrite the posteriors using Bernoulli probability:
p(y |x ) = g (θ T x )y (1 − g (θ T x ))1−y
which shows the great benefit of the chosen notation for class numbers.
Before we work on the formulas of the ML-estimator, we rewrite the posteriors using Bernoulli probability:
p(y |x ) = g (θ T x )y (1 − g (θ T x ))1−y
which shows the great benefit of the chosen notation for class numbers.
Now we can compute the log-likelihood function

(assuming that the training samples are mutually independent):
m

L (θ) = log ∏ p(yi |xi )
i =1
m 1−yi
T yi T
= ∑ log g (θ xi ) 1 − g (θ xi )
i =1
m
= ∑ yi log g (θ T xi ) + (1 − yi ) log 1 − g (θ T xi )
i =1

m

i =1
m 1−yi
T yi T
= ∑ log g (θ xi ) 1 − g (θ xi )
i =1
m
i =1

m

i =1
m 1−yi
T yi T
= ∑ log g (θ xi ) 1 − g (θ xi )
i =1
m
i =1
Logistic Regression
g) Why do we optimize the log-likelihood and not simply the likelihood?
Logistic Regression
g) Why do we optimize the log-likelihood and not simply the likelihood?
• Addition is easier than Multiplication
• Likelihood is a product of very small probabilities, Logarithm transforms to reasonable number range
Logistic Regression
h) What important property with respect to optimization does the log-likelihood function fulfill?
What important property with respect to optimization does the log-likelihood
function fulfill?
Notes for the expert:

• The negative of the log-likelihood function is the cross entropy of
y and g (θ T x ).
• The negative of the log-likelihood function is a convex function.
Convex functions are easier to optimize
• Function is just a “hole”
• Gradient decent brings you to the minimum
Logistic Regression
i) What method did we use to maximize the log-likelihood?
What method did we use to maximize the log-likelihood?
• The log-likelihood function is concave.
• We use the Newton-Raphson algorithm to solve the unconstrained optimization problem:

For the (k + 1)-st iteration step, we get:

∂2 −1 ∂
θ (k +1)
= θ (k )
− L θ (k )
L θ (k )
∂ θ∂ θ T ∂θ
Note: If you write the Newton-Raphson iteration in matrix form, you will end up with a weighted least
squares iteration scheme.
Logistic Regression
j) Could we also use a different method? What effect does the choice of optimization method have on
the convergence speed of the algorithm?
Logistic Regression
j) Could we also use a different method? What effect does the choice of optimization method have on
the convergence speed of the algorithm?
Answer:
• Alternative method: gradient decent without step size control
• Newton method has step-size control ⇒ less steps necessary for convergence

EX02 Logistic Regression Solution

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

EX02 Logistic Regression Solution

Uploaded by

Copyright:

Available Formats

Pattern Recognition — Theory Exercises

• Take class with highest posterior probability p (y |x )

+ Optimum decision when using 0/1-loss function

– We need to know all posterior probabilities

• In practice: Real posteriors unknown. Models needed to estimate posteriors

• Take class with highest posterior probability p (y |x )

+ Optimum decision when using 0/1-loss function

– We need to know all posterior probabilities

• In practice: Real posteriors unknown. Models needed to estimate posteriors

• Take class with highest posterior probability p (y |x )

+ Optimum decision when using 0/1-loss function

– We need to know all posterior probabilities

• In practice: Real posteriors unknown. Models needed to estimate posteriors

• Take class with highest posterior probability p (y |x )

+ Optimum decision when using 0/1-loss function

– We need to know all posterior probabilities

• In practice: Real posteriors unknown. Models needed to estimate posteriors

• Take class with highest posterior probability p (y |x )

+ Optimum decision when using 0/1-loss function

– We need to know all posterior probabilities

• In practice: Real posteriors unknown. Models needed to estimate posteriors

• Generative models: model for priors and class-conditional probabilities

• Discriminative model using logistic function

a) How can posterior probabilities be modeled?

a) How can posterior probabilities be modeled?

a) How can posterior probabilities be modeled?

a) How can posterior probabilities be modeled?

For two classes y ∈ {0, 1} we get:

For two classes y ∈ {0, 1} we get:

For two classes y ∈ {0, 1} we get:

For two classes y ∈ {0, 1} we get:

(Trick: extend with exponential and logarithm)

(Trick: extend with exponential and logarithm)

(Trick: extend with exponential and logarithm)

(Trick: extend with exponential and logarithm)

And thus the posterior for the other class y = 1:

p(y = 1|x ) = 1 − p(y = 0|x )

And thus the posterior for the other class y = 1:

p(y = 1|x ) = 1 − p(y = 0|x )

And thus the posterior for the other class y = 1:

p(y = 1|x ) = 1 − p(y = 0|x )

And thus the posterior for the other class y = 1:

p(y = 1|x ) = 1 − p(y = 0|x )

And thus the posterior for the other class y = 1:

p(y = 1|x ) = 1 − p(y = 0|x )

a) How can posterior probabilities be modeled?

Then we can write the posterior of y = 0 in terms of a logistic function:

p (y = 0|x ) p(y = 0)p(x |y = 0)

Then we can write the posterior of y = 0 in terms of a logistic function:

p (y = 0|x ) p(y = 0)p(x |y = 0)

Then we can write the posterior of y = 0 in terms of a logistic function:

p (y = 0|x ) p(y = 0)p(x |y = 0)

This function has the constant component:

This function has the constant component:

Two Gaussian sample sets and the decision boundary

Two Gaussian sample sets and the decision boundary

Two Gaussian sample sets and the decision boundary

Two Gaussian sample sets and the decision boundary

(a) Circles and Ovals (b) Parabolas (c) Hyperbolas

Pbroks13, CC BY 3.0, via Wikimedia Commons

Identical covariances lead to linear decision boundary

Identical covariances lead to linear decision boundary