Download as pdf or txt
Download as pdf or txt
You are on page 1of 73

Pattern Recognition — Theory Exercises

Logistic Regression
Siming Bayer, Paul Stoewer
Pattern Recognition Lab (CS 5)
Winter Semester 2022/23
Last Exercise. . .

Bayes Classifier

• Take class with highest posterior probability p (y |x )

+ Optimum decision when using 0/1-loss function

– We need to know all posterior probabilities

• In practice: Real posteriors unknown. Models needed to estimate posteriors

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 2
Last Exercise. . .

Bayes Classifier

• Take class with highest posterior probability p (y |x )

+ Optimum decision when using 0/1-loss function

– We need to know all posterior probabilities

• In practice: Real posteriors unknown. Models needed to estimate posteriors

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 2
Last Exercise. . .

Bayes Classifier

• Take class with highest posterior probability p (y |x )

+ Optimum decision when using 0/1-loss function

– We need to know all posterior probabilities

• In practice: Real posteriors unknown. Models needed to estimate posteriors

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 2
Last Exercise. . .

Bayes Classifier

• Take class with highest posterior probability p (y |x )

+ Optimum decision when using 0/1-loss function

– We need to know all posterior probabilities

• In practice: Real posteriors unknown. Models needed to estimate posteriors

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 2
Last Exercise. . .

Bayes Classifier

• Take class with highest posterior probability p (y |x )

+ Optimum decision when using 0/1-loss function

– We need to know all posterior probabilities

• In practice: Real posteriors unknown. Models needed to estimate posteriors

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 2
Generative vs Discriminative Models

• Generative models: model for priors and class-conditional probabilities


• Discriminative models: model posteriors directly

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 3
Logistic Regression

• Discriminative model using logistic function

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 4
Classification vs Regression

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 5
Logistic Regression

a) How can posterior probabilities be modeled?

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 6
Logistic Regression

a) How can posterior probabilities be modeled?

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 6
Logistic Regression

a) How can posterior probabilities be modeled?

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 6
Logistic Regression

a) How can posterior probabilities be modeled?

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 6
How can posterior probabilities be modeled?

For two classes y ∈ {0, 1} we get:

p(y = 0) · p(x |y = 0)
p (y = 0|x ) =
p (x )

p(y = 0) · p(x |y = 0)
=
p(y = 0)p(x |y = 0) + p(y = 1)p(x |y = 1)

1
= p(y =1)p(x |y =1)
1+ p(y =0)p(x |y =0)

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 7
How can posterior probabilities be modeled?

For two classes y ∈ {0, 1} we get:

p(y = 0) · p(x |y = 0)
p (y = 0|x ) =
p (x )

p(y = 0) · p(x |y = 0)
=
p(y = 0)p(x |y = 0) + p(y = 1)p(x |y = 1)

1
= p(y =1)p(x |y =1)
1+ p(y =0)p(x |y =0)

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 7
How can posterior probabilities be modeled?

For two classes y ∈ {0, 1} we get:

p(y = 0) · p(x |y = 0)
p (y = 0|x ) =
p (x )

p(y = 0) · p(x |y = 0)
=
p(y = 0)p(x |y = 0) + p(y = 1)p(x |y = 1)

1
= p(y =1)p(x |y =1)
1+ p(y =0)p(x |y =0)

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 7
How can posterior probabilities be modeled?

For two classes y ∈ {0, 1} we get:

p(y = 0) · p(x |y = 0)
p (y = 0|x ) =
p (x )

p(y = 0) · p(x |y = 0)
=
p(y = 0)p(x |y = 0) + p(y = 1)p(x |y = 1)

1
= p(y =1)p(x |y =1)
1+ p(y =0)p(x |y =0)

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 7
Posteriors and the Logistic Function (cont.)

1
p (y = 0|x ) = p(y =1)p(x |y =1)
1+ p(y =0)p(x |y =0)

(Trick: extend with exponential and logarithm)

1
= p(y =1)p(x |y =1)
log
1+e p(y =0)p(x |y =0)

1
= =0|x )
− log pp((yy =
1+e 1|x )

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 8
Posteriors and the Logistic Function (cont.)

1
p (y = 0|x ) = p(y =1)p(x |y =1)
1+ p(y =0)p(x |y =0)

(Trick: extend with exponential and logarithm)

1
= p(y =1)p(x |y =1)
log
1+e p(y =0)p(x |y =0)

1
= =0|x )
− log pp((yy =
1+e 1|x )

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 8
Posteriors and the Logistic Function (cont.)

1
p (y = 0|x ) = p(y =1)p(x |y =1)
1+ p(y =0)p(x |y =0)

(Trick: extend with exponential and logarithm)

1
= p(y =1)p(x |y =1)
log
1+e p(y =0)p(x |y =0)

1
= =0|x )
− log pp((yy =
1+e 1|x )

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 8
Posteriors and the Logistic Function (cont.)

1
p (y = 0|x ) = p(y =1)p(x |y =1)
1+ p(y =0)p(x |y =0)

(Trick: extend with exponential and logarithm)

1
= p(y =1)p(x |y =1)
log
1+e p(y =0)p(x |y =0)

1
= =0|x )
− log pp((yy =
1+e 1|x )

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 8
How can posterior probabilities be modeled?
We see that the posterior for class y = 0 can be written in terms of
a logistic function:

1
p(y = 0|x ) =
1 + e−F (x )

And thus the posterior for the other class y = 1:

p(y = 1|x ) = 1 − p(y = 0|x )

e−F (x )
=
1 + e−F (x )

1
=
1 + e F (x )

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 9
How can posterior probabilities be modeled?
We see that the posterior for class y = 0 can be written in terms of
a logistic function:

1
p(y = 0|x ) =
1 + e−F (x )

And thus the posterior for the other class y = 1:

p(y = 1|x ) = 1 − p(y = 0|x )

e−F (x )
=
1 + e−F (x )

1
=
1 + e F (x )

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 9
How can posterior probabilities be modeled?
We see that the posterior for class y = 0 can be written in terms of
a logistic function:

1
p(y = 0|x ) =
1 + e−F (x )

And thus the posterior for the other class y = 1:

p(y = 1|x ) = 1 − p(y = 0|x )

e−F (x )
=
1 + e−F (x )

1
=
1 + e F (x )

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 9
How can posterior probabilities be modeled?
We see that the posterior for class y = 0 can be written in terms of
a logistic function:

1
p(y = 0|x ) =
1 + e−F (x )

And thus the posterior for the other class y = 1:

p(y = 1|x ) = 1 − p(y = 0|x )

e−F (x )
=
1 + e−F (x )

1
=
1 + e F (x )

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 9
How can posterior probabilities be modeled?
We see that the posterior for class y = 0 can be written in terms of
a logistic function:

1
p(y = 0|x ) =
1 + e−F (x )

And thus the posterior for the other class y = 1:

p(y = 1|x ) = 1 − p(y = 0|x )

e−F (x )
=
1 + e−F (x )

1
=
1 + e F (x )

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 9
Logistic Regression

a) How can posterior probabilities be modeled?

Answer:
• p (y = 0|x ) = g (F (x )) can be modeled in terms of a function F (x ) and the logistic function
• F (x ) > 0 ⇒ y →0
• F (x ) ≤ 0 ⇒ y →1

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 10
Logistic Regression

b) What is the shape of the decision boundary for a Gaussian distributed class-conditional densities?
Does it change their covariance matrices are identical?

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 11
Logistic Regression

b) What is the shape of the decision boundary for a Gaussian distributed class-conditional densities?
Does it change their covariance matrices are identical?

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 11
Logistic Regression

b) What is the shape of the decision boundary for a Gaussian distributed class-conditional densities?
Does it change their covariance matrices are identical?

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 11
Logistic Regression

b) What is the shape of the decision boundary for a Gaussian distributed class-conditional densities?
Does it change their covariance matrices are identical?

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 11
What is the shape of the decision boundary for a Gaussian distributed
class-conditional densities?

Example
Let us assume both classes have normally distributed d-dimensional feature vectors:

1 1 T −1
p(x |y ) = p e− 2 (x −µy ) Σy (x −µy )
det (2πΣy )

Then we can write the posterior of y = 0 in terms of a logistic function:

1 1
p (y = 0|x ) = =
1 + e−F (x ) 1 + e−(x
T Ax +αT x +α0 )

p (y = 0|x ) p(y = 0)p(x |y = 0)


F (x ) = log = log
p (y = 1|x ) p(y = 1)p(x |y = 1)

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 12
What is the shape of the decision boundary for a Gaussian distributed
class-conditional densities?

Example
Let us assume both classes have normally distributed d-dimensional feature vectors:

1 1 T −1
p(x |y ) = p e− 2 (x −µy ) Σy (x −µy )
det (2πΣy )

Then we can write the posterior of y = 0 in terms of a logistic function:

1 1
p (y = 0|x ) = =
1 + e−F (x ) 1 + e−(x
T Ax +αT x +α0 )

p (y = 0|x ) p(y = 0)p(x |y = 0)


F (x ) = log = log
p (y = 1|x ) p(y = 1)p(x |y = 1)

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 12
What is the shape of the decision boundary for a Gaussian distributed
class-conditional densities?

Example
Let us assume both classes have normally distributed d-dimensional feature vectors:

1 1 T −1
p(x |y ) = p e− 2 (x −µy ) Σy (x −µy )
det (2πΣy )

Then we can write the posterior of y = 0 in terms of a logistic function:

1 1
p (y = 0|x ) = =
1 + e−F (x ) 1 + e−(x
T Ax +αT x +α0 )

p (y = 0|x ) p(y = 0)p(x |y = 0)


F (x ) = log = log
p (y = 1|x ) p(y = 1)p(x |y = 1)

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 12
What is the shape of the decision boundary for a Gaussian distributed
class-conditional densities?

Example cont.

1 T −1
√ 1 e− 2 (x −µ0 ) Σ0 (x −µ0 )
p (y = 0) det (2πΣ0 )
F (x ) = log + log 1 −1
e− 2 (x −µ1 ) Σ1 (x −µ1 )
T
p (y = 1 ) √ 1
det (2πΣ1 )

This function has the constant component:

p (y = 0) 1 det (2πΣ1 )
c = log + log
p (y = 1) 2 det (2πΣ0 )

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 13
What is the shape of the decision boundary for a Gaussian distributed
class-conditional densities?

Example cont.

1 T −1
√ 1 e− 2 (x −µ0 ) Σ0 (x −µ0 )
p (y = 0) det (2πΣ0 )
F (x ) = log + log 1 −1
e− 2 (x −µ1 ) Σ1 (x −µ1 )
T
p (y = 1 ) √ 1
det (2πΣ1 )

This function has the constant component:

p (y = 0) 1 det (2πΣ1 )
c = log + log
p (y = 1) 2 det (2πΣ0 )

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 13
What is the shape of the decision boundary for a Gaussian distributed
class-conditional densities?

Example cont.
Furthermore we have:
1 T −1
e− 2 (x −µ0 ) Σ0 (x −µ0 )
log 1 T −1 =
e− 2 (x −µ1 ) Σ1 (x −µ1 )

1
(x − µ1 )T Σ−1 T −1

= 1 (x − µ1 ) − (x − µ0 ) Σ0 (x − µ0 )
2

1
= x T (Σ−1 −1 T −1 T −1
1 − Σ0 )x − 2(µ1 Σ1 − µ0 Σ0 )x +
2
+µT1 Σ−1 T −1

1 µ1 − µ0 Σ0 µ0

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 14
What is the shape of the decision boundary for a Gaussian distributed
class-conditional densities?

Example cont.
Furthermore we have:
1 T −1
e− 2 (x −µ0 ) Σ0 (x −µ0 )
log 1 T −1 =
e− 2 (x −µ1 ) Σ1 (x −µ1 )

1
(x − µ1 )T Σ−1 T −1

= 1 (x − µ1 ) − (x − µ0 ) Σ0 (x − µ0 )
2

1
= x T (Σ−1 −1 T −1 T −1
1 − Σ0 )x − 2(µ1 Σ1 − µ0 Σ0 )x +
2
+µT1 Σ−1 T −1

1 µ1 − µ0 Σ0 µ0

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 14
What is the shape of the decision boundary for a Gaussian distributed
class-conditional densities?

Example cont.
Furthermore we have:
1 T −1
e− 2 (x −µ0 ) Σ0 (x −µ0 )
log 1 T −1 =
e− 2 (x −µ1 ) Σ1 (x −µ1 )

1
(x − µ1 )T Σ−1 T −1

= 1 (x − µ1 ) − (x − µ0 ) Σ0 (x − µ0 )
2

1
= x T (Σ−1 −1 T −1 T −1
1 − Σ0 )x − 2(µ1 Σ1 − µ0 Σ0 )x +
2
+µT1 Σ−1 T −1

1 µ1 − µ0 Σ0 µ0

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 14
What is the shape of the decision boundary for a Gaussian distributed
class-conditional densities?

Example cont.
Now we have:

1
A = (Σ−1 −1
1 − Σ0 )
2

αT = µT0 Σ−1 T −1
0 − µ1 Σ1

 
p (y
= 0) 1 det (2πΣ1 ) T −1 T −1
α0 = log + log + µ1 Σ1 µ1 − µ0 Σ0 µ0
p(y = 1) 2 det (2πΣ0 )

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 15
What is the shape of the decision boundary for a Gaussian distributed
class-conditional densities?
5

p (y = 0) = 0.5
3
p (y = 1) = 0.5
2

-1

-2

-3

-4

-5
-5 -4 -3 -2 -1 0 1 2 3 4 5

Two Gaussian sample sets and the decision boundary


Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 16
What is the shape of the decision boundary for a Gaussian distributed
class-conditional densities?
5

p (y = 0) = 0.5
3
p (y = 1) = 0.5
2

-1

-2

-3

-4

-5
-5 -4 -3 -2 -1 0 1 2 3 4 5

Two Gaussian sample sets and the decision boundary


Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 16
What is the shape of the decision boundary for a Gaussian distributed
class-conditional densities?
5

p (y = 0) = 0.8
3
p (y = 1) = 0.2
2

-1

-2

-3

-4

-5
-5 -4 -3 -2 -1 0 1 2 3 4 5

Two Gaussian sample sets and the decision boundary


Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 16
What is the shape of the decision boundary for a Gaussian distributed
class-conditional densities?
100

90

80 p (y = 0) = 0.5
70 p (y = 1) = 0.5
60

50

40

30

20

10

-10

-20

-30

-40

-50
-50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100

Two Gaussian sample sets and the decision boundary


Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 16
What is the shape of the decision boundary for a Gaussian distributed
class-conditional densities?
Quadratic polynomials in the 2 variables x1 and x2

F (x ) = x T Ax + αT x + α0
!
= ax12 + bx1 x2 + cx22 + dx1 + ex2 + f = 0

(a) Circles and Ovals (b) Parabolas (c) Hyperbolas

Pbroks13, CC BY 3.0, via Wikimedia Commons

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 17
Logistic Regression

b) What is the shape of the decision boundary for a Gaussian distributed class-conditional densities?
Does it change their covariance matrices are identical?

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 18
Logistic Regression

b) What is the shape of the decision boundary for a Gaussian distributed class-conditional densities?
Does it change their covariance matrices are identical?

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 18
Logistic Regression

b) What is the shape of the decision boundary for a Gaussian distributed class-conditional densities?
Does it change their covariance matrices are identical?

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 18
Does it change their covariance matrices are identical?

Example cont.
If both classes share the same covariances i. e. Σ = Σ0 = Σ1 , then the argument of the sigmoid function is linear
in the components of x.

A = 0

αT = (µ0 − µ1 )T Σ−1

p(y = 0) 1
α0 = log + (µ1 + µ0 )T Σ−1 (µ1 − µ0 )
p(y = 1) 2

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 19
Does it change their covariance matrices are identical?
5

p (y = 0) = 0.5
3
p (y = 1) = 0.5
2

-1

-2

-3

-4

-5
-5 -4 -3 -2 -1 0 1 2 3 4 5

Identical covariances lead to linear decision boundary

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 20
Does it change their covariance matrices are identical?
5

p (y = 0) = 0.5
3
p (y = 1) = 0.5
2

-1

-2

-3

-4

-5
-5 -4 -3 -2 -1 0 1 2 3 4 5

Identical covariances lead to linear decision boundary

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 20
Logistic Regression

c) How does a change of prior affect the decision boundary?

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 21
How does a change of prior affect the decision boundary?

Example cont.

1 T −1
√ 1 e− 2 (x −µ0 ) Σ0 (x −µ0 )
p (y = 0) det (2πΣ0 )
F (x ) = log + log 1 −1
e− 2 (x −µ1 ) Σ1 (x −µ1 )
T
p (y = 1 ) √ 1
det (2πΣ1 )

This function has the constant component:

p (y = 0) 1 det (2πΣ1 )
c = log + log
p (y = 1) 2 det (2πΣ0 )

We observe:
• Priors imply a constant offset of the decision boundary.
• If priors and covariance matrices of both classes are identical, this offset is c = 0.

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 22
Logistic Regression

d) For Gaussian distributed class-conditional densities, is it easier to calculate the decision boundary
directly or to estimate the parameters of the distributions?

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 23
Logistic Regression

d) For Gaussian distributed class-conditional densities, is it easier to calculate the decision boundary
directly or to estimate the parameters of the distributions?

Answer:
• Estimating the Gaussians: p(y = 0), µ0 , µ1 , Σ0 , Σ1
• Estimating the decision boundary: A, α, α0

⇒ Estimating the decision boundary directly easier!

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 23
Logistic Regression

e) How can we obtain a decision boundary that is non-linear in the components of the feature vector?

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 24
Logistic Regression

e) How can we obtain a decision boundary that is non-linear in the components of the feature vector?

Answer:
• We get a non-linear decision boundary when the covariance matrices of the involved Gaussians are
not identical

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 24
Logistic Regression

f) What is the underlying assumption of optimizing the log-likelihood?

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 25
Log-Likelihood Function

• Let us assume the posteriors are given by

p(y = 0|x ) = 1 − g (θ T x )
p(y = 1|x ) = g (θ T x )

where g (θ T x ) is the sigmoid function parameterized in θ .

• The parameter vector θ has to be estimated from a set S of m training samples:

S = {(x1 , y1 ), (x2 , y2 ), (x3 , y3 ), . . . , (xm , ym )} .

• Method of choice: Maximum Likelihood Estimation

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 26
Log-Likelihood Function

• Let us assume the posteriors are given by

p(y = 0|x ) = 1 − g (θ T x )
p(y = 1|x ) = g (θ T x )

where g (θ T x ) is the sigmoid function parameterized in θ .

• The parameter vector θ has to be estimated from a set S of m training samples:

S = {(x1 , y1 ), (x2 , y2 ), (x3 , y3 ), . . . , (xm , ym )} .

• Method of choice: Maximum Likelihood Estimation

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 26
Log-Likelihood Function (cont.)

Before we work on the formulas of the ML-estimator, we rewrite the posteriors using Bernoulli probability:

p(y |x ) = g (θ T x )y (1 − g (θ T x ))1−y

which shows the great benefit of the chosen notation for class numbers.

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 27
Log-Likelihood Function (cont.)

Before we work on the formulas of the ML-estimator, we rewrite the posteriors using Bernoulli probability:

p(y |x ) = g (θ T x )y (1 − g (θ T x ))1−y

which shows the great benefit of the chosen notation for class numbers.

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 27
Log-Likelihood Function (cont.)

Now we can compute the log-likelihood function


(assuming that the training samples are mutually independent):

m
 
L (θ) = log ∏ p(yi |xi )
i =1
m  1−yi 
T yi T
= ∑ log g (θ xi ) 1 − g (θ xi )
i =1
m 
= ∑ yi log g (θ T xi ) + (1 − yi ) log 1 − g (θ T xi )
i =1

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 28
Log-Likelihood Function (cont.)

Now we can compute the log-likelihood function


(assuming that the training samples are mutually independent):

m
 
L (θ) = log ∏ p(yi |xi )
i =1
m  1−yi 
T yi T
= ∑ log g (θ xi ) 1 − g (θ xi )
i =1
m 
= ∑ yi log g (θ T xi ) + (1 − yi ) log 1 − g (θ T xi )
i =1

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 28
Log-Likelihood Function (cont.)

Now we can compute the log-likelihood function


(assuming that the training samples are mutually independent):

m
 
L (θ) = log ∏ p(yi |xi )
i =1
m  1−yi 
T yi T
= ∑ log g (θ xi ) 1 − g (θ xi )
i =1
m 
= ∑ yi log g (θ T xi ) + (1 − yi ) log 1 − g (θ T xi )
i =1

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 28
Logistic Regression

g) Why do we optimize the log-likelihood and not simply the likelihood?

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 29
Logistic Regression

g) Why do we optimize the log-likelihood and not simply the likelihood?

• Addition is easier than Multiplication

• Likelihood is a product of very small probabilities, Logarithm transforms to reasonable number range

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 30
Logistic Regression

h) What important property with respect to optimization does the log-likelihood function fulfill?

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 31
What important property with respect to optimization does the log-likelihood
function fulfill?

Notes for the expert:


• The negative of the log-likelihood function is the cross entropy of
y and g (θ T x ).

• The negative of the log-likelihood function is a convex function.

Convex functions are easier to optimize

• Function is just a “hole”

• Gradient decent brings you to the minimum

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 32
Logistic Regression

i) What method did we use to maximize the log-likelihood?

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 33
What method did we use to maximize the log-likelihood?

• The log-likelihood function is concave.

• We use the Newton-Raphson algorithm to solve the unconstrained optimization problem:


For the (k + 1)-st iteration step, we get:

∂2  −1 ∂  
θ (k +1)
= θ (k )
− L θ (k )
L θ (k )
∂ θ∂ θ T ∂θ

Note: If you write the Newton-Raphson iteration in matrix form, you will end up with a weighted least
squares iteration scheme.

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 34
Logistic Regression

j) Could we also use a different method? What effect does the choice of optimization method have on
the convergence speed of the algorithm?

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 35
Logistic Regression

j) Could we also use a different method? What effect does the choice of optimization method have on
the convergence speed of the algorithm?

Answer:
• Alternative method: gradient decent without step size control

• Newton method has step-size control ⇒ less steps necessary for convergence

Siming Bayer, Paul Stoewer | Pattern Recognition Lab (CS 5) | Pattern Recognition — Theory Exercises Winter Semester 2022/23 36

You might also like