Download as pdf or txt
Download as pdf or txt
You are on page 1of 48

Machine Learning II

The Linear Model

Souhaib Ben Taieb


March 8, 2022
University of Mons
Table of contents

Linear Classification with PLA and Pocket

Linear Regression

Useful notions in optimization

Back to linear regression

Logistic Regression

1
Table of contents

Linear Classification with PLA and Pocket

Linear Regression

Useful notions in optimization

Back to linear regression

Logistic Regression

2
⃝ AM
L

3
x = (x0,x1, x2, · · · , x256)

(w0, w1, w2, · · · , w256)

x = (x0,x1, x2)

(w0, w1, w2)

⃝ AM
L

4
x = (x0,x1, x2) x1 x2

⃝ AM
L

5
h(x) = (w x) y= +1 w+y x

x
w
(x1, y1), (x2, y2), · · · , (xN , yN )

(w xn) ̸= yn y= −1 w

x
w+y x
w ← w + ynxn
⃝ AM
L

6
E E

⃝ AM
L

7
E

E
E
E

⃝ AM
L

8
⃝ AM
L

9
Table of contents

Linear Classification with PLA and Pocket

Linear Regression

Useful notions in optimization

Back to linear regression

Logistic Regression

10
x =

··· ···

d
!
h(x) = wi xi = w x
i=0

⃝ AM
L

11
(x1, y1), (x2, y2), · · · , (xN , yN )

yn ∈ R xn

⃝ AM
L

12
How to measure the error in linear regression

How well does h(x) = w T x approximate f (x)?


In linear regression, we often use the squared error loss function:

L(f (x), h(x)) = (f (x) − h(x))2 .

Then, the in-sample error is given by


N
1 X
Ein (h) = (yn − h(xn ))2 ,
N
n=1

or, equivalently, by
N
1 X
Ein (w ) = (yn − w T xn )2 .
N
n=1

13
How to measure the error in linear regression

N
1 X
Ein (w ) = (yn − w T xn )2 (1)
N
n=1
1
= ky − X w k22 (2)
N
where k·k is the Euclidean norm of a vector.

   
y1 x1T

|
   
 y2   x2T 
|

|
y = 
 ..  ∈ R
N
and X =
 ..  ∈ RN×(d+1)

 .   . 
yN xNT
|

|
14
Minimizing Ein

N
1 X T
Ein (w ) = (w xn − yn )2 (3)
N
n=1
1
= ky − X w k22 (4)
N
1
= (y − X w )T (y − X w ) (kv k22 = v T v ) (5)
N
1
= (y T y − 2w T X T y + w T X T X w ). (6)
N
We need to solve the following optimization problem

wlin = argmin Ein (w ),


w ∈Rd+1

where Ein (w ) is a single-valued multivariable function.


15
Table of contents

Linear Classification with PLA and Pocket

Linear Regression

Useful notions in optimization

Back to linear regression

Logistic Regression

16
Multivariate calculus - Gradient

Consider a function r (z) where z = [z1 , z2 , . . . , zd ]T is a d-vector.


The gradient vector of this function is given by the partial deriva-
tives with respect to each of the independent variables,
 ∂r 
∂z (z)
 ∂r1 
 ∂z2 (z) 
∇r (z) ≡ g (z) ≡  . 

.
 .. 
∂r
∂zd (z)

17
Multivariate calculus - Gradient

The gradient is the generalization of the concept of derivative, which


captures the local rate of change in the value of a function, in
multiple directions.
It is very important to remember that the gradient of a function is
only defined if the function is real-valued, that is, if it returns a
scalar value.
If the gradient exists at every point, the function is said to be dif-
ferentiable. If each entry of the gradient is continuous, we say the
function is once continuously differentiable.

18
Multivariate calculus - Hessian

While the gradient of a function of d variables (i.e. the “first deriva-


tive”) is an d-vector, the “second derivative” of an d-variable func-
tion is defined by d 2 partial derivatives (the derivatives of the d first
partial derivatives with respect to the d variables):
   
∂r ∂r ∂2r ∂r ∂r ∂2r
= , i 6= j and = 2 ,i = j
∂zi ∂zj ∂zi ∂zj ∂zi ∂zi ∂ zi

where i, j = 1, . . . , d.

19
Multivariate calculus - Hessian

∂r ∂r ∂2r
If r is single-valued and the partial derivatives ∂zi , ∂zj and ∂zi ∂zj are
∂2r ∂2r ∂2r
continuous, then ∂zi ∂zj exists and ∂zi ∂zj = ∂zj ∂zi .

Therefore the second order partial derivatives can be represented by


a square symmetric matrix called the Hessian matrix:
 2 
∂ r ∂2r
2 (z) · · · ∂z ∂z (z)
 ∂ z1. 1
..
d

∇2 r (z) ≡ H(z) ≡ 
.. . ,

∂2r ∂2r
∂zd ∂z1 (z) · · · ∂2z
(z)
d

which contains d(d + 1)/2 independent elements.


If a function has a Hessian matrix at every point, we say that the
function is twice differentiable. If each entry of the Hessian is con-
tinuous, we say the function is twice continuously differentiable.
20
Multivariate calculus - Hessian

Note on notations: ∇z r means the gradient of r where the ith par-


tial derivative is taken with respect to zi .
For functions of a vector, the gradient is a vector, and we cannot
take the gradient of a vector. Therefore, it is not the case that
the Hessian is the gradient of the gradient. However, this is almost
true, in the following sense: If we look at the ith entry of the gradient
(∇z r (z))i = ∂r∂z(z)
i
, and take the gradient with respect to z, we get
 ∂r 
∂zi ∂z1 (z)
 ∂r 
∂r (z)   ∂zi ∂z2 (z) 
,
∇z ≡ .. 
∂zi  . 
∂r
∂zi ∂zd (z)
which is the ith column (or row) of the Hessian. Therefore,
∇2z = [∇z (∇z r (z))1 ∇z (∇z r (z))2 · · · ∇z (∇z r (z))d ]. 21
Definiteness of a matrix

• A ∈ Rn×n is positive semi-definite, denoted A  0, if


v T Av ≥ 0, ∀v ∈ Rn>0 . If A = AT , all the eigenvalues of A are
larger or equal to zero.
• A ∈ Rn×n is positive definite, denoted A  0, if
v T Av > 0, ∀v ∈ Rn>0 . If A = AT , all the eigenvalues of A are
strictly positive.
• A ∈ Rn×n is negative definite, denoted A ≺ 0, if
v T Av < 0, ∀v ∈ Rn>0 . If A = AT , all the eigenvalues of A are
strictly negative.
• A ∈ Rn×n is indefinite if there exists v1 , v2 ∈ Rn>0 such that
v1T Av1 > 0 and v2T Av2 < 0. If A = AT , A has eigenvalues of
mixed sign.

22
Convex functions

A function r : Rd → R is called convex if for any z1 , z2 ∈ Rd and


every λ ∈ [0, 1], we have

f (λz1 + (1 − λ)z2 ) ≤ λf (z1 ) + (1 − λ)f (z2 ).

• r is convex if and only if its Hessian is positive semidefinite.


• For a convex function, any local minimum (optimum) is also a
global minimum (optimum).

23
Optimality conditions for unconstrained optimization

• Necessary Optimality Conditions


• First-Order Necessary Conditions
If z ∗ is a local minimum of r and r is once continously
differentiable, then ∇r (z ∗ ) = 0.
• Second-Order Necessary Conditions
If z ∗ is a local minimum of r and r is twice continously
differentiable, then ∇2 r (z ∗ )  0.
• There may exist points that satisfy these conditions but are
not local minima, e.g. z = 0 for r (z) = z 3 , r (z) = |z|3 or
r (z) = −|z|3 .
• Sufficient Optimality Conditions
• First-Order Sufficient Conditions
If r is once continously differentiable and convex, and
∇r (z ∗ ) = 0 then z ∗ is a global minimum of r .
• Second-Order Sufficient Conditions
If r is once continously differentiable, ∇r (z ∗ ) = 0 and
24
∇2 r (z ∗ )  0., then z ∗ is a (strict) local minimum of r .
Table of contents

Linear Classification with PLA and Pocket

Linear Regression

Useful notions in optimization

Back to linear regression

Logistic Regression

25
Minimizing Ein

The gradient of Ein (w ) is given by


 
1 T T T T T
∇Ein (w ) = ∇ (y y − 2w X y + w X X w ) (7)
N
2  T 
= X Xw − XTy (8)
N
and the Hessian is given by
  
2 2 T T
∇ Ein (w ) = ∇ X Xw − X y (9)
N
2
= XTX, (10)
N
which is positive semi-definite.
Gradient identities:
• ∇(w T Aw ) = (A + AT )w
• ∇(w T b) = b,
26
• ∇(Aw ) = A
Minimizing Ein

Since Ein (w ) is differentiable and convex, we can find the global


minimum of Ein (w ) by requiring ∇Ein (w ) = 0.

∇Ein (w ) = 0 (11)
2  T 
⇐⇒ X Xw − XTy = 0 (12)
N
⇐⇒ X T X w = X T y (13)
⇐⇒ w = (X T X )−1 X T y (if X T X is invertible) (14)

27
X y
(x1, y1), · · · , (xN , yN )
⎡ ⎤ ⎡ ⎤
x1 y1
⎢ ⎥ ⎢ ⎥
⎢ x2 ⎥ ⎢ y ⎥
X=⎢ ⎥, y = ⎢ 2 ⎥.
⎣ ⎦ ⎣ ⎦
x yN
' () N * ' () *

X† = (X X)−1X
w = X† y

⃝ AM
L

28
y

y
x1 x2
x
⃝ AM
L

29
Learning curves in linear regression

E E

D N

ED [E (g (D))]

ED [E (g (D))]

⃝ AM
L

30
Learning curves in linear regression

y = w∗ x +

D = {(x1, y1), . . . , (xN , yN )}

w = (X X)−1X y

Xw − y

Xw − y′

⃝ AM
L

31
Learning curves in linear regression

σ2

! " E
σ 2 1 − d+1
N
σ2
! "
σ 2 1 + d+1
N E
! "
2 d+1
2σ N
d+1 N

⃝ AM
L

32
y = f (x) ∈ R

±1 ∈ R

w w xn ≈ yn = ±1

(w xn) yn = ±1

⃝ AM
L

33
⃝ AM
L

34
Table of contents

Linear Classification with PLA and Pocket

Linear Regression

Useful notions in optimization

Back to linear regression

Logistic Regression

35
d
!
s = wixi
i=0

h(x) = (s) h(x) = s h(x) = θ(s)


x0
x0
x0 x1
x1 s
x1 s x2 h(x)
s x2 h(x)
x2 h(x)
xd
xd
xd

⃝ AM
L

36
θ

1
es θ(s)
θ(s) =
1 + es
0 s

⃝ AM
L

37
h(x) = θ(s)

θ(s)

s=wx

h(x) = θ(s)
⃝ AM
L

38
(x, y) y

!
f (x) y = +1;
P (y | x) =
1 −f (x) y = −1.

f : Rd → [0, 1]

g(x) = θ(w x) ≈ f (x)

⃝ AM
L

39
(x, y) y

!
f (x) y = +1;
P (y | x) =
1 −f (x) y = −1.

f : Rd → [0, 1]

g(x) = θ(w x) ≈ f (x)

⃝ AM
L

The data does not give us the value of f explicitly. It gives us


samples generated by this probability. How do we learn from such
data?

39
(x, y) y f (x)

h=f y x

!
h(x) y = +1;
P (y | x) =
1 − h(x) y = −1.

⃝ AM
L

40
Formula for likelihood

Since the data points are assumed to be (conditionally) indepen-


dently generated, the probability of observing all the yn ’s in the
data set from the corresponding xn is given by

P(y1 , y2 , . . . , yN |x1 , x2 , . . . , xn ) (15)


= ΠN
n=1 P(yn |x1 , x2 , . . . , xn ) (16)
= ΠN
n=1 P(yn |xn ), (17)

where 
h(x) for y = +1;
P(y |x) =
1 − h(x) for y = −1.

The method of maximum likelihood selects the hypothesis h which


maximizes this probability.
41
From likelihood to Ein


Maximize ΠN N
n=1 P(yn |xn ) ≡ Maximize ln Πn=1 P(yn |xn )
1 
≡ Maximize ln ΠN
n=1 P(yn |xn )
N
1 
≡ Minimize − ln ΠNn=1 P(yn |xn )
N
Furthermore, we can write

1 
− ln ΠN n=1 P(yn |xn )
N
N  
1 X 1
= ln
N n=1 P(yn |xn )
N    
1 X 1 1
= 1{yn = +1} ln + 1{yn = −1} ln
N n=1 h(xn ) 1 − h(xn )
42
From likelihood to Ein
s
e 1
We have h(x) = θ(w T x), where θ(s) = 1+e s = 1+e −s with θ(−s) =

1 − θ(s). So, we can write


N    
1 X 1 1
1{yn = +1} ln + 1{yn = −1} ln
N n=1 h(xn ) 1 − h(xn )
N    
1 X 1 1
= 1{yn = +1} ln + 1{yn = −1} ln
N n=1 θ(w T xn ) 1 − θ(w T xn )
N    
1 X 1 1
= 1{yn = +1} ln + 1{yn = −1} ln
N n=1 θ(w T xn ) θ(−w T xn )
N  
1 X 1
= ln (i.e. P(yn |xn ) = θ(yn w T xn ))
N n=1 θ(yn w T xn )

1 X  
N
T
= ln 1 + e −yn w xn
N n=1
| {z }
Ein (w )
43
Cross-entropy

For two probability distributions with binary outcomes {p, 1 − p}


and {q, 1 − q}, the cross-entropy (from information theory) is

1 1
p log + (1 − p) log .
q 1−q

The in-sample error above corresponds to a cross-entropy error mea-


sure on the data point (xn , yn ), with p = 1{yn = +1} and q =
h(xn ).

44
E

1 ! " #
N
E (w) = ln 1 + e−ynw xn
←−
N n=1

N
1 !
E (w) = (w xn − yn)2 ←−
N n=1

⃝ AM
L

45
⃝ AM
L

46

You might also like