Machine Learning II: The Linear Model

Machine Learning II
The Linear Model
Souhaib Ben Taieb

March 8, 2022
University of Mons
Table of contents
Linear Classification with PLA and Pocket
Linear Regression
Useful notions in optimization
Back to linear regression
Logistic Regression
1
Table of contents
Linear Regression
Logistic Regression
2
⃝ AM
L
3
x = (x0,x1, x2, · · · , x256)
(w0, w1, w2, · · · , w256)
x = (x0,x1, x2)
(w0, w1, w2)
⃝ AM
L
4
x = (x0,x1, x2) x1 x2
⃝ AM
L
5
h(x) = (w x) y= +1 w+y x
x
w
(x1, y1), (x2, y2), · · · , (xN , yN )
(w xn) ̸= yn y= −1 w
x
w+y x
w ← w + ynxn
⃝ AM
L
6
E E
⃝ AM
L
7
E
E
E
E
⃝ AM
L
8
⃝ AM
L
9
Table of contents
Linear Regression
Logistic Regression
10
x =
··· ···
d
!
h(x) = wi xi = w x
i=0
⃝ AM
L
11
(x1, y1), (x2, y2), · · · , (xN , yN )
yn ∈ R xn
⃝ AM
L
12
How to measure the error in linear regression
How well does h(x) = w T x approximate f (x)?

In linear regression, we often use the squared error loss function:
L(f (x), h(x)) = (f (x) − h(x))2 .
Then, the in-sample error is given by

N
1 X
Ein (h) = (yn − h(xn ))2 ,
N
n=1
or, equivalently, by
N
1 X
Ein (w ) = (yn − w T xn )2 .
N
n=1
13
How to measure the error in linear regression
N
1 X
Ein (w ) = (yn − w T xn )2 (1)
N
n=1
1
= ky − X w k22 (2)
N
where k·k is the Euclidean norm of a vector.
   
y1 x1T
|
   
 y2   x2T 
|
|
y = 
 ..  ∈ R
N
and X =
 ..  ∈ RN×(d+1)

 .   . 
yN xNT
|
|
14
Minimizing Ein
N
1 X T
Ein (w ) = (w xn − yn )2 (3)
N
n=1
1
= ky − X w k22 (4)
N
1
= (y − X w )T (y − X w ) (kv k22 = v T v ) (5)
N
1
= (y T y − 2w T X T y + w T X T X w ). (6)
N
We need to solve the following optimization problem
wlin = argmin Ein (w ),

w ∈Rd+1
where Ein (w ) is a single-valued multivariable function.

15
Table of contents
Linear Regression
Logistic Regression
16
Multivariate calculus - Gradient
Consider a function r (z) where z = [z1 , z2 , . . . , zd ]T is a d-vector.

The gradient vector of this function is given by the partial deriva-
tives with respect to each of the independent variables,
 ∂r 
∂z (z)
 ∂r1 
 ∂z2 (z) 
∇r (z) ≡ g (z) ≡  . 

.
 .. 
∂r
∂zd (z)
17
Multivariate calculus - Gradient
The gradient is the generalization of the concept of derivative, which

captures the local rate of change in the value of a function, in
multiple directions.
It is very important to remember that the gradient of a function is
only defined if the function is real-valued, that is, if it returns a
scalar value.
If the gradient exists at every point, the function is said to be dif-
ferentiable. If each entry of the gradient is continuous, we say the
function is once continuously differentiable.
18
Multivariate calculus - Hessian
While the gradient of a function of d variables (i.e. the “first deriva-

tive”) is an d-vector, the “second derivative” of an d-variable func-
tion is defined by d 2 partial derivatives (the derivatives of the d first
partial derivatives with respect to the d variables):

∂r ∂r ∂2r ∂r ∂r ∂2r
= , i 6= j and = 2 ,i = j
∂zi ∂zj ∂zi ∂zj ∂zi ∂zi ∂ zi
where i, j = 1, . . . , d.
19
∂r ∂r ∂2r
If r is single-valued and the partial derivatives ∂zi , ∂zj and ∂zi ∂zj are
∂2r ∂2r ∂2r
continuous, then ∂zi ∂zj exists and ∂zi ∂zj = ∂zj ∂zi .
Therefore the second order partial derivatives can be represented by

a square symmetric matrix called the Hessian matrix:
 2 
∂ r ∂2r
2 (z) · · · ∂z ∂z (z)
 ∂ z1. 1
..
d

∇2 r (z) ≡ H(z) ≡ 
.. . ,

∂2r ∂2r
∂zd ∂z1 (z) · · · ∂2z
(z)
d
which contains d(d + 1)/2 independent elements.

If a function has a Hessian matrix at every point, we say that the
function is twice differentiable. If each entry of the Hessian is con-
tinuous, we say the function is twice continuously differentiable.
20
Note on notations: ∇z r means the gradient of r where the ith par-

tial derivative is taken with respect to zi .
For functions of a vector, the gradient is a vector, and we cannot
take the gradient of a vector. Therefore, it is not the case that
the Hessian is the gradient of the gradient. However, this is almost
true, in the following sense: If we look at the ith entry of the gradient
(∇z r (z))i = ∂r∂z(z)
i
, and take the gradient with respect to z, we get
 ∂r 
∂zi ∂z1 (z)
 ∂r 
∂r (z)   ∂zi ∂z2 (z) 
,
∇z ≡ .. 
∂zi  . 
∂r
∂zi ∂zd (z)
which is the ith column (or row) of the Hessian. Therefore,
∇2z = [∇z (∇z r (z))1 ∇z (∇z r (z))2 · · · ∇z (∇z r (z))d ]. 21
Definiteness of a matrix
• A ∈ Rn×n is positive semi-definite, denoted A 0, if

v T Av ≥ 0, ∀v ∈ Rn>0 . If A = AT , all the eigenvalues of A are
larger or equal to zero.
• A ∈ Rn×n is positive definite, denoted A 0, if
v T Av > 0, ∀v ∈ Rn>0 . If A = AT , all the eigenvalues of A are
strictly positive.
• A ∈ Rn×n is negative definite, denoted A ≺ 0, if
v T Av < 0, ∀v ∈ Rn>0 . If A = AT , all the eigenvalues of A are
strictly negative.
• A ∈ Rn×n is indefinite if there exists v1 , v2 ∈ Rn>0 such that
v1T Av1 > 0 and v2T Av2 < 0. If A = AT , A has eigenvalues of
mixed sign.
22
Convex functions
A function r : Rd → R is called convex if for any z1 , z2 ∈ Rd and

every λ ∈ [0, 1], we have
f (λz1 + (1 − λ)z2 ) ≤ λf (z1 ) + (1 − λ)f (z2 ).
• r is convex if and only if its Hessian is positive semidefinite.

• For a convex function, any local minimum (optimum) is also a
global minimum (optimum).
23
Optimality conditions for unconstrained optimization
• Necessary Optimality Conditions

• First-Order Necessary Conditions
If z ∗ is a local minimum of r and r is once continously
differentiable, then ∇r (z ∗ ) = 0.
• Second-Order Necessary Conditions
If z ∗ is a local minimum of r and r is twice continously
differentiable, then ∇2 r (z ∗ ) 0.
• There may exist points that satisfy these conditions but are
not local minima, e.g. z = 0 for r (z) = z 3 , r (z) = |z|3 or
r (z) = −|z|3 .
• Sufficient Optimality Conditions
• First-Order Sufficient Conditions
If r is once continously differentiable and convex, and
∇r (z ∗ ) = 0 then z ∗ is a global minimum of r .
• Second-Order Sufficient Conditions
If r is once continously differentiable, ∇r (z ∗ ) = 0 and
24
∇2 r (z ∗ ) 0., then z ∗ is a (strict) local minimum of r .
Table of contents
Linear Regression
Logistic Regression
25
Minimizing Ein
The gradient of Ein (w ) is given by

1 T T T T T
∇Ein (w ) = ∇ (y y − 2w X y + w X X w ) (7)
N
2 T
= X Xw − XTy (8)
N
and the Hessian is given by

2 2 T T
∇ Ein (w ) = ∇ X Xw − X y (9)
N
2
= XTX, (10)
N
which is positive semi-definite.
Gradient identities:
• ∇(w T Aw ) = (A + AT )w
• ∇(w T b) = b,
26
• ∇(Aw ) = A
Minimizing Ein
Since Ein (w ) is differentiable and convex, we can find the global

minimum of Ein (w ) by requiring ∇Ein (w ) = 0.
∇Ein (w ) = 0 (11)
2 T
⇐⇒ X Xw − XTy = 0 (12)
N
⇐⇒ X T X w = X T y (13)
⇐⇒ w = (X T X )−1 X T y (if X T X is invertible) (14)
27
X y
(x1, y1), · · · , (xN , yN )
⎡ ⎤ ⎡ ⎤
x1 y1
⎢ ⎥ ⎢ ⎥
⎢ x2 ⎥ ⎢ y ⎥
X=⎢ ⎥, y = ⎢ 2 ⎥.
⎣ ⎦ ⎣ ⎦
x yN
' () N * ' () *
X† = (X X)−1X
w = X† y
⃝ AM
L
28
y
y
x1 x2
x
⃝ AM
L
29
Learning curves in linear regression
E E
D N
ED [E (g (D))]
ED [E (g (D))]
⃝ AM
L
30
y = w∗ x +
D = {(x1, y1), . . . , (xN , yN )}
w = (X X)−1X y
Xw − y
Xw − y′
⃝ AM
L
31
σ2
! " E
σ 2 1 − d+1
N
σ2
! "
σ 2 1 + d+1
N E
! "
2 d+1
2σ N
d+1 N
⃝ AM
L
32
y = f (x) ∈ R
±1 ∈ R
w w xn ≈ yn = ±1
(w xn) yn = ±1
⃝ AM
L
33
⃝ AM
L
34
Table of contents
Linear Regression
Logistic Regression
35
d
!
s = wixi
i=0
h(x) = (s) h(x) = s h(x) = θ(s)

x0
x0
x0 x1
x1 s
x1 s x2 h(x)
s x2 h(x)
x2 h(x)
xd
xd
xd
⃝ AM
L
36
θ
1
es θ(s)
θ(s) =
1 + es
0 s
⃝ AM
L
37
h(x) = θ(s)
θ(s)
s=wx
h(x) = θ(s)
⃝ AM
L
38
(x, y) y
!
f (x) y = +1;
P (y | x) =
1 −f (x) y = −1.
f : Rd → [0, 1]
g(x) = θ(w x) ≈ f (x)
⃝ AM
L
39
(x, y) y
!
f (x) y = +1;
P (y | x) =
1 −f (x) y = −1.
f : Rd → [0, 1]
g(x) = θ(w x) ≈ f (x)
⃝ AM
L
The data does not give us the value of f explicitly. It gives us

samples generated by this probability. How do we learn from such
data?
39
(x, y) y f (x)
h=f y x
!
h(x) y = +1;
P (y | x) =
1 − h(x) y = −1.
⃝ AM
L
40
Formula for likelihood
Since the data points are assumed to be (conditionally) indepen-

dently generated, the probability of observing all the yn ’s in the
data set from the corresponding xn is given by
P(y1 , y2 , . . . , yN |x1 , x2 , . . . , xn ) (15)

= ΠN
n=1 P(yn |x1 , x2 , . . . , xn ) (16)
= ΠN
n=1 P(yn |xn ), (17)
where 
h(x) for y = +1;
P(y |x) =
1 − h(x) for y = −1.
The method of maximum likelihood selects the hypothesis h which

maximizes this probability.
41
From likelihood to Ein

Maximize ΠN N
n=1 P(yn |xn ) ≡ Maximize ln Πn=1 P(yn |xn )
1
≡ Maximize ln ΠN
n=1 P(yn |xn )
N
1
≡ Minimize − ln ΠNn=1 P(yn |xn )
N
Furthermore, we can write
1
− ln ΠN n=1 P(yn |xn )
N
N
1 X 1
= ln
N n=1 P(yn |xn )
N
1 X 1 1
= 1{yn = +1} ln + 1{yn = −1} ln
N n=1 h(xn ) 1 − h(xn )
42
From likelihood to Ein
s
e 1
We have h(x) = θ(w T x), where θ(s) = 1+e s = 1+e −s with θ(−s) =
1 − θ(s). So, we can write

N
1 X 1 1
1{yn = +1} ln + 1{yn = −1} ln
N n=1 h(xn ) 1 − h(xn )
N
1 X 1 1
= 1{yn = +1} ln + 1{yn = −1} ln
N n=1 θ(w T xn ) 1 − θ(w T xn )
N
1 X 1 1
= 1{yn = +1} ln + 1{yn = −1} ln
N n=1 θ(w T xn ) θ(−w T xn )
N
1 X 1
= ln (i.e. P(yn |xn ) = θ(yn w T xn ))
N n=1 θ(yn w T xn )
1 X
N
T
= ln 1 + e −yn w xn
N n=1
| {z }
Ein (w )
43
Cross-entropy
For two probability distributions with binary outcomes {p, 1 − p}

and {q, 1 − q}, the cross-entropy (from information theory) is
1 1
p log + (1 − p) log .
q 1−q
The in-sample error above corresponds to a cross-entropy error mea-

sure on the data point (xn , yn ), with p = 1{yn = +1} and q =
h(xn ).
44
E
1 ! " #
N
E (w) = ln 1 + e−ynw xn
←−
N n=1
N
1 !
E (w) = (w xn − yn)2 ←−
N n=1
⃝ AM
L
45
⃝ AM
L
46

Machine Learning II: The Linear Model

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning II: The Linear Model

Uploaded by

Copyright:

Available Formats

Machine Learning II

The Linear Model

Souhaib Ben Taieb

Linear Classification with PLA and Pocket

Useful notions in optimization

Back to linear regression

Linear Classification with PLA and Pocket

Useful notions in optimization

Back to linear regression

(w0, w1, w2, · · · , w256)

(w0, w1, w2)

Linear Classification with PLA and Pocket

Useful notions in optimization

Back to linear regression

How well does h(x) = w T x approximate f (x)?

L(f (x), h(x)) = (f (x) − h(x))2 .

Then, the in-sample error is given by

wlin = argmin Ein (w ),

where Ein (w ) is a single-valued multivariable function.

Linear Classification with PLA and Pocket

Useful notions in optimization

Back to linear regression

Consider a function r (z) where z = [z1 , z2 , . . . , zd ]T is a d-vector.

The gradient is the generalization of the concept of derivative, which

While the gradient of a function of d variables (i.e. the “first deriva-

Therefore the second order partial derivatives can be represented by

which contains d(d + 1)/2 independent elements.

Note on notations: ∇z r means the gradient of r where the ith par-

• A ∈ Rn×n is positive semi-definite, denoted A  0, if

A function r : Rd → R is called convex if for any z1 , z2 ∈ Rd and

f (λz1 + (1 − λ)z2 ) ≤ λf (z1 ) + (1 − λ)f (z2 ).

• r is convex if and only if its Hessian is positive semidefinite.

• Necessary Optimality Conditions

Linear Classification with PLA and Pocket

Useful notions in optimization

Back to linear regression

The gradient of Ein (w ) is given by

Since Ein (w ) is differentiable and convex, we can find the global

D = {(x1, y1), . . . , (xN , yN )}

Linear Classification with PLA and Pocket

Useful notions in optimization

Back to linear regression

h(x) = (s) h(x) = s h(x) = θ(s)

g(x) = θ(w x) ≈ f (x)

g(x) = θ(w x) ≈ f (x)

The data does not give us the value of f explicitly. It gives us

Since the data points are assumed to be (conditionally) indepen-

P(y1 , y2 , . . . , yN |x1 , x2 , . . . , xn ) (15)

The method of maximum likelihood selects the hypothesis h which

1 − θ(s). So, we can write

For two probability distributions with binary outcomes {p, 1 − p}

The in-sample error above corresponds to a cross-entropy error mea-

You might also like

• A ∈ Rn×n is positive semi-definite, denoted A 0, if