Professional Documents
Culture Documents
Learning From Data: 5: The Linear Model
Learning From Data: 5: The Linear Model
Linear Models
Linear Regression
Linear Classification
Logistic Regression
Non-Linear Transformations
Bibliography
'
Training Examples
/ Learning Algorithm / Final Hypothesis
(x1 , y1 ), . . . , (xn , yn ) O g ≈h
O
Unknown
Probability Hypothesis Set
Distribution P(x ) H
β := Xt (XXt )−1 y.
We will see, that we can use this approach in more general setting.
where the expected value is taken w.r.t. joint probability distribution P(x, y ).
Recall that
N
1 X 2
Ein (h) = h(xn ) − yn
N n=1
The hypothesis h can be written as
d
X
hw (x) = wi xi = wt x, (1)
i=0
where we use the usual compact notation x0 = 1 and x ∈ {1} × Rd for the offset.
Definition 3
Let C ⊂ Rn be a non-empty convex set. A function f : C → R is called convex
iff for all x, y ∈ C and any t ∈ (0, 1) we have
f (tx + (1 − t)y) ≤ tf (x) + (1 − t)f (y). It is called strictly convex if the
inequality is strict.
It is clear that a function is convex iff every restriction to a one-dimensional line
is convex. Thus, the following characterization of one-dimensional convex
functions is useful:
Lemma 4
Let R(x1 , x2 ) := f (xx11)−f
−x2
(x2 )
. A function is convex iff R is monotonically
non-decreasing in x1 (and henceforth in x2 , too as R is symmetric with respect
to its arguments). It is strictly convex iff if R is monotonically increasing.
10/45 Jörg Schäfer | Learning From Data | c b na 5: The Linear Model
Convex Functions Example
10
f(3)
8
6
t(−1) + (1−t)3
x^2
4
2
f(−1)
0
−4 −2 0 2 4
We set t = c−b
c−a ∈ (0, 1) and verify that, indeed, b = ta + (1 − t)c. Then the
last inequality holds since f is convex.
⇐ If eq. 2 holds, then f is convex as we can set b = ta + (1 − t)c.
Strict convexity follow similarly.
Theorem 5
Let C ⊂ RN be a non-empty, open and convex set and let f : C ∈ R be a
differentiable function. Then f is convex iff for any x ∗ , x ∈ C
f (x ) ≥ ∇f (x ∗ )(x − x ∗ ) + f (x ∗ ). (3)
Proof.
⇒ If f is convex, then for any x ∗ , x ∈ C , x 6= x ∗ , and any t ∈ (0, 1),
f (tx + (1 − t)x ∗ ) ≤ tf (x ) + (1 − t)f (x ∗ ). This yields
f (x ∗ + t(x − x ∗ )) − f (x ∗ )
f (x ) − f (x ∗ ) ≥ .
t
Letting t → 0, we get eq. 3.
Proof.
⇐ Conversely, take any a, b ∈ C , any t ∈ (0, 1). We set x ∗ = ta + (1 − t)b.
Note that a − x ∗ = −(1 − t)(b − a) and b − x ∗ = t(b − a). Therefore, by
inequality 3:
Multiplying the first by t > 0 and the second by 1 − t > 0, and adding, yields
tf (a) + (1 − t)f (b) ≥ f (x ∗ ).
Proof.
Let x ∗ such that ∇f (x ∗ ) = 0. From the theorem above, it follows that for any
x ∈ C we have f (x ) ≥ f (x ∗ ).
Global Optimums
It can be shown, that a convex function always has a global minimum if it has a
local minimum (even if it is non differentiable). This is very important, because
it is easy to get stuck in local minima (plaguing many ML approaches)
Lemma 7
Any norm is a convex function.
Proof.
Exercise!
Proof.
1
Φ(tw1 + (1 − t)w2 ) = √ ||(tw1 + (1 − t)w2 )t X − y ||
2
1
= √ ||tw1t X + (1 − t)w2t X − ty − (1 − t)y ||
2
1
= √
tw1t X − ty + (1 − t)w2t X − (1 − t)y
2
1 1
≤ t √ ||w1t X − y || + (1 − t) √ ||w2t X − y ||
2 2
= t Φ(w1 ) + (1 − t) Φ(w2 )
Proof.
As Φ(w ) is convex, we have Φ(tw1 + (1 − t)w2 ) ≤ tΦ(w1 ) + (1 − t)Φ(w2 ).
Define Ψ(x ) := x 2 . Then Ψ is convex and non-decreasing. Henceforth:
wopt = X † y, (4)
Proof.
We claim that the function w → Ein (w) is convex. This follows easily from the
properties of a norm and Ein (w) = N1 kX w − yk2 .
Now,
2
∇w Ein (w) = (X t X w − X t y).
N
From the theory of convex differentiable functions we know that we only have to solve
for ∇w Ein (w) = 0. This is equivalent to
X t X w = X t y.
As the Moore-Penrose pseudo inverse is defined by this equation, this completes the
proof.
19/45 Jörg Schäfer | Learning From Data | c b na 5: The Linear Model
Comments
X † = (X t X )−1 X t
Algorithm 1 P e rc e p t ro n
1: while misclassified do
2: for any x [i] misclassified do
3: if y [i] = +1 then
4: w ← w + x [i]
5: b ← b + ||w ||2
6: else
7: w ← w − x [i]
8: b ← b − ||w ||2
9: end if
10: end for
11: end while
Problem: The Perceptron algorithm will find a solution with Ein = 0 iff the
data is linearly separable.
But if the data is not linearly separable it will not converge.
So, in the general case we seek a solution to
N
1 X
arg min 1{sign(wt x)6=yi } (yi )
w∈Rd+1 N i=1
Algorithm 2 P o c k e t A l g o r i t h m
1: Set the pocket weight vector ŵ to w0
2: for t = 0, . . . , T − 1 do
3: if any x [i] misclassified then
4: if y [i] = +1 then
5: w ← w + x [i]
6: b ← b + ||w ||2
7: else
8: w ← w − x [i]
9: b ← b − ||w ||2
10: end if
11: Evaluate Ein [wt ].
12: if wt is better than the pocket weight vector ŵ then
13: ŵ ← wt
14: end if
15: if Break Condition satisfied then
16: Break
17: end if
18: end if
19: end for
26/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model
Logistic Regression
Suppose we want to classify with a probability of getting it “right”.
Idea: Smooth the step function
1.0
0.8
0.6
0.4
0.2
0.0
−4 −2 0 2 4
logit
0.4
tanh
0.2
0.0
−4 −2 0 2 4
1.0
0.8
0.5
exp(x)/(1 + exp(x))
0.6
tanh(x)
0.0
0.4
−0.5
0.2
−1.0
0.0
−4 −2 0 2 4 −4 −2 0 2 4
x x
P(Y |x ) = θ(y wt x)
Definition 12
PN t
The quantity Ein := 1
N i=1 log(1 + e −θ(yi w xi ) ) is called the cross entropy
error.
31/45 Jörg Schäfer | Learning From Data | c b na 5: The Linear Model
How to Compute Cross Entropy?
Practical Problem
How to compute the arg max?
Answer: There does not exist a closed form solution – thus we have to
rely on numerical iterations.
Algorithm 3 G r a d i e n t D e s c e n t
1: Fix threshold
2: Fix step-size η0 (and scheme ηk )
3: Chose (random) start value x0
4: while |xk+1 − xk | ≥ do
5: xk+1 = xk ± ηk ∇f (xk ) . Chose "+" for max and "−" for min.
6: end while
k∇f (x ) − ∇f (y )k ≤ L kx − y k x , y ∈ Rn
Lemma 14
2
Let f be L-Lipschitz. Then f (y ) ≤ f (x ) + < ∇f (x ), y − x > + L2 kx − y k .
Proof.
Exercises!
Lemma 15
Let f be L-Lipschitz and f ∗ := minx f (x ) > −∞. Then the gradient descent
algorithm with fixed step size η such that η < L2 will converge to a stationary
point.
34/45 Jörg Schäfer | Learning From Data | c b na 5: The Linear Model
More Math (cont.)
Proof.
Let xk+1 = xk − η∇f (xk ). Then using the lemma above we get
L 2
f (xk+1 ) ≤ f (xk )+ < ∇f (xk ), xk+1 − xk > + kxk − xk+1 k
2
2 η2 2
= f (xk ) − η k∇f (xk )k + L k∇f xk k
2
η 2
= f (xk ) − η(1 − L) k∇f (xk )k
2
2 1
⇒ k∇f (xk )k ≤ (f (xk ) − f (xk+1 )).
η(1 − η2 L)
N
X 2 1
⇒ k∇f (xk )k ≤ (f (x0 ) − f (xN )) (telescope trick)
η(1 − η2 L)
k=0
1
≤ (f (x0 ) − f ∗ ) ∀N.
η(1 − η2 L)
Lemma 16
1 PN tx
The log likelihood of logistic regression w → N i=1 log(θ(yi w i )) is a
concave L-Lipschitz function.
Proof.
Exercises!
Open question:
How to fix the step size in general?
Answer: We will see!
36/45 Jörg Schäfer | Learning From Data | c b na 5: The Linear Model
Logistic Regression – the Algorithm
Chose a value for η and then compute:
Algorithm 4 L o g i s t i c R e g r e s s i o n
1: Initialize the weights w (0)
2: while Error bigger than Target Approximation Error do
3: Compute the gradient
N
1 X yi xi
∇Ein =
N i=1 1 + e yi wtT xi
−4 −3 −2 −1 0 1 2 3 4
9
x2
8
following map Φ : R → R2 6
5
defined as Φ(x ) := (x , x 2 ), it 4
becomes linearly separable: 3
2
1
x
−4 −3 −2 −1 0 1 2 3 4
−4 −3 −2 −1 0 1 2 3 4
9
x2
8
following map Φ : R → R2 6
5
defined as Φ(x ) := (x , x 2 ), it 4
becomes linearly separable: 3
2
1
x
−4 −3 −2 −1 0 1 2 3 4
General Idea
If we cannot linearly separate a problem in dimension d, we might be able
0
to define a map Φ : Rd → Rd with d 0 > d such that in this dimension,
the problem becomes separable.
General Idea
If we cannot linearly separate a problem in dimension d, we might be able
0
to define a map Φ : Rd → Rd with d 0 > d such that in this dimension,
the problem becomes separable.
● ●● ● ●●●
● ●●●●
● ●● ●● ●●
●●
● ●●
●● ● ●●
● ● ●● ●●
● ●
●●●● ●
●
●●●●●●●
●● ● ●● ● ● ●
●●●
● ●●●●
●● ● ●● ●● ● ●● ●●● ●
● ● ● ●● ● ●●
●
● ● ● ● ● ● ●● ●
● ●●●●
● ●●●● ●●● ●
●●● ● ●
● ●● ● ●
● ●● ●
● ● ●●
● ●
● ● ●
● ●
● ● ● ● ● ●●
●● ●
● ● ● ● ● ●
● ● ● ● ● ●● ●● ● ●
● ●
●● ●● ● ●●
● ● ●
●
● ● ●● ● ● ● ●● ● ● ●● ●●
● ● ●● ● ●●●●● ● ● ● ●●● ● ●
● ● ●● ●
●● ●● ●
●● ● ●● ● ●
●
● ● ● ●● ● ● ● ● ●●● ● ● ●●
● ●● ● ● ● ●● ●
●●
● ●
● ●●●
●● ●● ●● ● ●● ●●● ● ●●● ●
● ●● ● ●●● ●● ● ● ●●●
● ●●
●●● ● ● ● ●● ●● ●●●●●●●●●
●● ● ● ● ● ● ●● ● ● ●●●
●● ● ● ● ● ● ● ●● ● ●● ●● ● ●
● ● ●● ● ● ●
● ● ● ● ● ●● ● ●● ●●●●●
●●
● ● ● ● ● ● ●● ● ●
phi
● ● ● ●
x2
● ● ●
●
● ● ●●
● ●
●● ● ● ● ● ● ● ● ● ● ● ●
●● ● ● ●● ● ● ●● ● ● ●
● ●● ● ●
●
●
● ● ●
●
● ● ● ●●
●
●● ●●●● ● ●● ● ●
●
●● ● ● ● ●● ●●●●● ● ●
● ●●● ●
●
● ● ● ●●●● ● ● ● ●
● ● ●●
●● ● ●● ●● ● ● ●● ● ● ●
●
● ● ●●● ● ● ● ●● ●● ●● ● ●●
●● ●
●●●● ●● ● ●● ●
● ● ●● ●● ● ● ●● ●● ●●●●
●
●●
● ● ●
●
● ● ●●●●● ● ●●● ●
● ●
● ●● ● ●● ● ●
● ● ● ●● ●
● ●
●● ● ● ● ● ●●● ●● ●
●● ●
● ● ●
●
● ●
● ● ●● ●● ● ●●
●
● ●● ● ● ● ● ● ● ● ● ●●
● ● ●● ● ●●● ● ● ● ● ●●●●
● ● ●● ● ●● ● ● ●
●● ● ● ● ● ● ● ●● ●
●●● ●●●● ● ● ● ● ● ●●●● ●●●●
● ● ● ●●●
●
● ●● ●●●●
● ● ● ● ●●●●
●● ●●● ● ● ●●
● ●●●●●● ●● ●
● ● ●●
●
●● ● ●● ●
● ● ●● ● ●●●● ● ● ●
● ●● ● ● ●● ● ●●
●
●● ●● ● ●
● ● ●
● ●
● ●● ●
x1 r
Lemma 17
Two sets of points Ri , i = 1, 2 are separable by a circle in two dimensions, iff Φ(Ri ) are
separable by a plane in three dimensions.
Proof.
“⇒” Let S := (x1 − a)2 + (x2 − b)2 = r 2 be the circle containing all points of R1 and having
any point of R2 outside.
By definition (x1 − a)2 + (x2 − b)2 ≤ r 2 for all the points of R1 .
Thus
Therefore, x ∈ S iff h(Φ(x )) ≤ 0. This proves that if the point set is separable by a circle,
then the lifted point sets Φ(Ri ) are separable by a plane.
such that all points in Φ(R1 ) are less than zero and all points in Φ(R2 ) are bigger than zero.
In particular, for any z ∈ φ(R1 ) ⇔ z = (x1 , x2 , x12 + x22 ) we have ax1 + bx2 + c(x12 + x22 ) ≤ 0.
Define U(h) := {(x1 , x2 )|h(x1 , x2 , x12 + x22 ) ≤ 0}, then we just have to show, that U is a
circle, because R1 ⊂ U(h) and R2 ∩ U(h) = ∅. But