Download as pdf or txt
Download as pdf or txt
You are on page 1of 49

Learning From Data

5: The Linear Model


Jörg Schäfer
Frankfurt University of Applied Sciences
Department of Computer Sciences
Nibelungenplatz 1
D-60318 Frankfurt am Main

Wissen durch Praxis stärkt


1/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model Summer Semester 2022
Content

Linear Models

Linear Regression

Linear Classification

Logistic Regression

Non-Linear Transformations

Bibliography

2/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model


Everything should be made as simple as possible, but not simpler
– Albert Einstein

3/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model


Motivation: Why Linear Models?

Linear Models are


simple to begin with
are capable of explaining many things
easy to understand (conceptually and mathematically)
a good solid, workhorse
serve as a building block for more sophisticated models

4/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model


The Context

Unkown Target Function Error


f :X →Y Measure(s)

 '
Training Examples 
/ Learning Algorithm / Final Hypothesis
(x1 , y1 ), . . . , (xn , yn ) O g ≈h
O

Unknown
Probability Hypothesis Set
Distribution P(x ) H

Figure: Learning Model, source: [AMMIL12]

For Linear Models, we usually assume X = RN and Y = R for Linear


Regression or Y = {−1, +1} for classification.
5/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model
The Context (cont.)

We want to learn, i.e. get a small Eout . In general,


2 N/8
P[|Ein (h) − Eout (h)| > ] ≤ 4mH (2N) e − ,

thus we need to ensure


1. that mH (2N) is polynomially bounded – equivalently its VC-dimension
is finite (low) – and N is big enough (we have enough data) and
2. that Ein is low.

6/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model


Recap: Fitting Polynomials

Remember from lecture 2:


Lemma 1
Let N be any number of points (x1 , y1 ), . . . , (xn , yn ) ∈ R2 . Then there
exists a polynomial p(x ) = M i
P
i=0 βi x of degree M such that p(xi ) = yi
for all i ∈ 1, . . . , N if M ≥ N − 1.

In the proof, we found a solution for the coefficients βi as follows:

β := Xt (XXt )−1 y.

We will see, that we can use this approach in more general setting.

7/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model


The Algorithm
Linear regression minimizes the squared error between our hypothesis h(x) and y :

Eout (h) = E (h(x) − y )2 ,


 

where the expected value is taken w.r.t. joint probability distribution P(x, y ).
Recall that
N
1 X 2
Ein (h) = h(xn ) − yn
N n=1
The hypothesis h can be written as
d
X
hw (x) = wi xi = wt x, (1)
i=0

where we use the usual compact notation x0 = 1 and x ∈ {1} × Rd for the offset.

8/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model


The Algorithm (cont.)
Thus,

Ein (w) := Ein (hw )


N
1 X t 2
= w xn − yn
N n=1
1 2
= kX w − yk (using matrix notation – X is matrix of data points.)
N
1
wt X t X w − 2wt X t y + yt y

=
N

Linear Regression Algorithm


The linear regression algorithm is defined by a solution wopt to the following
optimization problem:
wopt := arg min Ein (w)
w∈Rd+1

9/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model


Digression: Convex Functions – Definition
Definition 2
A set C ⊂ Rn is called convex iff for all x, y ∈ C and any t ∈ (0, 1) we have
tx + (1 − t)y ∈ C .

Definition 3
Let C ⊂ Rn be a non-empty convex set. A function f : C → R is called convex
iff for all x, y ∈ C and any t ∈ (0, 1) we have
f (tx + (1 − t)y) ≤ tf (x) + (1 − t)f (y). It is called strictly convex if the
inequality is strict.
It is clear that a function is convex iff every restriction to a one-dimensional line
is convex. Thus, the following characterization of one-dimensional convex
functions is useful:
Lemma 4
Let R(x1 , x2 ) := f (xx11)−f
−x2
(x2 )
. A function is convex iff R is monotonically
non-decreasing in x1 (and henceforth in x2 , too as R is symmetric with respect
to its arguments). It is strictly convex iff if R is monotonically increasing.
10/45 Jörg Schäfer | Learning From Data | c b na 5: The Linear Model
Convex Functions Example

10
f(3)

8
6

t(−1) + (1−t)3
x^2

4
2

f(−1)
0

−4 −2 0 2 4

Note, that R(x1 , x2 ) equals the slope of the function!


11/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model
Proof of Lemma 4
Proof.
⇒ Take any a, b, c ∈ C , a < b < c. Then R is monotonically non-decreasing iff

f (b) − f (a) f (c) − f (b)


≤ (2)
b−a c −b
This is equivalent to the following chain of inequalities:

[f (b) − f (a)](c − b) ≤ [f (c) − f (b)](b − a),


f (b)(c − a) ≤ f (a)(c − b) + f (c)(b − a)
c −b b−a
f (b) ≤ f (a) + f (c).
c −a c −a

We set t = c−b
c−a ∈ (0, 1) and verify that, indeed, b = ta + (1 − t)c. Then the
last inequality holds since f is convex.
⇐ If eq. 2 holds, then f is convex as we can set b = ta + (1 − t)c.
Strict convexity follow similarly.

12/45 Jörg Schäfer | Learning From Data | c b na 5: The Linear Model


Convex Differential Functions

Theorem 5
Let C ⊂ RN be a non-empty, open and convex set and let f : C ∈ R be a
differentiable function. Then f is convex iff for any x ∗ , x ∈ C

f (x ) ≥ ∇f (x ∗ )(x − x ∗ ) + f (x ∗ ). (3)

Proof.
⇒ If f is convex, then for any x ∗ , x ∈ C , x 6= x ∗ , and any t ∈ (0, 1),
f (tx + (1 − t)x ∗ ) ≤ tf (x ) + (1 − t)f (x ∗ ). This yields

f (x ∗ + t(x − x ∗ )) − f (x ∗ )
f (x ) − f (x ∗ ) ≥ .
t
Letting t → 0, we get eq. 3.

13/45 Jörg Schäfer | Learning From Data | c b na 5: The Linear Model


Convex Differential Functions (cont.)

Proof.
⇐ Conversely, take any a, b ∈ C , any t ∈ (0, 1). We set x ∗ = ta + (1 − t)b.
Note that a − x ∗ = −(1 − t)(b − a) and b − x ∗ = t(b − a). Therefore, by
inequality 3:

f (a) ≥ ∇f (x ∗ )[−(1 − t)(b − a)] + f (x ∗ )


f (b) ≥ ∇f (x ∗ )[t(b − a)] + f (x ∗ )

Multiplying the first by t > 0 and the second by 1 − t > 0, and adding, yields
tf (a) + (1 − t)f (b) ≥ f (x ∗ ).

14/45 Jörg Schäfer | Learning From Data | c b na 5: The Linear Model


Convex Differential Functions (cont.)
Convex functions are important, because we can ensure that local optimums are
global ones:
Theorem 6
If ∇f (x ∗ ) = 0 then f has a global minimum at x ∗ .

Proof.
Let x ∗ such that ∇f (x ∗ ) = 0. From the theorem above, it follows that for any
x ∈ C we have f (x ) ≥ f (x ∗ ).

Global Optimums
It can be shown, that a convex function always has a global minimum if it has a
local minimum (even if it is non differentiable). This is very important, because
it is easy to get stuck in local minima (plaguing many ML approaches)

15/45 Jörg Schäfer | Learning From Data | c b na 5: The Linear Model


Norms are Convex Functions

Lemma 7
Any norm is a convex function.

Proof.
Exercise!

16/45 Jörg Schäfer | Learning From Data | c b na 5: The Linear Model


Ein is a Convex Function
Lemma 8
The function Φ(w ) := √1 ||w t X − y || is convex.
2

Proof.

1
Φ(tw1 + (1 − t)w2 ) = √ ||(tw1 + (1 − t)w2 )t X − y ||
2
1
= √ ||tw1t X + (1 − t)w2t X − ty − (1 − t)y ||
2
1
= √ tw1t X − ty + (1 − t)w2t X − (1 − t)y
 
2
1 1
≤ t √ ||w1t X − y || + (1 − t) √ ||w2t X − y ||
2 2
= t Φ(w1 ) + (1 − t) Φ(w2 )

17/45 Jörg Schäfer | Learning From Data | c b na 5: The Linear Model


Ein is a Convex Function (cont.)
Lemma 9
The function Φ0 (w ) := Φ2 (w ) = 12 ||w t X − y ||2 is convex.

Proof.
As Φ(w ) is convex, we have Φ(tw1 + (1 − t)w2 ) ≤ tΦ(w1 ) + (1 − t)Φ(w2 ).
Define Ψ(x ) := x 2 . Then Ψ is convex and non-decreasing. Henceforth:

Φ0 (tw1 + (1 − t)w2 ) = Ψ(Φ(tw1 + (1 − t)w2 ))


≤ Ψ(tΦ(tw1 ) + (1 − t)Φ(w2 ))
≤ tΨ(Φ(w1 )) + (1 − t)Ψ(Φ(w2 ))
= tΦ2 (w1 ) + (1 − t)Φ2 (w2 ))
= tΦ0 (w1 ) + (1 − t)Φ0 (w2 ))

18/45 Jörg Schäfer | Learning From Data | c b na 5: The Linear Model


Deriving Linear Regression Solution
Lemma 10
The solution to the linear regression problem is given by

wopt = X † y, (4)

where X † denotes the Moore-Penrose pseudo inverse.

Proof.
We claim that the function w → Ein (w) is convex. This follows easily from the
properties of a norm and Ein (w) = N1 kX w − yk2 .
Now,
2
∇w Ein (w) = (X t X w − X t y).
N
From the theory of convex differentiable functions we know that we only have to solve
for ∇w Ein (w) = 0. This is equivalent to

X t X w = X t y.

As the Moore-Penrose pseudo inverse is defined by this equation, this completes the
proof.
19/45 Jörg Schäfer | Learning From Data | c b na 5: The Linear Model
Comments

If X t X is invertible, the Moore-Penrose pseudo inverse can be


obtained from the inverse of X t X , i.e.

X † = (X t X )−1 X t

If X t X is not invertible, the solution exists but is not unique.


The crucial result is, that we found a global! optimal solution even
and have an exact formula for it! (This is a rare exception in ML.)

20/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model


Where is the Learning Taking Place?

In Linear Regression, where does the Learning take place ?


When we calculate wopt we “learn” from the data d ∈ D as Ein
depends on the data set presented
The fact that we get the solution (eq. 4) “immediately” is
uncommon in ML but should not confuse anybody – learning just
means to select the “best” hypothesis given the data available.

21/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model


Where is the Learning Taking Place?

In Linear Regression, where does the Learning take place ?


When we calculate wopt we “learn” from the data d ∈ D as Ein
depends on the data set presented
The fact that we get the solution (eq. 4) “immediately” is
uncommon in ML but should not confuse anybody – learning just
means to select the “best” hypothesis given the data available.

21/45 Jörg Schäfer | Learning From Data | c b na 5: The Linear Model


Where is the Learning Taking Place?

In Linear Regression, where does the Learning take place ?


When we calculate wopt we “learn” from the data d ∈ D as Ein
depends on the data set presented
The fact that we get the solution (eq. 4) “immediately” is
uncommon in ML but should not confuse anybody – learning just
means to select the “best” hypothesis given the data available.

21/45 Jörg Schäfer | Learning From Data | c b na 5: The Linear Model


Generalization Ability of Linear Model

Although we have not proven the Generalization Bound for regression


estimates, such results exist. In fact the following theorem can be proven
[AMMIL12]:
Theorem 11
For the linear regression with d variables in N dimensions
d
Eout (h) = Ein (g) + O( )
N
is true.

22/45 Jörg Schäfer | Learning From Data | c b na 5: The Linear Model


What is Linear in Linear Regression?
If we look at the functional Ein (see eq. 1) we can conclude that it is
linear in w and in x.
A careful inspection of the proof reveals, that the linearity in x was
not used at all.
Thus we can extend the linear model easily to make use of any
functional relation ship as long as the coefficients (weights) stay
linear:
d
X
h(w) = wi f(xi )
i=1

For example, we might use polynomials,


i.e. f (x ) = {1, x , x 2 , . . . , x d }.
This is a powerful idea (and we will pursue it later, see section
“Non-Linear Transformations”).

23/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model


Linear Classification
Assume X = RN and Y = {−1, +1}. We have
h(x) = sign(wt x),
where w ∈ Rd . We already know that the VC dimension of this problem is d + 1.
Thus for learning, we only have to find a hypothesis such that Ein is low.
Recall the perceptron:

Algorithm 1 P e rc e p t ro n
1: while misclassified do
2: for any x [i] misclassified do
3: if y [i] = +1 then
4: w ← w + x [i]
5: b ← b + ||w ||2
6: else
7: w ← w − x [i]
8: b ← b − ||w ||2
9: end if
10: end for
11: end while

24/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model


Linear Classification (cont.)

Problem: The Perceptron algorithm will find a solution with Ein = 0 iff the
data is linearly separable.
But if the data is not linearly separable it will not converge.
So, in the general case we seek a solution to
N
1 X
arg min 1{sign(wt x)6=yi } (yi )
w∈Rd+1 N i=1

Problem: This is NP hard [AMMIL12]! It is a hard combinatorial problem


(analytical problems are often nicer, see below!)

25/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model


Alternative: Pocket Algorithm
The pocket algorithm approximates the solution:

Algorithm 2 P o c k e t A l g o r i t h m
1: Set the pocket weight vector ŵ to w0
2: for t = 0, . . . , T − 1 do
3: if any x [i] misclassified then
4: if y [i] = +1 then
5: w ← w + x [i]
6: b ← b + ||w ||2
7: else
8: w ← w − x [i]
9: b ← b − ||w ||2
10: end if
11: Evaluate Ein [wt ].
12: if wt is better than the pocket weight vector ŵ then
13: ŵ ← wt
14: end if
15: if Break Condition satisfied then
16: Break
17: end if
18: end if
19: end for
26/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model
Logistic Regression
Suppose we want to classify with a probability of getting it “right”.
Idea: Smooth the step function

1.0
0.8
0.6
0.4
0.2
0.0

−4 −2 0 2 4

by a continuous, smooth (smoothing) function:


1.0
0.8
0.6

logit
0.4

tanh
0.2
0.0

−4 −2 0 2 4

27/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model


x
Smoothing Functions
ex
Commonly used are 1+e x and tanh(x ).
1.0

1.0
0.8

0.5
exp(x)/(1 + exp(x))

0.6

tanh(x)

0.0
0.4

−0.5
0.2

−1.0
0.0

−4 −2 0 2 4 −4 −2 0 2 4

x x

28/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model


Example: Sigmoid Function
the commonly used function
es 1
θ(s) := =
1+e s 1 + e −s
1
θ(x ) :=
1 + e −wt x
is called Sigmoid or soft threshold function.
The sigmoid function is similar to the step function, so far as the
behaviour for large positive and negative numbers is considered
For input of modest size it avoids the discontinuity at 0 by smoothly
interpolation.
Henceforth – as smoothness implies continuity – small changes in
inputs will produce small changes in the output
As range(θ) ∈ (0, 1) we can interpret its results as probabilities.
(The other function tanh(x ) has similar properties.)
29/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model
Logistic Regression
We try to guess the best target distribution P(Y |x ) given the observed
data D, i.e. (
f (x ) for y = +1
P(Y |x ) =
1 − f (x ) for y = −1.
Thus, the target function f is a probability and we can try as a hypothesis
any function that can be interpreted as a probability, for instance θ:
(
h(x ) = θ(wt x) for y = +1
P(Y |x ) =
1 − θ(wt x) for y = −1.

This can be written compactly as

P(Y |x ) = θ(y wt x)

30/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model


Logistic Regression (cont.)
The probability of getting the y1 , . . . , yN in D given x1 , . . . , xN is
N
Y
P(y1 , . . . , yN |x1 , . . . , xN ) = θ(yi wt xi )
i=1
To maximize the likelihood, we select an h such that it maximizes the
probability
N
Y N
X
t
arg max log( θ(yi w xi )) = arg max log(θ(yi wt xi )).
w w
i=1 i=1
This is equivalent to minimizing
N N
1 Y
t 1 X
arg min − log( θ(yi w xi )) = arg min log(1/θ(yi wt xi )).
w N i=1
w N i=1

Definition 12
PN t
The quantity Ein := 1
N i=1 log(1 + e −θ(yi w xi ) ) is called the cross entropy
error.
31/45 Jörg Schäfer | Learning From Data | c b na 5: The Linear Model
How to Compute Cross Entropy?

if the sample is correctly classified, then yi and wt xi have the same


sign, henceforth all values are small (exponentiating a negative
number)
if the sample is incorrectly classified, then yi and wt xi have different
signs, henceforth all values are large (exponentiating a positive
number)

Practical Problem
How to compute the arg max?

Answer: There does not exist a closed form solution – thus we have to
rely on numerical iterations.

32/45 Jörg Schäfer | Learning From Data | c b na 5: The Linear Model


Digression: Gradient Descent (aka Steepest Descent)

Given any (differentiable) function f , how do we find maxima (or


minima)?
Idea: We climb uphill (downhill) by following the deepest descend:

Algorithm 3 G r a d i e n t D e s c e n t
1: Fix threshold 
2: Fix step-size η0 (and scheme ηk )
3: Chose (random) start value x0
4: while |xk+1 − xk | ≥  do
5: xk+1 = xk ± ηk ∇f (xk ) . Chose "+" for max and "−" for min.
6: end while

33/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model


More Math
Definition 13
Let L be a positive number. A function Rn → R is called L-Lipschitz iff

k∇f (x ) − ∇f (y )k ≤ L kx − y k x , y ∈ Rn

Lemma 14
2
Let f be L-Lipschitz. Then f (y ) ≤ f (x ) + < ∇f (x ), y − x > + L2 kx − y k .

Proof.
Exercises!

Lemma 15
Let f be L-Lipschitz and f ∗ := minx f (x ) > −∞. Then the gradient descent
algorithm with fixed step size η such that η < L2 will converge to a stationary
point.
34/45 Jörg Schäfer | Learning From Data | c b na 5: The Linear Model
More Math (cont.)
Proof.
Let xk+1 = xk − η∇f (xk ). Then using the lemma above we get

L 2
f (xk+1 ) ≤ f (xk )+ < ∇f (xk ), xk+1 − xk > + kxk − xk+1 k
2
2 η2 2
= f (xk ) − η k∇f (xk )k + L k∇f xk k
2
η 2
= f (xk ) − η(1 − L) k∇f (xk )k
2
2 1
⇒ k∇f (xk )k ≤ (f (xk ) − f (xk+1 )).
η(1 − η2 L)
N
X 2 1
⇒ k∇f (xk )k ≤ (f (x0 ) − f (xN )) (telescope trick)
η(1 − η2 L)
k=0
1
≤ (f (x0 ) − f ∗ ) ∀N.
η(1 − η2 L)

This is only possible for all N if lim∞


k=0 ∇f (xk ) = 0.
35/45 Jörg Schäfer | Learning From Data | c b na 5: The Linear Model
More Math (cont.)
Remember, if f is convex, any local stationary point is a global minimum.
Gradient Descent always works for Convex Functions
Thus, for convex (concave) L-Lipschitz functions gradient descent
always works and yields the global minimum (or maximum).

Lemma 16
1 PN tx
The log likelihood of logistic regression w → N i=1 log(θ(yi w i )) is a
concave L-Lipschitz function.

Proof.
Exercises!
Open question:
How to fix the step size in general?
Answer: We will see!
36/45 Jörg Schäfer | Learning From Data | c b na 5: The Linear Model
Logistic Regression – the Algorithm
Chose a value for η and then compute:

Algorithm 4 L o g i s t i c R e g r e s s i o n
1: Initialize the weights w (0)
2: while Error bigger than Target Approximation Error do
3: Compute the gradient
N
1 X yi xi
∇Ein =
N i=1 1 + e yi wtT xi

4: Update weights wt+1 = wt − η∇Ein


5: Compute the Error
6: end while

37/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model


Digression: Stochastic Gradient Descent

Computing the gradient can be expensive, in particular if there are


many summands
In Stochastic Gradient Descent we approximate the gradient by
choosing a summand at random as a proxy
For details we refer to [RM51].

38/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model


Lifting Non Separable Problems
Consider the following one dimensional problem:

−4 −3 −2 −1 0 1 2 3 4

It is clearly not linearly separable (why)?


f (x )

9
x2
8

However, if we apply the 7

following map Φ : R → R2 6
5
defined as Φ(x ) := (x , x 2 ), it 4
becomes linearly separable: 3
2
1
x
−4 −3 −2 −1 0 1 2 3 4

39/45 Jörg Schäfer | Learning From Data | c b na 5: The Linear Model


Lifting Non Separable Problems
Consider the following one dimensional problem:

−4 −3 −2 −1 0 1 2 3 4

It is clearly not linearly separable (why)?


f (x )

9
x2
8

However, if we apply the 7

following map Φ : R → R2 6
5
defined as Φ(x ) := (x , x 2 ), it 4
becomes linearly separable: 3
2
1
x
−4 −3 −2 −1 0 1 2 3 4

39/45 Jörg Schäfer | Learning From Data | c b na 5: The Linear Model


Lifting

General Idea
If we cannot linearly separate a problem in dimension d, we might be able
0
to define a map Φ : Rd → Rd with d 0 > d such that in this dimension,
the problem becomes separable.

Note, that given any N points pi ∈ RN with binary classifiers yi ∈ X , we


can define a map Φ to RN+1 by Φ(x ) := (x , ±1) depending on whether
or not y = ±1. It maps any data to a linearly separable set by just adding
one dimension. This is, however, a really bad idea (why?)!
Warning
You must not use the expected output to define any preprocessing! Thus,
any mapping Φ must be defined using x ∈ X only.

40/45 Jörg Schäfer | Learning From Data | c b na 5: The Linear Model


Lifting

General Idea
If we cannot linearly separate a problem in dimension d, we might be able
0
to define a map Φ : Rd → Rd with d 0 > d such that in this dimension,
the problem becomes separable.

Note, that given any N points pi ∈ RN with binary classifiers yi ∈ X , we


can define a map Φ to RN+1 by Φ(x ) := (x , ±1) depending on whether
or not y = ±1. It maps any data to a linearly separable set by just adding
one dimension. This is, however, a really bad idea (why?)!
Warning
You must not use the expected output to define any preprocessing! Thus,
any mapping Φ must be defined using x ∈ X only.

40/45 Jörg Schäfer | Learning From Data | c b na 5: The Linear Model


Coordinates Matter

● ●● ● ●●●
● ●●●●
● ●● ●● ●●
●●
● ●●
●● ● ●●
● ● ●● ●●
● ●
●●●● ●

●●●●●●●
●● ● ●● ● ● ●
●●●
● ●●●●
●● ● ●● ●● ● ●● ●●● ●
● ● ● ●● ● ●●

● ● ● ● ● ● ●● ●
● ●●●●
● ●●●● ●●● ●
●●● ● ●
● ●● ● ●
● ●● ●
● ● ●●
● ●
● ● ●
● ●
● ● ● ● ● ●●
●● ●
● ● ● ● ● ●
● ● ● ● ● ●● ●● ● ●
● ●
●● ●● ● ●●
● ● ●

● ● ●● ● ● ● ●● ● ● ●● ●●
● ● ●● ● ●●●●● ● ● ● ●●● ● ●
● ● ●● ●
●● ●● ●
●● ● ●● ● ●

● ● ● ●● ● ● ● ● ●●● ● ● ●●
● ●● ● ● ● ●● ●
●●
● ●
● ●●●
●● ●● ●● ● ●● ●●● ● ●●● ●
● ●● ● ●●● ●● ● ● ●●●
● ●●
●●● ● ● ● ●● ●● ●●●●●●●●●
●● ● ● ● ● ● ●● ● ● ●●●
●● ● ● ● ● ● ● ●● ● ●● ●● ● ●
● ● ●● ● ● ●
● ● ● ● ● ●● ● ●● ●●●●●
●●
● ● ● ● ● ● ●● ● ●

phi
● ● ● ●
x2

● ● ●

● ● ●●
● ●
●● ● ● ● ● ● ● ● ● ● ● ●
●● ● ● ●● ● ● ●● ● ● ●
● ●● ● ●


● ● ●

● ● ● ●●

●● ●●●● ● ●● ● ●

●● ● ● ● ●● ●●●●● ● ●
● ●●● ●

● ● ● ●●●● ● ● ● ●
● ● ●●
●● ● ●● ●● ● ● ●● ● ● ●

● ● ●●● ● ● ● ●● ●● ●● ● ●●
●● ●
●●●● ●● ● ●● ●
● ● ●● ●● ● ● ●● ●● ●●●●

●●
● ● ●

● ● ●●●●● ● ●●● ●
● ●
● ●● ● ●● ● ●
● ● ● ●● ●
● ●
●● ● ● ● ● ●●● ●● ●
●● ●
● ● ●

● ●
● ● ●● ●● ● ●●

● ●● ● ● ● ● ● ● ● ● ●●
● ● ●● ● ●●● ● ● ● ● ●●●●
● ● ●● ● ●● ● ● ●
●● ● ● ● ● ● ● ●● ●
●●● ●●●● ● ● ● ● ● ●●●● ●●●●
● ● ● ●●●

● ●● ●●●●
● ● ● ● ●●●●
●● ●●● ● ● ●●
● ●●●●●● ●● ●
● ● ●●

●● ● ●● ●
● ● ●● ● ●●●● ● ● ●
● ●● ● ● ●● ● ●●

●● ●● ● ●
● ● ●
● ●
● ●● ●

x1 r

41/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model


Solving the Separability Problem for Circles – Lifting into
Higher Dimensions
Let Φ : R2 → R3 be defined as Φ(x1 , x2 ) := (z1 , z2 , z3 ) = (x1 , x2 , x12 + x22 ).

Lemma 17
Two sets of points Ri , i = 1, 2 are separable by a circle in two dimensions, iff Φ(Ri ) are
separable by a plane in three dimensions.

Proof.
“⇒” Let S := (x1 − a)2 + (x2 − b)2 = r 2 be the circle containing all points of R1 and having
any point of R2 outside.
By definition (x1 − a)2 + (x2 − b)2 ≤ r 2 for all the points of R1 .
Thus

−2ax1 − 2bx2 + x12 + x22 ≤ r 2 − a2 − b 2


h := −2az1 − 2bz2 + z3 − r 2 − a2 − b 2 ≤ 0

Therefore, x ∈ S iff h(Φ(x )) ≤ 0. This proves that if the point set is separable by a circle,
then the lifted point sets Φ(Ri ) are separable by a plane.

42/45 Jörg Schäfer | Learning From Data | c b na 5: The Linear Model


Solving the Separability Problem for Circles – Lifting into
Higher Dimensions (cont.)
Proof.
“⇐” assume that Φ(Ri ) are separable in R3 . By definition there exist a, b, c, and d such that:

h := az1 + bz2 + cz3 + d = 0,

such that all points in Φ(R1 ) are less than zero and all points in Φ(R2 ) are bigger than zero.
In particular, for any z ∈ φ(R1 ) ⇔ z = (x1 , x2 , x12 + x22 ) we have ax1 + bx2 + c(x12 + x22 ) ≤ 0.
Define U(h) := {(x1 , x2 )|h(x1 , x2 , x12 + x22 ) ≤ 0}, then we just have to show, that U is a
circle, because R1 ⊂ U(h) and R2 ∩ U(h) = ∅. But

ax1 + bx2 + c(x12 + x22 ) ≤ −d


a a d
⇒ (x12 + x1 ) + (x22 + x2 ) ≤ −
c c c
a 2 a 2 a2 + b 2 d
⇒ (x1 + ) + (x2 + ) ≤− −
2c 2c 4c 2 c

and the last equation defines a circle.

43/45 Jörg Schäfer | Learning From Data | c b na 5: The Linear Model


Solving the Separability Problem for Circles – Lifting into
Higher Dimensions (cont.)

Note, that the lifting is by no means unique.


Often, one defines the following transformation
√ Φ : R2 → R3 as
Φ(x1 , x2 ) := (z1 , z2 , z3 ) = (x12 , x22 , 2x1 x2 ). This is inspired by the
so-called kernel trick to be used for support vector machines (see
later).
One can prove that this transformation also separates circles (and/or
ellipses)

44/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model


References I

[AMMIL12] Y. S. Abu-Mostafa, M. Magdon-Ismail, and H.-T. Lin, Learning


From Data. AMLBook, 2012.

[RM51] H. Robbins and S. Munro, “A stochastic approximation method.”


Ann. Math. Stat., vol. 22, pp. 400–407, 1951.

45/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model

You might also like