Learning From Data: 5: The Linear Model

Learning From Data
5: The Linear Model

Jörg Schäfer
Frankfurt University of Applied Sciences
Department of Computer Sciences
Nibelungenplatz 1
D-60318 Frankfurt am Main
Wissen durch Praxis stärkt

1/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model Summer Semester 2022
Content
Linear Models
Linear Regression
Linear Classification
Logistic Regression
Non-Linear Transformations
Bibliography
2/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model

Everything should be made as simple as possible, but not simpler
– Albert Einstein

Motivation: Why Linear Models?
Linear Models are

simple to begin with
are capable of explaining many things
easy to understand (conceptually and mathematically)
a good solid, workhorse
serve as a building block for more sophisticated models

The Context
Unkown Target Function Error

f :X →Y Measure(s)
'
Training Examples
/ Learning Algorithm / Final Hypothesis
(x1 , y1 ), . . . , (xn , yn ) O g ≈h
O
Unknown
Probability Hypothesis Set
Distribution P(x ) H
Figure: Learning Model, source: [AMMIL12]
For Linear Models, we usually assume X = RN and Y = R for Linear

Regression or Y = {−1, +1} for classification.
The Context (cont.)
We want to learn, i.e. get a small Eout . In general,

2 N/8
P[|Ein (h) − Eout (h)| > ] ≤ 4mH (2N) e − ,
thus we need to ensure

1. that mH (2N) is polynomially bounded – equivalently its VC-dimension
is finite (low) – and N is big enough (we have enough data) and
2. that Ein is low.

Recap: Fitting Polynomials
Remember from lecture 2:

Lemma 1
Let N be any number of points (x1 , y1 ), . . . , (xn , yn ) ∈ R2 . Then there
exists a polynomial p(x ) = M i
P
i=0 βi x of degree M such that p(xi ) = yi
for all i ∈ 1, . . . , N if M ≥ N − 1.
In the proof, we found a solution for the coefficients βi as follows:
β := Xt (XXt )−1 y.
We will see, that we can use this approach in more general setting.

The Algorithm
Linear regression minimizes the squared error between our hypothesis h(x) and y :
Eout (h) = E (h(x) − y )2 ,

where the expected value is taken w.r.t. joint probability distribution P(x, y ).
Recall that
N
1 X 2
Ein (h) = h(xn ) − yn
N n=1
The hypothesis h can be written as
d
X
hw (x) = wi xi = wt x, (1)
i=0
where we use the usual compact notation x0 = 1 and x ∈ {1} × Rd for the offset.

The Algorithm (cont.)
Thus,
Ein (w) := Ein (hw )

N
1 X t 2
= w xn − yn
N n=1
1 2
= kX w − yk (using matrix notation – X is matrix of data points.)
N
1
wt X t X w − 2wt X t y + yt y

=
N
Linear Regression Algorithm

The linear regression algorithm is defined by a solution wopt to the following
optimization problem:
wopt := arg min Ein (w)
w∈Rd+1

Digression: Convex Functions – Definition
Definition 2
A set C ⊂ Rn is called convex iff for all x, y ∈ C and any t ∈ (0, 1) we have
tx + (1 − t)y ∈ C .
Definition 3
Let C ⊂ Rn be a non-empty convex set. A function f : C → R is called convex
iff for all x, y ∈ C and any t ∈ (0, 1) we have
f (tx + (1 − t)y) ≤ tf (x) + (1 − t)f (y). It is called strictly convex if the
inequality is strict.
It is clear that a function is convex iff every restriction to a one-dimensional line
is convex. Thus, the following characterization of one-dimensional convex
functions is useful:
Lemma 4
Let R(x1 , x2 ) := f (xx11)−f
−x2
(x2 )
. A function is convex iff R is monotonically
non-decreasing in x1 (and henceforth in x2 , too as R is symmetric with respect
to its arguments). It is strictly convex iff if R is monotonically increasing.
10/45 Jörg Schäfer | Learning From Data | c b na 5: The Linear Model
Convex Functions Example
10
f(3)
8
6
t(−1) + (1−t)3
x^2
4
2
f(−1)
0
−4 −2 0 2 4
Note, that R(x1 , x2 ) equals the slope of the function!

Proof of Lemma 4
Proof.
⇒ Take any a, b, c ∈ C , a < b < c. Then R is monotonically non-decreasing iff
f (b) − f (a) f (c) − f (b)

≤ (2)
b−a c −b
This is equivalent to the following chain of inequalities:
[f (b) − f (a)](c − b) ≤ [f (c) − f (b)](b − a),

f (b)(c − a) ≤ f (a)(c − b) + f (c)(b − a)
c −b b−a
f (b) ≤ f (a) + f (c).
c −a c −a
We set t = c−b
c−a ∈ (0, 1) and verify that, indeed, b = ta + (1 − t)c. Then the
last inequality holds since f is convex.
⇐ If eq. 2 holds, then f is convex as we can set b = ta + (1 − t)c.
Strict convexity follow similarly.

Convex Differential Functions
Theorem 5
Let C ⊂ RN be a non-empty, open and convex set and let f : C ∈ R be a
differentiable function. Then f is convex iff for any x ∗ , x ∈ C
f (x ) ≥ ∇f (x ∗ )(x − x ∗ ) + f (x ∗ ). (3)
Proof.
⇒ If f is convex, then for any x ∗ , x ∈ C , x 6= x ∗ , and any t ∈ (0, 1),
f (tx + (1 − t)x ∗ ) ≤ tf (x ) + (1 − t)f (x ∗ ). This yields
f (x ∗ + t(x − x ∗ )) − f (x ∗ )
f (x ) − f (x ∗ ) ≥ .
t
Letting t → 0, we get eq. 3.

Convex Differential Functions (cont.)
Proof.
⇐ Conversely, take any a, b ∈ C , any t ∈ (0, 1). We set x ∗ = ta + (1 − t)b.
Note that a − x ∗ = −(1 − t)(b − a) and b − x ∗ = t(b − a). Therefore, by
inequality 3:
f (a) ≥ ∇f (x ∗ )[−(1 − t)(b − a)] + f (x ∗ )

f (b) ≥ ∇f (x ∗ )[t(b − a)] + f (x ∗ )
Multiplying the first by t > 0 and the second by 1 − t > 0, and adding, yields
tf (a) + (1 − t)f (b) ≥ f (x ∗ ).

Convex Differential Functions (cont.)
Convex functions are important, because we can ensure that local optimums are
global ones:
Theorem 6
If ∇f (x ∗ ) = 0 then f has a global minimum at x ∗ .
Proof.
Let x ∗ such that ∇f (x ∗ ) = 0. From the theorem above, it follows that for any
x ∈ C we have f (x ) ≥ f (x ∗ ).
Global Optimums
It can be shown, that a convex function always has a global minimum if it has a
local minimum (even if it is non differentiable). This is very important, because
it is easy to get stuck in local minima (plaguing many ML approaches)

Norms are Convex Functions
Lemma 7
Any norm is a convex function.
Proof.
Exercise!

Ein is a Convex Function
Lemma 8
The function Φ(w ) := √1 ||w t X − y || is convex.
2
Proof.
1
Φ(tw1 + (1 − t)w2 ) = √ ||(tw1 + (1 − t)w2 )t X − y ||
2
1
= √ ||tw1t X + (1 − t)w2t X − ty − (1 − t)y ||
2
1
= √ tw1t X − ty + (1 − t)w2t X − (1 − t)y

2
1 1
≤ t √ ||w1t X − y || + (1 − t) √ ||w2t X − y ||
2 2
= t Φ(w1 ) + (1 − t) Φ(w2 )

Ein is a Convex Function (cont.)
Lemma 9
The function Φ0 (w ) := Φ2 (w ) = 12 ||w t X − y ||2 is convex.
Proof.
As Φ(w ) is convex, we have Φ(tw1 + (1 − t)w2 ) ≤ tΦ(w1 ) + (1 − t)Φ(w2 ).
Define Ψ(x ) := x 2 . Then Ψ is convex and non-decreasing. Henceforth:
Φ0 (tw1 + (1 − t)w2 ) = Ψ(Φ(tw1 + (1 − t)w2 ))

≤ Ψ(tΦ(tw1 ) + (1 − t)Φ(w2 ))
≤ tΨ(Φ(w1 )) + (1 − t)Ψ(Φ(w2 ))
= tΦ2 (w1 ) + (1 − t)Φ2 (w2 ))
= tΦ0 (w1 ) + (1 − t)Φ0 (w2 ))

Deriving Linear Regression Solution
Lemma 10
The solution to the linear regression problem is given by
wopt = X † y, (4)
where X † denotes the Moore-Penrose pseudo inverse.
Proof.
We claim that the function w → Ein (w) is convex. This follows easily from the
properties of a norm and Ein (w) = N1 kX w − yk2 .
Now,
2
∇w Ein (w) = (X t X w − X t y).
N
From the theory of convex differentiable functions we know that we only have to solve
for ∇w Ein (w) = 0. This is equivalent to
X t X w = X t y.
As the Moore-Penrose pseudo inverse is defined by this equation, this completes the
proof.
Comments
If X t X is invertible, the Moore-Penrose pseudo inverse can be

obtained from the inverse of X t X , i.e.
X † = (X t X )−1 X t
If X t X is not invertible, the solution exists but is not unique.

The crucial result is, that we found a global! optimal solution even
and have an exact formula for it! (This is a rare exception in ML.)

Where is the Learning Taking Place?
In Linear Regression, where does the Learning take place ?

When we calculate wopt we “learn” from the data d ∈ D as Ein
depends on the data set presented
The fact that we get the solution (eq. 4) “immediately” is
uncommon in ML but should not confuse anybody – learning just
means to select the “best” hypothesis given the data available.





Generalization Ability of Linear Model
Although we have not proven the Generalization Bound for regression

estimates, such results exist. In fact the following theorem can be proven
[AMMIL12]:
Theorem 11
For the linear regression with d variables in N dimensions
d
Eout (h) = Ein (g) + O( )
N
is true.

What is Linear in Linear Regression?
If we look at the functional Ein (see eq. 1) we can conclude that it is
linear in w and in x.
A careful inspection of the proof reveals, that the linearity in x was
not used at all.
Thus we can extend the linear model easily to make use of any
functional relation ship as long as the coefficients (weights) stay
linear:
d
X
h(w) = wi f(xi )
i=1
For example, we might use polynomials,

i.e. f (x ) = {1, x , x 2 , . . . , x d }.
This is a powerful idea (and we will pursue it later, see section
“Non-Linear Transformations”).

Linear Classification
Assume X = RN and Y = {−1, +1}. We have
h(x) = sign(wt x),
where w ∈ Rd . We already know that the VC dimension of this problem is d + 1.
Thus for learning, we only have to find a hypothesis such that Ein is low.
Recall the perceptron:
Algorithm 1 P e rc e p t ro n
1: while misclassified do
2: for any x [i] misclassified do
3: if y [i] = +1 then
4: w ← w + x [i]
5: b ← b + ||w ||2
6: else
7: w ← w − x [i]
8: b ← b − ||w ||2
9: end if
10: end for
11: end while

Linear Classification (cont.)
Problem: The Perceptron algorithm will find a solution with Ein = 0 iff the
data is linearly separable.
But if the data is not linearly separable it will not converge.
So, in the general case we seek a solution to
N
1 X
arg min 1{sign(wt x)6=yi } (yi )
w∈Rd+1 N i=1
Problem: This is NP hard [AMMIL12]! It is a hard combinatorial problem

(analytical problems are often nicer, see below!)

Alternative: Pocket Algorithm
The pocket algorithm approximates the solution:
Algorithm 2 P o c k e t A l g o r i t h m
1: Set the pocket weight vector ŵ to w0
2: for t = 0, . . . , T − 1 do
3: if any x [i] misclassified then
4: if y [i] = +1 then
5: w ← w + x [i]
6: b ← b + ||w ||2
7: else
8: w ← w − x [i]
9: b ← b − ||w ||2
10: end if
11: Evaluate Ein [wt ].
12: if wt is better than the pocket weight vector ŵ then
13: ŵ ← wt
14: end if
15: if Break Condition satisfied then
16: Break
17: end if
18: end if
19: end for
Logistic Regression
Suppose we want to classify with a probability of getting it “right”.
Idea: Smooth the step function
1.0
0.8
0.6
0.4
0.2
0.0
−4 −2 0 2 4
by a continuous, smooth (smoothing) function:

1.0
0.8
0.6
logit
0.4
tanh
0.2
0.0
−4 −2 0 2 4

x
Smoothing Functions
ex
Commonly used are 1+e x and tanh(x ).
1.0
1.0
0.8
0.5
exp(x)/(1 + exp(x))
0.6
tanh(x)
0.0
0.4
−0.5
0.2
−1.0
0.0
−4 −2 0 2 4 −4 −2 0 2 4
x x

Example: Sigmoid Function
the commonly used function
es 1
θ(s) := =
1+e s 1 + e −s
1
θ(x ) :=
1 + e −wt x
is called Sigmoid or soft threshold function.
The sigmoid function is similar to the step function, so far as the
behaviour for large positive and negative numbers is considered
For input of modest size it avoids the discontinuity at 0 by smoothly
interpolation.
Henceforth – as smoothness implies continuity – small changes in
inputs will produce small changes in the output
As range(θ) ∈ (0, 1) we can interpret its results as probabilities.
(The other function tanh(x ) has similar properties.)
Logistic Regression
We try to guess the best target distribution P(Y |x ) given the observed
data D, i.e. (
f (x ) for y = +1
P(Y |x ) =
1 − f (x ) for y = −1.
Thus, the target function f is a probability and we can try as a hypothesis
any function that can be interpreted as a probability, for instance θ:
(
h(x ) = θ(wt x) for y = +1
P(Y |x ) =
1 − θ(wt x) for y = −1.
This can be written compactly as
P(Y |x ) = θ(y wt x)

Logistic Regression (cont.)
The probability of getting the y1 , . . . , yN in D given x1 , . . . , xN is
N
Y
P(y1 , . . . , yN |x1 , . . . , xN ) = θ(yi wt xi )
i=1
To maximize the likelihood, we select an h such that it maximizes the
probability
N
Y N
X
t
arg max log( θ(yi w xi )) = arg max log(θ(yi wt xi )).
w w
i=1 i=1
This is equivalent to minimizing
N N
1 Y
t 1 X
arg min − log( θ(yi w xi )) = arg min log(1/θ(yi wt xi )).
w N i=1
w N i=1
Definition 12
PN t
The quantity Ein := 1
N i=1 log(1 + e −θ(yi w xi ) ) is called the cross entropy
error.
How to Compute Cross Entropy?
if the sample is correctly classified, then yi and wt xi have the same

sign, henceforth all values are small (exponentiating a negative
number)
if the sample is incorrectly classified, then yi and wt xi have different
signs, henceforth all values are large (exponentiating a positive
number)
Practical Problem
How to compute the arg max?
Answer: There does not exist a closed form solution – thus we have to
rely on numerical iterations.

Digression: Gradient Descent (aka Steepest Descent)
Given any (differentiable) function f , how do we find maxima (or

minima)?
Idea: We climb uphill (downhill) by following the deepest descend:
Algorithm 3 G r a d i e n t D e s c e n t
1: Fix threshold
2: Fix step-size η0 (and scheme ηk )
3: Chose (random) start value x0
4: while |xk+1 − xk | ≥ do
5: xk+1 = xk ± ηk ∇f (xk ) . Chose "+" for max and "−" for min.
6: end while

More Math
Definition 13
Let L be a positive number. A function Rn → R is called L-Lipschitz iff
k∇f (x ) − ∇f (y )k ≤ L kx − y k x , y ∈ Rn
Lemma 14
2
Let f be L-Lipschitz. Then f (y ) ≤ f (x ) + < ∇f (x ), y − x > + L2 kx − y k .
Proof.
Exercises!
Lemma 15
Let f be L-Lipschitz and f ∗ := minx f (x ) > −∞. Then the gradient descent
algorithm with fixed step size η such that η < L2 will converge to a stationary
point.
More Math (cont.)
Proof.
Let xk+1 = xk − η∇f (xk ). Then using the lemma above we get
L 2
f (xk+1 ) ≤ f (xk )+ < ∇f (xk ), xk+1 − xk > + kxk − xk+1 k
2
2 η2 2
= f (xk ) − η k∇f (xk )k + L k∇f xk k
2
η 2
= f (xk ) − η(1 − L) k∇f (xk )k
2
2 1
⇒ k∇f (xk )k ≤ (f (xk ) − f (xk+1 )).
η(1 − η2 L)
N
X 2 1
⇒ k∇f (xk )k ≤ (f (x0 ) − f (xN )) (telescope trick)
η(1 − η2 L)
k=0
1
≤ (f (x0 ) − f ∗ ) ∀N.
η(1 − η2 L)
This is only possible for all N if lim∞

k=0 ∇f (xk ) = 0.
More Math (cont.)
Remember, if f is convex, any local stationary point is a global minimum.
Gradient Descent always works for Convex Functions
Thus, for convex (concave) L-Lipschitz functions gradient descent
always works and yields the global minimum (or maximum).
Lemma 16
1 PN tx
The log likelihood of logistic regression w → N i=1 log(θ(yi w i )) is a
concave L-Lipschitz function.
Proof.
Exercises!
Open question:
How to fix the step size in general?
Answer: We will see!
Logistic Regression – the Algorithm
Chose a value for η and then compute:
Algorithm 4 L o g i s t i c R e g r e s s i o n
1: Initialize the weights w (0)
2: while Error bigger than Target Approximation Error do
3: Compute the gradient
N
1 X yi xi
∇Ein =
N i=1 1 + e yi wtT xi
4: Update weights wt+1 = wt − η∇Ein

5: Compute the Error
6: end while

Digression: Stochastic Gradient Descent
Computing the gradient can be expensive, in particular if there are

many summands
In Stochastic Gradient Descent we approximate the gradient by
choosing a summand at random as a proxy
For details we refer to [RM51].

Lifting Non Separable Problems
Consider the following one dimensional problem:
−4 −3 −2 −1 0 1 2 3 4
It is clearly not linearly separable (why)?

f (x )
9
x2
8
However, if we apply the 7
following map Φ : R → R2 6
5
defined as Φ(x ) := (x , x 2 ), it 4
becomes linearly separable: 3
2
1
x
−4 −3 −2 −1 0 1 2 3 4

Lifting Non Separable Problems
Consider the following one dimensional problem:
−4 −3 −2 −1 0 1 2 3 4
It is clearly not linearly separable (why)?

f (x )
9
x2
8
However, if we apply the 7
following map Φ : R → R2 6
5
defined as Φ(x ) := (x , x 2 ), it 4
becomes linearly separable: 3
2
1
x
−4 −3 −2 −1 0 1 2 3 4

Lifting
General Idea
If we cannot linearly separate a problem in dimension d, we might be able
0
to define a map Φ : Rd → Rd with d 0 > d such that in this dimension,
the problem becomes separable.
Note, that given any N points pi ∈ RN with binary classifiers yi ∈ X , we

can define a map Φ to RN+1 by Φ(x ) := (x , ±1) depending on whether
or not y = ±1. It maps any data to a linearly separable set by just adding
one dimension. This is, however, a really bad idea (why?)!
Warning
You must not use the expected output to define any preprocessing! Thus,
any mapping Φ must be defined using x ∈ X only.

Lifting
General Idea
If we cannot linearly separate a problem in dimension d, we might be able
0
to define a map Φ : Rd → Rd with d 0 > d such that in this dimension,
the problem becomes separable.
Note, that given any N points pi ∈ RN with binary classifiers yi ∈ X , we

can define a map Φ to RN+1 by Φ(x ) := (x , ±1) depending on whether
or not y = ±1. It maps any data to a linearly separable set by just adding
one dimension. This is, however, a really bad idea (why?)!
Warning
You must not use the expected output to define any preprocessing! Thus,
any mapping Φ must be defined using x ∈ X only.

Coordinates Matter
● ●● ● ●●●
● ●●●●
● ●● ●● ●●
●●
● ●●
●● ● ●●
● ● ●● ●●
● ●
●●●● ●
●
●●●●●●●
●● ● ●● ● ● ●
●●●
● ●●●●
●● ● ●● ●● ● ●● ●●● ●
● ● ● ●● ● ●●
●
● ● ● ● ● ● ●● ●
● ●●●●
● ●●●● ●●● ●
●●● ● ●
● ●● ● ●
● ●● ●
● ● ●●
● ●
● ● ●
● ●
● ● ● ● ● ●●
●● ●
● ● ● ● ● ●
● ● ● ● ● ●● ●● ● ●
● ●
●● ●● ● ●●
● ● ●
●
● ● ●● ● ● ● ●● ● ● ●● ●●
● ● ●● ● ●●●●● ● ● ● ●●● ● ●
● ● ●● ●
●● ●● ●
●● ● ●● ● ●
●
● ● ● ●● ● ● ● ● ●●● ● ● ●●
● ●● ● ● ● ●● ●
●●
● ●
● ●●●
●● ●● ●● ● ●● ●●● ● ●●● ●
● ●● ● ●●● ●● ● ● ●●●
● ●●
●●● ● ● ● ●● ●● ●●●●●●●●●
●● ● ● ● ● ● ●● ● ● ●●●
●● ● ● ● ● ● ● ●● ● ●● ●● ● ●
● ● ●● ● ● ●
● ● ● ● ● ●● ● ●● ●●●●●
●●
● ● ● ● ● ● ●● ● ●
phi
● ● ● ●
x2
● ● ●
●
● ● ●●
● ●
●● ● ● ● ● ● ● ● ● ● ● ●
●● ● ● ●● ● ● ●● ● ● ●
● ●● ● ●
●
●
● ● ●
●
● ● ● ●●
●
●● ●●●● ● ●● ● ●
●
●● ● ● ● ●● ●●●●● ● ●
● ●●● ●
●
● ● ● ●●●● ● ● ● ●
● ● ●●
●● ● ●● ●● ● ● ●● ● ● ●
●
● ● ●●● ● ● ● ●● ●● ●● ● ●●
●● ●
●●●● ●● ● ●● ●
● ● ●● ●● ● ● ●● ●● ●●●●
●
●●
● ● ●
●
● ● ●●●●● ● ●●● ●
● ●
● ●● ● ●● ● ●
● ● ● ●● ●
● ●
●● ● ● ● ● ●●● ●● ●
●● ●
● ● ●
●
● ●
● ● ●● ●● ● ●●
●
● ●● ● ● ● ● ● ● ● ● ●●
● ● ●● ● ●●● ● ● ● ● ●●●●
● ● ●● ● ●● ● ● ●
●● ● ● ● ● ● ● ●● ●
●●● ●●●● ● ● ● ● ● ●●●● ●●●●
● ● ● ●●●
●
● ●● ●●●●
● ● ● ● ●●●●
●● ●●● ● ● ●●
● ●●●●●● ●● ●
● ● ●●
●
●● ● ●● ●
● ● ●● ● ●●●● ● ● ●
● ●● ● ● ●● ● ●●
●
●● ●● ● ●
● ● ●
● ●
● ●● ●
x1 r

Solving the Separability Problem for Circles – Lifting into
Higher Dimensions
Let Φ : R2 → R3 be defined as Φ(x1 , x2 ) := (z1 , z2 , z3 ) = (x1 , x2 , x12 + x22 ).
Lemma 17
Two sets of points Ri , i = 1, 2 are separable by a circle in two dimensions, iff Φ(Ri ) are
separable by a plane in three dimensions.
Proof.
“⇒” Let S := (x1 − a)2 + (x2 − b)2 = r 2 be the circle containing all points of R1 and having
any point of R2 outside.
By definition (x1 − a)2 + (x2 − b)2 ≤ r 2 for all the points of R1 .
Thus
−2ax1 − 2bx2 + x12 + x22 ≤ r 2 − a2 − b 2

h := −2az1 − 2bz2 + z3 − r 2 − a2 − b 2 ≤ 0
Therefore, x ∈ S iff h(Φ(x )) ≤ 0. This proves that if the point set is separable by a circle,
then the lifted point sets Φ(Ri ) are separable by a plane.

Higher Dimensions (cont.)
Proof.
“⇐” assume that Φ(Ri ) are separable in R3 . By definition there exist a, b, c, and d such that:
h := az1 + bz2 + cz3 + d = 0,
such that all points in Φ(R1 ) are less than zero and all points in Φ(R2 ) are bigger than zero.
In particular, for any z ∈ φ(R1 ) ⇔ z = (x1 , x2 , x12 + x22 ) we have ax1 + bx2 + c(x12 + x22 ) ≤ 0.
Define U(h) := {(x1 , x2 )|h(x1 , x2 , x12 + x22 ) ≤ 0}, then we just have to show, that U is a
circle, because R1 ⊂ U(h) and R2 ∩ U(h) = ∅. But
ax1 + bx2 + c(x12 + x22 ) ≤ −d

a a d
⇒ (x12 + x1 ) + (x22 + x2 ) ≤ −
c c c
a 2 a 2 a2 + b 2 d
⇒ (x1 + ) + (x2 + ) ≤− −
2c 2c 4c 2 c
and the last equation defines a circle.

Higher Dimensions (cont.)
Note, that the lifting is by no means unique.

Often, one defines the following transformation
√ Φ : R2 → R3 as
Φ(x1 , x2 ) := (z1 , z2 , z3 ) = (x12 , x22 , 2x1 x2 ). This is inspired by the
so-called kernel trick to be used for support vector machines (see
later).
One can prove that this transformation also separates circles (and/or
ellipses)

References I
[AMMIL12] Y. S. Abu-Mostafa, M. Magdon-Ismail, and H.-T. Lin, Learning

From Data. AMLBook, 2012.
[RM51] H. Robbins and S. Munro, “A stochastic approximation method.”

Ann. Math. Stat., vol. 22, pp. 400–407, 1951.

Learning From Data: 5: The Linear Model

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Learning From Data: 5: The Linear Model

Uploaded by

Copyright:

Available Formats

Learning From Data

5: The Linear Model

Wissen durch Praxis stärkt

2/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model

3/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model

Linear Models are

4/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model

Unkown Target Function Error

Figure: Learning Model, source: [AMMIL12]

For Linear Models, we usually assume X = RN and Y = R for Linear

We want to learn, i.e. get a small Eout . In general,

thus we need to ensure

6/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model

Remember from lecture 2:

In the proof, we found a solution for the coefficients βi as follows:

7/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model

Eout (h) = E (h(x) − y )2 ,

8/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model

Ein (w) := Ein (hw )

Linear Regression Algorithm

9/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model

Note, that R(x1 , x2 ) equals the slope of the function!

f (b) − f (a) f (c) − f (b)

[f (b) − f (a)](c − b) ≤ [f (c) − f (b)](b − a),

12/45 Jörg Schäfer | Learning From Data | c b na 5: The Linear Model

13/45 Jörg Schäfer | Learning From Data | c b na 5: The Linear Model

f (a) ≥ ∇f (x ∗ )[−(1 − t)(b − a)] + f (x ∗ )

14/45 Jörg Schäfer | Learning From Data | c b na 5: The Linear Model

15/45 Jörg Schäfer | Learning From Data | c b na 5: The Linear Model

16/45 Jörg Schäfer | Learning From Data | c b na 5: The Linear Model

17/45 Jörg Schäfer | Learning From Data | c b na 5: The Linear Model

Φ0 (tw1 + (1 − t)w2 ) = Ψ(Φ(tw1 + (1 − t)w2 ))

18/45 Jörg Schäfer | Learning From Data | c b na 5: The Linear Model

where X † denotes the Moore-Penrose pseudo inverse.

If X t X is invertible, the Moore-Penrose pseudo inverse can be

If X t X is not invertible, the solution exists but is not unique.

20/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model

In Linear Regression, where does the Learning take place ?

21/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model

In Linear Regression, where does the Learning take place ?

21/45 Jörg Schäfer | Learning From Data | c b na 5: The Linear Model

In Linear Regression, where does the Learning take place ?

21/45 Jörg Schäfer | Learning From Data | c b na 5: The Linear Model

Although we have not proven the Generalization Bound for regression

22/45 Jörg Schäfer | Learning From Data | c b na 5: The Linear Model

For example, we might use polynomials,

23/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model

24/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model

Problem: This is NP hard [AMMIL12]! It is a hard combinatorial problem

25/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model

by a continuous, smooth (smoothing) function:

27/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model

28/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model

This can be written compactly as

30/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model

if the sample is correctly classified, then yi and wt xi have the same

32/45 Jörg Schäfer | Learning From Data | c b na 5: The Linear Model

Given any (differentiable) function f , how do we find maxima (or

33/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model

This is only possible for all N if lim∞

4: Update weights wt+1 = wt − η∇Ein

37/45 Jörg Schäfer | Learning From Data | c b n a 5: The Linear Model