IE643 Lecture7 2020sep4 Moodle

Deep Learning - Theory and Practice
IE 643
Lecture 7
September 4, 2020.
P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 1 / 67

Outline
1 Recap
MLP for XOR
Description of MLP
2 MLP - Data Perspective

Optimization Concepts
Gradient Descent

Recap MLP for XOR
Perceptron - Caveat
Not suitable when linear separability assumption fails

Example: Classical XOR problem
x1 x2 y = x1 ⊕ x2 ŷ = sign(w1 x1 + w2 x2 − θ)
0 0 -1 sign(−θ)
0 1 1 sign(w2 − θ)
1 0 1 sign(w1 − θ)
1 1 -1 sign(w1 + w2 − θ)

Recap MLP for XOR
Perceptron - Caveat
Not suitable when linear separability assumption fails
Example: Classical XOR problem
sign(−θ) = −1 =⇒ θ > 0
sign(w2 − θ) = 1 =⇒ w2 − θ ≥ 0
sign(w1 − θ) = 1 =⇒ w1 − θ ≥ 0
sign(w1 + w2 − θ) = −1 =⇒ −w1 − w2 + θ > 0
Note: This system is inconsistent.

Recall: We verified this using code for linear separability check.
Recap MLP for XOR
Moving away from perceptron - Dealing with XOR problem

Recap MLP for XOR
x1 x2 a11 a21 ŷ y
0 0 max{b1 , 0} max{b2 , 0} sign(ta11 + ua21 + b3 ) -1
0 1 max{q + b1 , 0} max{s + b2 , 0} sign(ta11 + ua21 + b3 ) +1
1 0 max{p + b1 , 0} max{r + b2 , 0} sign(ta11 + ua21 + b3 ) +1
1 1 max{p + q + b1 , 0} max{r + s + b2 , 0} sign(ta11 + ua21 + b3 ) -1

Recap MLP for XOR

Recap Description of MLP
Multi Layer Perceptron
Notable features:
Multiple layers stacked together.
Zero-th layer usually called input layer.
Final layer usually called output layer.
Intermediate layers are called hidden layers.

Multi Layer Perceptron
Notable features:
Each neuron in the hidden and output layer is like a perceptron.
However, unlike perceptron, different activation functions are used.

Multi Layer Perceptron - More notations
Recall:
nk` denotes k-th neuron at `-th layer.
ak` denotes activation of neuron nk` .

wij` denotes weight of connection connecting ni` from nj`−1 .

We have at layer L1 :
1 1 1 1 x

1 a1 z1 w11 x1 + w12 2
a = 1 =φ =φ = φ(W 1 x)
a2 z21 1 x + w1 x
w21 1 22 2

We have at layer L2 :
h 2i h 2 i h 2 1 2 a1
i h 1 i
a z w a1 + w12 a
a2 = a12 = φ z12 = φ w11 2
2 a1 + w 2 a1 = φ W 2 a11 = φ(W 2 a1 )
2 2 21 1 22 2 2

At Layer L3 ,
h i
3 a12
a3 = [a13 ] = φ ([z13 ]) = φ ([w11
3 a2 + w 3 a2
1 12 2 ]) = φ W a2
= φ(W 3 a2 )
2

ŷ = a3 = φ(W 3 a2 ) = φ(W 3 φ(W 2 a1 )) = φ(W 3 φ(W 2 φ(W 1 x)))

MLP - Data Perspective
Multi Layer Perceptron - Data Perspective
Given data (x, y ), multi layer perceptron predicts:
ŷ = φ(W 3 φ(W 2 φ(W 1 x)))

ŷ = φ(W 3 φ(W 2 φ(W 1 x))) =: MLP(x)

ŷ = φ(W 3 φ(W 2 φ(W 1 x))) =: MLP(x)
Similar to perceptron, if y 6= ŷ an error E (y , ŷ ) is incurred.

ŷ = φ(W 3 φ(W 2 φ(W 1 x))) =: MLP(x)

Aim: To change the weights W 1 , W 2 , W 3 , such that the error E (y , ŷ ) is
minimized.

ŷ = φ(W 3 φ(W 2 φ(W 1 x))) =: MLP(x)

Aim: To change the weights W 1 , W 2 , W 3 , such that the error Err(y , ŷ ) is
minimized.
Leads to an error minimization problem.

Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ X ⊆ Rd , y i ∈ Y.

For each sample x i the prediction ŷ i = MLP(x i ).
Error: e i = Err(y i , ŷ i ).
Aim: To minimize Si=1 e i .
P

Optimization perspective
Given training data D = {(x i , y i )}Si=1 ,
S
X
min ei
i=1

S
X S
X
i
min e = Err(y i , ŷ i )
i=1 i=1

S
X S
X S
X
min ei = Err(y i , ŷ i ) = Err(y i , MLP(x i ))
i=1 i=1 i=1

S
X S
X S
X
min ei = Err(y i , ŷ i ) = Err(y i , MLP(x i ))
i=1 i=1 i=1
Note: The minimization is over the weights of the MLP W 1 , . . . , W L ,

where L denotes number of layers in MLP.
MLP - Data Perspective: A Simple Example
1 1 1
ŷ = σ(w11 x1 + w12 x2 ) = 1 x + w1 x ]

1 + exp −[w11 1 12 2

Let
D = {(x 1 = (−3, −3), y 1 = 1),

(x 2 = (−2, −2), y 2 = 1),
(x 3 = (4, 4), y 3 = 0),
(x 4 = (2, −5), y 4 = 0)}.

1 1
x1 x2 y ŷ = σ(w11 x1 + w12 x2 )
1 1
-3 -3 1 σ(−3w11 − 3w12 )
1 1
-2 -2 1 σ(−2w11 − 2w12 )
1 1
4 4 0 σ(4w11 + 4w12 )
1 1
2 -5 0 σ(2w11 − 5w12 )

1 1
x1 x2 y ŷ = σ(w11 x1 + w12 x2 )
1 1
-3 -3 1 σ(−3w11 − 3w12 )
1 1
-2 -2 1 σ(−2w11 − 2w12 )
1 1
4 4 0 σ(4w11 + 4w12 )
1 1
2 -5 0 σ(2w11 − 5w12 )
Assume: Err(y , ŷ ) = (y − ŷ )2 .

1 1
x1 x2 y ŷ = σ(w11 x1 + w12 x2 )
1 1
-3 -3 1 σ(−3w11 − 3w12 )
1 1
-2 -2 1 σ(−2w11 − 2w12 )
1 1
4 4 0 σ(4w11 + 4w12 )
1 1
2 -5 0 σ(2w11 − 5w12 )
Assume: Err(y , ŷ ) = (y − ŷ )2 .
Popularly called the squared error.
1 1
x1 x2 y ŷ = σ(w11 x1 + w12 x2 )
1 1
-3 -3 1 σ(−3w11 − 3w12 )
1 1
-2 -2 1 σ(−2w11 − 2w12 )
1 1
4 4 0 σ(4w11 + 4w12 )
1 1
2 -5 0 σ(2w11 − 5w12 )
Total error (or loss):

4
X 4
X
E= ei = Err(y i , ŷ i )
i=1 i=1

1 1
x1 x2 y ŷ = σ(w11 x1 + w12 x2 )
1 1
-3 -3 1 σ(−3w11 − 3w12 )
1 1
-2 -2 1 σ(−2w11 − 2w12 )
1 1
4 4 0 σ(4w11 + 4w12 )
1 1
2 -5 0 σ(2w11 − 5w12 )
Total error (or loss):

4
!2
X
i 1
E= y − 1 i 1 xi

i=1
1 + exp − w11 x1 + w12 2

1 1
x1 x2 y ŷ = σ(w11 x1 + w12 x2 )
1 1
-3 -3 1 σ(−3w11 − 3w12 )
1 1
-2 -2 1 σ(−2w11 − 2w12 )
1 1
4 4 0 σ(4w11 + 4w12 )
1 1
2 -5 0 σ(2w11 − 5w12 )
Aim: To minimize the total error (or loss), which is

4
!2
X
i 1
min E = y − 1 i 1 xi

1 ,w 1
w11 12 i=1
1 + exp − w11 x1 + w12 2


Visualizing the loss surface:
1 1
x1 x2 y ŷ = σ(w11 x1 + w12 x2 )
1 1
-3 -3 1 σ(−3w11 − 3w12 )
1 1
-2 -2 1 σ(−2w11 − 2w12 )
1 1
4 4 0 σ(4w11 + 4w12 )
1 1
2 -5 0 σ(2w11 − 5w12 )
4
!2
X 1
E = yi − 1 xi + w1 xi

i=1
1 + exp − w11 1 12 2

MLP - Data Perspective Optimization Concepts
Optimization Concepts

General Optimization Problem
min f (x)
x∈C

min f (x)
x∈C
f is called objective function and C is called feasible set.

Let f ∗ = minx∈C f (x) denote the optimal objective function value.
Optimal Solution Set S ∗ = {x ∈ C : f (x) = f ∗ }.
Let us denote by x ∗ an optimal solution in S ∗ .

min f (x) (OP)

x∈C

min f (x) (OP)

x∈C
Local Optimal Solution

A solution z to (OP) is called local optimal solution if f (z) ≤ f (ẑ),
∀ẑ ∈ N (z, ) for some > 0.
Note: N (z, ) denotes suitable −neighborhood of z.

min f (x) (OP)

x∈C

∀ẑ ∈ N (z, ) for some > 0.

min f (x) (OP)

x∈C

∀ẑ ∈ N (z, ) for some > 0.
− Neighborhood of z ∈ C
N (z, ) = {u ∈ C : dist(z, u) ≤ }.

min f (x) (OP)

x∈C

∀ẑ ∈ N (z, ) for some > 0.
Global Optimal Solution

A solution z to (OP) is called global optimal solution if f (z) ≤ f (ẑ),
∀ẑ ∈ C.

min f (x)
x∈C
General Assumption: C ⊆ Rd .

High Dimensional Representation
Points x ∈ Rd , an Euclidean Space of dimension d.

High Dimensional Representation - Notations

Vector Representation of x ∈ Rd
 
x1
 x2 
 
 .. 
.
xd
Transpose of vector x
x > = x1 x2 . . . xd .

`2 Norm of vector x
√
kxk2 = x > x = (|x1 |2 + |x2 |2 + . . . + |xd |2 )1/2 .
`1 Norm of vector x
kxk1 = |x1 | + |x2 | + . . . + |xd |.
Note: |x1 | denotes the absolute value of x1 .
Zero vector has all its components to be zero:

>
0 0 ...0 .
Non-zero vector (x 6= 0) has at least one non-zero component.
One vector has all its components to be one:

>
1 1 ...1 .

Gradient of a function f : Rd → R at a point x

 ∂f (x) 
∂x
 1 
 
 ∂f (x) 
 ∂x2 
 
 : 
∇f (x) =  
 .. 
 . 
 
 : 
∂f (x)
∂xd

min f (x)
x∈C
C ⊆ Rd .
f : C −→ R.

Directional derivative
Let f : C −→ R be a function defined over C ⊆ Rd . Let x ∈ int(C). Let

0 6= d ∈ Rd . If the limit
f (x + αd) − f (x)
lim
α↓0 α
exists, then it is called the directional derivative of f at x along the

direction d, and is denoted by f 0 (x; d).

Interior of a set C
Let C ⊆ Rd . Then int(C) is defined by:
int(C) = {x ∈ C : B(x, ) ⊆ C, for some > 0},
where B(x, ) is the open ball centered at x with radius given by
B(x, ) = {y ∈ C : kx − y k < }.


Let f : C −→ R be a function defined over C ⊆ Rd . Let x ∈ int(C). Let

d 6= 0 ∈ Rd . If the limit
f (x + αd) − f (x)
lim
α↓0 α
exists, then it is called the directional derivative of f at x along the

direction d, and is denoted by f 0 (x; d).
Note: If all partial derivatives of f exist at x, then f 0 (x; d) = h∇f (x), di,
h i>
where ∇f (x) = ∂f∂x(x)1
. . . ∂f (x)
∂x d
.

Descent Direction
Let f : Rd −→ R be a continuously differentiable function over Rd . Then

a vector 0 6= d ∈ Rd is called a descent direction of f at x if the
directional derivative of f at x is negative; that is,
f 0 (x; d) = h∇f (x), di < 0.

Descent Direction
Let f : Rd −→ R be a continuously differentiable function over Rd . Then

a vector 0 6= d ∈ Rd is called a descent direction of f at x if the
directional derivative derivative of f at x is negative; that is,
f 0 (x; d) = h∇f (x), di < 0.
Note: A natural candidate for a descent direction is d = −∇f (x).

Descent Direction
Proposition
Let f : Rd −→ R be a continuously differentiable function over Rd . Let
0 6= d ∈ Rd be a descent direction of f at x. Then there exists > 0 such
that ∀α ∈ (0, ] we have
f (x + αd) < f (x).

Descent Direction

Descent Direction
Proposition
Let f : Rd −→ R be a continuously differentiable function over Rd . Let
0 6= d ∈ Rd be a descent direction of f at x. Then there exists > 0 such
that ∀α ∈ (0, ] we have
f (x + αd) < f (x).
Proof idea:

Algorithm Development using Descent Direction

Consider the general optimization problem:
min f (x) (GEN-OPT)

x∈Rd
where f : Rd −→ R
Algorithm to solve (GEN-OPT)
Start with x 0 ∈ Rd .
For k = 0, 1, 2, . . .
I Find a descent direction d k of f at x k and αk > 0 such that
f (x k + αk d k ) < f (x k ).
I x k+1 = x k + αk d k .
I Check for some stopping criterion and break from loop.

Characterization Of Local Optimum
Proposition
Let f : C −→ R be a function over the set C ⊆ Rd . Let x ? ∈ int(C) be a
local optimum point of f . Let all partial derivatives of f exist at x ? . Then
∇f (x ? ) = 0.

Characterization Of Local Optimum

Proposition
Let f : C −→ R be a function over the set C ⊆ Rd . Let x ? ∈ int(C) be a
local optimum point of f . Let all partial derivatives of f exist at x ? . Then
∇f (x ? ) = 0.
Proof idea:


min f (x) (GEN-OPT)

x∈Rd
where f : Rd −→ R.
For k = 0, 1, 2, . . .
f (x k + αk d k ) < f (x k ).
I x k+1 = x k + αk d k .
I If k∇f (x k+1 )k2 = 0, set x ∗ = x k+1 , break from loop.
Output x ∗ .


For k = 0, 1, 2, . . .
f (x k + αk d k ) < f (x k ).
I x k+1 = x k + αk d k .
Output x ∗ .
Homework: Compare the structure of this algorithm with the Perceptron

training algorithm and try to understand the perceptron update rule from
an optimization perspective.

MLP - Data Perspective Gradient Descent

min f (x) (GEN-OPT)

x∈Rd
where f : Rd −→ R.
Gradient Descent Algorithm to solve (GEN-OPT)
For k = 0, 1, 2, . . .
I d k = −∇f (x k ).
I αk = argminα>0 f (x k + αd k ).
I x k+1 = x k + αk d k .
Output x ∗ .
Gradient Descent for our MLP Problem

Recall: For MLP, the loss minimization problem is:
4 4
!2
X X 1
min E (w ) = e i (w ) = yi − 1 i 1 xi

1 ,w 1 )
w =(w11 12 i=1 i=1
1 + exp − w11 x1 + w12 2
where E : R2 −→ R.
Gradient Descent Algorithm to solve MLP Loss Minimization Problem
Start with w 0 ∈ Rd .
For k = 0, 1, 2, . . .
I d k = −∇E (w k ).
I αk = argminα>0 E (w k + αd k ).
I w k+1 = w k + αk d k .
I If k∇E (w k+1 )k2 = 0, set w ∗ = w k+1 , break from loop.
Output w ∗ .

4 4
!2
X
i
X
i 1
min E (w ) = e (w ) = y − 1 i 1 xi

1 ,w 1 )
w =(w11 12 i=1 i=1
1 + exp − w11 x1 + w12 2

For k = 0, 1, 2, . . .
I d k = −∇E (w k ).
I w k+1 = w k + αk d k .
Output w ∗ .


4 4
!2
X
i
X
i 1
min E (w ) = e (w ) = y − 1 i 1 xi

1 ,w 1 )
w =(w11 12 i=1 i=1
1 + exp − w11 x1 + w12 2

For k = 0, 1, 2, . . .
P4
I dk = − i=1 ∇e i (w k ).
I w k+1 = w k + αk d k .
Output w ∗ .

Gradient Descent Introduced By:

Cauchy, A.: Méthode générale pour la résolution des systèmes
d'équations simultanées. Comptes rendus des séances de l'Académie
des sciences de Paris 25, 536–538, 1847.
Acknowledgments:
CalcPlot3D website for plotting.

IE643 Lecture7 2020sep4 Moodle

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IE643 Lecture7 2020sep4 Moodle

Uploaded by

Copyright:

Available Formats

Deep Learning - Theory and Practice

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 1 / 67

2 MLP - Data Perspective

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 2 / 67

Not suitable when linear separability assumption fails

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 3 / 67

Note: This system is inconsistent.

Moving away from perceptron - Dealing with XOR problem

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 5 / 67

Moving away from perceptron - Dealing with XOR problem

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 6 / 67

Moving away from perceptron - Dealing with XOR problem

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 7 / 67

Multi Layer Perceptron

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 8 / 67

Multi Layer Perceptron

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 9 / 67

Multi Layer Perceptron - More notations

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 10 / 67

Multi Layer Perceptron - More notations

wij` denotes weight of connection connecting ni` from nj`−1 .

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 11 / 67

Multi Layer Perceptron - More notations

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 12 / 67

Multi Layer Perceptron - More notations

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 13 / 67

Multi Layer Perceptron - More notations

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 14 / 67

Multi Layer Perceptron - More notations

ŷ = a3 = φ(W 3 a2 ) = φ(W 3 φ(W 2 a1 )) = φ(W 3 φ(W 2 φ(W 1 x)))

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 15 / 67

Multi Layer Perceptron - Data Perspective

Given data (x, y ), multi layer perceptron predicts:

ŷ = φ(W 3 φ(W 2 φ(W 1 x)))

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 16 / 67

Multi Layer Perceptron - Data Perspective

Given data (x, y ), multi layer perceptron predicts:

ŷ = φ(W 3 φ(W 2 φ(W 1 x))) =: MLP(x)

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 17 / 67

Multi Layer Perceptron - Data Perspective

Given data (x, y ), multi layer perceptron predicts:

ŷ = φ(W 3 φ(W 2 φ(W 1 x))) =: MLP(x)

Similar to perceptron, if y 6= ŷ an error E (y , ŷ ) is incurred.

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 18 / 67

Multi Layer Perceptron - Data Perspective

Given data (x, y ), multi layer perceptron predicts:

ŷ = φ(W 3 φ(W 2 φ(W 1 x))) =: MLP(x)

Similar to perceptron, if y 6= ŷ an error E (y , ŷ ) is incurred.

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 19 / 67

Multi Layer Perceptron - Data Perspective

Given data (x, y ), multi layer perceptron predicts:

ŷ = φ(W 3 φ(W 2 φ(W 1 x))) =: MLP(x)

Similar to perceptron, if y 6= ŷ an error E (y , ŷ ) is incurred.

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 20 / 67

Multi Layer Perceptron - Data Perspective

Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ X ⊆ Rd , y i ∈ Y.

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 21 / 67

Multi Layer Perceptron - Data Perspective

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 22 / 67

Multi Layer Perceptron - Data Perspective

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 23 / 67

Multi Layer Perceptron - Data Perspective

Note: N (z, ) denotes suitable −neighborhood of z.

Note: N (z, ) denotes suitable −neighborhood of z.

Note: N (z, ) denotes suitable −neighborhood of z.

N (z, ) = {u ∈ C : dist(z, u) ≤ }.