Download as pdf or txt
Download as pdf or txt
You are on page 1of 67

Deep Learning - Theory and Practice

IE 643
Lecture 7

September 4, 2020.

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 1 / 67


Outline

1 Recap
MLP for XOR
Description of MLP

2 MLP - Data Perspective


Optimization Concepts
Gradient Descent

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 2 / 67


Recap MLP for XOR

Perceptron - Caveat

Not suitable when linear separability assumption fails


Example: Classical XOR problem

x1 x2 y = x1 ⊕ x2 ŷ = sign(w1 x1 + w2 x2 − θ)
0 0 -1 sign(−θ)
0 1 1 sign(w2 − θ)
1 0 1 sign(w1 − θ)
1 1 -1 sign(w1 + w2 − θ)

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 3 / 67


Recap MLP for XOR

Perceptron - Caveat
Not suitable when linear separability assumption fails
Example: Classical XOR problem

sign(−θ) = −1 =⇒ θ > 0
sign(w2 − θ) = 1 =⇒ w2 − θ ≥ 0
sign(w1 − θ) = 1 =⇒ w1 − θ ≥ 0
sign(w1 + w2 − θ) = −1 =⇒ −w1 − w2 + θ > 0

Note: This system is inconsistent.


Recall: We verified this using code for linear separability check.
P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 4 / 67
Recap MLP for XOR

Moving away from perceptron - Dealing with XOR problem

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 5 / 67


Recap MLP for XOR

Moving away from perceptron - Dealing with XOR problem

x1 x2 a11 a21 ŷ y
0 0 max{b1 , 0} max{b2 , 0} sign(ta11 + ua21 + b3 ) -1
0 1 max{q + b1 , 0} max{s + b2 , 0} sign(ta11 + ua21 + b3 ) +1
1 0 max{p + b1 , 0} max{r + b2 , 0} sign(ta11 + ua21 + b3 ) +1
1 1 max{p + q + b1 , 0} max{r + s + b2 , 0} sign(ta11 + ua21 + b3 ) -1

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 6 / 67


Recap MLP for XOR

Moving away from perceptron - Dealing with XOR problem

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 7 / 67


Recap Description of MLP

Multi Layer Perceptron

Notable features:
Multiple layers stacked together.
Zero-th layer usually called input layer.
Final layer usually called output layer.
Intermediate layers are called hidden layers.

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 8 / 67


Recap Description of MLP

Multi Layer Perceptron

Notable features:
Each neuron in the hidden and output layer is like a perceptron.
However, unlike perceptron, different activation functions are used.

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 9 / 67


Recap Description of MLP

Multi Layer Perceptron - More notations

Recall:
nk` denotes k-th neuron at `-th layer.
ak` denotes activation of neuron nk` .

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 10 / 67


Recap Description of MLP

Multi Layer Perceptron - More notations

wij` denotes weight of connection connecting ni` from nj`−1 .

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 11 / 67


Recap Description of MLP

Multi Layer Perceptron - More notations

We have at layer L1 :
 1  1   1 1 x

1 a1 z1 w11 x1 + w12 2
a = 1 =φ =φ = φ(W 1 x)
a2 z21 1 x + w1 x
w21 1 22 2

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 12 / 67


Recap Description of MLP

Multi Layer Perceptron - More notations

We have at layer L2 :
h 2i h 2 i h 2 1 2 a1
i  h 1 i
a z w a1 + w12 a
a2 = a12 = φ z12 = φ w11 2
2 a1 + w 2 a1 = φ W 2 a11 = φ(W 2 a1 )
2 2 21 1 22 2 2

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 13 / 67


Recap Description of MLP

Multi Layer Perceptron - More notations

At Layer L3 ,
 h i
3 a12
a3 = [a13 ] = φ ([z13 ]) = φ ([w11
3 a2 + w 3 a2
1 12 2 ]) = φ W a2
= φ(W 3 a2 )
2

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 14 / 67


Recap Description of MLP

Multi Layer Perceptron - More notations

ŷ = a3 = φ(W 3 a2 ) = φ(W 3 φ(W 2 a1 )) = φ(W 3 φ(W 2 φ(W 1 x)))

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 15 / 67


MLP - Data Perspective

Multi Layer Perceptron - Data Perspective

Given data (x, y ), multi layer perceptron predicts:

ŷ = φ(W 3 φ(W 2 φ(W 1 x)))

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 16 / 67


MLP - Data Perspective

Multi Layer Perceptron - Data Perspective

Given data (x, y ), multi layer perceptron predicts:

ŷ = φ(W 3 φ(W 2 φ(W 1 x))) =: MLP(x)

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 17 / 67


MLP - Data Perspective

Multi Layer Perceptron - Data Perspective

Given data (x, y ), multi layer perceptron predicts:

ŷ = φ(W 3 φ(W 2 φ(W 1 x))) =: MLP(x)

Similar to perceptron, if y 6= ŷ an error E (y , ŷ ) is incurred.

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 18 / 67


MLP - Data Perspective

Multi Layer Perceptron - Data Perspective

Given data (x, y ), multi layer perceptron predicts:

ŷ = φ(W 3 φ(W 2 φ(W 1 x))) =: MLP(x)

Similar to perceptron, if y 6= ŷ an error E (y , ŷ ) is incurred.


Aim: To change the weights W 1 , W 2 , W 3 , such that the error E (y , ŷ ) is
minimized.

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 19 / 67


MLP - Data Perspective

Multi Layer Perceptron - Data Perspective

Given data (x, y ), multi layer perceptron predicts:

ŷ = φ(W 3 φ(W 2 φ(W 1 x))) =: MLP(x)

Similar to perceptron, if y 6= ŷ an error E (y , ŷ ) is incurred.


Aim: To change the weights W 1 , W 2 , W 3 , such that the error Err(y , ŷ ) is
minimized.
Leads to an error minimization problem.

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 20 / 67


MLP - Data Perspective

Multi Layer Perceptron - Data Perspective

Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ X ⊆ Rd , y i ∈ Y.


For each sample x i the prediction ŷ i = MLP(x i ).
Error: e i = Err(y i , ŷ i ).
Aim: To minimize Si=1 e i .
P

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 21 / 67


MLP - Data Perspective

Multi Layer Perceptron - Data Perspective

Optimization perspective
Given training data D = {(x i , y i )}Si=1 ,
S
X
min ei
i=1

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 22 / 67


MLP - Data Perspective

Multi Layer Perceptron - Data Perspective

Optimization perspective
Given training data D = {(x i , y i )}Si=1 ,
S
X S
X
i
min e = Err(y i , ŷ i )
i=1 i=1

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 23 / 67


MLP - Data Perspective

Multi Layer Perceptron - Data Perspective

Optimization perspective
Given training data D = {(x i , y i )}Si=1 ,

S
X S
X S
X
min ei = Err(y i , ŷ i ) = Err(y i , MLP(x i ))
i=1 i=1 i=1

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 24 / 67


MLP - Data Perspective

Multi Layer Perceptron - Data Perspective

Optimization perspective
Given training data D = {(x i , y i )}Si=1 ,

S
X S
X S
X
min ei = Err(y i , ŷ i ) = Err(y i , MLP(x i ))
i=1 i=1 i=1

Note: The minimization is over the weights of the MLP W 1 , . . . , W L ,


where L denotes number of layers in MLP.
P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 25 / 67
MLP - Data Perspective

MLP - Data Perspective: A Simple Example

1 1 1
ŷ = σ(w11 x1 + w12 x2 ) = 1 x + w1 x ]

1 + exp −[w11 1 12 2

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 26 / 67


MLP - Data Perspective

MLP - Data Perspective: A Simple Example

Let

D = {(x 1 = (−3, −3), y 1 = 1),


(x 2 = (−2, −2), y 2 = 1),
(x 3 = (4, 4), y 3 = 0),
(x 4 = (2, −5), y 4 = 0)}.

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 27 / 67


MLP - Data Perspective

MLP - Data Perspective: A Simple Example

1 1
x1 x2 y ŷ = σ(w11 x1 + w12 x2 )
1 1
-3 -3 1 σ(−3w11 − 3w12 )
1 1
-2 -2 1 σ(−2w11 − 2w12 )
1 1
4 4 0 σ(4w11 + 4w12 )
1 1
2 -5 0 σ(2w11 − 5w12 )

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 28 / 67


MLP - Data Perspective

MLP - Data Perspective: A Simple Example

1 1
x1 x2 y ŷ = σ(w11 x1 + w12 x2 )
1 1
-3 -3 1 σ(−3w11 − 3w12 )
1 1
-2 -2 1 σ(−2w11 − 2w12 )
1 1
4 4 0 σ(4w11 + 4w12 )
1 1
2 -5 0 σ(2w11 − 5w12 )

Assume: Err(y , ŷ ) = (y − ŷ )2 .

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 29 / 67


MLP - Data Perspective

MLP - Data Perspective: A Simple Example

1 1
x1 x2 y ŷ = σ(w11 x1 + w12 x2 )
1 1
-3 -3 1 σ(−3w11 − 3w12 )
1 1
-2 -2 1 σ(−2w11 − 2w12 )
1 1
4 4 0 σ(4w11 + 4w12 )
1 1
2 -5 0 σ(2w11 − 5w12 )

Assume: Err(y , ŷ ) = (y − ŷ )2 .
Popularly called the squared error.
P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 30 / 67
MLP - Data Perspective

MLP - Data Perspective: A Simple Example

1 1
x1 x2 y ŷ = σ(w11 x1 + w12 x2 )
1 1
-3 -3 1 σ(−3w11 − 3w12 )
1 1
-2 -2 1 σ(−2w11 − 2w12 )
1 1
4 4 0 σ(4w11 + 4w12 )
1 1
2 -5 0 σ(2w11 − 5w12 )

Total error (or loss):


4
X 4
X
E= ei = Err(y i , ŷ i )
i=1 i=1

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 31 / 67


MLP - Data Perspective

MLP - Data Perspective: A Simple Example

1 1
x1 x2 y ŷ = σ(w11 x1 + w12 x2 )
1 1
-3 -3 1 σ(−3w11 − 3w12 )
1 1
-2 -2 1 σ(−2w11 − 2w12 )
1 1
4 4 0 σ(4w11 + 4w12 )
1 1
2 -5 0 σ(2w11 − 5w12 )

Total error (or loss):


4
!2
X
i 1
E= y −  1 i 1 xi

i=1
1 + exp − w11 x1 + w12 2

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 32 / 67


MLP - Data Perspective

MLP - Data Perspective: A Simple Example

1 1
x1 x2 y ŷ = σ(w11 x1 + w12 x2 )
1 1
-3 -3 1 σ(−3w11 − 3w12 )
1 1
-2 -2 1 σ(−2w11 − 2w12 )
1 1
4 4 0 σ(4w11 + 4w12 )
1 1
2 -5 0 σ(2w11 − 5w12 )

Aim: To minimize the total error (or loss), which is


4
!2
X
i 1
min E = y −  1 i 1 xi

1 ,w 1
w11 12 i=1
1 + exp − w11 x1 + w12 2

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 33 / 67


MLP - Data Perspective

MLP - Data Perspective: A Simple Example


Visualizing the loss surface:

1 1
x1 x2 y ŷ = σ(w11 x1 + w12 x2 )
1 1
-3 -3 1 σ(−3w11 − 3w12 )
1 1
-2 -2 1 σ(−2w11 − 2w12 )
1 1
4 4 0 σ(4w11 + 4w12 )
1 1
2 -5 0 σ(2w11 − 5w12 )

4
!2
X 1
E = yi − 1 xi + w1 xi
 
i=1
1 + exp − w11 1 12 2

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 34 / 67


MLP - Data Perspective Optimization Concepts

Optimization Concepts

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 35 / 67


MLP - Data Perspective Optimization Concepts

General Optimization Problem

min f (x)
x∈C

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 36 / 67


MLP - Data Perspective Optimization Concepts

General Optimization Problem

min f (x)
x∈C

f is called objective function and C is called feasible set.


Let f ∗ = minx∈C f (x) denote the optimal objective function value.
Optimal Solution Set S ∗ = {x ∈ C : f (x) = f ∗ }.
Let us denote by x ∗ an optimal solution in S ∗ .

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 37 / 67


MLP - Data Perspective Optimization Concepts

General Optimization Problem

min f (x) (OP)


x∈C

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 38 / 67


MLP - Data Perspective Optimization Concepts

General Optimization Problem

min f (x) (OP)


x∈C

Local Optimal Solution


A solution z to (OP) is called local optimal solution if f (z) ≤ f (ẑ),
∀ẑ ∈ N (z, ) for some  > 0.

Note: N (z, ) denotes suitable −neighborhood of z.

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 39 / 67


MLP - Data Perspective Optimization Concepts

General Optimization Problem

min f (x) (OP)


x∈C

Local Optimal Solution


A solution z to (OP) is called local optimal solution if f (z) ≤ f (ẑ),
∀ẑ ∈ N (z, ) for some  > 0.

Note: N (z, ) denotes suitable −neighborhood of z.

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 40 / 67


MLP - Data Perspective Optimization Concepts

General Optimization Problem

min f (x) (OP)


x∈C

Local Optimal Solution


A solution z to (OP) is called local optimal solution if f (z) ≤ f (ẑ),
∀ẑ ∈ N (z, ) for some  > 0.

Note: N (z, ) denotes suitable −neighborhood of z.

− Neighborhood of z ∈ C

N (z, ) = {u ∈ C : dist(z, u) ≤ }.

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 41 / 67


MLP - Data Perspective Optimization Concepts

General Optimization Problem

min f (x) (OP)


x∈C

Local Optimal Solution


A solution z to (OP) is called local optimal solution if f (z) ≤ f (ẑ),
∀ẑ ∈ N (z, ) for some  > 0.

Global Optimal Solution


A solution z to (OP) is called global optimal solution if f (z) ≤ f (ẑ),
∀ẑ ∈ C.

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 42 / 67


MLP - Data Perspective Optimization Concepts

General Optimization Problem

min f (x)
x∈C

General Assumption: C ⊆ Rd .

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 43 / 67


MLP - Data Perspective Optimization Concepts

High Dimensional Representation

Points x ∈ Rd , an Euclidean Space of dimension d.

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 44 / 67


MLP - Data Perspective Optimization Concepts

High Dimensional Representation - Notations


Vector Representation of x ∈ Rd
 
x1
 x2 
 
 .. 
.
xd
Transpose of vector x
x > = x1 x2 . . . xd .


`2 Norm of vector x

kxk2 = x > x = (|x1 |2 + |x2 |2 + . . . + |xd |2 )1/2 .
`1 Norm of vector x
kxk1 = |x1 | + |x2 | + . . . + |xd |.
Note: |x1 | denotes the absolute value of x1 .
P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 45 / 67
MLP - Data Perspective Optimization Concepts

High Dimensional Representation - Notations

Zero vector has all its components to be zero:


>
0 0 ...0 .

Non-zero vector (x 6= 0) has at least one non-zero component.

One vector has all its components to be one:


>
1 1 ...1 .

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 46 / 67


MLP - Data Perspective Optimization Concepts

High Dimensional Representation - Notations

Gradient of a function f : Rd → R at a point x


 ∂f (x) 
∂x
 1 
 
 ∂f (x) 
 ∂x2 
 
 : 
∇f (x) =  
 .. 
 . 
 
 : 
∂f (x)
∂xd

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 47 / 67


MLP - Data Perspective Optimization Concepts

General Optimization Problem

min f (x)
x∈C

C ⊆ Rd .

f : C −→ R.

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 48 / 67


MLP - Data Perspective Optimization Concepts

Directional derivative

Let f : C −→ R be a function defined over C ⊆ Rd . Let x ∈ int(C). Let


0 6= d ∈ Rd . If the limit

f (x + αd) − f (x)
lim
α↓0 α

exists, then it is called the directional derivative of f at x along the


direction d, and is denoted by f 0 (x; d).

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 49 / 67


MLP - Data Perspective Optimization Concepts

Directional derivative

Interior of a set C
Let C ⊆ Rd . Then int(C) is defined by:

int(C) = {x ∈ C : B(x, ) ⊆ C, for some  > 0},

where B(x, ) is the open ball centered at x with radius  given by

B(x, ) = {y ∈ C : kx − y k < }.

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 50 / 67


MLP - Data Perspective Optimization Concepts

Directional derivative

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 51 / 67


MLP - Data Perspective Optimization Concepts

Directional derivative

Let f : C −→ R be a function defined over C ⊆ Rd . Let x ∈ int(C). Let


d 6= 0 ∈ Rd . If the limit

f (x + αd) − f (x)
lim
α↓0 α

exists, then it is called the directional derivative of f at x along the


direction d, and is denoted by f 0 (x; d).

Note: If all partial derivatives of f exist at x, then f 0 (x; d) = h∇f (x), di,
h i>
where ∇f (x) = ∂f∂x(x)1
. . . ∂f (x)
∂x d
.

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 52 / 67


MLP - Data Perspective Optimization Concepts

Descent Direction

Let f : Rd −→ R be a continuously differentiable function over Rd . Then


a vector 0 6= d ∈ Rd is called a descent direction of f at x if the
directional derivative of f at x is negative; that is,

f 0 (x; d) = h∇f (x), di < 0.

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 53 / 67


MLP - Data Perspective Optimization Concepts

Descent Direction

Let f : Rd −→ R be a continuously differentiable function over Rd . Then


a vector 0 6= d ∈ Rd is called a descent direction of f at x if the
directional derivative derivative of f at x is negative; that is,

f 0 (x; d) = h∇f (x), di < 0.

Note: A natural candidate for a descent direction is d = −∇f (x).

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 54 / 67


MLP - Data Perspective Optimization Concepts

Descent Direction

Proposition
Let f : Rd −→ R be a continuously differentiable function over Rd . Let
0 6= d ∈ Rd be a descent direction of f at x. Then there exists  > 0 such
that ∀α ∈ (0, ] we have

f (x + αd) < f (x).

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 55 / 67


MLP - Data Perspective Optimization Concepts

Descent Direction

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 56 / 67


MLP - Data Perspective Optimization Concepts

Descent Direction

Proposition
Let f : Rd −→ R be a continuously differentiable function over Rd . Let
0 6= d ∈ Rd be a descent direction of f at x. Then there exists  > 0 such
that ∀α ∈ (0, ] we have

f (x + αd) < f (x).

Proof idea:

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 57 / 67


MLP - Data Perspective Optimization Concepts

Algorithm Development using Descent Direction


Consider the general optimization problem:

min f (x) (GEN-OPT)


x∈Rd

where f : Rd −→ R
Algorithm to solve (GEN-OPT)
Start with x 0 ∈ Rd .
For k = 0, 1, 2, . . .
I Find a descent direction d k of f at x k and αk > 0 such that
f (x k + αk d k ) < f (x k ).
I x k+1 = x k + αk d k .
I Check for some stopping criterion and break from loop.

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 58 / 67


MLP - Data Perspective Optimization Concepts

Characterization Of Local Optimum

Proposition
Let f : C −→ R be a function over the set C ⊆ Rd . Let x ? ∈ int(C) be a
local optimum point of f . Let all partial derivatives of f exist at x ? . Then
∇f (x ? ) = 0.

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 59 / 67


MLP - Data Perspective Optimization Concepts

Characterization Of Local Optimum


Proposition
Let f : C −→ R be a function over the set C ⊆ Rd . Let x ? ∈ int(C) be a
local optimum point of f . Let all partial derivatives of f exist at x ? . Then
∇f (x ? ) = 0.

Proof idea:

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 60 / 67


MLP - Data Perspective Optimization Concepts

Algorithm Development using Descent Direction


Consider the general optimization problem:

min f (x) (GEN-OPT)


x∈Rd

where f : Rd −→ R.
Algorithm to solve (GEN-OPT)
Start with x 0 ∈ Rd .
For k = 0, 1, 2, . . .
I Find a descent direction d k of f at x k and αk > 0 such that
f (x k + αk d k ) < f (x k ).
I x k+1 = x k + αk d k .
I If k∇f (x k+1 )k2 = 0, set x ∗ = x k+1 , break from loop.

Output x ∗ .

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 61 / 67


MLP - Data Perspective Optimization Concepts

Algorithm Development using Descent Direction

Algorithm to solve (GEN-OPT)


Start with x 0 ∈ Rd .
For k = 0, 1, 2, . . .
I Find a descent direction d k of f at x k and αk > 0 such that
f (x k + αk d k ) < f (x k ).
I x k+1 = x k + αk d k .
I If k∇f (x k+1 )k2 = 0, set x ∗ = x k+1 , break from loop.

Output x ∗ .

Homework: Compare the structure of this algorithm with the Perceptron


training algorithm and try to understand the perceptron update rule from
an optimization perspective.

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 62 / 67


MLP - Data Perspective Gradient Descent

Algorithm Development using Descent Direction


Consider the general optimization problem:

min f (x) (GEN-OPT)


x∈Rd

where f : Rd −→ R.
Gradient Descent Algorithm to solve (GEN-OPT)
Start with x 0 ∈ Rd .
For k = 0, 1, 2, . . .
I d k = −∇f (x k ).
I αk = argminα>0 f (x k + αd k ).
I x k+1 = x k + αk d k .
I If k∇f (x k+1 )k2 = 0, set x ∗ = x k+1 , break from loop.

Output x ∗ .
P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 63 / 67
MLP - Data Perspective Gradient Descent

Gradient Descent for our MLP Problem


Recall: For MLP, the loss minimization problem is:
4 4
!2
X X 1
min E (w ) = e i (w ) = yi −  1 i 1 xi

1 ,w 1 )
w =(w11 12 i=1 i=1
1 + exp − w11 x1 + w12 2

where E : R2 −→ R.
Gradient Descent Algorithm to solve MLP Loss Minimization Problem
Start with w 0 ∈ Rd .
For k = 0, 1, 2, . . .
I d k = −∇E (w k ).
I αk = argminα>0 E (w k + αd k ).
I w k+1 = w k + αk d k .
I If k∇E (w k+1 )k2 = 0, set w ∗ = w k+1 , break from loop.

Output w ∗ .
P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 64 / 67
MLP - Data Perspective Gradient Descent

Gradient Descent for our MLP Problem


Recall: For MLP, the loss minimization problem is:

4 4
!2
X
i
X
i 1
min E (w ) = e (w ) = y −  1 i 1 xi

1 ,w 1 )
w =(w11 12 i=1 i=1
1 + exp − w11 x1 + w12 2

Gradient Descent Algorithm to solve MLP Loss Minimization Problem


Start with w 0 ∈ Rd .
For k = 0, 1, 2, . . .
I d k = −∇E (w k ).
I αk = argminα>0 E (w k + αd k ).
I w k+1 = w k + αk d k .
I If k∇E (w k+1 )k2 = 0, set w ∗ = w k+1 , break from loop.
Output w ∗ .

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 65 / 67


MLP - Data Perspective Gradient Descent

Gradient Descent for our MLP Problem


Recall: For MLP, the loss minimization problem is:

4 4
!2
X
i
X
i 1
min E (w ) = e (w ) = y −  1 i 1 xi

1 ,w 1 )
w =(w11 12 i=1 i=1
1 + exp − w11 x1 + w12 2

Gradient Descent Algorithm to solve MLP Loss Minimization Problem


Start with w 0 ∈ Rd .
For k = 0, 1, 2, . . .
P4
I dk = − i=1 ∇e i (w k ).
I αk = argminα>0 E (w k + αd k ).
I w k+1 = w k + αk d k .
I If k∇E (w k+1 )k2 = 0, set w ∗ = w k+1 , break from loop.
Output w ∗ .

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 66 / 67


MLP - Data Perspective Gradient Descent

Gradient Descent Introduced By:


Cauchy, A.: Méthode générale pour la résolution des systèmes
d'équations simultanées. Comptes rendus des séances de l'Académie
des sciences de Paris 25, 536–538, 1847.

Acknowledgments:
CalcPlot3D website for plotting.

P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 67 / 67

You might also like