P. Balamurugan Deep Learning - Theory and Practice

Deep Learning - Theory and Practice
IE 643
Lecture 10
September 25, 2020.
P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 1 / 47

Outline
1 Recap
2 Popular examples of MLP

Recap
Multi Layer Perceptron for Classification Tasks
We have been looking at Multi-layer perceptron (MLP) for a variety

of classification tasks:
I Binary classification
I Multi-class classification
I Multi-label classification

Popular examples of MLP
Multi Layer Perceptron: Encoder-Decoder Architecture
We consider a simple MLP architecture with:

An input layer with n neurons
A hidden layer with p neurons, where p ≤ n
An output layer with n neurons
All neurons in hidden and output layers have linear activations.
This architecture is popularly called Encoder-Decoder Architecture.

Let B denote the p × n matrix connecting input layer and hidden layer.
Let A denote the n × p matrix connecting hidden layer and output layer.

Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ Rn , y i ∈ Rn ,

∀i ∈ {1, . . . , S}.
MLP Parameters: Weight matrices B and A.


∀i ∈ {1, . . . , S}.

PS i
Error: E (B, A) = i=1 kABx − y i k22 .

∀i ∈ {1, . . . , S}.
Special case: When y i = x i , ∀i, the architecture is called Autoencoder.


∀i ∈ {1, . . . , S}.

PS i
Error for autoencoder: E (B, A) = i=1 kABx − x i k22 .
Example: Consider the case: n = p = 1.

Error: E (b, a) = Si=1 (abx i − y i )2 .
P

Sample Training Data:
xi yi
.708333 .708333
.583333 .583333
.166667 .166667
.458333 .458333
.875 .875


Loss surface for the sample training data:


Observation: Though there are two valleys in the loss surface, they look
symmetric.

Question: Does this extend to cases where y i 6= x i and for n ≥ 2, n ≥ p?

However, when we fix a, we have:




Similarly, when we fix b, we have:




PS i
Error: E (b, a) = i=1 (abx − y i )2 .
When we fix a or when we fix b, we observe that the graph does not
contain multiple valleys.

PS i
Error: E (b, a) = i=1 (abx − y i )2 .
Thus when we fix a or when we fix b, we observe that the graph looks as if
every local optimum is a global optimum.

PS i
Error: E (b, a) = i=1 (abx − y i )2 .
Question: Does this behavior extend to higher dimensions?

PS i
Error: E (b, a) = i=1 (abx − y i )2 .
Question: Does this behavior extend to higher dimensions?
We shall investigate this in higher dimensions !

Some notations:
   
↑ ↑ ... ↑ ↑ ↑ ... ↑
Let X = x 1 x 2 . . . x S , Y = y 1 y 2 . . . y S .
↓ ↓ ... ↓ ↓ ↓ ... ↓
Note: X and Y are n × S matrices.

Some notations:
 
h11 h12 . . . h1q
For a n × q matrix H =  ... .. . 
. . . . .. ,

hn1 hn2 . . . hnq

>
let vec(H) = h11 h21 . . . hn1 . . . h1q h2q . . . hnq
denote a nq × 1 vector.
Note: vec(H) contains the columns of matrix H stacked upon one
another.

Recall:
∀i ∈ {1, . . . , S}.
Error: E = Si=1 kABx i − y i k22 . (We have used E since the context is
P
clear!)

Recall:
∀i ∈ {1, . . . , S}.
Error: E = Si=1 kABx i − y i k22 . (We have used E since the context is
P
clear!)
Using the new notations, we write: E = kvec(ABX − Y )k22 .
Note: The norm is a simple vector norm.

We have: E = kvec(ABX − Y )k22 .

Now note: vec(ABX − Y ) = vec(ABX ) − vec(Y ). (Homework: Verify this
claim!)

We have: E = kvec(ABX − Y )k22 .

Now note: vec(ABX − Y ) = vec(ABX ) − vec(Y ). (Homework: Verify this
claim!)
Thus we have: E = kvec(ABX ) − vec(Y )k22 .

Some more new notations:

 
g11 ... g1n
 .. .. 
For a m × n matrix G =  . ... . 
gm1 . . . gmn
 
h11 . . . h1q
and a p × q matrix H =  ... . . . ... ,
 
hp1 . . . hpq
 
g11 H . . . g1n H
define: G ⊗ H =  ... ..  as Kronecker product of G and H.

... . 
gm1 H . . . gmn H
Note: G ⊗ H is of size mp × nq.

Claim:
vec(ABX ) = (X > ⊗ A)vec(B).
Proof idea:


Using:
we can write:
E = kvec(ABX − Y )k22 = kvec(ABX ) − vec(Y )k22

= k(X > ⊗ A)vec(B) − vec(Y )k22 .


Using:
we can write:

= k(X > ⊗ A)vec(B) − vec(Y )k22 .
This is of the form: F (z) = kMz − ck22 , where M = (X > ⊗ A),

z = vec(B) and c = vec(Y ).


Using:
we can write:

= k(X > ⊗ A)vec(B) − vec(Y )k22 .
This is of the form: F (z) = kMz − ck22 , where M = (X > ⊗ A),

z = vec(B) and c = vec(Y ).
Observe: M is of size nS × np, z is a np × 1 vector and c is a nS × 1
vector.

Consider the function F (z) = kMz − ck22 . We have the following result:
Convexity of function F
The function F (z) is convex.

Question: What is a convex function?

Consider the problem F (z) = kMz − ck22 . We have the following result:
Question: What is a convex function?

We discuss convex sets and convex functions in Part 2 of this
lecture !

Hence minz F (z) is a convex optimization problem.


Note:
argmin F (z) = argmin kMz − ck22

z z


Note:
argmin F (z) = argmin kMz − ck22 = argmin z > M > Mz − 2z > M > c + c > c
z z z


Note:
argmin F (z) = min kMz − ck22 = argmin z > M > Mz − 2z > M > c + c > c
z z z
= argmin z M Mz − 2z > M > c
> >
z


By first-order optimality characterization, we have z ∗ is a solution of
minz F (z) if and only if ∇F (z ∗ ) = 0.


=⇒ 2M > Mz ∗ − 2M > c = 0.


=⇒ 2M > Mz ∗ − 2M > c = 0.
=⇒ M > Mz ∗ = M > c.


=⇒ 2M > Mz ∗ − 2M > c = 0.
=⇒ M > Mz ∗ = M > c.
Further if M > M is positive definite, z ∗ is the unique optimal solution and

is given by z ∗ = (M > M)−1 M > c.

P. Balamurugan Deep Learning - Theory and Practice

Uploaded by

Copyright:

Available Formats

You might also like

P. Balamurugan Deep Learning - Theory and Practice

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

P. Balamurugan Deep Learning - Theory and Practice

Uploaded by

Copyright:

Available Formats

Deep Learning - Theory and Practice

September 25, 2020.

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 1 / 47

2 Popular examples of MLP

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 2 / 47

Multi Layer Perceptron for Classification Tasks

We have been looking at Multi-layer perceptron (MLP) for a variety

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 3 / 47

Multi Layer Perceptron: Encoder-Decoder Architecture

We consider a simple MLP architecture with:

Multi Layer Perceptron: Encoder-Decoder Architecture

This architecture is popularly called Encoder-Decoder Architecture.

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 5 / 47

Multi Layer Perceptron: Encoder-Decoder Architecture

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 6 / 47

Multi Layer Perceptron: Encoder-Decoder Architecture

Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ Rn , y i ∈ Rn ,

MLP Parameters: Weight matrices B and A.

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 7 / 47

Multi Layer Perceptron: Encoder-Decoder Architecture

Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ Rn , y i ∈ Rn ,

MLP Parameters: Weight matrices B and A.

Multi Layer Perceptron: Encoder-Decoder Architecture

Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ Rn , y i ∈ Rn ,

MLP Parameters: Weight matrices B and A.

Special case: When y i = x i , ∀i, the architecture is called Autoencoder.

Multi Layer Perceptron: Encoder-Decoder Architecture

Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ Rn , y i ∈ Rn ,

MLP Parameters: Weight matrices B and A.

Multi Layer Perceptron: Encoder-Decoder Architecture

Example: Consider the case: n = p = 1.

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 11 / 47

Multi Layer Perceptron: Encoder-Decoder Architecture

Sample Training Data:

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 12 / 47

Multi Layer Perceptron: Encoder-Decoder Architecture

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 13 / 47

Multi Layer Perceptron: Encoder-Decoder Architecture

Multi Layer Perceptron: Encoder-Decoder Architecture

Question: Does this extend to cases where y i 6= x i and for n ≥ 2, n ≥ p?

Multi Layer Perceptron: Encoder-Decoder Architecture

However, when we fix a, we have:

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 16 / 47

Multi Layer Perceptron: Encoder-Decoder Architecture

However, when we fix a, we have:

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 17 / 47

Multi Layer Perceptron: Encoder-Decoder Architecture

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 18 / 47

Multi Layer Perceptron: Encoder-Decoder Architecture

Similarly, when we fix b, we have:

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 19 / 47

Multi Layer Perceptron: Encoder-Decoder Architecture

Similarly, when we fix b, we have:

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 20 / 47

Multi Layer Perceptron: Encoder-Decoder Architecture

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 21 / 47

Multi Layer Perceptron: Encoder-Decoder Architecture