P. Balamurugan Deep Learning - Theory and Practice

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 47

Deep Learning - Theory and Practice

IE 643
Lecture 10

September 25, 2020.

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 1 / 47


Outline

1 Recap

2 Popular examples of MLP

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 2 / 47


Recap

Multi Layer Perceptron for Classification Tasks

We have been looking at Multi-layer perceptron (MLP) for a variety


of classification tasks:
I Binary classification
I Multi-class classification
I Multi-label classification

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 3 / 47


Popular examples of MLP

Multi Layer Perceptron: Encoder-Decoder Architecture

We consider a simple MLP architecture with:


An input layer with n neurons
A hidden layer with p neurons, where p ≤ n
An output layer with n neurons
All neurons in hidden and output layers have linear activations.
P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 4 / 47
Popular examples of MLP

Multi Layer Perceptron: Encoder-Decoder Architecture

This architecture is popularly called Encoder-Decoder Architecture.

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 5 / 47


Popular examples of MLP

Multi Layer Perceptron: Encoder-Decoder Architecture

Let B denote the p × n matrix connecting input layer and hidden layer.
Let A denote the n × p matrix connecting hidden layer and output layer.

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 6 / 47


Popular examples of MLP

Multi Layer Perceptron: Encoder-Decoder Architecture

Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ Rn , y i ∈ Rn ,


∀i ∈ {1, . . . , S}.

MLP Parameters: Weight matrices B and A.

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 7 / 47


Popular examples of MLP

Multi Layer Perceptron: Encoder-Decoder Architecture

Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ Rn , y i ∈ Rn ,


∀i ∈ {1, . . . , S}.

MLP Parameters: Weight matrices B and A.


PS i
Error: E (B, A) = i=1 kABx − y i k22 .
P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 8 / 47
Popular examples of MLP

Multi Layer Perceptron: Encoder-Decoder Architecture

Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ Rn , y i ∈ Rn ,


∀i ∈ {1, . . . , S}.

MLP Parameters: Weight matrices B and A.

Special case: When y i = x i , ∀i, the architecture is called Autoencoder.


P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 9 / 47
Popular examples of MLP

Multi Layer Perceptron: Encoder-Decoder Architecture

Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ Rn , y i ∈ Rn ,


∀i ∈ {1, . . . , S}.

MLP Parameters: Weight matrices B and A.


PS i
Error for autoencoder: E (B, A) = i=1 kABx − x i k22 .
P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 10 / 47
Popular examples of MLP

Multi Layer Perceptron: Encoder-Decoder Architecture

Example: Consider the case: n = p = 1.


Error: E (b, a) = Si=1 (abx i − y i )2 .
P

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 11 / 47


Popular examples of MLP

Multi Layer Perceptron: Encoder-Decoder Architecture

Sample Training Data:

xi yi
.708333 .708333
.583333 .583333
.166667 .166667
.458333 .458333
.875 .875

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 12 / 47


Popular examples of MLP

Multi Layer Perceptron: Encoder-Decoder Architecture


Loss surface for the sample training data:

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 13 / 47


Popular examples of MLP

Multi Layer Perceptron: Encoder-Decoder Architecture


Loss surface for the sample training data:

Observation: Though there are two valleys in the loss surface, they look
symmetric.
P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 14 / 47
Popular examples of MLP

Multi Layer Perceptron: Encoder-Decoder Architecture


Loss surface for the sample training data:

Question: Does this extend to cases where y i 6= x i and for n ≥ 2, n ≥ p?


P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 15 / 47
Popular examples of MLP

Multi Layer Perceptron: Encoder-Decoder Architecture

However, when we fix a, we have:

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 16 / 47


Popular examples of MLP

Multi Layer Perceptron: Encoder-Decoder Architecture

However, when we fix a, we have:

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 17 / 47


Popular examples of MLP

Multi Layer Perceptron: Encoder-Decoder Architecture


However, when we fix a, we have:

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 18 / 47


Popular examples of MLP

Multi Layer Perceptron: Encoder-Decoder Architecture

Similarly, when we fix b, we have:

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 19 / 47


Popular examples of MLP

Multi Layer Perceptron: Encoder-Decoder Architecture

Similarly, when we fix b, we have:

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 20 / 47


Popular examples of MLP

Multi Layer Perceptron: Encoder-Decoder Architecture


Similarly, when we fix b, we have:

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 21 / 47


Popular examples of MLP

Multi Layer Perceptron: Encoder-Decoder Architecture

PS i
Error: E (b, a) = i=1 (abx − y i )2 .
When we fix a or when we fix b, we observe that the graph does not
contain multiple valleys.

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 22 / 47


Popular examples of MLP

Multi Layer Perceptron: Encoder-Decoder Architecture

PS i
Error: E (b, a) = i=1 (abx − y i )2 .
When we fix a or when we fix b, we observe that the graph does not
contain multiple valleys.
Thus when we fix a or when we fix b, we observe that the graph looks as if
every local optimum is a global optimum.

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 23 / 47


Popular examples of MLP

Multi Layer Perceptron: Encoder-Decoder Architecture

PS i
Error: E (b, a) = i=1 (abx − y i )2 .
When we fix a or when we fix b, we observe that the graph does not
contain multiple valleys.
Thus when we fix a or when we fix b, we observe that the graph looks as if
every local optimum is a global optimum.
Question: Does this behavior extend to higher dimensions?

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 24 / 47


Popular examples of MLP

Multi Layer Perceptron: Encoder-Decoder Architecture

PS i
Error: E (b, a) = i=1 (abx − y i )2 .
When we fix a or when we fix b, we observe that the graph does not
contain multiple valleys.
Thus when we fix a or when we fix b, we observe that the graph looks as if
every local optimum is a global optimum.
Question: Does this behavior extend to higher dimensions?
We shall investigate this in higher dimensions !

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 25 / 47


Popular examples of MLP

Multi Layer Perceptron: Encoder-Decoder Architecture

Some notations:
   
↑ ↑ ... ↑ ↑ ↑ ... ↑
Let X = x 1 x 2 . . . x S , Y = y 1 y 2 . . . y S .
↓ ↓ ... ↓ ↓ ↓ ... ↓
Note: X and Y are n × S matrices.

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 26 / 47


Popular examples of MLP

Multi Layer Perceptron: Encoder-Decoder Architecture

Some notations:
 
h11 h12 . . . h1q
For a n × q matrix H =  ... .. . 
. . . . .. ,

hn1 hn2 . . . hnq


 >
let vec(H) = h11 h21 . . . hn1 . . . h1q h2q . . . hnq
denote a nq × 1 vector.
Note: vec(H) contains the columns of matrix H stacked upon one
another.

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 27 / 47


Popular examples of MLP

Multi Layer Perceptron: Encoder-Decoder Architecture

Recall:
Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ Rn , y i ∈ Rn ,
∀i ∈ {1, . . . , S}.

MLP Parameters: Weight matrices B and A.

Error: E = Si=1 kABx i − y i k22 . (We have used E since the context is
P
clear!)

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 28 / 47


Popular examples of MLP

Multi Layer Perceptron: Encoder-Decoder Architecture

Recall:
Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ Rn , y i ∈ Rn ,
∀i ∈ {1, . . . , S}.

MLP Parameters: Weight matrices B and A.

Error: E = Si=1 kABx i − y i k22 . (We have used E since the context is
P
clear!)

Using the new notations, we write: E = kvec(ABX − Y )k22 .

Note: The norm is a simple vector norm.

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 29 / 47


Popular examples of MLP

Multi Layer Perceptron: Encoder-Decoder Architecture

We have: E = kvec(ABX − Y )k22 .


Now note: vec(ABX − Y ) = vec(ABX ) − vec(Y ). (Homework: Verify this
claim!)

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 30 / 47


Popular examples of MLP

Multi Layer Perceptron: Encoder-Decoder Architecture

We have: E = kvec(ABX − Y )k22 .


Now note: vec(ABX − Y ) = vec(ABX ) − vec(Y ). (Homework: Verify this
claim!)
Thus we have: E = kvec(ABX ) − vec(Y )k22 .

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 31 / 47


Popular examples of MLP

Multi Layer Perceptron: Encoder-Decoder Architecture

Some more new notations:


 
g11 ... g1n
 .. .. 
For a m × n matrix G =  . ... . 
gm1 . . . gmn
 
h11 . . . h1q
and a p × q matrix H =  ... . . . ... ,
 

hp1 . . . hpq
 
g11 H . . . g1n H
define: G ⊗ H =  ... ..  as Kronecker product of G and H.

... . 
gm1 H . . . gmn H
Note: G ⊗ H is of size mp × nq.

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 32 / 47


Popular examples of MLP

Multi Layer Perceptron: Encoder-Decoder Architecture

Claim:

vec(ABX ) = (X > ⊗ A)vec(B).

Proof idea:

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 33 / 47


Popular examples of MLP

Multi Layer Perceptron: Encoder-Decoder Architecture


Using:

vec(ABX ) = (X > ⊗ A)vec(B).

we can write:

E = kvec(ABX − Y )k22 = kvec(ABX ) − vec(Y )k22


= k(X > ⊗ A)vec(B) − vec(Y )k22 .

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 34 / 47


Popular examples of MLP

Multi Layer Perceptron: Encoder-Decoder Architecture


Using:

vec(ABX ) = (X > ⊗ A)vec(B).

we can write:

E = kvec(ABX − Y )k22 = kvec(ABX ) − vec(Y )k22


= k(X > ⊗ A)vec(B) − vec(Y )k22 .

This is of the form: F (z) = kMz − ck22 , where M = (X > ⊗ A),


z = vec(B) and c = vec(Y ).

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 35 / 47


Popular examples of MLP

Multi Layer Perceptron: Encoder-Decoder Architecture


Using:

vec(ABX ) = (X > ⊗ A)vec(B).

we can write:

E = kvec(ABX − Y )k22 = kvec(ABX ) − vec(Y )k22


= k(X > ⊗ A)vec(B) − vec(Y )k22 .

This is of the form: F (z) = kMz − ck22 , where M = (X > ⊗ A),


z = vec(B) and c = vec(Y ).
Observe: M is of size nS × np, z is a np × 1 vector and c is a nS × 1
vector.

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 36 / 47


Popular examples of MLP

Multi Layer Perceptron: Encoder-Decoder Architecture

Consider the function F (z) = kMz − ck22 . We have the following result:

Convexity of function F
The function F (z) is convex.

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 37 / 47


Popular examples of MLP

Multi Layer Perceptron: Encoder-Decoder Architecture

Consider the function F (z) = kMz − ck22 . We have the following result:

Convexity of function F
The function F (z) is convex.

Question: What is a convex function?

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 38 / 47


Popular examples of MLP

Multi Layer Perceptron: Encoder-Decoder Architecture

Consider the problem F (z) = kMz − ck22 . We have the following result:

Convexity of function F
The function F (z) is convex.

Question: What is a convex function?


We discuss convex sets and convex functions in Part 2 of this
lecture !

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 39 / 47


Popular examples of MLP

Multi Layer Perceptron: Encoder-Decoder Architecture

Consider the function F (z) = kMz − ck22 . We have the following result:

Convexity of function F
The function F (z) is convex.

Hence minz F (z) is a convex optimization problem.

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 40 / 47


Popular examples of MLP

Multi Layer Perceptron: Encoder-Decoder Architecture

Consider the function F (z) = kMz − ck22 . We have the following result:

Convexity of function F
The function F (z) is convex.

Hence minz F (z) is a convex optimization problem.


Note:

argmin F (z) = argmin kMz − ck22


z z

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 41 / 47


Popular examples of MLP

Multi Layer Perceptron: Encoder-Decoder Architecture

Consider the function F (z) = kMz − ck22 . We have the following result:

Convexity of function F
The function F (z) is convex.

Hence minz F (z) is a convex optimization problem.


Note:

argmin F (z) = argmin kMz − ck22 = argmin z > M > Mz − 2z > M > c + c > c
z z z

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 42 / 47


Popular examples of MLP

Multi Layer Perceptron: Encoder-Decoder Architecture

Consider the function F (z) = kMz − ck22 . We have the following result:

Convexity of function F
The function F (z) is convex.

Hence minz F (z) is a convex optimization problem.


Note:

argmin F (z) = min kMz − ck22 = argmin z > M > Mz − 2z > M > c + c > c
z z z
= argmin z M Mz − 2z > M > c
> >
z

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 43 / 47


Popular examples of MLP

Multi Layer Perceptron: Encoder-Decoder Architecture

Consider the function F (z) = kMz − ck22 . We have the following result:

Convexity of function F
The function F (z) is convex.

Hence minz F (z) is a convex optimization problem.


By first-order optimality characterization, we have z ∗ is a solution of
minz F (z) if and only if ∇F (z ∗ ) = 0.

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 44 / 47


Popular examples of MLP

Multi Layer Perceptron: Encoder-Decoder Architecture

Consider the function F (z) = kMz − ck22 . We have the following result:

Convexity of function F
The function F (z) is convex.

Hence minz F (z) is a convex optimization problem.


By first-order optimality characterization, we have z ∗ is a solution of
minz F (z) if and only if ∇F (z ∗ ) = 0.
=⇒ 2M > Mz ∗ − 2M > c = 0.

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 45 / 47


Popular examples of MLP

Multi Layer Perceptron: Encoder-Decoder Architecture

Consider the function F (z) = kMz − ck22 . We have the following result:

Convexity of function F
The function F (z) is convex.

Hence minz F (z) is a convex optimization problem.


By first-order optimality characterization, we have z ∗ is a solution of
minz F (z) if and only if ∇F (z ∗ ) = 0.
=⇒ 2M > Mz ∗ − 2M > c = 0.
=⇒ M > Mz ∗ = M > c.

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 46 / 47


Popular examples of MLP

Multi Layer Perceptron: Encoder-Decoder Architecture

Consider the function F (z) = kMz − ck22 . We have the following result:

Convexity of function F
The function F (z) is convex.

Hence minz F (z) is a convex optimization problem.


By first-order optimality characterization, we have z ∗ is a solution of
minz F (z) if and only if ∇F (z ∗ ) = 0.
=⇒ 2M > Mz ∗ − 2M > c = 0.
=⇒ M > Mz ∗ = M > c.

Further if M > M is positive definite, z ∗ is the unique optimal solution and


is given by z ∗ = (M > M)−1 M > c.

P. Balamurugan Deep Learning - Theory and Practice September 25, 2020. 47 / 47

You might also like