Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

Deep Learning

Dr. Sugata Ghosal


BITS Pilani sugata.ghosal@pilani.bits-pilani.ac.in
Pilani Campus
BITS Pilani
Pilani Campus

Problem Solving Session


Time: 7:30 – 9:30 PM
Date:21/10/2020
These slides are assembled by the instructor with grateful acknowledgement of the many
others who made their course materials freely available online.
Weight Updates in Deep Networks
𝑦 = 𝑦12

𝑧12
𝑦11 𝑦21

𝑧11 𝑧21

𝑦10 𝑦20

• Training Input: (x1,x2)=(1,1) Target Output: d=0.0


• div function: square error
• Activation function f( ): ReLU
• Bias: 0 at all nodes
• 𝜂 = 0.1
• What are the values of w31 and w12 in next iteration?
Weight Updates with ReLU, square error
𝑦 = 𝑦2 1

𝑧11 = 0.25*1+0.75*1=1.0
𝑧21 = 0.75*1+0.25*1=1.0
𝑧12
𝑦11 = ReLU(𝑧11 )= 1 𝑦21 = ReLU(𝑧21 )= 1 𝑦11 𝑦21
𝑧12 = 0.67*1+0.33*1=1.0 𝑧11 𝑧21
𝑦12 = ReLU(𝑧12 )= 1.0
𝑦10 𝑦20
1
𝑑𝑖𝑣 = 𝑑 − 𝑦 2
2
𝛿𝑑𝑖𝑣
= y − d = 1.0
𝛿𝑦
𝛿𝑑𝑖𝑣 𝛿𝑑𝑖𝑣 𝛿𝑦12
= * = 1*1=1 𝛿𝑑𝑖𝑣 𝛿𝐷𝑖𝑣 𝛿𝑧12
𝛿𝑧12 𝛿𝑦12 𝛿𝑧12 = 𝛿𝑧 2 *𝛿𝑦 1 =1*w31
𝛿𝑦11 1 1
𝛿𝑑𝑖𝑣 𝛿𝐷𝑖𝑣 𝛿𝑧12 𝛿𝑑𝑖𝑣 𝛿𝑑𝑖𝑣 𝛿𝑦11
= * 𝛿𝑤 = 1 ∗ 𝑦11 =1 = ∗ 𝛿𝑧 1 =0.67*1=0.67
𝛿𝑤31 𝛿𝑧12 31
𝛿𝑧11 𝛿𝑦11 1
𝛿𝑑𝑖𝑣 𝛿𝑑𝑖𝑣 𝛿𝑑𝑖𝑣 𝛿𝑧11 0
𝑤31 = 𝑤31 − 𝜂*𝛿𝑤 =0.67-0.1=0.57 = 1 *
𝛿𝑤12 𝛿𝑧1 𝛿𝑤12
=0.67*𝑦2 =0.67
31
𝛿𝑑𝑖𝑣
𝑤12 = 𝑤12 − 𝜂∗ =0.75−.067=0.683
𝛿𝑤12
Weight Updates with ReLU Hidden, sigmoid output
and binary cross-entropy 𝑦 = 𝑦12
𝑧11 = 0.25*1+0.75*1=1.0
𝑧21 = 0.75*1+0.25*1=1.0
𝑧12
𝑦11 = ReLU(𝑧11 )= 1 𝑦21 = ReLU(𝑧21 )= 1 𝑦11 𝑦21
𝑧12 = 0.67*1+0.33*1=1.0 𝑧11 𝑧21
𝑦12 = sigmoid(𝑧12 )=e/(1+e)
𝑦10 𝑦20
div = -d*log(y)-(1-d)*log(1-y)
𝛿𝑑𝑖𝑣
=-d/y+(1-d)/(1-y)=1/(1-y)= (1+e)
𝛿𝑦

𝛿𝑑𝑖𝑣 𝛿𝐷𝑖𝑣 𝛿𝑦12


= * = (1+e)*sigmoid(𝑧12 )*(1-sigmoid(𝑧12 ))=e/(1+e)
𝛿𝑧12 𝛿𝑦12 𝛿𝑧12

𝛿𝑑𝑖𝑣 𝛿𝐷𝑖𝑣 𝛿𝑧12


= * = e/(1 + e) ∗ 𝑦11 =e/(1+e)
𝛿𝑤31 𝛿𝑧12 𝛿𝑤31
𝜹𝒅𝒊𝒗
𝒘𝟑𝟏 = 𝒘𝟑𝟏 − 𝜼* =0.67-0.1*e/(1+e)=0.597
𝜹𝒘𝟑𝟏
Example Special Case : Softmax
(k )
y(k-1) z(k) y(k) (k) i
i (k )
j j
(k )
j
(k ) (k ) (k )
i j j i
Div
(k) (k) (k )
j i i
(k ) k k
i - i j

(k ) (k )
(k ) (k ) i ij j
i j j

• For future reference


• is the Kronecker delta:
Weight Updates with Softmax
y1 y2

𝑧1 𝑧2 Softmax Output Layer

𝑤11 = 1 𝑤12 = 0 𝑤21 = 0 𝑤22 = 1

h1 h2

• Training input: (h1, h2) = (1, 0) target output: (d1,d2)=(0,1)


• div function: cross-entropy
• Learning rate: 0.1
• Bias = 0
• What will be the value of w11 in next iteration?

• Input to softmax node z1 = w11=1; z2 =0


• y1 = e / (1+e)
• y2 = 1/ (1+e)
• div = -d1 log(y1) – d2 log(y2) = -log(y2) = log(1+e) = 1.313
Weight Updates with Softmax …
y1 y2

(k ) (k )
i j
Softmax
(k ) (k ) ij
i j j Output Layer

𝑤11 = 1 𝑤12 = 0 𝑤21 = 0 𝑤22 = 1

h1 h2

• Change in w11 = 𝛿w11 = -0.1 * ddiv/dw11


= -0.1 * ddiv/dz1 * dz1/dw11 = -0.1 * ddiv/dz1 * h1

• ddiv/dy1 = -d1/y1 = 0 ddiv/dy2=-d2/y2 = -1/y2


• ddiv/dz1 = ddiv/dy1*y1(1-y1) + ddiv/dy2*y1(-y2) = - y1 y2 / y2 = -y1

• Input to softmax node z1 = w11=1; z2 =0 ; y1 = e / (1+e)


• 𝛿w11 = −0.1 * e/(1+e) = -0.0731
Weight Updates with Sigmoid Output
y1 y2

𝑧1 𝑧2 Sigmoid Output Layer

𝑤11 = 1 𝑤12 = 0 𝑤21 = 0 𝑤22 = 1

h1 h2

• Training input: (h1, h2) = (1, 0) target output: (d1,d2)=(0,1)


• Activation: sigmoid, Bias=0
• div function: cross-entropy
• Learning rate: 0.1
• What will be the value of w11 in next iteration?

• Input to output node z1 = w11=1; z2 =0


• y1 = 1 / (1+e-1) =e/(1+e) y2 = 1/ (1+e0)=1/2
• div = -d1 log(y1) – d2 log(y2) = -log(y2) = log2
Weight Updates with Sigmoid Outputs …
Input (h1, h2)=(1.0, 0.0) y1 y2
target output: (d1,d2)=(0,1)
𝑤11 (t+1)=? Sigmoid
Output Layer
div = -d1 log(y1) – d2 log(y2)
𝑤11 = 1 𝑤12 = 0 𝑤21 = 0 𝑤22 = 1
𝛿𝑑𝑖𝑣 𝑑1
=− h2
𝛿𝑦1 𝑦1 h1

𝛿𝑦1 1 1 𝑒
= ∗ 1− =
𝛿𝑧1 1 + 𝑒 −1 1 + 𝑒 −1 1+𝑒 2

𝛿𝑑𝑖𝑣
𝑤11 𝑡 + 1 = 𝑤11 𝑡 −𝜂
𝛿𝑤11
𝛿𝑑𝑖𝑣 𝛿𝑦1 𝛿𝑧1
= 𝑤11 𝑡 − 0.1 ∗
𝛿𝑦1 𝛿𝑧1 𝛿𝑤11
𝑒
= 𝑤11 𝑡 − 0.1 ∗ 0 ∗ ∗ ℎ1
1+𝑒 2
= 1.0
Example Problems
L1 and L2 Regularization
𝑙2 regularization

• Gradient of regularized objective


𝛻𝐿𝑅 𝜃 ≈ 𝐻 𝜃 − 𝜃∗ + 𝛼𝜃
• On the optimal 𝜃𝑅∗
0 = 𝛻𝐿𝑅 𝜃𝑅∗ ≈ 𝐻 𝜃𝑅∗ − 𝜃∗ + 𝛼𝜃𝑅∗
𝜃𝑅∗ ≈ 𝐻 + 𝛼𝐼 −1 𝐻𝜃 ∗
L2 Regularization
• Loss(x,y)=3x2+6xy+y2
• What is the value of (xopt , yopt) that minimizes the loss, if L2
regularization constant 0.1 is applied?

𝜃R∗≈ (𝐻 + 0.1 𝐼)-1𝐻𝜃∗

• Now, for regularized case, optimal 𝜃R∗ is given by

where H is the Hessian of Loss(x,y) and 𝜃∗is the optimal value


in the unregularized case.
L2 Regularization
• Loss(x,y)=3x2+6xy+y2
• What is the value of (xopt , yopt) that minimizes the loss, if L2
regularization constant 0.1 is applied?

• d Loss / dx = 6x + 6y and d Loss / dy = 6x+2y. So, in


unregularized case, xopt = yopt = 0.
• Now, for regularized case, optimal 𝜃R∗ is given by

𝜃R∗≈ (𝐻 + 0.1 𝐼)-1𝐻𝜃∗


where H is the Hessian of Loss(x,y) and 𝜃∗is the optimal value
in the unregularized case.
• Since in unregularized case, xopt=yopt=0, 𝜃R∗is also a zero
vector.
𝑙1 regularization
min 𝐿R 𝜃 = 𝐿(𝜃) + 𝛼| 𝜃 | 1
𝜃

• Further assume that 𝐻 is diagonal and positive (𝐻𝑖𝑖> 0, ∀𝑖)


• not true in general but assume for getting some intuition

• The regularized objective is (ignoring constants)

• The optimal 𝜃𝑅∗

𝛼
max 𝜃𝑖∗ − ,0 if 𝜃𝑖∗ ≥ 0
𝐻𝑖𝑖
(𝜃𝑅∗ )𝑖 ≈

𝛼
min 𝜃𝑖 + ,0 if 𝜃𝑖∗ < 0
𝐻𝑖𝑖
L1 Regularization
• Loss 𝑥, 𝑦 = 3𝑥 2 − 6𝑥 + 3 + 𝑦 2
• What is the value of (xopt , yopt) that minimizes the loss, if L1
regularization constant 0.1 is applied?

𝛼
max 𝜃𝑖∗ − ,0 if 𝜃𝑖∗ ≥ 0
𝐻𝑖𝑖
(𝜃𝑅∗ )𝑖 ≈
𝛼
min 𝜃𝑖∗ + ,0 if 𝜃𝑖∗ < 0
𝐻𝑖𝑖

• So, in regularized case, xopt = 1 – 0.1/6 = 0.9833 and yopt =


max (0-0.1/2,0)=0.
L1 Regularization
• Loss 𝑥, 𝑦 = 3𝑥 2 − 6𝑥 + 3 + 𝑦 2
• What is the value of (xopt , yopt) that minimizes the loss, if L1
regularization constant 0.1 is applied?

• H11= d2 Loss (x,y) / dx2 = 6, H12= H21 = d2 Loss (x,y) / dxdy = 0


and H22= d2 Loss (x,y) / dy2 = 2
𝛼
max 𝜃𝑖∗ − ,0 if 𝜃𝑖∗ ≥ 0
𝐻𝑖𝑖
(𝜃𝑅∗ )𝑖 ≈

𝛼
min 𝜃𝑖 + ,0 if 𝜃𝑖∗ < 0
𝐻𝑖𝑖

• d Loss(x,y)/dx = 6x -6, and d Loss(x,y)/dy = 2y. Thus, in


unregularized case, xopt = 1 and yopt = 0.
• So, in regularized case, xopt = 1 – 0.1/6 = 0.9833 and yopt =
max (0-0.1/2,0)=0.
Optimal Learning Rate: Multivariate Diagonal Quadratic
Error Function
Error surface is given by E(x,y,z) = 3x2 +2y2 + 4z2 +6. What is the optimal
learning rate that leads to fastest convergence to the global minimum?

ɳx,opt = 1/6
ɳy,opt = 1/4
ɳz,opt = 1/8

Optimal learning rate =


min(ɳx,opt , ɳy,opt , ɳy,opt ) = 0.125

Largest learning rate for convergence =


min (2ɳx,opt , 2ɳy,opt , 2ɳy,opt) = 0.333
Dependence on learning rate –
Error Minimization

• 1,opt 2,opt

• 2,opt

• 2,opt

• 2,opt

• 2,opt

• 2,opt
Minimization of Quadratic Error Function
Weight Updates – Ordinary Gradient Descent
Weight Updates – Ordinary Gradient Descent
Weight Updates – Momentum Method

w2(t+1)= 2.058+0.9*(2.0-1.0)=2.958
Weight Updates – RProp

Assume, α = 1.5, β = 0.6


What will be (w1,w2) at (t+1)?

At time t-1,
dE/dw1 =0.5*(1-3)-(1-4)/6=-0.5
dE/w2=2/9*(1-4)-(1- 3)/6=-0.333

At time t,
dE/dw1 =0.5*(1.5-3)-(2.0-4)/6=0.4167
dE/w2=2/9*(2-4)-(1.5- 3)/6=-0.194

Delta w1 = 1.5-1 =0.5 Delta w2 = 2-1=1

w1(t+1)= 1+0.5*0.6 = 1.3, sign of derivation became different


w2(t+1)= 2.0+1.5*1=3.5, sign of derivative remained same
Coding Related
...
# define the keras model
model = Sequential()
model.add(Dense(12, input_dim=8, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
...

How many inputs ?

How Many Outputs?

How Many Hidden Nodes?

Total number of trainable parameters?

You might also like