Professional Documents
Culture Documents
Neural Network II - Part 1
Neural Network II - Part 1
Neural Network II - Part 1
1DT109 ASPLOC
2021 VT1-VT2
)
𝑤33
𝝈 ( 𝑎31
2
𝑤41
2
𝑤42
𝑎42 , 𝑏42
2
𝑤43
P1: a matrix-based approach to represent
a neural network
layer 1 layer 2 layer 3
2
𝑤11 2
𝑤12 2
𝑤13 𝑏12 2
𝑤11
𝑎11
2
𝑤21 2
𝑤22 2
𝑤23 𝑏22 2
2
𝑤 = 1
𝑎 = 𝑎12 𝑏2 = 𝑤12 𝑎12 , 𝑏12
2
𝑤31 2
𝑤32 2
𝑤33 𝑏32 2
𝑤13
𝑎13
2
𝑤41 2
𝑤42 2
𝑤43 𝑏42 𝑎11 2
𝑤21
2
𝑤22 𝑎22 , 𝑏22
𝑎 𝑓(𝑎) 2
Denote that 𝑓 = 𝑤23
𝑏 𝑓(𝑏)
2
Calculate activations in layer 2, 𝑎2 𝑎21 𝑤31
2
𝑤32
𝑎32 , 𝑏32
2
)
𝑤33
𝝈 ( 𝑎31
2
𝑤41
2
𝑤42
𝑎42 , 𝑏42
2
𝑤43
P1: a matrix-based approach to represent
a neural network
layer 1 layer 2 layer 3
2
𝑤11 2
𝑤12 2
𝑤13 𝑏12 2
𝑤11
𝑎11
2
𝑤21 2
𝑤22 2
𝑤23 𝑏22 2
2
𝑤 = 1
𝑎 = 𝑎12 𝑏2 = 𝑤12 𝑎12 , 𝑏12
2
𝑤31 2
𝑤32 2
𝑤33 𝑏32 2
𝑤13
𝑎13
2
𝑤41 2
𝑤42 2
𝑤43 𝑏42 𝑎11 2
𝑤21
2
𝑤22 𝑎22 , 𝑏22
𝑎 𝑓(𝑎) 2
Denote that 𝑓 = 𝑤23
𝑏 𝑓(𝑏)
2
Calculate activations in layer 2, 𝑎2 𝑎21 𝑤31
2
𝑤32
𝑎32 , 𝑏32
2
)
𝑤33
𝑎12 = 𝝈 ( 𝑎31
2
𝑤41
2
𝑤42
𝑎42 , 𝑏42
2
𝑤43
P1: a matrix-based approach to represent
a neural network
layer 1 layer 2 layer 3
2
𝑤11 2
𝑤12 2
𝑤13 𝑏12 2
𝑤11
𝑎11
2
𝑤21 2
𝑤22 2
𝑤23 𝑏22 2
2
𝑤 = 1
𝑎 = 𝑎12 𝑏2 = 𝑤12 𝑎12 , 𝑏12
2
𝑤31 2
𝑤32 2
𝑤33 𝑏32 2
𝑤13
𝑎13
2
𝑤41 2
𝑤42 2
𝑤43 𝑏42 𝑎11 2
𝑤21
2
𝑤22 𝑎22 , 𝑏22
𝑎 𝑓(𝑎) 2
Denote that 𝑓 = 𝑤23
𝑏 𝑓(𝑏)
2
Calculate activations in layer 2, 𝑎2 𝑎21 𝑤31
2
𝑤32
𝑎32 , 𝑏32
2
)
𝑤33
𝑎12 = 𝝈 ( 2
𝑤11
𝑎31
2
𝑤41
2
𝑤42
𝑎42 , 𝑏42
2
𝑤43
P1: a matrix-based approach to represent
a neural network
layer 1 layer 2 layer 3
2
𝑤11 2
𝑤12 2
𝑤13 𝑏12 2
𝑤11
𝑎11
2
𝑤21 2
𝑤22 2
𝑤23 𝑏22 2
2
𝑤 = 1
𝑎 = 𝑎12 𝑏2 = 𝑤12 𝑎12 , 𝑏12
2
𝑤31 2
𝑤32 2
𝑤33 𝑏32 2
𝑤13
𝑎13
2
𝑤41 2
𝑤42 2
𝑤43 𝑏42 𝑎11 2
𝑤21
2
𝑤22 𝑎22 , 𝑏22
𝑎 𝑓(𝑎) 2
Denote that 𝑓 = 𝑤23
𝑏 𝑓(𝑏)
2
Calculate activations in layer 2, 𝑎2 𝑎21 𝑤31
2
𝑤32
𝑎32 , 𝑏32
2
)
𝑤33
𝑎12 = 𝝈 ( 2
𝑤11 2
𝑤12
𝑎31
2
𝑤41
2
𝑤42
𝑎42 , 𝑏42
2
𝑤43
P1: a matrix-based approach to represent
a neural network
layer 1 layer 2 layer 3
2
𝑤11 2
𝑤12 2
𝑤13 𝑏12 2
𝑤11
𝑎11
2
𝑤21 2
𝑤22 2
𝑤23 𝑏22 2
2
𝑤 = 1
𝑎 = 𝑎12 𝑏2 = 𝑤12 𝑎12 , 𝑏12
2
𝑤31 2
𝑤32 2
𝑤33 𝑏32 2
𝑤13
𝑎13
2
𝑤41 2
𝑤42 2
𝑤43 𝑏42 𝑎11 2
𝑤21
2
𝑤22 𝑎22 , 𝑏22
𝑎 𝑓(𝑎) 2
Denote that 𝑓 = 𝑤23
𝑏 𝑓(𝑏)
2
Calculate activations in layer 2, 𝑎2 𝑎21 𝑤31
2
𝑤32
𝑎32 , 𝑏32
2
)
𝑤33
𝑎12 = 𝝈 ( 2
𝑤11 2
𝑤12 2
𝑤13
𝑎31
2
𝑤41
2
𝑤42
𝑎42 , 𝑏42
2
𝑤43
P1: a matrix-based approach to represent
a neural network
layer 1 layer 2 layer 3
2
𝑤11 2
𝑤12 2
𝑤13 𝑏12 2
𝑤11
𝑎11
2
𝑤21 2
𝑤22 2
𝑤23 𝑏22 2
2
𝑤 = 1
𝑎 = 𝑎12 𝑏2 = 𝑤12 𝑎12 , 𝑏12
2
𝑤31 2
𝑤32 2
𝑤33 𝑏32 2
𝑤13
𝑎13
2
𝑤41 2
𝑤42 2
𝑤43 𝑏42 𝑎11 2
𝑤21
2
𝑤22 𝑎22 , 𝑏22
𝑎 𝑓(𝑎) 2
Denote that 𝑓 = 𝑤23
𝑏 𝑓(𝑏)
2
Calculate activations in layer 2, 𝑎2 𝑎21 𝑤31
2
𝑤32
𝑎32 , 𝑏32
𝑎11 2
)
𝑤33
𝑎12 = 𝝈 ( 2
𝑤11 2
𝑤12 2
𝑤13
𝑎31
2
𝑤41
2
𝑤42
𝑎42 , 𝑏42
2
𝑤43
P1: a matrix-based approach to represent
a neural network
layer 1 layer 2 layer 3
2
𝑤11 2
𝑤12 2
𝑤13 𝑏12 2
𝑤11
𝑎11
2
𝑤21 2
𝑤22 2
𝑤23 𝑏22 2
2
𝑤 = 1
𝑎 = 𝑎12 𝑏2 = 𝑤12 𝑎12 , 𝑏12
2
𝑤31 2
𝑤32 2
𝑤33 𝑏32 2
𝑤13
𝑎13
2
𝑤41 2
𝑤42 2
𝑤43 𝑏42 𝑎11 2
𝑤21
2
𝑤22 𝑎22 , 𝑏22
𝑎 𝑓(𝑎) 2
Denote that 𝑓 = 𝑤23
𝑏 𝑓(𝑏)
2
Calculate activations in layer 2, 𝑎2 𝑎21 𝑤31
2
𝑤32
𝑎32 , 𝑏32
𝑎11 2
)
𝑤33
𝑎12 = 𝝈 ( 2
𝑤11 2
𝑤12 2
𝑤13 𝑎12 𝑎31
2
𝑤41
2
𝑤42
𝑎42 , 𝑏42
2
𝑤43
P1: a matrix-based approach to represent
a neural network
layer 1 layer 2 layer 3
2
𝑤11 2
𝑤12 2
𝑤13 𝑏12 2
𝑤11
𝑎11
2
𝑤21 2
𝑤22 2
𝑤23 𝑏22 2
2
𝑤 = 1
𝑎 = 𝑎12 𝑏2 = 𝑤12 𝑎12 , 𝑏12
2
𝑤31 2
𝑤32 2
𝑤33 𝑏32 2
𝑤13
𝑎13
2
𝑤41 2
𝑤42 2
𝑤43 𝑏42 𝑎11 2
𝑤21
2
𝑤22 𝑎22 , 𝑏22
𝑎 𝑓(𝑎) 2
Denote that 𝑓 = 𝑤23
𝑏 𝑓(𝑏)
2
Calculate activations in layer 2, 𝑎2 𝑎21 𝑤31
2
𝑤32
𝑎32 , 𝑏32
𝑎11 2
)
𝑤33
𝑎12 = 𝝈 ( 2
𝑤11 2
𝑤12 2
𝑤13 𝑎12
𝑎13
𝑎31
2
𝑤41
2
𝑤42
𝑎42 , 𝑏42
2
𝑤43
P1: a matrix-based approach to represent
a neural network
layer 1 layer 2 layer 3
2
𝑤11 2
𝑤12 2
𝑤13 𝑏12 2
𝑤11
𝑎11
2
𝑤21 2
𝑤22 2
𝑤23 𝑏22 2
2
𝑤 = 1
𝑎 = 𝑎12 𝑏2 = 𝑤12 𝑎12 , 𝑏12
2
𝑤31 2
𝑤32 2
𝑤33 𝑏32 2
𝑤13
𝑎13
2
𝑤41 2
𝑤42 2
𝑤43 𝑏42 𝑎11 2
𝑤21
2
𝑤22 𝑎22 , 𝑏22
𝑎 𝑓(𝑎) 2
Denote that 𝑓 = 𝑤23
𝑏 𝑓(𝑏)
2
Calculate activations in layer 2, 𝑎2 𝑎21 𝑤31
2
𝑤32
𝑎32 , 𝑏32
𝑎11 2
)
𝑤33
𝑎12 = 𝝈 ( 2
𝑤11 2
𝑤12 2
𝑤13 𝑎12
𝑎13
𝑏12
𝑎31
2
𝑤41
2
𝑤42
𝑎42 , 𝑏42
2
𝑤43
P1: a matrix-based approach to represent
a neural network
layer 1 layer 2 layer 3
2
𝑤11 2
𝑤12 2
𝑤13 𝑏12 2
𝑤11
𝑎11
2
𝑤21 2
𝑤22 2
𝑤23 𝑏22 2
2
𝑤 = 1
𝑎 = 𝑎12 𝑏2 = 𝑤12 𝑎12 , 𝑏12
2
𝑤31 2
𝑤32 2
𝑤33 𝑏32 2
𝑤13
𝑎13
2
𝑤41 2
𝑤42 2
𝑤43 𝑏42 𝑎11 2
𝑤21
2
𝑤22 𝑎22 , 𝑏22
𝑎 𝑓(𝑎) 2
Denote that 𝑓 = 𝑤23
𝑏 𝑓(𝑏)
2
Calculate activations in layer 2, 𝑎2 𝑎21 𝑤31
2
𝑤32
𝑎32 , 𝑏32
𝑎11 2
)
𝑤33
𝑎12 = 𝝈 ( 2
𝑤11 2
𝑤12 2
𝑤13 × 𝑎12
𝑎13
𝑏12
𝑎31
2
𝑤41
2
𝑤42
𝑎42 , 𝑏42
2
𝑤43
P1: a matrix-based approach to represent
a neural network
layer 1 layer 2 layer 3
2
𝑤11 2
𝑤12 2
𝑤13 𝑏12 2
𝑤11
𝑎11
2
𝑤21 2
𝑤22 2
𝑤23 𝑏22 2
2
𝑤 = 1
𝑎 = 𝑎12 𝑏2 = 𝑤12 𝑎12 , 𝑏12
2
𝑤31 2
𝑤32 2
𝑤33 𝑏32 2
𝑤13
𝑎13
2
𝑤41 2
𝑤42 2
𝑤43 𝑏42 𝑎11 2
𝑤21
2
𝑤22 𝑎22 , 𝑏22
𝑎 𝑓(𝑎) 2
Denote that 𝑓 = 𝑤23
𝑏 𝑓(𝑏)
2
Calculate activations in layer 2, 𝑎2 𝑎21 𝑤31
2
𝑤32
𝑎32 , 𝑏32
𝑎11 2
)
𝑤33
𝑎12 = 𝝈 ( 2
𝑤11 2
𝑤12 2
𝑤13 × 𝑎12
𝑎13
+ 𝑏12
𝑎31
2
𝑤41
2
𝑤42
𝑎42 , 𝑏42
2
𝑤43
P1: a matrix-based approach to represent
a neural network
layer 1 layer 2 layer 3
2
𝑤11 2
𝑤12 2
𝑤13 𝑏12 2
𝑤11
𝑎11
2
𝑤21 2
𝑤22 2
𝑤23 𝑏22 2
2
𝑤 = 1
𝑎 = 𝑎12 𝑏2 = 𝑤12 𝑎12 , 𝑏12
2
𝑤31 2
𝑤32 2
𝑤33 𝑏32 2
𝑤13
𝑎13
2
𝑤41 2
𝑤42 2
𝑤43 𝑏42 𝑎11 2
𝑤21
2
𝑤22 𝑎22 , 𝑏22
𝑎 𝑓(𝑎) 2
Denote that 𝑓 = 𝑤23
𝑏 𝑓(𝑏)
2
Calculate activations in layer 2, 𝑎2 𝑎21 𝑤31
2
𝑤32
𝑎32 , 𝑏32
𝑎11 2
)
𝑤33
𝑎22 = 𝝈 ( 2
𝑤21 2
𝑤22 2
𝑤23 × 𝑎12
𝑎13
+ 𝑏22
𝑎31
2
𝑤41
2
𝑤42
𝑎42 , 𝑏42
2
𝑤43
P1: a matrix-based approach to represent
a neural network
layer 1 layer 2 layer 3
2
𝑤11 2
𝑤12 2
𝑤13 𝑏12 2
𝑤11
𝑎11
2
𝑤21 2
𝑤22 2
𝑤23 𝑏22 2
2
𝑤 = 1
𝑎 = 𝑎12 𝑏2 = 𝑤12 𝑎12 , 𝑏12
2
𝑤31 2
𝑤32 2
𝑤33 𝑏32 2
𝑤13
𝑎13
2
𝑤41 2
𝑤42 2
𝑤43 𝑏42 𝑎11 2
𝑤21
2
𝑤22 𝑎22 , 𝑏22
𝑎 𝑓(𝑎) 2
Denote that 𝑓 = 𝑤23
𝑏 𝑓(𝑏)
2
Calculate activations in layer 2, 𝑎2 𝑎21 𝑤31
2
𝑤32
𝑎32 , 𝑏32
𝑎11 2
)
𝑤33
𝑎32 = 𝝈 ( 2
𝑤31 2
𝑤32 2
𝑤33 × 𝑎12
𝑎13
+ 𝑏32
𝑎31
2
𝑤41
2
𝑤42
𝑎42 , 𝑏42
2
𝑤43
P1: a matrix-based approach to represent
a neural network
layer 1 layer 2 layer 3
2
𝑤11 2
𝑤12 2
𝑤13 𝑏12 2
𝑤11
𝑎11
2
𝑤21 2
𝑤22 2
𝑤23 𝑏22 2
2
𝑤 = 1
𝑎 = 𝑎12 𝑏2 = 𝑤12 𝑎12 , 𝑏12
2
𝑤31 2
𝑤32 2
𝑤33 𝑏32 2
𝑤13
𝑎13
2
𝑤41 2
𝑤42 2
𝑤43 𝑏42 𝑎11 2
𝑤21
2
𝑤22 𝑎22 , 𝑏22
𝑎 𝑓(𝑎) 2
Denote that 𝑓 = 𝑤23
𝑏 𝑓(𝑏)
2
Calculate activations in layer 2, 𝑎2 𝑎21 𝑤31
2
𝑤32
𝑎32 , 𝑏32
𝑎11 2
)
𝑤33
𝑎42 = 𝝈 ( 2
𝑤41 2
𝑤42 2
𝑤43 × 𝑎12
𝑎13
+ 𝑏42
𝑎31
2
𝑤41
2
𝑤42
𝑎42 , 𝑏42
2
𝑤43
Summarizing it up
• The activation of layer 2 can be layer 1 layer 2 layer 3
modeled by 2
𝑤11
𝑎 2 = 𝜎 𝑤 2 ∙ 𝑎1 + 𝑏 2 2
𝑤12 𝑎12 , 𝑏12
• The activation of layer 𝑙 can be 2
𝑤13
modeled by 𝑎11 2
𝑤21
𝑎𝑙 = 𝜎 𝑤 𝑙 ∙ 𝑎𝑙−1 + 𝑏 𝑙 2
𝑤22 𝑎22 , 𝑏22
2
• This expression gives us a global view 𝑤23
2
about how the activations in one layer 𝑎21 𝑤31
2
relate to activations in the previous 𝑤32
𝑎32 , 𝑏32
2
layer. 𝑤33
2
• Apply the weight matrix to the activations, 𝑎31 𝑤41
2
then add the bias vector, and finally apply 𝑤42
𝑎42 , 𝑏42
the 𝜎 function. 2
𝑤43
Summarizing it up
• The activation of layer 2 can be layer 1 layer 2 layer 3
modeled by 2
𝑤11
2 2
𝑎 =𝜎 𝑤 ∙𝑎 +𝑏 1 2
2
𝑤12 𝑎12 , 𝑏12
• The activation of layer 𝑙 can be 2
𝑤13
Why𝑙 𝒘𝒋𝒌 is
modeled by 𝒍
a better notation 𝑎1 than
1
𝑤 2
21 𝒘 𝒍
𝒌𝒋 ?
𝑎 = 𝜎 𝑤 𝑙 ∙ 𝑎𝑙−1 + 𝑏 𝑙 2
𝑤22 𝑎22 , 𝑏22
Avoiding tranpose of the
• This expression gives us a global view weight matrix!
2
𝑤23
2
about how the activations in one layer 𝑎21 𝑤31
2
relate to activations in the previous 𝑤32
𝑎32 , 𝑏32
2
layer. 𝑤33
2
• Apply the weight matrix to the activations, 𝑎31 𝑤41
2
then add the bias vector, and finally apply 𝑤42
𝑎42 , 𝑏42
the 𝜎 function. 2
𝑤43
Summarizing it up
• We can assign
𝑧 𝑙 = 𝑤 𝑙 ∙ 𝑎𝑙−1 + 𝑏 𝑙
• Thus
𝑎𝑙 = 𝜎 𝑤 𝑙 ∙ 𝑎𝑙−1 + 𝑏 𝑙 = 𝜎 𝑧 𝑙
• Also note that
𝑧𝑗𝑙 = 𝑤𝑗𝑘
𝑙 𝑙−1
𝑎𝑘 + 𝑏𝑗𝑙
𝑘
P2: assumptions of the cost function
• With the previous notation
𝑎𝑙 = 𝜎 𝑧 𝑙
• The cost function of the network can be rewritten as
1
𝐶 𝑤, 𝑏 = 𝑦 𝑥 − 𝑎𝐿 (𝑥) 2
2𝑛
𝑥∈𝑛
• For a single input 𝑥
1 1 2 This is because the
𝐶𝑥 𝑤, 𝑏 = 𝑦 𝑥 − 𝑎𝐿 (𝑥) 2 = 𝑦𝑗 − 𝑎𝑗𝐿
2 2 final 𝐶 𝑤, 𝑏 is an
average value of 𝐶𝑥
𝑗
• Where 𝑎𝐿 is the activation vector of the last level 𝐿 Notice how 𝐶/𝐶𝑥 depends
only on 𝑎 𝐿 . However, to
• N.B. In order that backpropagation can be applied calculate 𝑎𝐿 , we need
• We can compute 𝐶 for each single input 𝑥 as 𝐶𝑥 values from the previous
layers of the NN. But that
• 𝐶𝑥 depends only on the last layer 𝑎𝐿 of the NN is not how 𝐶/𝐶𝑥 is defined.
P3: The Hadamard product, s ⊙ 𝑡
• Suppose s and 𝑡 are two vectors of the same
dimension. s ⊙ 𝑡 denotes the elementwise product of
the two vectors. That is
(s ⊙ 𝑡)𝑗 = 𝑠𝑗 ⋅ 𝑡𝑗
• Example:
𝑎 𝑐 𝑎∙𝑐
⊙ =
𝑏 𝑐 𝑏∙𝑑
Backpropagation
• Backpropagation is about understanding how changing the
weights and biases in a network changes the cost function.
• In Math terms, this enables us to compute the partial derivatives
𝜕𝐶𝑥 𝜕𝐶𝑥
𝜕𝑤𝑗𝑘𝑙 and 𝑙
𝜕𝑏𝑗
• We define ∆𝑧𝑗𝑙
𝜕𝐶𝑥
𝛿𝑗𝑙 = 𝑙
𝜕𝑧𝑗 ... ...
Backpropagation – Corollary 1 𝜕𝐶𝑥
𝛿𝑗𝑙 =
𝜕𝑧𝑗𝑙
• We first deduct 𝛿𝑗𝐿 , that is, the error on the 𝑗𝑡ℎ neuron
1
of the output layer 𝐿. 𝐶𝑥 = 𝑦𝑗 − 𝑎𝑗𝐿
2
2
𝜕𝑎𝑗𝐿
𝑗
𝜕𝐶𝑥 𝜕𝐶𝑥 Notice how we get
• By definition, 𝛿𝑗𝐿 = = ∙ ride of σ𝑗 layer L-1 layer L
𝜕𝑧𝑗𝐿 𝜕𝑎𝑗𝐿 𝜕𝑧𝑗𝐿
...
• Recall that 𝑎𝑗𝐿 =𝜎 𝑧𝑗𝐿
, we have 𝑎1𝐿 , 𝑦1
𝐿
𝜕𝐶𝑥 𝜕𝑎 𝑗 𝜕𝐶𝑥
𝛿𝑗 = 𝐿 ∙ 𝐿 = 𝐿 ∙ 𝜎′(𝑧𝑗𝐿 )
𝐿
...
𝜕𝑎𝑗 𝜕𝑧𝑗 𝜕𝑎𝑗 𝑎𝑗𝐿 , 𝑦𝑗
• In summary 𝜎 𝑧𝑗𝐿
𝜕𝐶𝑥 ...
𝛿𝑗 = 𝐿 ∙ 𝜎′(𝑧𝑗𝐿 )
𝐿
𝜕𝑎𝑗 𝑎𝑗𝐿 = 𝜎 𝑧𝑗𝐿
𝐿 𝜕𝐶𝑥 𝐿
Interpretation of Corollary 1: 𝛿𝑗 = ∙ 𝜎′(𝑧𝑗 )
𝜕𝑎𝑗𝐿
𝜕𝐶𝑥
• The first term, , measures how fast the cost is changing as a
𝜕𝑎𝑗𝐿
function of the 𝑗𝑡ℎ output activation.
• The second term, 𝜎′(𝑧𝑗𝐿 ), measures how fast the activation
function 𝜎 is changing at 𝑧𝑗𝐿 .
layer L-1 layer L
1 𝐿 2 𝜕𝐶𝑥
• Given 𝐶𝑥 = σ𝑗 𝑦𝑗 − 𝑎𝑗 → 𝐿 = 𝑎𝑗𝐿 − 𝑦𝑗
2 𝜕𝑎𝑗 ...
𝑎1𝐿 , 𝑦1
1
• Given 𝜎 𝑧𝑗𝐿 = −𝑧𝐿
, 𝜎′ 𝑧𝑗
𝐿
= 𝜎 𝑧𝑗
𝐿
∙ 1 − 𝜎 𝑧𝑗
𝐿
1+𝑒 𝑗 ...