Neural Network II - Part 1

Accelerating Systems with
Programmable Logic Components

Lecture 12 Neural network II
Part 1
1DT109 ASPLOC
2021 VT1-VT2
Yuan Yao, yuan.yao@it.uu.se

Agenda
• A brief introduction to backpropagation
• Notations
• Preparatory concepts
• Backpropagation
• Corollary 1
Backpropagation
• In the last lecture we saw how neural networks can learn their
weights and biases using the gradient descent algorithm.
• However, a gap in our explanation: we didn't discuss how to compute
the gradient of the cost function.
• For 𝑚 training samples 𝑇
𝜂 𝜂 𝜕𝐶𝑥 ′𝜂 𝜕𝐶𝑥 𝜂
′ 𝜕𝐶𝑥 𝜂 𝜕𝐶𝑥
′ ′
(∆𝑤, ∆𝑏) = − ∇𝐶′ = − ෍ ,…,− ෍ ,− ෍ ,…,− ෍
𝑚 𝑚 ′ 𝜕𝑤1 𝑚 ′ 𝜕𝑤𝑘 𝑛 ′ 𝜕𝑏1 𝑛 ′ 𝜕𝑏𝑙
𝑥 ∈𝑚 𝑥 ∈𝑚 𝑥 ∈𝑚 𝑥 ∈𝑚
𝜕𝐶𝑥′
• How to compute ?
𝜕𝑤1
𝜕𝐶
• Backpropagation is a fast way to compute for any complexity cost
𝜕𝑤
functions
Backpropagation – a brief history
• The backpropagation algorithm was originally introduced in the
1970s.
• Fully appreciated by David Rumelhart, Geoffrey Hinton, and
Ronald Williams in the paper:
• David Rumelhart, Geoffrey Hinton, Ronald William, “Learning
representations by back-propagating errors,” in Nature, vol 323, issue
9, 533-536, 1986.
• Abstract: The paper describes several neural networks where
backpropagation works far faster than earlier approaches to learning,
making it possible to use neural nets to solve problems which had
previously been insoluble.
• Today, the backpropagation algorithm is the workhorse of
learning in neural networks.
Backpropagation – a brief history
• The backpropagation algorithm was originally introduced in the
1970s.
• Fully appreciated by David Rumelhart, Geoffrey Hinton, and
Ronald Williams in the paper:
Remember that Geoffrey
• David Rumelhart, the derivative
Hinton, Ronald of the“Learning
William, cost function
determines
representations how (fast)
by back-propagating
9, 533-536, 1986.
a
errors,” NN
in learns.
Nature, vol 323, issue
• Abstract: The paper describes several neural networks where

backpropagation works far faster than earlier approaches to learning,
making it possible to use neural nets to solve problems which had
previously been insoluble.
• Today, the backpropagation algorithm is the workhorse of
learning in neural networks.
Backpropagation
𝜕𝐶
• Backpropagation is an expression for the partial derivative of
𝜕𝑤
the cost function 𝐶 with respect to any weight 𝑤 (or bias 𝑏) in
the network.
• The expression tells us how quickly the cost changes when we
change the weights and biases.
• It gives us detailed insights into how changing the weights and
biases changes the overall behavior of the network.
Notations
𝑙
• We'll use 𝑤𝑗𝑘 to denote the weight for
layer 1 layer 2 layer 3
the connection from the 𝑘 𝑡ℎ neuron in
the (𝑙 − 1)𝑡ℎ layer to the 𝑗𝑡ℎ neuron in 1 𝑙
the 𝑙 𝑡ℎ layer.
1
• You might think that it makes more 2 1
sense to use 𝑗 to refer to the input
2
neuron and 𝑘 to the output.
3 2
• We will see later why the current notion 3
3 𝑤24
makes its sense. 𝑗
4
𝑘
Notations
• We use a similar notation for the layer 1 layer 2 layer 3
network’s biases and activations.
• 𝑏𝑗𝑙 denotes the bias of the 𝑗𝑡ℎ neuron
in the 𝑙 𝑡ℎ layer.
𝑎13
• 𝑎𝑗𝑙denotes the activation function of
the 𝑗𝑡ℎ neuron in the 𝑙 𝑡ℎ layer.
𝑏32
Notations
• With these notations, the activation 𝑎𝑗𝑙 of the 𝑗𝑡ℎ neuron in the
𝑙 𝑡ℎ layer is related to the activations in the (𝑙 − 1)𝑡ℎ layer by the
following equation.
𝑎𝑗𝑙 = 𝜎 ෍ 𝑤𝑗𝑘
𝑙 𝑙−1
𝑎𝑘 + 𝑏𝑗𝑙
𝑘
• The sum is over all neurons 𝑘 in the (𝑙 − 1)𝑡ℎ layer.
• Recall from last lecture that
1 1
𝜎 𝑧 = −𝑧
=
1+𝑒 1 + 𝑒 −(𝑤 ∙ 𝑥 + 𝑏)
• Note how the 𝑥 is replaced by 𝑎
Preparations
• Before we heading into how backpropagation works, we need to
know 3 concepts
1. How to model an NN using matrix
2. What kind of cost function can be applied with
backpropagation
3. The Hadamard product
P1: a matrix-based approach to represent
a neural network
• We use matrix 𝑤 𝑙 to denote the weights layer 1 layer 2 layer 3
from layer (𝑙 − 1)𝑡ℎ to layer 𝑙 𝑡ℎ 2
𝑤11
• The entries of the weight matrix 𝑤 𝑙 are the 2
𝑤12 𝑎12 , 𝑏12
weights connecting to the 𝑙 𝑡ℎ layer. 2
𝑤13
• The entry in the 𝑗𝑡ℎ row and 𝑘𝑡ℎ column is 𝑎11 2
𝑤21
𝑙
𝑤𝑗𝑘 2
𝑤22 𝑎22 , 𝑏22
2
𝑤23
• 𝑏 𝑙 denotes all the biases in layer 𝑙 𝑡ℎ 2
𝑎21 𝑤31
• 𝑎𝑙 denotes all activations in layer 𝑙 𝑡ℎ 2
𝑤32
𝑎32 , 𝑏32
2
2
𝑤11 2
𝑤12 2
𝑤13 𝑏12 𝑤33
𝑎11 2
2
𝑤21 2
𝑤22 2
𝑤23 𝑏22 𝑎31 𝑤41
𝑤2 = 𝑎1 = 𝑎12 𝑏2 = 2
2
𝑤31 2
𝑤32 2
𝑤33 𝑏32 𝑤42
𝑎13 2 𝑎42 , 𝑏42
𝑤43
2
𝑤41 2
𝑤42 2
𝑤43 𝑏42
a neural network
2
𝑤11 2
𝑤12 2
𝑤13 𝑏12 2
𝑤11
𝑎11
2
𝑤21 2
𝑤22 2
𝑤23 𝑏22 2
2
𝑤 = 1
𝑎 = 𝑎12 𝑏2 = 𝑤12 𝑎12 , 𝑏12
2
𝑤31 2
𝑤32 2
𝑤33 𝑏32 2
𝑤13
𝑎13
2
𝑤41 2
𝑤42 2
𝑤43 𝑏42 𝑎11 2
𝑤21
2
𝑤22 𝑎22 , 𝑏22
2
𝑤23
2
𝑎21 𝑤31
2
𝑤32
𝑎32 , 𝑏32
2
)
𝑤33
𝝈 ( 𝑎31
2
𝑤41
2
𝑤42
𝑎42 , 𝑏42
2
𝑤43
a neural network
2
𝑤11 2
𝑤12 2
𝑤13 𝑏12 2
𝑤11
𝑎11
2
𝑤21 2
𝑤22 2
𝑤23 𝑏22 2
2
𝑤 = 1
𝑎 = 𝑎12 𝑏2 = 𝑤12 𝑎12 , 𝑏12
2
𝑤31 2
𝑤32 2
𝑤33 𝑏32 2
𝑤13
𝑎13
2
𝑤41 2
𝑤42 2
𝑤43 𝑏42 𝑎11 2
𝑤21
2
𝑤22 𝑎22 , 𝑏22
𝑎 𝑓(𝑎) 2
Denote that 𝑓 = 𝑤23
𝑏 𝑓(𝑏)
2
Calculate activations in layer 2, 𝑎2 𝑎21 𝑤31
2
𝑤32
𝑎32 , 𝑏32
2
)
𝑤33
𝝈 ( 𝑎31
2
𝑤41
2
𝑤42
𝑎42 , 𝑏42
2
𝑤43
a neural network
2
𝑤11 2
𝑤12 2
𝑤13 𝑏12 2
𝑤11
𝑎11
2
𝑤21 2
𝑤22 2
𝑤23 𝑏22 2
2
𝑤 = 1
𝑎 = 𝑎12 𝑏2 = 𝑤12 𝑎12 , 𝑏12
2
𝑤31 2
𝑤32 2
𝑤33 𝑏32 2
𝑤13
𝑎13
2
𝑤41 2
𝑤42 2
𝑤43 𝑏42 𝑎11 2
𝑤21
2
𝑤22 𝑎22 , 𝑏22
𝑎 𝑓(𝑎) 2
𝑏 𝑓(𝑏)
2
2
𝑤32
𝑎32 , 𝑏32
2
)
𝑤33
𝑎12 = 𝝈 ( 𝑎31
2
𝑤41
2
𝑤42
𝑎42 , 𝑏42
2
𝑤43
a neural network
2
𝑤11 2
𝑤12 2
𝑤13 𝑏12 2
𝑤11
𝑎11
2
𝑤21 2
𝑤22 2
𝑤23 𝑏22 2
2
𝑤 = 1
𝑎 = 𝑎12 𝑏2 = 𝑤12 𝑎12 , 𝑏12
2
𝑤31 2
𝑤32 2
𝑤33 𝑏32 2
𝑤13
𝑎13
2
𝑤41 2
𝑤42 2
𝑤43 𝑏42 𝑎11 2
𝑤21
2
𝑤22 𝑎22 , 𝑏22
𝑎 𝑓(𝑎) 2
𝑏 𝑓(𝑏)
2
2
𝑤32
𝑎32 , 𝑏32
2
)
𝑤33
𝑎12 = 𝝈 ( 2
𝑤11
𝑎31
2
𝑤41
2
𝑤42
𝑎42 , 𝑏42
2
𝑤43
a neural network
2
𝑤11 2
𝑤12 2
𝑤13 𝑏12 2
𝑤11
𝑎11
2
𝑤21 2
𝑤22 2
𝑤23 𝑏22 2
2
𝑤 = 1
𝑎 = 𝑎12 𝑏2 = 𝑤12 𝑎12 , 𝑏12
2
𝑤31 2
𝑤32 2
𝑤33 𝑏32 2
𝑤13
𝑎13
2
𝑤41 2
𝑤42 2
𝑤43 𝑏42 𝑎11 2
𝑤21
2
𝑤22 𝑎22 , 𝑏22
𝑎 𝑓(𝑎) 2
𝑏 𝑓(𝑏)
2
2
𝑤32
𝑎32 , 𝑏32
2
)
𝑤33
𝑎12 = 𝝈 ( 2
𝑤11 2
𝑤12
𝑎31
2
𝑤41
2
𝑤42
𝑎42 , 𝑏42
2
𝑤43
a neural network
2
𝑤11 2
𝑤12 2
𝑤13 𝑏12 2
𝑤11
𝑎11
2
𝑤21 2
𝑤22 2
𝑤23 𝑏22 2
2
𝑤 = 1
𝑎 = 𝑎12 𝑏2 = 𝑤12 𝑎12 , 𝑏12
2
𝑤31 2
𝑤32 2
𝑤33 𝑏32 2
𝑤13
𝑎13
2
𝑤41 2
𝑤42 2
𝑤43 𝑏42 𝑎11 2
𝑤21
2
𝑤22 𝑎22 , 𝑏22
𝑎 𝑓(𝑎) 2
𝑏 𝑓(𝑏)
2
2
𝑤32
𝑎32 , 𝑏32
2
)
𝑤33
𝑎12 = 𝝈 ( 2
𝑤11 2
𝑤12 2
𝑤13
𝑎31
2
𝑤41
2
𝑤42
𝑎42 , 𝑏42
2
𝑤43
a neural network
2
𝑤11 2
𝑤12 2
𝑤13 𝑏12 2
𝑤11
𝑎11
2
𝑤21 2
𝑤22 2
𝑤23 𝑏22 2
2
𝑤 = 1
𝑎 = 𝑎12 𝑏2 = 𝑤12 𝑎12 , 𝑏12
2
𝑤31 2
𝑤32 2
𝑤33 𝑏32 2
𝑤13
𝑎13
2
𝑤41 2
𝑤42 2
𝑤43 𝑏42 𝑎11 2
𝑤21
2
𝑤22 𝑎22 , 𝑏22
𝑎 𝑓(𝑎) 2
𝑏 𝑓(𝑏)
2
2
𝑤32
𝑎32 , 𝑏32
𝑎11 2
)
𝑤33
𝑎12 = 𝝈 ( 2
𝑤11 2
𝑤12 2
𝑤13
𝑎31
2
𝑤41
2
𝑤42
𝑎42 , 𝑏42
2
𝑤43
a neural network
2
𝑤11 2
𝑤12 2
𝑤13 𝑏12 2
𝑤11
𝑎11
2
𝑤21 2
𝑤22 2
𝑤23 𝑏22 2
2
𝑤 = 1
𝑎 = 𝑎12 𝑏2 = 𝑤12 𝑎12 , 𝑏12
2
𝑤31 2
𝑤32 2
𝑤33 𝑏32 2
𝑤13
𝑎13
2
𝑤41 2
𝑤42 2
𝑤43 𝑏42 𝑎11 2
𝑤21
2
𝑤22 𝑎22 , 𝑏22
𝑎 𝑓(𝑎) 2
𝑏 𝑓(𝑏)
2
2
𝑤32
𝑎32 , 𝑏32
𝑎11 2
)
𝑤33
𝑎12 = 𝝈 ( 2
𝑤11 2
𝑤12 2
𝑤13 𝑎12 𝑎31
2
𝑤41
2
𝑤42
𝑎42 , 𝑏42
2
𝑤43
a neural network
2
𝑤11 2
𝑤12 2
𝑤13 𝑏12 2
𝑤11
𝑎11
2
𝑤21 2
𝑤22 2
𝑤23 𝑏22 2
2
𝑤 = 1
𝑎 = 𝑎12 𝑏2 = 𝑤12 𝑎12 , 𝑏12
2
𝑤31 2
𝑤32 2
𝑤33 𝑏32 2
𝑤13
𝑎13
2
𝑤41 2
𝑤42 2
𝑤43 𝑏42 𝑎11 2
𝑤21
2
𝑤22 𝑎22 , 𝑏22
𝑎 𝑓(𝑎) 2
𝑏 𝑓(𝑏)
2
2
𝑤32
𝑎32 , 𝑏32
𝑎11 2
)
𝑤33
𝑎12 = 𝝈 ( 2
𝑤11 2
𝑤12 2
𝑤13 𝑎12
𝑎13
𝑎31
2
𝑤41
2
𝑤42
𝑎42 , 𝑏42
2
𝑤43
a neural network
2
𝑤11 2
𝑤12 2
𝑤13 𝑏12 2
𝑤11
𝑎11
2
𝑤21 2
𝑤22 2
𝑤23 𝑏22 2
2
𝑤 = 1
𝑎 = 𝑎12 𝑏2 = 𝑤12 𝑎12 , 𝑏12
2
𝑤31 2
𝑤32 2
𝑤33 𝑏32 2
𝑤13
𝑎13
2
𝑤41 2
𝑤42 2
𝑤43 𝑏42 𝑎11 2
𝑤21
2
𝑤22 𝑎22 , 𝑏22
𝑎 𝑓(𝑎) 2
𝑏 𝑓(𝑏)
2
2
𝑤32
𝑎32 , 𝑏32
𝑎11 2
)
𝑤33
𝑎12 = 𝝈 ( 2
𝑤11 2
𝑤12 2
𝑤13 𝑎12
𝑎13
𝑏12
𝑎31
2
𝑤41
2
𝑤42
𝑎42 , 𝑏42
2
𝑤43
a neural network
2
𝑤11 2
𝑤12 2
𝑤13 𝑏12 2
𝑤11
𝑎11
2
𝑤21 2
𝑤22 2
𝑤23 𝑏22 2
2
𝑤 = 1
𝑎 = 𝑎12 𝑏2 = 𝑤12 𝑎12 , 𝑏12
2
𝑤31 2
𝑤32 2
𝑤33 𝑏32 2
𝑤13
𝑎13
2
𝑤41 2
𝑤42 2
𝑤43 𝑏42 𝑎11 2
𝑤21
2
𝑤22 𝑎22 , 𝑏22
𝑎 𝑓(𝑎) 2
𝑏 𝑓(𝑏)
2
2
𝑤32
𝑎32 , 𝑏32
𝑎11 2
)
𝑤33
𝑎12 = 𝝈 ( 2
𝑤11 2
𝑤12 2
𝑤13 × 𝑎12
𝑎13
𝑏12
𝑎31
2
𝑤41
2
𝑤42
𝑎42 , 𝑏42
2
𝑤43
a neural network
2
𝑤11 2
𝑤12 2
𝑤13 𝑏12 2
𝑤11
𝑎11
2
𝑤21 2
𝑤22 2
𝑤23 𝑏22 2
2
𝑤 = 1
𝑎 = 𝑎12 𝑏2 = 𝑤12 𝑎12 , 𝑏12
2
𝑤31 2
𝑤32 2
𝑤33 𝑏32 2
𝑤13
𝑎13
2
𝑤41 2
𝑤42 2
𝑤43 𝑏42 𝑎11 2
𝑤21
2
𝑤22 𝑎22 , 𝑏22
𝑎 𝑓(𝑎) 2
𝑏 𝑓(𝑏)
2
2
𝑤32
𝑎32 , 𝑏32
𝑎11 2
)
𝑤33
𝑎12 = 𝝈 ( 2
𝑤11 2
𝑤12 2
𝑤13 × 𝑎12
𝑎13
+ 𝑏12
𝑎31
2
𝑤41
2
𝑤42
𝑎42 , 𝑏42
2
𝑤43
a neural network
2
𝑤11 2
𝑤12 2
𝑤13 𝑏12 2
𝑤11
𝑎11
2
𝑤21 2
𝑤22 2
𝑤23 𝑏22 2
2
𝑤 = 1
𝑎 = 𝑎12 𝑏2 = 𝑤12 𝑎12 , 𝑏12
2
𝑤31 2
𝑤32 2
𝑤33 𝑏32 2
𝑤13
𝑎13
2
𝑤41 2
𝑤42 2
𝑤43 𝑏42 𝑎11 2
𝑤21
2
𝑤22 𝑎22 , 𝑏22
𝑎 𝑓(𝑎) 2
𝑏 𝑓(𝑏)
2
2
𝑤32
𝑎32 , 𝑏32
𝑎11 2
)
𝑤33
𝑎22 = 𝝈 ( 2
𝑤21 2
𝑤22 2
𝑤23 × 𝑎12
𝑎13
+ 𝑏22
𝑎31
2
𝑤41
2
𝑤42
𝑎42 , 𝑏42
2
𝑤43
a neural network
2
𝑤11 2
𝑤12 2
𝑤13 𝑏12 2
𝑤11
𝑎11
2
𝑤21 2
𝑤22 2
𝑤23 𝑏22 2
2
𝑤 = 1
𝑎 = 𝑎12 𝑏2 = 𝑤12 𝑎12 , 𝑏12
2
𝑤31 2
𝑤32 2
𝑤33 𝑏32 2
𝑤13
𝑎13
2
𝑤41 2
𝑤42 2
𝑤43 𝑏42 𝑎11 2
𝑤21
2
𝑤22 𝑎22 , 𝑏22
𝑎 𝑓(𝑎) 2
𝑏 𝑓(𝑏)
2
2
𝑤32
𝑎32 , 𝑏32
𝑎11 2
)
𝑤33
𝑎32 = 𝝈 ( 2
𝑤31 2
𝑤32 2
𝑤33 × 𝑎12
𝑎13
+ 𝑏32
𝑎31
2
𝑤41
2
𝑤42
𝑎42 , 𝑏42
2
𝑤43
a neural network
2
𝑤11 2
𝑤12 2
𝑤13 𝑏12 2
𝑤11
𝑎11
2
𝑤21 2
𝑤22 2
𝑤23 𝑏22 2
2
𝑤 = 1
𝑎 = 𝑎12 𝑏2 = 𝑤12 𝑎12 , 𝑏12
2
𝑤31 2
𝑤32 2
𝑤33 𝑏32 2
𝑤13
𝑎13
2
𝑤41 2
𝑤42 2
𝑤43 𝑏42 𝑎11 2
𝑤21
2
𝑤22 𝑎22 , 𝑏22
𝑎 𝑓(𝑎) 2
𝑏 𝑓(𝑏)
2
2
𝑤32
𝑎32 , 𝑏32
𝑎11 2
)
𝑤33
𝑎42 = 𝝈 ( 2
𝑤41 2
𝑤42 2
𝑤43 × 𝑎12
𝑎13
+ 𝑏42
𝑎31
2
𝑤41
2
𝑤42
𝑎42 , 𝑏42
2
𝑤43
Summarizing it up
• The activation of layer 2 can be layer 1 layer 2 layer 3
modeled by 2
𝑤11
𝑎 2 = 𝜎 𝑤 2 ∙ 𝑎1 + 𝑏 2 2
𝑤12 𝑎12 , 𝑏12
• The activation of layer 𝑙 can be 2
𝑤13
modeled by 𝑎11 2
𝑤21
𝑎𝑙 = 𝜎 𝑤 𝑙 ∙ 𝑎𝑙−1 + 𝑏 𝑙 2
𝑤22 𝑎22 , 𝑏22
2
• This expression gives us a global view 𝑤23
2
about how the activations in one layer 𝑎21 𝑤31
2
relate to activations in the previous 𝑤32
𝑎32 , 𝑏32
2
layer. 𝑤33
2
• Apply the weight matrix to the activations, 𝑎31 𝑤41
2
then add the bias vector, and finally apply 𝑤42
𝑎42 , 𝑏42
the 𝜎 function. 2
𝑤43
Summarizing it up
• The activation of layer 2 can be layer 1 layer 2 layer 3
modeled by 2
𝑤11
2 2
𝑎 =𝜎 𝑤 ∙𝑎 +𝑏 1 2
2
𝑤12 𝑎12 , 𝑏12
• The activation of layer 𝑙 can be 2
𝑤13
Why𝑙 𝒘𝒋𝒌 is
modeled by 𝒍
a better notation 𝑎1 than
1
𝑤 2
21 𝒘 𝒍
𝒌𝒋 ?
𝑎 = 𝜎 𝑤 𝑙 ∙ 𝑎𝑙−1 + 𝑏 𝑙 2
𝑤22 𝑎22 , 𝑏22
Avoiding tranpose of the
• This expression gives us a global view weight matrix!
2
𝑤23
2
about how the activations in one layer 𝑎21 𝑤31
2
relate to activations in the previous 𝑤32
𝑎32 , 𝑏32
2
layer. 𝑤33
2
• Apply the weight matrix to the activations, 𝑎31 𝑤41
2
then add the bias vector, and finally apply 𝑤42
𝑎42 , 𝑏42
the 𝜎 function. 2
𝑤43
Summarizing it up
• We can assign
𝑧 𝑙 = 𝑤 𝑙 ∙ 𝑎𝑙−1 + 𝑏 𝑙
• Thus
𝑎𝑙 = 𝜎 𝑤 𝑙 ∙ 𝑎𝑙−1 + 𝑏 𝑙 = 𝜎 𝑧 𝑙
• Also note that
𝑧𝑗𝑙 = ෍ 𝑤𝑗𝑘
𝑙 𝑙−1
𝑘
P2: assumptions of the cost function
• With the previous notation
𝑎𝑙 = 𝜎 𝑧 𝑙
• The cost function of the network can be rewritten as
1
𝐶 𝑤, 𝑏 = ෍ 𝑦 𝑥 − 𝑎𝐿 (𝑥) 2
2𝑛
𝑥∈𝑛
• For a single input 𝑥
1 1 2 This is because the
𝐶𝑥 𝑤, 𝑏 = 𝑦 𝑥 − 𝑎𝐿 (𝑥) 2 = ෍ 𝑦𝑗 − 𝑎𝑗𝐿
2 2 final 𝐶 𝑤, 𝑏 is an
average value of 𝐶𝑥
𝑗
• Where 𝑎𝐿 is the activation vector of the last level 𝐿 Notice how 𝐶/𝐶𝑥 depends
only on 𝑎 𝐿 . However, to
• N.B. In order that backpropagation can be applied calculate 𝑎𝐿 , we need
• We can compute 𝐶 for each single input 𝑥 as 𝐶𝑥 values from the previous
layers of the NN. But that
• 𝐶𝑥 depends only on the last layer 𝑎𝐿 of the NN is not how 𝐶/𝐶𝑥 is defined.
P3: The Hadamard product, s ⊙ 𝑡
• Suppose s and 𝑡 are two vectors of the same
dimension. s ⊙ 𝑡 denotes the elementwise product of
the two vectors. That is
(s ⊙ 𝑡)𝑗 = 𝑠𝑗 ⋅ 𝑡𝑗
• Example:
𝑎 𝑐 𝑎∙𝑐
⊙ =
𝑏 𝑐 𝑏∙𝑑
Backpropagation
• Backpropagation is about understanding how changing the
weights and biases in a network changes the cost function.
• In Math terms, this enables us to compute the partial derivatives
𝜕𝐶𝑥 𝜕𝐶𝑥
𝜕𝑤𝑗𝑘𝑙 and 𝑙
𝜕𝑏𝑗
• For notation easy, we first introduce an intermediate quantity,

𝛿𝑗𝑙 , which helps quantifying the error in the 𝑗𝑡ℎ neuron of the 𝑙 𝑡ℎ
layer.
𝜕𝐶𝑥 𝜕𝐶𝑥
• We will see how 𝛿𝑗𝑙 relates to 𝑙 and
𝜕𝑤𝑗𝑘 𝜕𝑏𝑗𝑙
Backpropagation
𝑙 𝑙−1
• Assume that the 𝑗𝑡ℎ neuron 𝑘
layer l-1 layer l layer l+1 layer L
of the 𝑙 𝑡ℎ layer has an error
in the weighted input 𝑧𝑗𝑙 as ... ...
∆𝑧𝑗𝑙 .
• So, instead of outputting
𝜎 𝑧𝑗𝑙 , the neuron in red ... ... C
outputs 𝜎 𝑧𝑗𝑙 + ∆𝑧𝑗𝑙 . ∆𝑧𝑗𝑙
• Finally, the cost function

𝜕𝐶𝑥 ... ...
changes by 𝑙 ∙ ∆𝑧𝑗𝑙 because
𝜕𝑧𝑗
of the error
Backpropagation
𝑙 𝑙−1
• We can observe that 𝑘
𝜕𝐶𝑥 layer l-1 layer l layer l+1 layer L

• If is large, then the neuron in red
𝜕𝑧𝑗𝑙
𝜕𝐶𝑥 ... ...
may suffer a large error ∙ ∆𝑧𝑗𝑙
𝜕𝑧𝑗𝑙
𝜕𝐶𝑥
• Otherwise, if approaches 0, the
𝜕𝑧𝑗𝑙
neuron may suffer a limited error. ... ... C
• We define ∆𝑧𝑗𝑙
𝜕𝐶𝑥
𝛿𝑗𝑙 = 𝑙
𝜕𝑧𝑗 ... ...
Backpropagation – Corollary 1 𝜕𝐶𝑥
𝛿𝑗𝑙 =
𝜕𝑧𝑗𝑙
• We first deduct 𝛿𝑗𝐿 , that is, the error on the 𝑗𝑡ℎ neuron
1
of the output layer 𝐿. 𝐶𝑥 = ෍ 𝑦𝑗 − 𝑎𝑗𝐿
2
2
𝜕𝑎𝑗𝐿
𝑗
𝜕𝐶𝑥 𝜕𝐶𝑥 Notice how we get
• By definition, 𝛿𝑗𝐿 = = ∙ ride of σ𝑗 layer L-1 layer L
𝜕𝑧𝑗𝐿 𝜕𝑎𝑗𝐿 𝜕𝑧𝑗𝐿
...
• Recall that 𝑎𝑗𝐿 =𝜎 𝑧𝑗𝐿
, we have 𝑎1𝐿 , 𝑦1
𝐿
𝜕𝐶𝑥 𝜕𝑎 𝑗 𝜕𝐶𝑥
𝛿𝑗 = 𝐿 ∙ 𝐿 = 𝐿 ∙ 𝜎′(𝑧𝑗𝐿 )
𝐿
...
𝜕𝑎𝑗 𝜕𝑧𝑗 𝜕𝑎𝑗 𝑎𝑗𝐿 , 𝑦𝑗
• In summary 𝜎 𝑧𝑗𝐿
𝜕𝐶𝑥 ...
𝛿𝑗 = 𝐿 ∙ 𝜎′(𝑧𝑗𝐿 )
𝐿
𝜕𝑎𝑗 𝑎𝑗𝐿 = 𝜎 𝑧𝑗𝐿
𝐿 𝜕𝐶𝑥 𝐿
Interpretation of Corollary 1: 𝛿𝑗 = ∙ 𝜎′(𝑧𝑗 )
𝜕𝑎𝑗𝐿
𝜕𝐶𝑥
• The first term, , measures how fast the cost is changing as a
𝜕𝑎𝑗𝐿
function of the 𝑗𝑡ℎ output activation.
• The second term, 𝜎′(𝑧𝑗𝐿 ), measures how fast the activation
function 𝜎 is changing at 𝑧𝑗𝐿 .
layer L-1 layer L
1 𝐿 2 𝜕𝐶𝑥
• Given 𝐶𝑥 = σ𝑗 𝑦𝑗 − 𝑎𝑗 → 𝐿 = 𝑎𝑗𝐿 − 𝑦𝑗
2 𝜕𝑎𝑗 ...
𝑎1𝐿 , 𝑦1
1
• Given 𝜎 𝑧𝑗𝐿 = −𝑧𝐿
, 𝜎′ 𝑧𝑗
𝐿
= 𝜎 𝑧𝑗
𝐿
∙ 1 − 𝜎 𝑧𝑗
𝐿
1+𝑒 𝑗 ...
• Given 𝑧𝑗𝑙 = 𝑙 𝑙−1

σ𝑘 𝑤𝑗𝑘 𝑎𝑘 + 𝑏𝑗𝑙 𝑎𝑗𝐿 , 𝑦𝑗
𝜕𝐶𝑥 𝐿 𝐿−1 𝜎 𝑧𝑗𝐿
𝐿 = 𝛿𝑗 ∙ 𝑎 𝑘 ...
𝜕𝑤𝑗𝑘
𝜕𝑎𝑗𝐿
𝜕𝐶𝑥
• The first term, , measures how fast the cost is changing as a
𝜕𝑎𝑗𝐿
function of the 𝑗𝑡ℎ output activation.
• The second term, 𝜎′(𝑧𝑗𝐿 ), measures how fast the activation
𝝏𝑪𝒙 𝐿
Youfunction
can calculate
𝜎 is changing at𝒍 𝑧following
𝑗. the same principle.
1
𝝏𝒃𝒋 2 𝜕𝐶𝑥
layer L-1 layer L
• Given 𝐶𝑥 = σ𝑗 𝑦𝑗 − 𝑎𝑗𝐿 → = 𝑎𝑗𝐿 − 𝑦𝑗𝜕𝐶 𝜕𝐶. 𝑥. .
𝜕𝑎𝑗𝐿
That’s how 2
𝛿𝑗𝐿
relates to 𝑥
𝑙 and 𝑎1𝐿 , 𝑦1
• Given 𝜎
1
𝑧𝑗𝐿 = , 𝜎′ 𝑧𝑗𝐿 =𝜎 𝑧𝑗𝐿 ∙ 𝜕𝑤
1 − 𝜎𝑗𝑘𝑧𝑗𝐿 𝜕𝑏𝑗𝑙
−𝑧𝐿
1+𝑒 𝑗 ...
• Given 𝑧𝑗𝑙 = 𝑙 𝑙−1

σ𝑘 𝑤𝑗𝑘 𝑎𝑘 + 𝑏𝑗𝑙 𝑎𝑗𝐿 , 𝑦𝑗
𝜕𝐶𝑥 𝐿 𝐿−1 𝜎 𝑧𝑗𝐿
𝐿 = 𝛿𝑗 ∙ 𝑎 𝑘 ...
𝜕𝑤𝑗𝑘
𝜕𝑎𝑗𝐿
• Notice that Corollary 1 is a component-wise expression 𝛿𝑗𝐿 → 𝑧𝑗𝐿

𝜕𝐶𝑥 𝜕𝐶𝑥 𝜕𝐶𝑥
• 𝛿1𝐿 = ∙ 𝜎′(𝑧1𝐿 ), 𝛿2𝐿 = ∙ 𝜎′(𝑧2𝐿 ), …, 𝛿𝑗𝐿 = 𝐿 ∙ 𝜎′(𝑧 𝐿
𝑗)
𝜕𝑎1𝐿 𝜕𝑎2𝐿 𝜕𝑎𝑗
𝜕𝐶𝑥 𝜕𝐶𝑥 𝜕𝐶𝑥

• If we define ∇𝑎 𝐶𝑥 = 𝐿 , 𝐿,…, 𝐿 , 𝜎 ′ 𝑧 𝐿 = 𝑧1𝐿 , 𝑧2𝐿 , … , 𝑧𝑗𝐿
𝜕𝑎1 𝜕𝑎2 𝜕𝑎𝑗
we have
𝛿 𝐿 = ∇𝑎 𝐶𝑥 ⊙ 𝜎 ′ 𝑧 𝐿
• Further, if we use quadratic cost function, we have
𝛿 𝐿 = (𝑎𝐿 − 𝑦) ⊙ 𝜎 ′ 𝑧 𝐿
Questions?

Neural Network II - Part 1

Uploaded by

Copyright:

Available Formats

You might also like

Neural Network II - Part 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Neural Network II - Part 1

Uploaded by

Copyright:

Available Formats

Accelerating Systems with

Programmable Logic Components

Yuan Yao, yuan.yao@it.uu.se

• Abstract: The paper describes several neural networks where

• For notation easy, we first introduce an intermediate quantity,

outputs 𝜎 𝑧𝑗𝑙 + ∆𝑧𝑗𝑙 . ∆𝑧𝑗𝑙

• Finally, the cost function

𝜕𝐶𝑥 layer l-1 layer l layer l+1 layer L

• Given 𝑧𝑗𝑙 = 𝑙 𝑙−1

• Given 𝑧𝑗𝑙 = 𝑙 𝑙−1

• Notice that Corollary 1 is a component-wise expression 𝛿𝑗𝐿 → 𝑧𝑗𝐿

𝜕𝐶𝑥 𝜕𝐶𝑥 𝜕𝐶𝑥

You might also like