Professional Documents
Culture Documents
1.1.2 Backpropagation
1.1.2 Backpropagation
Backpropagation
Daniel Manrique
2022
2
3
http://www.asimovinstitute.org/neural-network-zoo/
The artificial neuron for FFNN
x1 w1 Net
x2 y W1x n𝓍 Net
y
x3 x
Activation
bias function
b
wn𝓍
xn𝓍 n𝓍 x 1
y = f $ x ! w! + b y = f 𝑊𝑥 + 𝑏
!"#
Net = 𝑊𝑥 + 𝑏
$𝓍
net = $ x! w! + b
!"# 4
Feedforward Neural network dynamics
y1=f[1](w1,01+w1,x1x1+w1,x2x2)
y2=f[1](w2,01+w2,x1x1+w2,x2x2)
W1,x1
f[1](x) f[1](x)
x1 x1
W1,x2 f[2](x) W2,x1 f[2](x)
f[1](x) f[1](x)
x1 x1
f[2](x) f[2](x) W6,4
f[1](x) W5,1 f[3](x) y f[1](x) f[3](x) y
W5,2 f[2](x) W6,5 y=f[3](w6,01+w6,4y4+w6,5y5)
f[2](x)
x2 f[1](x) W5,3 y5=f[2](w5,01+w5,1y1+w5,2y2+w5,3y3) x2 f[1](x)
5
Linear activation function
n The linear activation function is the identity function.
y = f 𝑊𝑥 + 𝑏 = 𝑊𝑥 + 𝑏
For each unit:
%# %# f(x)
𝑦 = f $ x! w&! + b = $ x! w&! + b
!"# !"#
%# x
net = $ x! w&! + b
!"#
https://towardsdatascience.com/activation-functions-neural-
networks-1cbd9f8d91d6 6
Non-linear activation functions
Logistic
n Linear activation functions can not build more than one-layer neural
networks. Therefore, linear models cannot be deep.
n The linear activation function may be used in the output layer, though,
combined with non-linear activation functions in the hidden layers.
n Non-linear activation functions are then required.
n Logistic sigmoid or tanh activation functions are the classical choices
because their derivatives are computationally cheap: the function itself is
involved in its derivative; come from classical statistical logistic
regression, and are biologically plausible.
[-1,1]
y= y=
[0,1]
y’ = y·(1 – y) y’ = 1-y2
7
https://qph.ec.quoracdn.net/main-qimg-05edc1873d0103e36064862a45566dba
Softmax activation function
w1,1 n1 n1
x1
y1
b1
Softmax
x2
n2 n2
y2
Logits are the nets x3
b2
feeding the
…
…
Softmax activation xnx nny nny
yny
function. wny,nx
bny
1
#𝑒
n3 2.0 y3=0.63 ✅ Expensive
8
One hidden layer neural networks
Hidden layer [1] Output layer [2]
x(1)
x(2)
f
Softmax
…
x(p) W[1]x(p) +b[1] W[2]·a(p) +b[2] y(p)
…
x(m)
9
Cost and loss functions for
regression problems
n We adopt the following convention:
n The loss function, ℒ(𝑊), compares, for each input
example x(p), the computed output y(p), and the
target t(p). It is a local error, e.g., SSE:
*$
(() (()
1 (() (() +
ℒ W =E W = $ t & − y&
2
&"#
1. Feed-forward.
n Feed a particular training example x(p) to the network input and
move forward to calculate all the neuron logits and outputs:
å
neti(p) = wij·yj(p) ; yi(p) = f (neti(p))
j
12
Step by step
1. Feedforward
y1=f[1](w1,01+w1,x1x1+w1,x2x2)
y2=f[1](w2,01+w2,x1x1+w2,x2x2)
W1,x1
f[1](x) f[1](x)
x1 x1
W1,x2 f[2](x) W2,x1 f[2](x)
f[1](x) f[1](x)
x1 x1
f[2](x) f[2](x) W6,4
f[1](x) W5,1 f[3](x) y f[1](x) f[3](x) y
W5,2 f[2](x) W6,5 y=f[3](w6,01+w6,4y4+w6,5y5)
f[2](x)
x2 f[1](x) W5,3 y5=f[2](w5,01+w5,1y1+w5,2y2+w5,3y3) x2 f[1](x)
13
Back-propagation algorithm (2)
(") $ (. (") (") %
Loss ℒ function: E W = ∑&'$ t& − y&
%
2. Gradient back-propagation.
n Calculate δi(p) for the output layer:
δi(p) = (ti(p) – yi(p))·f’(neti(p)). If neti(p) is high, δi(p)→0
n Calulate δi(p) for the layer preceding the output layer right back to the
input neurons using the recursive formula:
δi(p) = f’(neti(p)) åδk(p)·wki.
k
Note that if δk(p)→0 and neti(p) is high, then δi(p)≈0. (DL problem)
n The index k follows the numbering of all the nodes to which the ith
neuron is connected. Index j follows the numbering of all the nodes
in the previous layer.
14
Step by Step
2. Gradient Back-propagation
𝜹1 𝜹1= f ’(n1)·(𝜹4·w4,1+𝜹5·w5,1)
[1]
f[1](x)
𝜹4 f[1](x) W4,1
x1 x1 𝜹4
f[2](x) W5,1 f[2](x)
f[1](x)
W6,5 f[3](x) y f[1](x) f[3](x) y
𝜹5
f[2](x) 𝜹6 𝜹6
f (x)
[2]
x2 f[1](x) 𝜹5 𝜹5= 𝜹6·w6,5·f[2]’(n5) x2 f[1](x)
𝜹1 𝜹2= f[1]’(n2)·(𝜹4·w4,2+𝜹5·w5,2)
𝜹1
f[1](x) f[1](x)
x1 𝜹4
x1
f[2](x) f[2](x)
𝜹2 𝜹2
f (x) W4,2
[1] f[3](x) y 𝜹4
𝜹5 f (x)
[1] f[3](x) y
W5,2 𝜹6 𝜹6
f (x)
[2]
W4,3 f[2](x)
x2 x2 𝜹5
f[1](x) f[1](x) W5,3
𝜹3 𝜹3= f[1]’(n3)·(𝜹4·w4,3+𝜹5·w5,3)
15
Back-propagation algorithm (3)
3. Weights update.
n Calculate the weight increments of the connections making up the
neural network
△(p)wij = 𝛂·δi(p)·yj(p)
n Update the values of the network weights:
1. Online or Stochastic updating. Weights are updated for each single x(p).
w’ij = wij + △(p)wij
2. Batch or mini-batch updating. The single weight increments are
accumulated for the whole training dataset (batch) or for a subset,
the mini-batch, of size m. The updating is then computed.
w’ij = wij + ∑,△(() w&!
𝜹3 𝜹3
W’4,0= W4,0+𝛂𝜹4·1
f[1](x) W’4,1= W4,1+𝛂𝜹4·y1
x1 f[1](x) W’4,1
f[2](x) x1 W’4,2= W4,2+𝛂𝜹4·y2
W’4,2 f[2](x) W’4,3= W4,3+𝛂𝜹4·y3
f[1](x) 𝜹4
f[3](x) y 𝜹4
f[1](x) W’4,3 f[3](x) y
W’3,x1 f[2](x) 𝜹6 𝜹6
f[2](x)
x2 𝜹5
W’3,x2 f[1](x) x2 f[1](x) 𝜹5
𝜹3 W’3,0 = W3,0.+𝛂𝜹3·1 ; W’3,x1= W3,x1+𝛂𝜹3·x1
W’3,x2= W3,x2+𝛂𝜹3·x2
W’5,0= W5,0+𝛂𝜹5·1
W’5,1= W5,1+𝛂𝜹5·y1 W’6,0= W6,0+𝛂𝜹6·1
f[1](x) f[1](x)
x1 W’6,4= W6,4+𝛂𝜹6·y4
W’5,2= W5,2+𝛂𝜹5·y2 x1
W’5,1 f[2](x) f[2](x) W’6,5= W6,5+𝛂𝜹6·y5
W’5,3= W5,3+𝛂𝜹5·y3
f[1](x) f[1](x) W’6,4
f[3](x) y f[3](x) y
W’5,2
f[2](x) 𝜹6 f[2](x) W’ 𝜹6
6,5
x2 f[1](x) W’5,3 𝜹5 x2 f[1](x)
17
Stop conditions or criteria
1. The training cost function, e.g., MSE, is lower than a
threshold.
2. The training error does not improve after some epochs.
3. The number of epochs or iterations.
n A (learning) iteration involves updating the weights during
training.
n An epoch means presenting the entire training dataset to the
network.
n Following these definitions, an epoch consists of P iterations
18
Overfitting
The dev cost function (MSE) increases (overfitting).
MSE
(Hands-On Machine
Learning with Scikit-
Learn and TensorFlow,
Géron 2017) 19
Fair error measure for classification
problems
20
Log-loss & Cross-entropy
n For each input example x, the output vector
y provided by the Softmax function, S(n), is
compared to the one-hot encoded target
vector t. https://jamesmccaffrey.wordpress.com/2016/09/25
/log-loss-and-cross-entropy-are-almost-the-same/
S(n)=y t Expensive