1.1.2 Backpropagation

Course: Deep Learning
Unit 1: Training Methods for Deep Neural Networks
Backpropagation
Daniel Manrique
2022
This work is licensed under a Creative Commons

Attribution-NonCommercial-ShareAlike 4.0 International License:
http://creativecommons.org/licenses/by-nc-sa/4.0/
Deep Learning
Artificial neural networks and deep learning
1. Case study & software tools.

2. Backpropagation.
3. Deep Neural Networks.
4. Generative adversarial Networks.
2
3
http://www.asimovinstitute.org/neural-network-zoo/
The artificial neuron for FFNN
x1 w1 Net
x2 y W1x n𝓍 Net
y
x3 x
Activation
bias function
b
wn𝓍
xn𝓍 n𝓍 x 1
Scalar model Matrix model

$𝓍
y = f $ x ! w! + b y = f 𝑊𝑥 + 𝑏
!"#
Net = 𝑊𝑥 + 𝑏
$𝓍
net = $ x! w! + b
!"# 4
Feedforward Neural network dynamics
y1=f[1](w1,01+w1,x1x1+w1,x2x2)
y2=f[1](w2,01+w2,x1x1+w2,x2x2)
W1,x1
f[1](x) f[1](x)
x1 x1
W1,x2 f[2](x) W2,x1 f[2](x)
f[1](x) f[3](x) y f[1](x) f[3](x) y

f[2](x) W2,x2 f[2](x)
x2 f[1](x) x2
f[1}(x)
f[1](x) f[1](x) y4=f[2](w4,01+w4,1y1+w4,2y2+w4,3y3)

x1 x1 W4,1
f[2](x) W4,2 f[2](x)
f[1](x) f[3](x) y f[1](x) W4,3 f[3](x) y
W3,x1 f[2](x) f[2](x)
x2 x2
W3,x2 f[1](x) y =f[1](w 1+w x +w x ) f[1](x)
3 3,0 3,x1 1 3,x2 2
f[1](x) f[1](x)
x1 x1
f[2](x) f[2](x) W6,4
f[1](x) W5,1 f[3](x) y f[1](x) f[3](x) y
W5,2 f[2](x) W6,5 y=f[3](w6,01+w6,4y4+w6,5y5)
f[2](x)
x2 f[1](x) W5,3 y5=f[2](w5,01+w5,1y1+w5,2y2+w5,3y3) x2 f[1](x)
5
Linear activation function
n The linear activation function is the identity function.
y = f 𝑊𝑥 + 𝑏 = 𝑊𝑥 + 𝑏
For each unit:
%# %# f(x)
𝑦 = f $ x! w&! + b = $ x! w&! + b
!"# !"#
%# x
net = $ x! w&! + b
!"#
https://towardsdatascience.com/activation-functions-neural-
networks-1cbd9f8d91d6 6
Non-linear activation functions
Logistic
n Linear activation functions can not build more than one-layer neural
networks. Therefore, linear models cannot be deep.
n The linear activation function may be used in the output layer, though,
combined with non-linear activation functions in the hidden layers.
n Non-linear activation functions are then required.
n Logistic sigmoid or tanh activation functions are the classical choices
because their derivatives are computationally cheap: the function itself is
involved in its derivative; come from classical statistical logistic
regression, and are biologically plausible.
[-1,1]
y= y=
[0,1]
y’ = y·(1 – y) y’ = 1-y2
7
https://qph.ec.quoracdn.net/main-qimg-05edc1873d0103e36064862a45566dba
Softmax activation function
w1,1 n1 n1
x1
y1
b1
Softmax
x2
n2 n2
y2
Logits are the nets x3
b2
feeding the
…
…
Softmax activation xnx nny nny
yny
function. wny,nx
bny
1
Logits Softmax Probabilities
n1 0.5 " y1=0.14 ❌ Cheap

𝑒 !
n2 1.0 𝑦𝑖 = 𝑆 𝑛! = ∑ "" y2=0.23 ❌ Averaged
#𝑒
n3 2.0 y3=0.63 ✅ Expensive
8
One hidden layer neural networks
Hidden layer [1] Output layer [2]
x(1)
x(2)
f
Softmax
…
x(p) W[1]x(p) +b[1] W[2]·a(p) +b[2] y(p)
…
x(m)
Dataset Kernel 1 Bias Logistic f Kernel 2 logits
n This model permits to approximate non-linear functions:

n The first layer consists of weights W[1] and biases b[1] applied to each input x(p) and passed
through sigmoid / tanh activation functions f. The output of this layer a(p) is fed to the
next one but is not observable outside the network. Hence, it is known as a hidden layer.
n The second layer consists of the kernel W[2] and biases b[2] applied to these intermediate
outputs a(p), followed by the softmax function to generate probabilities y(p). NOTE that the
output layer can also employ the linear activation function for regression problems.
9
Cost and loss functions for
regression problems
n We adopt the following convention:
n The loss function, ℒ(𝑊), compares, for each input
example x(p), the computed output y(p), and the
target t(p). It is a local error, e.g., SSE:
*$
(() (()
1 (() (() +
ℒ W =E W = $ t & − y&
2
&"#
n The cost function, J(W), involves the whole

dataset of size m. It is a global error, e.g., MSE:
,
1
J W = MSE(W) = $ 𝐸 (-) (W)
m
("#
10
Error gradient back-
propagation learning algorithm
n It is the generalization of the delta rule, extended to
networks with hidden layers and whose neurons have non-
linear activation functions: derivable and non-decreasing.
n Three major steps:
1. The Feedforward step calculates the output y(p) from an input
training example x(p).
2. Error gradient back-propagation after comparing target t(p) and
computed output y(p).
3. Weights update.
n These steps involve intensive vector/matrix computation.
n Back-prop needs twice more memory and three times more
computational resources than the forward prop. 11
Back-propagation algorithm (1)
1. Feed-forward.
n Feed a particular training example x(p) to the network input and
move forward to calculate all the neuron logits and outputs:
å
neti(p) = wij·yj(p) ; yi(p) = f (neti(p))
j
12
Step by step
1. Feedforward
y1=f[1](w1,01+w1,x1x1+w1,x2x2)
y2=f[1](w2,01+w2,x1x1+w2,x2x2)
W1,x1
f[1](x) f[1](x)
x1 x1
W1,x2 f[2](x) W2,x1 f[2](x)
f[1](x) f[3](x) y f[1](x) f[3](x) y

f[2](x) W2,x2 f[2](x)
x2 f[1](x) x2
f[1}(x)
f[1](x) f[1](x) y4=f[2](w4,01+w4,1y1+w4,2y2+w4,3y3)

x1 x1 W4,1
f[2](x) W4,2 f[2](x)
f[1](x) f[3](x) y f[1](x) W4,3 f[3](x) y
W3,x1 f[2](x) f[2](x)
x2 x2
W3,x2 f[1](x) y =f[1](w 1+w x +w x ) f[1](x)
3 3,0 3,x1 1 3,x2 2
f[1](x) f[1](x)
x1 x1
f[2](x) f[2](x) W6,4
f[1](x) W5,1 f[3](x) y f[1](x) f[3](x) y
W5,2 f[2](x) W6,5 y=f[3](w6,01+w6,4y4+w6,5y5)
f[2](x)
x2 f[1](x) W5,3 y5=f[2](w5,01+w5,1y1+w5,2y2+w5,3y3) x2 f[1](x)
13
(") $ (. (") (") %
Loss ℒ function: E W = ∑&'$ t& − y&
%
2. Gradient back-propagation.
n Calculate δi(p) for the output layer:
δi(p) = (ti(p) – yi(p))·f’(neti(p)). If neti(p) is high, δi(p)→0
n Calulate δi(p) for the layer preceding the output layer right back to the
input neurons using the recursive formula:
δi(p) = f’(neti(p)) åδk(p)·wki.
k
Note that if δk(p)→0 and neti(p) is high, then δi(p)≈0. (DL problem)
n The index k follows the numbering of all the nodes to which the ith
neuron is connected. Index j follows the numbering of all the nodes
in the previous layer.
14
Step by Step
2. Gradient Back-propagation
f[1](x) f[1](x) W4,1 𝜹4= 𝜹6·w6,4·f[2]’(n4)

x1 x1 𝜹4
f[2](x) W6,4 W4,2
f (x)
[2]
W6,4
f[1](x) f[3](x) y f[1](x) W4,3 f[3](x) y
f[2](x) W6,5
𝜹6 𝜹6=(t-y)·f[3]’(n6) 𝜹6
f[2](x)
x2 f[1](x) x2
f[1](x)
𝜹1 𝜹1= f ’(n1)·(𝜹4·w4,1+𝜹5·w5,1)
[1]
f[1](x)
𝜹4 f[1](x) W4,1
x1 x1 𝜹4
f[2](x) W5,1 f[2](x)
f[1](x)
W6,5 f[3](x) y f[1](x) f[3](x) y
𝜹5
f[2](x) 𝜹6 𝜹6
f (x)
[2]
x2 f[1](x) 𝜹5 𝜹5= 𝜹6·w6,5·f[2]’(n5) x2 f[1](x)
𝜹1 𝜹2= f[1]’(n2)·(𝜹4·w4,2+𝜹5·w5,2)
𝜹1
f[1](x) f[1](x)
x1 𝜹4
x1
f[2](x) f[2](x)
𝜹2 𝜹2
f (x) W4,2
[1] f[3](x) y 𝜹4
𝜹5 f (x)
[1] f[3](x) y
W5,2 𝜹6 𝜹6
f (x)
[2]
W4,3 f[2](x)
x2 x2 𝜹5
f[1](x) f[1](x) W5,3
𝜹3 𝜹3= f[1]’(n3)·(𝜹4·w4,3+𝜹5·w5,3)
15
3. Weights update.
n Calculate the weight increments of the connections making up the
neural network
△(p)wij = 𝛂·δi(p)·yj(p)
n Update the values of the network weights:
1. Online or Stochastic updating. Weights are updated for each single x(p).
w’ij = wij + △(p)wij
2. Batch or mini-batch updating. The single weight increments are
accumulated for the whole training dataset (batch) or for a subset,
the mini-batch, of size m. The updating is then computed.
w’ij = wij + ∑,△(() w&!
n Given the training, dev, testing dataset or a mini-batch of size

m, a usual cost function J is the mean square error,
$ )
MSE = ∑*'$ 𝐸 (*) for regression problems.
)
16
Step by step
3. Stochastic weights update
W’1,0= W1,0. +𝛂𝜹1·1 W’2,0= W2,0. +𝛂𝜹2·1
W’1,x1= W1,x1+𝛂𝜹1·x1 W’2,x1= W2,x1+𝛂𝜹2·x1
𝜹1 W’1,x2= W1,x2+𝛂𝜹1·x2 W’2,x2= W2,x2+𝛂𝜹2·x2
W’1,x1
f (x)
[1] f[1](x)
x1 x1
W’1,x2 f[2](x) f[2](x)
W’2,x1
𝜹2 𝜹2
𝜹4 𝜹4
f[1](x) f[3](x) y f[1](x) f[3](x) y
𝜹6 W’2,x2 𝜹6
f[2](x) f[2](x)
x2 f[1](x)
𝜹5 x2 f[1](x) 𝜹5
𝜹3 𝜹3
W’4,0= W4,0+𝛂𝜹4·1
f[1](x) W’4,1= W4,1+𝛂𝜹4·y1
x1 f[1](x) W’4,1
f[2](x) x1 W’4,2= W4,2+𝛂𝜹4·y2
W’4,2 f[2](x) W’4,3= W4,3+𝛂𝜹4·y3
f[1](x) 𝜹4
f[3](x) y 𝜹4
f[1](x) W’4,3 f[3](x) y
W’3,x1 f[2](x) 𝜹6 𝜹6
f[2](x)
x2 𝜹5
W’3,x2 f[1](x) x2 f[1](x) 𝜹5
𝜹3 W’3,0 = W3,0.+𝛂𝜹3·1 ; W’3,x1= W3,x1+𝛂𝜹3·x1
W’3,x2= W3,x2+𝛂𝜹3·x2
W’5,0= W5,0+𝛂𝜹5·1
W’5,1= W5,1+𝛂𝜹5·y1 W’6,0= W6,0+𝛂𝜹6·1
f[1](x) f[1](x)
x1 W’6,4= W6,4+𝛂𝜹6·y4
W’5,2= W5,2+𝛂𝜹5·y2 x1
W’5,1 f[2](x) f[2](x) W’6,5= W6,5+𝛂𝜹6·y5
W’5,3= W5,3+𝛂𝜹5·y3
f[1](x) f[1](x) W’6,4
f[3](x) y f[3](x) y
W’5,2
f[2](x) 𝜹6 f[2](x) W’ 𝜹6
6,5
x2 f[1](x) W’5,3 𝜹5 x2 f[1](x)
17
Stop conditions or criteria
1. The training cost function, e.g., MSE, is lower than a
threshold.
2. The training error does not improve after some epochs.
3. The number of epochs or iterations.
n A (learning) iteration involves updating the weights during
training.
n An epoch means presenting the entire training dataset to the
network.
n Following these definitions, an epoch consists of P iterations
for a P-size training dataset in the case of stochastic weight

update.
n An iteration equals an epoch in the case of a batch update.
18
Overfitting
The dev cost function (MSE) increases (overfitting).
MSE
(Hands-On Machine
Learning with Scikit-
Learn and TensorFlow,
Géron 2017) 19
Fair error measure for classification
problems
n The MSE calculates the error by comparing each

computed output with its corresponding desired
target output.
n In classification problems, where the winner
takes all, it does not matter the exact value of
the output, but whether the maximum
corresponds to the right class or not.
20
Log-loss & Cross-entropy
n For each input example x, the output vector
y provided by the Softmax function, S(n), is
compared to the one-hot encoded target
vector t. https://jamesmccaffrey.wordpress.com/2016/09/25
/log-loss-and-cross-entropy-are-almost-the-same/
S(n)=y t Expensive
For all training examples:

0.14 0.0 ,
Log-loss J W = 1/m $ ℒ (() ;
0.23 Cross-entropy (.
0.0 ("#
ℒ (*) y, t = − , t & log(y& ) ∆W=−𝛂 ▽ 𝐽 𝑊
0.63 &'$ 1.0
ℒ (y,t) ≠ ℒ (t,y) BP must consider the derivative of
log-loss instead of the quadratic error. 21
Lecture slides of master course “Deep Learning”.
2022 Daniel Manrique
Suggested work citation:

D. Manrique. (2022). Training Methods for Deep Neural Networks. Backpropagation.
Master course (lecture slides). Department of Artificial Intelligence. Universidad
Politécnica de Madrid.
This work is licensed under a Creative Commons

Attribution-NonCommercial-ShareAlike 4.0 International License:
http://creativecommons.org/licenses/by-nc-sa/4.0/

1.1.2 Backpropagation

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1.1.2 Backpropagation

Uploaded by

Copyright:

Available Formats

Course: Deep Learning

Unit 1: Training Methods for Deep Neural Networks

This work is licensed under a Creative Commons

1. Case study & software tools.

Scalar model Matrix model

f[1](x) f[3](x) y f[1](x) f[3](x) y

f[1](x) f[1](x) y4=f[2](w4,01+w4,1y1+w4,2y2+w4,3y3)

Logits Softmax Probabilities

n1 0.5 " y1=0.14 ❌ Cheap

Dataset Kernel 1 Bias Logistic f Kernel 2 logits

n This model permits to approximate non-linear functions:

n The cost function, J(W), involves the whole

f[1](x) f[3](x) y f[1](x) f[3](x) y

f[1](x) f[1](x) y4=f[2](w4,01+w4,1y1+w4,2y2+w4,3y3)

f[1](x) f[1](x) W4,1 𝜹4= 𝜹6·w6,4·f[2]’(n4)

n Given the training, dev, testing dataset or a mini-batch of size

for a P-size training dataset in the case of stochastic weight

n The MSE calculates the error by comparing each

For all training examples:

Suggested work citation:

This work is licensed under a Creative Commons

You might also like