Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

Course: Deep Learning

Unit 1: Training Methods for Deep Neural Networks

Backpropagation

Daniel Manrique
2022

This work is licensed under a Creative Commons


Attribution-NonCommercial-ShareAlike 4.0 International License:
http://creativecommons.org/licenses/by-nc-sa/4.0/
Deep Learning
Artificial neural networks and deep learning

1. Case study & software tools.


2. Backpropagation.
3. Deep Neural Networks.
4. Generative adversarial Networks.

2
3
http://www.asimovinstitute.org/neural-network-zoo/
The artificial neuron for FFNN

x1 w1 Net

x2 y W1x n𝓍 Net
y
x3 x
Activation
bias function
b
wn𝓍
xn𝓍 n𝓍 x 1

Scalar model Matrix model


$𝓍

y = f $ x ! w! + b y = f 𝑊𝑥 + 𝑏
!"#
Net = 𝑊𝑥 + 𝑏
$𝓍

net = $ x! w! + b
!"# 4
Feedforward Neural network dynamics
y1=f[1](w1,01+w1,x1x1+w1,x2x2)
y2=f[1](w2,01+w2,x1x1+w2,x2x2)
W1,x1
f[1](x) f[1](x)
x1 x1
W1,x2 f[2](x) W2,x1 f[2](x)

f[1](x) f[3](x) y f[1](x) f[3](x) y


f[2](x) W2,x2 f[2](x)
x2 f[1](x) x2
f[1}(x)

f[1](x) f[1](x) y4=f[2](w4,01+w4,1y1+w4,2y2+w4,3y3)


x1 x1 W4,1
f[2](x) W4,2 f[2](x)
f[1](x) f[3](x) y f[1](x) W4,3 f[3](x) y
W3,x1 f[2](x) f[2](x)
x2 x2
W3,x2 f[1](x) y =f[1](w 1+w x +w x ) f[1](x)
3 3,0 3,x1 1 3,x2 2

f[1](x) f[1](x)
x1 x1
f[2](x) f[2](x) W6,4
f[1](x) W5,1 f[3](x) y f[1](x) f[3](x) y
W5,2 f[2](x) W6,5 y=f[3](w6,01+w6,4y4+w6,5y5)
f[2](x)
x2 f[1](x) W5,3 y5=f[2](w5,01+w5,1y1+w5,2y2+w5,3y3) x2 f[1](x)

5
Linear activation function
n The linear activation function is the identity function.

y = f 𝑊𝑥 + 𝑏 = 𝑊𝑥 + 𝑏
For each unit:
%# %# f(x)
𝑦 = f $ x! w&! + b = $ x! w&! + b
!"# !"#

%# x
net = $ x! w&! + b
!"#

https://towardsdatascience.com/activation-functions-neural-
networks-1cbd9f8d91d6 6
Non-linear activation functions
Logistic
n Linear activation functions can not build more than one-layer neural
networks. Therefore, linear models cannot be deep.
n The linear activation function may be used in the output layer, though,
combined with non-linear activation functions in the hidden layers.
n Non-linear activation functions are then required.
n Logistic sigmoid or tanh activation functions are the classical choices
because their derivatives are computationally cheap: the function itself is
involved in its derivative; come from classical statistical logistic
regression, and are biologically plausible.
[-1,1]

y= y=
[0,1]
y’ = y·(1 – y) y’ = 1-y2

7
https://qph.ec.quoracdn.net/main-qimg-05edc1873d0103e36064862a45566dba
Softmax activation function
w1,1 n1 n1
x1
y1
b1

Softmax
x2
n2 n2
y2
Logits are the nets x3
b2
feeding the



Softmax activation xnx nny nny
yny
function. wny,nx
bny
1

Logits Softmax Probabilities

n1 0.5 " y1=0.14 ❌ Cheap


𝑒 !
n2 1.0 𝑦𝑖 = 𝑆 𝑛! = ∑ "" y2=0.23 ❌ Averaged

#𝑒
n3 2.0 y3=0.63 ✅ Expensive
8
One hidden layer neural networks
Hidden layer [1] Output layer [2]
x(1)
x(2)
f

Softmax

x(p) W[1]x(p) +b[1] W[2]·a(p) +b[2] y(p)

x(m)

Dataset Kernel 1 Bias Logistic f Kernel 2 logits

n This model permits to approximate non-linear functions:


n The first layer consists of weights W[1] and biases b[1] applied to each input x(p) and passed
through sigmoid / tanh activation functions f. The output of this layer a(p) is fed to the
next one but is not observable outside the network. Hence, it is known as a hidden layer.
n The second layer consists of the kernel W[2] and biases b[2] applied to these intermediate
outputs a(p), followed by the softmax function to generate probabilities y(p). NOTE that the
output layer can also employ the linear activation function for regression problems.

9
Cost and loss functions for
regression problems
n We adopt the following convention:
n The loss function, ℒ(𝑊), compares, for each input
example x(p), the computed output y(p), and the
target t(p). It is a local error, e.g., SSE:
*$
(() (()
1 (() (() +
ℒ W =E W = $ t & − y&
2
&"#

n The cost function, J(W), involves the whole


dataset of size m. It is a global error, e.g., MSE:
,
1
J W = MSE(W) = $ 𝐸 (-) (W)
m
("#
10
Error gradient back-
propagation learning algorithm
n It is the generalization of the delta rule, extended to
networks with hidden layers and whose neurons have non-
linear activation functions: derivable and non-decreasing.
n Three major steps:
1. The Feedforward step calculates the output y(p) from an input
training example x(p).
2. Error gradient back-propagation after comparing target t(p) and
computed output y(p).
3. Weights update.
n These steps involve intensive vector/matrix computation.
n Back-prop needs twice more memory and three times more
computational resources than the forward prop. 11
Back-propagation algorithm (1)

1. Feed-forward.
n Feed a particular training example x(p) to the network input and
move forward to calculate all the neuron logits and outputs:
å
neti(p) = wij·yj(p) ; yi(p) = f (neti(p))
j

12
Step by step
1. Feedforward
y1=f[1](w1,01+w1,x1x1+w1,x2x2)
y2=f[1](w2,01+w2,x1x1+w2,x2x2)
W1,x1
f[1](x) f[1](x)
x1 x1
W1,x2 f[2](x) W2,x1 f[2](x)

f[1](x) f[3](x) y f[1](x) f[3](x) y


f[2](x) W2,x2 f[2](x)
x2 f[1](x) x2
f[1}(x)

f[1](x) f[1](x) y4=f[2](w4,01+w4,1y1+w4,2y2+w4,3y3)


x1 x1 W4,1
f[2](x) W4,2 f[2](x)
f[1](x) f[3](x) y f[1](x) W4,3 f[3](x) y
W3,x1 f[2](x) f[2](x)
x2 x2
W3,x2 f[1](x) y =f[1](w 1+w x +w x ) f[1](x)
3 3,0 3,x1 1 3,x2 2

f[1](x) f[1](x)
x1 x1
f[2](x) f[2](x) W6,4
f[1](x) W5,1 f[3](x) y f[1](x) f[3](x) y
W5,2 f[2](x) W6,5 y=f[3](w6,01+w6,4y4+w6,5y5)
f[2](x)
x2 f[1](x) W5,3 y5=f[2](w5,01+w5,1y1+w5,2y2+w5,3y3) x2 f[1](x)

13
Back-propagation algorithm (2)
(") $ (. (") (") %
Loss ℒ function: E W = ∑&'$ t& − y&
%

2. Gradient back-propagation.
n Calculate δi(p) for the output layer:
δi(p) = (ti(p) – yi(p))·f’(neti(p)). If neti(p) is high, δi(p)→0

n Calulate δi(p) for the layer preceding the output layer right back to the
input neurons using the recursive formula:
δi(p) = f’(neti(p)) åδk(p)·wki.
k
Note that if δk(p)→0 and neti(p) is high, then δi(p)≈0. (DL problem)

n The index k follows the numbering of all the nodes to which the ith
neuron is connected. Index j follows the numbering of all the nodes
in the previous layer.
14
Step by Step
2. Gradient Back-propagation

f[1](x) f[1](x) W4,1 𝜹4= 𝜹6·w6,4·f[2]’(n4)


x1 x1 𝜹4
f[2](x) W6,4 W4,2
f (x)
[2]
W6,4
f[1](x) f[3](x) y f[1](x) W4,3 f[3](x) y
f[2](x) W6,5
𝜹6 𝜹6=(t-y)·f[3]’(n6) 𝜹6
f[2](x)
x2 f[1](x) x2
f[1](x)

𝜹1 𝜹1= f ’(n1)·(𝜹4·w4,1+𝜹5·w5,1)
[1]

f[1](x)
𝜹4 f[1](x) W4,1
x1 x1 𝜹4
f[2](x) W5,1 f[2](x)
f[1](x)
W6,5 f[3](x) y f[1](x) f[3](x) y
𝜹5
f[2](x) 𝜹6 𝜹6
f (x)
[2]
x2 f[1](x) 𝜹5 𝜹5= 𝜹6·w6,5·f[2]’(n5) x2 f[1](x)

𝜹1 𝜹2= f[1]’(n2)·(𝜹4·w4,2+𝜹5·w5,2)
𝜹1
f[1](x) f[1](x)
x1 𝜹4
x1
f[2](x) f[2](x)
𝜹2 𝜹2
f (x) W4,2
[1] f[3](x) y 𝜹4
𝜹5 f (x)
[1] f[3](x) y
W5,2 𝜹6 𝜹6
f (x)
[2]
W4,3 f[2](x)
x2 x2 𝜹5
f[1](x) f[1](x) W5,3
𝜹3 𝜹3= f[1]’(n3)·(𝜹4·w4,3+𝜹5·w5,3)
15
Back-propagation algorithm (3)
3. Weights update.
n Calculate the weight increments of the connections making up the
neural network
△(p)wij = 𝛂·δi(p)·yj(p)
n Update the values of the network weights:
1. Online or Stochastic updating. Weights are updated for each single x(p).
w’ij = wij + △(p)wij
2. Batch or mini-batch updating. The single weight increments are
accumulated for the whole training dataset (batch) or for a subset,
the mini-batch, of size m. The updating is then computed.
w’ij = wij + ∑,△(() w&!

n Given the training, dev, testing dataset or a mini-batch of size


m, a usual cost function J is the mean square error,
$ )
MSE = ∑*'$ 𝐸 (*) for regression problems.
)
16
Step by step
3. Stochastic weights update
W’1,0= W1,0. +𝛂𝜹1·1 W’2,0= W2,0. +𝛂𝜹2·1
W’1,x1= W1,x1+𝛂𝜹1·x1 W’2,x1= W2,x1+𝛂𝜹2·x1
𝜹1 W’1,x2= W1,x2+𝛂𝜹1·x2 W’2,x2= W2,x2+𝛂𝜹2·x2
W’1,x1
f (x)
[1] f[1](x)
x1 x1
W’1,x2 f[2](x) f[2](x)
W’2,x1
𝜹2 𝜹2
𝜹4 𝜹4
f[1](x) f[3](x) y f[1](x) f[3](x) y
𝜹6 W’2,x2 𝜹6
f[2](x) f[2](x)
x2 f[1](x)
𝜹5 x2 f[1](x) 𝜹5

𝜹3 𝜹3
W’4,0= W4,0+𝛂𝜹4·1
f[1](x) W’4,1= W4,1+𝛂𝜹4·y1
x1 f[1](x) W’4,1
f[2](x) x1 W’4,2= W4,2+𝛂𝜹4·y2
W’4,2 f[2](x) W’4,3= W4,3+𝛂𝜹4·y3
f[1](x) 𝜹4
f[3](x) y 𝜹4
f[1](x) W’4,3 f[3](x) y
W’3,x1 f[2](x) 𝜹6 𝜹6
f[2](x)
x2 𝜹5
W’3,x2 f[1](x) x2 f[1](x) 𝜹5
𝜹3 W’3,0 = W3,0.+𝛂𝜹3·1 ; W’3,x1= W3,x1+𝛂𝜹3·x1
W’3,x2= W3,x2+𝛂𝜹3·x2
W’5,0= W5,0+𝛂𝜹5·1
W’5,1= W5,1+𝛂𝜹5·y1 W’6,0= W6,0+𝛂𝜹6·1
f[1](x) f[1](x)
x1 W’6,4= W6,4+𝛂𝜹6·y4
W’5,2= W5,2+𝛂𝜹5·y2 x1
W’5,1 f[2](x) f[2](x) W’6,5= W6,5+𝛂𝜹6·y5
W’5,3= W5,3+𝛂𝜹5·y3
f[1](x) f[1](x) W’6,4
f[3](x) y f[3](x) y
W’5,2
f[2](x) 𝜹6 f[2](x) W’ 𝜹6
6,5
x2 f[1](x) W’5,3 𝜹5 x2 f[1](x)
17
Stop conditions or criteria
1. The training cost function, e.g., MSE, is lower than a
threshold.
2. The training error does not improve after some epochs.
3. The number of epochs or iterations.
n A (learning) iteration involves updating the weights during
training.
n An epoch means presenting the entire training dataset to the
network.
n Following these definitions, an epoch consists of P iterations

for a P-size training dataset in the case of stochastic weight


update.
n An iteration equals an epoch in the case of a batch update.

18
Overfitting
The dev cost function (MSE) increases (overfitting).
MSE

(Hands-On Machine
Learning with Scikit-
Learn and TensorFlow,
Géron 2017) 19
Fair error measure for classification
problems

n The MSE calculates the error by comparing each


computed output with its corresponding desired
target output.
n In classification problems, where the winner
takes all, it does not matter the exact value of
the output, but whether the maximum
corresponds to the right class or not.

20
Log-loss & Cross-entropy
n For each input example x, the output vector
y provided by the Softmax function, S(n), is
compared to the one-hot encoded target
vector t. https://jamesmccaffrey.wordpress.com/2016/09/25
/log-loss-and-cross-entropy-are-almost-the-same/

S(n)=y t Expensive

For all training examples:


0.14 0.0 ,
Log-loss J W = 1/m $ ℒ (() ;
0.23 Cross-entropy (.
0.0 ("#
ℒ (*) y, t = − , t & log(y& ) ∆W=−𝛂 ▽ 𝐽 𝑊
0.63 &'$ 1.0
ℒ (y,t) ≠ ℒ (t,y) BP must consider the derivative of
log-loss instead of the quadratic error. 21
Lecture slides of master course “Deep Learning”.
2022 Daniel Manrique

Suggested work citation:


D. Manrique. (2022). Training Methods for Deep Neural Networks. Backpropagation.
Master course (lecture slides). Department of Artificial Intelligence. Universidad
Politécnica de Madrid.

This work is licensed under a Creative Commons


Attribution-NonCommercial-ShareAlike 4.0 International License:
http://creativecommons.org/licenses/by-nc-sa/4.0/

You might also like