CM01 Deep

Deep Learning
Romain H ÉRAULT
Normandie Université - INSA de Rouen - LITIS
Automne 2021
R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 1 / 74

What are we talking about ?
Outline
1 What are we talking about ?
2 Introduction to supervised learning
3 Introduction to Neural Networks
4 Multi-Layer Perceptron - Feed-forward network
5 Deep Neural Networks

Object classification
Figure: ImageNet [Krizhevsky 2012]
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. ”Imagenet classification with deep convolutional neural networks.” Advances in neural
information processing systems. 2012.

Object detection
person : 0.992
dog : 0.994
horse : 0.993
car : 1.000 cat : 0.982
dog : 0.997 person : 0.979
bus : 0.996
boat : 0.970
person : 0.983
person : 0.736 person : 0.983
person : 0.925
person : 0.989
Figure: PASCAL VOC 2007 [Ren 2015]
Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in
neural information processing systems (pp. 91-99).

Semantic classification / scene labelling
(a) Testing (b) Truth (c) Predict (d) Testing (e) Truth (f) Predict
Figure: PASCAL VOC 2012 [Lin 2015]
Guosheng Lin, Chunhua Shen, Ian Reid, et al. Efficient piecewise training of deep structured models for semantic segmentation. arXiv:1504.01013,
2015.

Image captioning
Figure: Flickr8K, Flickr30K and MSCOCO [Karpathy 2015]
Andrej Karpathy et Li Fei-Lei. Deep visual-semantic alignments for generating image descriptions. In : Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. 2015. p. 3128-3137.

State of the art
State of the art performance
All by Deep Neural Network !

How that works ?

Introduction to supervised learning
Outline

Supervised learning: Concept
Setup
A input (or features) space, X ∈ Rm ,

A output (or target) space Y,
Objective
Find the link f : X → Y (or the dependencies p(y|x) ) between the input and the output spaces.

Supervised learning: general framework
Hypotheses space
f belongs to a hypotheses space H that depends on the chosen methods (MLP,SVM, Decision
trees, . . . ).
How to choose f within H ?
Expected Prediction Error

or generalization error, or generalization risk,
Z Z
R(f ) = EX ,Y [L(f (X ), Y )] = L(f (x), y)p(x, y)dxdy
where L is a loss function that measures the accuracy of a prediction f (x) to a target value y.

Supervised learning: different tasks, different losses
Regression
Support Vector Machine Regression
1
0.5
If Y ∈ Ro , it is a regression task.
0
y
Standard loss are (y − f (x))2 or |y − f (x)|. −0.5
−1
−1.5
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x
Classification / Discrimination
3
If Y in a discrete set, it is a classification or 1

0
1
discrimination task. 0
−1
0
−1
Standard loss is Θ(−yf (x))2 where Θ is the step

−1
1
function for the binary case. −1
−1
0
0
1
−1
−2
−1
1
−3 0
−3 −2 −1 0 1 2 3

Supervised learning: Experimental setup
Available data
Data consists in a set of n examples (x, y) where x ∈ X and y ∈ Y is split into:
A training set that will be used to choose f ,
i.e. to learn the parameters w of the model
A test set to evaluate the chosen f
(A validation set to choose the hyper-parameters of f )
Because of the human cost of labelling data, one can found a separate unlabelled set, i.e.
examples with only the feature x (see semi-supervised learning)
Evaluation: Empirical risk
n
1 X
RS (f ) = L(f (x), y)
card(S)
(x,y)∈S
where S is the train set during learning, the test set during final evaluation.

Supervised learning: Overfitting
Test set
Empirical risk
Learning set
Low High
Model complexity
Adding noise to data or to model parameters (dark age)

Limiting model capacity ⇒ Regularization

Supervised learning as an optimization problem
Tikhonov regularization scheme

X
arg min L(f (x; w), y) + λΩ(w)
w
(x,y)∈Strain
where
L is a loss term that measures the accuracy of the model,
Ω is a regularization term that limits the capacity of the model,
λ ∈ [0, ∞[ is the regularization hyper-parameter.
Example: Ridge regression

Linear regression with the sum squared error as loss and a L2-norm as regularization:
2 2
X
arg min ||Y − X.w|| + λ ||wd ||
w∈Rd d
Solution
| −1 |
w(λ) = (X X + λI) X Y
Regularization path:
{w(λ)|λ ∈ [0, ∞[}

Ridge regression: illustration
5
Reg. term
Loss term
4 Reg. Path
3
λ=0
2
w1
0
λ = +∞
−1
−2
−2 −1 0 1 2 3 4 5
w0

Introducing sparsity
Lasso
Linear regression with the sum squared error as loss and a L1-norm as regularization:
X
arg min ||Y − X.w||2 + λ |wd |
w∈Rd d
which is equivalent to
||Y − X.(w+ − w− )||2 + λ + + w− )

P
arg min d (w
w+ ∈Rd ,w− ∈Rd
s.t.
w+
i
≥ 0 ∀i ∈ [1..d]
w−
i
≥ 0 ∀i ∈ [1..d]
Why is it sparse ?

Lasso: illustration
5
Reg. term
Loss term
4 Reg. Path
3
λ=0
2
w1
0
λ = +∞
−1
−2
−2 −1 0 1 2 3 4 5
w0

Introduction to Neural Networks
Outline

History . . .
1943 : Formal neuron (Mc Culloch & Pitts)

1957 : Perceptron (Rosenblatt)
1969 : Limitation of the perceptron (Minsky & Papert)
1974 : Gradient back-propagation (Werbos)
no success !?!?
1986 : Gradient back-propagation bis (Rumelhart & McClelland, Lecun)
New neural networks architectures
New Applications :
Character recognition
Speech recognition and synthesis
Vision (image processing) CNN
2005 : Deep networks
Deep Belief Machine, DBM (Hinton and Salakhutdinov, 2006)
Deep Neural Network, DNN
Generative Adversarial Networks, GAN, 2014

Biological neuron
Figure: Scheme of a biological neuron [Wikimedia commons - M. R. Villarreal]

Formal neuron (1)
Origin
Warren McCulloch and Walter Pitts (1943), Frank Rosenblatt (1957),

Mathematical representation of a biological neuron
Schematic
x1
w1
x2
w2
Σ cd ŷ1
...
wm b
xm

Formal neuron (2)
Formulation
ŷ = f (hw, xi + b) (1)
where
x, input vector,
ŷ, output estimation,
w, weights linked to each input (model parameter),
b, bias (model parameter),
f , activation function.
Evaluation
Typical losses are
Classification
L(ŷ, y) = − (y .log(ŷ) + (1 − y ).log(1 − ŷ))

Regression
2
L(ŷ , y) = ||y − ŷ||

Formal neuron (3)
Activation functions are typically step function, sigmoid function ([0 1]) or hyperbolic tangent
([−1 1]).
f (x)
1 f (x) = sigm(x)
x
1
Figure: Sigmoid
f (x)
1 f (x) = tanh(x)
x
1
Figure: Hyperbolic tangent
If loss and activation function are differentiable, parameters w and b can be learned by gradient
descent.

A perceptron
x0 = 1
w10
w20
x1 w11
w21 P
S1 y1
f
w12
x2 w22
y2
P
w13 S2 f
w23
x3
Let’s be xi input number i and yj output number j
X
Sj = Wji xi
i
yj = f (Sj )
with Wj0 = bj and x0 = 1.

PseudoCode
Code
function O= f o r w a r d t a n h (W, I )
% Forward d’ une couche de MLP
% avec une fonction de transfert tanh
% - W parametres de la couche: matrice (n entrees+1)*(n sorties)
% - I entrees de la couche: matrice n exemples*n entrees
% - O sorties de la couche: matrice n exemples*n sorties
% Forme lineaire
% On rajoute 1 à l’entrée pour faire le biais
nI=size ( I , 1 ) ;
S = [ I ones ( nI , 1 ) ] *W ;
% Fonction de transfert
O = tanh ( S ) ;

Gradient descent: principle
If a function F (w) taking its values in R is defined and differentiable in a then F (w) decrease faster
in the opposite direction of the gradient of F in a, −∇F (a) .
f :R→R
There is a η small enough so that,
f (b) ≤ f (a) où b = a − ηf 0 (a) .
F : Cn → R
There is a η small enough so that,
F (b) ≤ F (a) où b = a − η∇F (a) .
η is called the learning rate.

Gradient descent: Algorithm
1 Initialize w0 and η0 such that 0 < η0 < 1

2 Assign η0 to ηn
ηn ← η0
3 Compute,
wn+1 = wn − ηn ∇F (wn )
4 If F (wn+1 ) > F (wn ), the learning rate ηn is too big, correct it by,
ηw ← αηw ,
with 0 < α < 1. Return to (3).

5 Loop to (2) until F (wn ) − F (wn+1 ) lower to , a given tolerance, or that the number of
maximum iterations is reached.

Example : 1D, convex
6
f (w)
0
3 2 1 0 1 2 3
w

Example : 1D, non-convex
3.5
3.0
2.5
2.0
f (w)
1.5
1.0
0.5
0.0
0.5
1.0
3 2 1 0 1 2 3
w
We find a local minimum.

Example : 2D, convex
2.0
1.5
1.0
0.5
w2
0.0
0.5
1.0
1.5
2.0
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
w1

0
w2
3
3 2 1 0 1 2
w1
The local minimum depends on the initialization.

0
w2
3
3 2 1 0 1 2
w1
The local minimum depends on the initialization.

Gradient Machine I
Principle
The learning problem is considered such as an optimization problem solved by a gradient descent.
Notation
(x, y) ∈ (Xtrain , Ytrain ) a training example, x is the feature vector of the example, y its outputs
or label,
F the model,
W the model parameters,
ŷ = F (x, W) the estimated output of the model when called on an example x with W
parameters.
L(x, W, y) = L(ŷ = F (x, W), y), the loss function used for the training.
We can use a gradient descent as long as the derivative of the loss with respect to the parameters
∂L(x,W,y)
of the model is computable, i.e. ∂W
exists ∀x, W, y.
i

Gradient Machine II
What are we minimizing ?

The optimization problem is
W∗ = arg min L(Xtrain , W, Ytrain ))

W
X
W∗ = arg min L(x, W, y)
W
(x,y)∈(Xtrain ,Ytrain )
At each iteration of the gradient descent, we assign
Wn+1 = Wn − ηn ∆Wn
Warning
∂L(xi , W, yi ) ∂L(xj , W, yj )
6=
∂Wk ∂Wk
How to compute the gradient on multiple examples ?

Gradient Machine III
Two main strategies are possible :
Batch Gradient
The batch (X , Y) is a subset of the training set.
1 X ∂L(x, W, y)
∆Wi =
card(X ) ∂Wi
(x,y)∈(X ,Y)
The gradient is average over all the examples in the batch and then used to updated the
parameters.
A batch is not processed again until all other batches are processed.
On-line Gradient
∂L(x, W, y)
∆Wi =
∂Wi
Parameters are update each time an example is processed. It is equivalent of a batch gradient
with a batch size of 1, card(X ) = 1.

Gradient Machine IV
Stochastic gradient
We speak of stochastic gradient when the batches are presented in a random order.
As with the batch gradient, one batch only reappears when all others bacthes have been
processed.
Extreme setup :
On small dataset: 1 batch and non-stochastic.
On big dataset: on-line stochastic

Gradient descent : general algorithm
Input: Integer Nb : Batch number

Input: Boolean Sto : Stochastic grad ?
Input: (Xtrain , Ytrain ) : Training set
W ← random initialization
(Xsplit , Ysplit ) ← split ((Xtrain , Ytrain ), Nb)
while stopping criterion not reached do
if Sto then
(Xsplit , Ysplit ) ← randperm (Xsplit , Ysplit )
end if
for (Xbloc , Ybloc ) ∈ (Xsplit , Ysplit ) do
∆W ← 0
for (x, y) ∈ (Xbloc , Ybloc ) do
∂L(x,W,y)
∆Wi ← ∆Wi + ∂W ∀i
i
end for
∆W
∆W ← card(X
bloc )
W ← W − η∆W
end for
end while

A perceptron
x0 = 1
w10
w20
x1 w11
w21 P
S1 y1
f
w12
x2 w22
y2
P
w13 S2 f
w23
x3
∂L
As the loss is differentiable, we can compute ∂yj
.
∂L ∂L ∂yj ∂Sj
=
∂wji ∂yj ∂Sj ∂wji
∂L ∂L 0
= f (Sj )xi
∂wji ∂yj

PseudoCode
Criterion Gradient
function [ e r r , g r a d C r i t ] = c r i t e r i o n m s e (O,TARGET)
% Critere L en MSE
% - O sorties du MLP
% - TARGET valeur cible
% - err valeur de L
% - gradCrit Grad de L par rapport à O
e r r =sum ( ( TARGET−O) . ˆ 2 ) ;
g r a d C r i t = −(TARGET−O) ;
Model Gradient
function [ gradW , g r a d I n ] = backward tanh (W, I , O, g r a d C r i t )

% Backward d’une couche de MLP
% - I entrees de la couche: matrice 1 * n entrees
% - O sorties de la couche: matrice 1 * n sorties
% - gradW Grad de L par rapport à W
% - gradIn Grad de L par rapport à I
% Grad de tanh par rapport à S

fGrad = (1 −O. ˆ 2 ) ;
% Grad de L par rapport à W
gradW = [ I 1 ] ’ * ( g r a d C r i t . * fGrad ) ;

Neural network
A perceptron can only solve linearly separable problems
Neural network
To solve more complex problems, we need to build a network of perceptrons
Principles
The network is an oriented graph, each node represent a formal neuron,

Information follows graph edges,
Calculus is distributed over nodes

Multi-Layer Perceptron - Feed-forward network
x1
x2 ŷ1
x3 ŷ2
x4
Figure: Feed-forward network, with two layers and one hidden representation
Neurons are layered.

Calculus always flows in one direction.

Recurrent network
At least one retroactive loop

Hysteresis effect
x1
x2 ŷ1
x3 ŷ2
x4
Figure: Recurrent network

Recurrent network
x1,t
x2,t
x3,t ŷ1,t
ŷ1,t−3
ŷ1,t−2
ŷ1,t−1
Figure: NARX Recurrent network

Outline

Scheme of a Multi Layer Perceptron
x1
x2 ŷ1
x3 ŷ2
x4
Figure: Example of feed-forward network: a 2-layer perceptron
This MLP has

Formalism:
an input layer and an output layer (2 layers),
Layer, computational element,
an input, a hidden and output representations (3
Representation, data element
representations).

Estimation of ŷ: Forward path
(l)
I0 = 1
w10
(l) w20
I1 w11
w21 P (l) (l)
S1 f (l) O1
(l) w12
I2 w22
P (l) (l)
w13 S2 f (l) O2
(l) w23
I3
(l) (l)
If we look at layer (l), let’s be Ii input number i and Oj output number j,
(l) (l) (l)

X
Sj = Wji Ii
i
(l) (l) (l) (l+1)
Oj = f (Sj ) = I
Starts with I (0) = x and finishes with O (last) = ŷ

How to learn parameters ? Gradient back-propagation
(l)
I0 = 1
w10
(l) w20
I1 w11
w21 P (l) (l)
S1 f (l) O1
(l) w12
I2 w22
P (l) (l)
w13 S2 f (l) O2
(l) w23
I3
We assume to know ∂L
(l)
∂O
j
(l) (l)
∂L ∂L ∂Oj ∂Sj
=
(l) (l) (l) (l)
∂wji ∂Oj ∂Sj ∂wji
∂L ∂L 0(l) (l) (l)

= f (Sj )Ii
(l) (l)
∂wji ∂Oj

(l)
I0 = 1
w10
(l) w20
I1 w11
w21 P (l) (l)
S1 f (l) O1
(l) w12
I2 w22
P (l) (l)
w13 S2 f (l) O2
(l) w23
I3
Now we compute ∂L
(l)
∂I
i
(l)
∂L X ∂L ∂Oj
=
(l) (l) (l)
∂Ii j ∂Oj ∂Ii
(l) (l)
∂L X ∂L ∂Oj ∂Sj
=
(l) (l) (l) (l)
∂Ii j ∂Oj ∂Sj ∂Ii
∂L X ∂L 0(l) (l)
= f (Sj )wji
(l) (l)
∂Ii j ∂Oj

(l)
I0 = 1
w10
(l) w20
I1 w11
w21 P (l) (l)
S1 f (l) O1
(l) w12
I2 w22
P (l) (l)
w13 S2 f (l) O2
(l) w23
I3
Start ∂L ∂L
= ∂ŷ
(last) j
∂O
j
Backward recurrence
∂L ∂L 0(l) (l) (l)
= f (Sj )Ii
(l) (l)
∂wji ∂Oj
∂L X ∂L 0(l) (l)
= f (Sj )wji
(l) (l)
∂Ii j ∂Oj
∂L ∂L
=
(l−1) (l)
∂Oj ∂Ii

PseudoCode
Criterion Gradient
function [ e r r , g r a d C r i t ] = c r i t e r i o n m s e (O,TARGET)
% Critere L en MSE
% - O sorties du MLP
% - TARGET valeur cible
% - err valeur de L
e r r =sum ( ( TARGET−O) . ˆ 2 ) ;
g r a d C r i t = −(TARGET−O) ;
Layer Gradient (repeated backward)
function [ gradW , g r a d I n ] = backward tanh (W, I , O, gradOut )

% Backward d’une couche de MLP
% - I entrees de la couche: matrice 1 * n entrees
% - O sorties de la couche: matrice 1 * n sorties
% - gradOut Grad de L par rapport à O
% - gradW Grad de L par rapport à W
% - gradIn Grad de L par rapport à I
% Grad de tanh par rapport à S

fGrad = (1 −O. ˆ 2 ) ;
% Grad de L par rapport à W
gradW = [ I 1 ] ’ * ( gradOut . * fGrad ) ;
% Grad de L par rapport à I
g r a d I n = ( W( 1 : ( end−1) , : ) * ( gradOut . * fGrad ) ’ ) ’ ;
Deep Neural Networks
Outline

Deep architecture
x1
x2
x3 ŷ1
x4
x5
Why ?
Some problems needs exponential number of neurons on the hidden representation,

Build / extract features inside the NN in order not to rely on handcrafted extraction (human
prior).

The vanishing gradient problem
f (x)
1 f (x) = tanh(x)
x
1
Figure: Hyperbolic tangent
∂L X ∂L (l)
(l)
= (l)
f 0(l) (Sj )wji
∂Ii j ∂Oj
When neurons at higher layers are saturated, the gradient decreases toward zero.

Rectified Linear Units (ReLU)
Less saturated units ⇒ more gradient
f (x)
f (x) = ln(1 + ex )
f (x) = max(0, x)
1
x
1
Figure: ReLU = max(0, x) and Softplus = ln(1 + ex )
LeakyReLU, ELU, . . .

Improve optimization with batch normalization
Idea
Perform normalization not only when preprocessing data but also in between each layer.
One normalization per dimension: no covariance matrix inversion

Training phase parameters based on the current batch statistics
Testing phase parameters based on the full training set statistics
x − µB
x0 = q
σB2 +
where x is only one dimension.
Back-propagation
∂L ∂L
How to compute ∂x
from ∂x 0
?
Ioffe, S., and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167.

Back-propagation on batch normalization
Three propagation paths

Mean and variance also depend on x !
x − µB (xi )
xi0 = qi
2 (x ) +
σB i
where x is only one feature, i.e. a scalar, and i ∈ B an example.
m
∂L −1 2 −3/2 X ∂L
= σB + (xi − µB )
2
∂σB 2 ∂xi0
i=1
m m
∂L 1 X ∂L 1 ∂L X
= −q 0
− 2
2(xi − µB )
∂µB 2 +
σB ∂xi m ∂σB
i=1 i=1
!
∂L ∂L 1 1 ∂L ∂L
= + 2(x i − µ B ) +
∂xi0 σ 2 + 2
q
∂xi m ∂σB ∂µB
B
where m is the number of examples in B.

Convolutional network: 1D layer

A unit on representation (l) is connected to a sub-slice of o units from representation (l − 1). All the weights
between units are tied leading to only o weights. Warning, bias are not tied.
If representation (l − 1) is in Rm and (l) is in Rn , number of parameters:
(m + 1) ∗ n → (o + 1) ∗ n
w1
w2
w3
w1
w2
w1 w3
w2
w3
Figure: 1D convolutional network

Convolutional network : 2D Convolution layer I
Input feature map Kernel filter Output feature map
Figure: One output feature

Convolutional network : 2D Convolution layer II
Input feature map 2 kernel filters 2 output feature maps
Figure: Two output features

Convolutional network : 2D Subsampling I
Input feature map Max operator Output feature map
Figure: Maxpooling (sliding by 2 elements)

Convolutional network : 2D Subsampling II
Input feature map Kernel filter Output feature map
Figure: Dilated convolution
Yu, Fisher, and Vladlen Koltun. ”Multi-scale context aggregation by dilated convolutions.” ICLR 2016, arXiv preprint arXiv:1511.07122 (2015).

Convolutional network : Global template
Convolutional blocks (n times)
Convolutional layer
Batch normalization
Relu activation
(Subsampling)
Dense blocks (n times)
Batch normalization
Dense layer (+ weights regularization)
Relu/Tanh activation
Last dense block

Batch normalization
Dense layer (+ weights regularization)
Tanh/Sigmoid/Softmax activation

Convolutional network : 2D examples I
Figure: [LeCun 2010]
LeCun, Y. (1989). Generalization and network design strategies. Connections in Perspective. North-Holland, Amsterdam, 143-55.
LeCun, Y., Kavukcuoglu, K., & Farabet, C. (2010, May). Convolutional networks and applications in vision. In Circuits and Systems (ISCAS),
Proceedings of 2010 IEEE International Symposium on (pp. 253-256). IEEE.

56 / 74
Szegedy, Christian, et al. ”Going deeper with convolutions.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015
softmax2
Automne 2021
SoftmaxActivation
FC
AveragePool
7x7+1(V)
DepthConcat
Conv Conv Conv Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
Conv Conv MaxPool
1x1+1(S) 1x1+1(S) 3x3+1(S)
DepthConcat
Conv Conv Conv Conv
softmax1
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
Conv Conv MaxPool
SoftmaxActivation
1x1+1(S) 1x1+1(S) 3x3+1(S)
MaxPool
FC
3x3+2(S)
DepthConcat FC
Conv Conv Conv Conv Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 1x1+1(S)
Conv Conv MaxPool AveragePool
Figure: GoogLeNet [Szegedy 2015]
1x1+1(S) 1x1+1(S) 3x3+1(S) 5x5+3(V)

DepthConcat
Conv Conv Conv Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
Conv Conv MaxPool
1x1+1(S) 1x1+1(S) 3x3+1(S)
DepthConcat softmax0
Conv Conv Conv Conv
SoftmaxActivation
Deep Learning
Convolutional network : 2D examples II
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

Conv Conv MaxPool
FC
1x1+1(S) 1x1+1(S) 3x3+1(S)
DepthConcat FC
Conv Conv Conv Conv Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 1x1+1(S)

Conv Conv MaxPool AveragePool
1x1+1(S) 1x1+1(S) 3x3+1(S) 5x5+3(V)
DepthConcat
Conv Conv Conv Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
Conv Conv MaxPool
1x1+1(S) 1x1+1(S) 3x3+1(S)
MaxPool
3x3+2(S)
DepthConcat
Conv Conv Conv Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
Conv Conv MaxPool
1x1+1(S) 1x1+1(S) 3x3+1(S)
DepthConcat
Conv Conv Conv Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
Conv Conv MaxPool
1x1+1(S) 1x1+1(S) 3x3+1(S)
R. H ÉRAULT (INSA LITIS)

MaxPool
3x3+2(S)
LocalRespNorm
Conv
3x3+1(S)
Conv
1x1+1(V)
LocalRespNorm
MaxPool
3x3+2(S)
Conv
7x7+2(S)
input
Fully Convolutional networks I
Only Convolutional blocks (n times)
Convolutional layer
Batch normalization
Relu activation
(Subsampling)
No Dense blocks !
Advantages: Variable input size, small number of parameters,
Disadvantages: Variable output size, can not modelize global dependencies.

Fully Convolutional networks II
Figure: Semantic image segmentation
Long, Jonathan, Evan Shelhamer, and Trevor Darrell. ”Fully convolutional networks for semantic segmentation.” Proceedings of the IEEE conference
on computer vision and pattern recognition. 2015.

Skip connection
Principle
When a layer (or a group of layers) has the same input and output size, add the input to the output.
⇒ Gradient can ”fly” over the layer.
h Layers + h’
Figure: A skip connection

2016.
VGG-19 34-layer plain 34-layer residual
image image image
output
3x3 conv, 64
size: 224
3x3 conv, 64
pool, /2
output
size: 112
3x3 conv, 128
3x3 conv, 128 7x7 conv, 64, /2 7x7 conv, 64, /2
pool, /2 pool, /2 pool, /2
R. H ÉRAULT (INSA LITIS)

output
size: 56
3x3 conv, 256 3x3 conv, 64 3x3 conv, 64
3x3 conv, 64 3x3 conv, 64
3x3 conv, 64 3x3 conv, 64
pool, /2 3x3 conv, 128, /2 3x3conv, 128, /2

output
size: 28
3x3 conv, 128 3x3 conv, 128
3x3 conv, 128 3x3 conv, 128

Skip connection example: Res-Net
3x3 conv, 128 3x3 conv, 128

output
pool, /2 3x3 conv, 256, /2 3x3 conv, 256, /2
size: 14
Deep Learning
Figure: Res-net
3x3 conv, 256 3x3 conv, 256
3x3 conv, 256 3x3 conv, 256
3x3 conv, 256 3x3 conv, 256
3x3 conv, 256 3x3 conv, 256
3x3 conv, 256 3x3 conv, 256
3x3 conv, 256 3x3 conv, 256
3x3 conv, 256 3x3 conv, 256
output
pool, /2 3x3 conv, 512, /2 3x3 conv, 512, /2
size: 7
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512

output
fc 4096 avg pool avg pool
size: 1
fc 4096 fc 1000 fc 1000
fc 1000
Automne 2021
He, Kaiming, et al. ”Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition.
60 / 74
Skip connection example: U-net
Figure: U-net
Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. ”U-net: Convolutional networks for biomedical image segmentation.” International Conference
on Medical image computing and computer-assisted intervention. Springer, Cham, 2015.

Better initialization through unsupervised learning (old fashioned)
The learning is split into two steps:
Pre-training
A unsupervised pre-training of the input layers with auto-encoders. Intuition: learning the manifold
where the input data resides.
Can take into account an unlabelled dataset.
Finetuning
A finetuning of the whole network with supervised back-propagation.
Hinton, G. E., Osindero, S. and Teh, Y. (2006) A fast learning algorithm for deep belief nets. Neural Computation, 18, pp 1527-1554
Hinton, G. E. and Salakhutdinov, R. R. (2006) Reducing the dimensionality of data with neural networks. Science, Vol. 313. no. 5786, pp. 504 - 507,
28 July 2006.

Diabolo network, Autoencoders

Autoencoders are neural network where the input and output representations have the same
number of units. The learned target is the input itself.
x1 x̂1
x2 x̂2
h1
x3 x̂3 x
h2
x4 x̂4
x5 x̂5
Figure: Diabolo network
When 2 layers :
The input layer is called the encoder,
The output layer, the decoder.
|
Tied weights Wdec = Wenc , convergence? PCA ?
Diabolo network, Autoencoders

Autoencoders are neural network where the input and output representations have the same
number of units. The learned target is the input itself.
x1 x̂1
x2 x̂2
h1
x3 x̂3 x
h2
x4 x̂4
x5 x̂5
Figure: Diabolo network
Undercomplete, size(h) < size(x)

Overcomplete, size(x) < size(h).

Building from auto-encoders
Autoencoders are stacked and learned layer by layer. When a layer is pre-trained, its weights are
fixed until the finetuning.
x1 x̂1
x2 x̂2
x3 x̂3
x4 x̂4
x5 x̂5

x1 h1,1 ĥ1,1
x2 h1,2 ĥ1,2
x3 h1,3 ĥ1,3
x4 h1,4 ĥ1,4
x5 h1,5 ĥ1,5

x1
x2 h2,1 ĥ2,1
x3 h2,2 ĥ2,2
x4 h2,3 ĥ2,3
x5

x1
x2
x3 ŷ1
x4
x5

Simplified stacked AE Algorithm
Input: X , a training feature set of size Nbexamples × Nbfeatures

Input: Y , a corresponding training label set of size Nbexamples × Nblabels
Input: Ninput , the number of input layers to be pre-trained
Output: [w1 , w2 , . . . , wN ], the parameters for all the layers
Randomly initialize [w1 , w2 , . . . , wN ]
Input pre-training
R←X
for i ← 1..Ninput do
{Training an AE on R and keeps its encoding parameters}
[wi , wdummy ] ← MLPT RAIN([wi , w|i ], R, R)
Drop wdummy
R ← MLPF ORWARD([wi ], R)
end for
Final supervised learning
[w1 , w2 , . . . , wN ] ← MLPT RAIN([w1 , w2 , . . . , wN ], X , Y )

Pre-training ⇒ Transfer learning

Source Task: Classification
Alexnet, VGG16, VGG19, Googlenet, . . .
ImageNet (14M samples)
C1 C2 C3 C4 C5 FC1 FC2 FC3
Parameter Transfer 1000 classes
C1 C2 C3 C4 C5 FC1
Target Task: Regression

L3 slice prediction (pixel)
L3CT1 (642 samples)
Cireşan, D. C., Meier, U., & Schmidhuber, J. (2012, June). Transfer learning for Latin and Chinese characters with deep neural networks. In Neural
Networks (IJCNN), The 2012 International Joint Conference on (pp. 1-6)
S. Belharbi, C. Chatelain, R. Hérault, S. Adam, S. Thureau, M. Chastan, R. Modzelewski, Spotting L3 slice in CT scans using deep convolutional
network and transfer learning, Computers in Biology and Medicine

Improve optimization by adding noise 1/3
Denoising (undercomplete) auto-encoders

The auto-encoder is learned from x̃, a disturbed x; the target is still x.
x1 x̃1 x̂1
x2 x̃2 x̂2
Disturbance
h1
x3 x̃3 x̂3 x
h2
x4 x̃4 x̂4
x5 x̃5 x̂5

Prevent co-adaptation in (overcomplete) autoencoders

During training, randomly disconnect hidden units.
h1
x1 x̂1
h2
x2 x̂2
h3
x3 x̂3
h4
h5


h1
x1 x̂1
h2
x2 x̂2
h3
x3 x̂3
h4
h5


Figure: MNIST [Hinton 2012]

Dropout
During training, randomly disconnect at each iteration weights by probability p.

# actual disconnections
At testing, multiply the weights by # iterations
(6= p).
x1
x2
x3 ŷ1
x4
x5
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of
feature detectors. arXiv:1207.0580.

Dropout

(6= p).
x1
x2
x3 ŷ1
x4
x5

Dropout

(6= p).
Figure: Reuters dataset

Tikhonov regularization scheme (new school)
Noise and early stopping are connected to regularization.

So why not using Tikhonov regularization scheme ?
X
J = L(yi , f (xi ; w)) + λ.Ω(w)
i
Notation
2-layer MLP
ŷ = fMLP (x; win , wout ) = fout (bout + wout .fin (bin + win .x))
AE
x̂ = fAE (x; wenc , wdec ) = fdec (bdec + wdec .fenc (benc + wenc .x))
Tied weights
win ↔ wenc , wdec ↔ w|enc
Bishop, C. M. (1995). Training with noise is equivalent to Tikhonov regularization. Neural computation, 7(1), 108-116
Collobert, R. and Bengio, S. (2004). Links between perceptrons, MLPs and SVMs. In ICML’2004

Regularization on weights
X
J = L(yi , fMLP (xi ; w)) + λ.Ω(wout )
i
L2 (Gaussian prior): X
Ω(wout ) = ||wd ||2
d
L1 (Laplace prior): X
Ω(wout ) = |wd |
d
t-Student: X
Ω(wout ) = log(1 + w2d )
d
With infinite units,

L1 : boosting
L2 : SVM
Bengio, Y., Roux, N. L., Vincent, P., Delalleau, O., & Marcotte, P. (2005). Convex neural networks. In Advances in neural information processing
systems (pp. 123-130)

Regularization brought by multi-task learning / embedding
Combine multiple tasks in the same optimization problem.

Tasks are sharing parameters.
X
J = λL L(yi , fMLP (xi ; wout , win ))
i∈L
X
+λU L(xi , fAE (xi ; win ))
i∈L∪U
+λΩ Ω(wout )
Mix supervised and unsupervised data.

Weston, J., Ratle, F., and Collobert, R. . Deep learning via semi-supervised embedding. ICML, pages 1168–1175, 2008

Open question: Adversarial examples
Forge miss-classifying examples by back-propagating the gradient to the example space.
Figure: Some adversarial examples
Szegedy, Christian, et al. ”Intriguing properties of neural networks.” arXiv preprint arXiv:1312.6199 (2013).

Thank you
Questions ?

CM01 Deep

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CM01 Deep

Uploaded by

Copyright:

Available Formats

Deep Learning

Normandie Université - INSA de Rouen - LITIS

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 1 / 74

1 What are we talking about ?

2 Introduction to supervised learning

3 Introduction to Neural Networks

4 Multi-Layer Perceptron - Feed-forward network

5 Deep Neural Networks

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 2 / 74

Figure: ImageNet [Krizhevsky 2012]

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 2 / 74

car : 1.000 cat : 0.982

dog : 0.997 person : 0.979

Figure: PASCAL VOC 2007 [Ren 2015]

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 3 / 74

Semantic classification / scene labelling

Figure: PASCAL VOC 2012 [Lin 2015]

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 4 / 74

Figure: Flickr8K, Flickr30K and MSCOCO [Karpathy 2015]

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 5 / 74

State of the art

State of the art performance

All by Deep Neural Network !

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 6 / 74

1 What are we talking about ?

2 Introduction to supervised learning

3 Introduction to Neural Networks

4 Multi-Layer Perceptron - Feed-forward network

5 Deep Neural Networks

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 7 / 74

Supervised learning: Concept

A input (or features) space, X ∈ Rm ,

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 7 / 74

Supervised learning: general framework

Expected Prediction Error

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 8 / 74

Supervised learning: different tasks, different losses

If Y in a discrete set, it is a classification or 1

Standard loss is Θ(−yf (x))2 where Θ is the step

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 9 / 74

Supervised learning: Experimental setup

Evaluation: Empirical risk

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 10 / 74

Supervised learning: Overfitting

Adding noise to data or to model parameters (dark age)

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 11 / 74

Supervised learning as an optimization problem

Tikhonov regularization scheme

Example: Ridge regression

{w(λ)|λ ∈ [0, ∞[}

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 12 / 74

Ridge regression: illustration

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 13 / 74

||Y − X.(w+ − w− )||2 + λ + + w− )

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 14 / 74

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 15 / 74

1 What are we talking about ?

2 Introduction to supervised learning

3 Introduction to Neural Networks

4 Multi-Layer Perceptron - Feed-forward network

5 Deep Neural Networks

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 16 / 74