Download as pdf or txt
Download as pdf or txt
You are on page 1of 90

Deep Learning

Romain H ÉRAULT

Normandie Université - INSA de Rouen - LITIS

Automne 2021

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 1 / 74


What are we talking about ?

Outline

1 What are we talking about ?

2 Introduction to supervised learning

3 Introduction to Neural Networks

4 Multi-Layer Perceptron - Feed-forward network

5 Deep Neural Networks

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 2 / 74


What are we talking about ?

Object classification

Figure: ImageNet [Krizhevsky 2012]

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. ”Imagenet classification with deep convolutional neural networks.” Advances in neural
information processing systems. 2012.

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 2 / 74


What are we talking about ?

Object detection

person : 0.992
dog : 0.994
horse : 0.993

car : 1.000 cat : 0.982

dog : 0.997 person : 0.979

bus : 0.996
boat : 0.970

person : 0.983
person : 0.736 person : 0.983
person : 0.925

person : 0.989

Figure: PASCAL VOC 2007 [Ren 2015]

Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in
neural information processing systems (pp. 91-99).

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 3 / 74


What are we talking about ?

Semantic classification / scene labelling

(a) Testing (b) Truth (c) Predict (d) Testing (e) Truth (f) Predict

Figure: PASCAL VOC 2012 [Lin 2015]

Guosheng Lin, Chunhua Shen, Ian Reid, et al. Efficient piecewise training of deep structured models for semantic segmentation. arXiv:1504.01013,
2015.

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 4 / 74


What are we talking about ?

Image captioning

Figure: Flickr8K, Flickr30K and MSCOCO [Karpathy 2015]

Andrej Karpathy et Li Fei-Lei. Deep visual-semantic alignments for generating image descriptions. In : Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. 2015. p. 3128-3137.

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 5 / 74


What are we talking about ?

State of the art

State of the art performance

All by Deep Neural Network !


How that works ?

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 6 / 74


Introduction to supervised learning

Outline

1 What are we talking about ?

2 Introduction to supervised learning

3 Introduction to Neural Networks

4 Multi-Layer Perceptron - Feed-forward network

5 Deep Neural Networks

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 7 / 74


Introduction to supervised learning

Supervised learning: Concept

Setup

A input (or features) space, X ∈ Rm ,


A output (or target) space Y,

Objective
Find the link f : X → Y (or the dependencies p(y|x) ) between the input and the output spaces.

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 7 / 74


Introduction to supervised learning

Supervised learning: general framework

Hypotheses space
f belongs to a hypotheses space H that depends on the chosen methods (MLP,SVM, Decision
trees, . . . ).
How to choose f within H ?

Expected Prediction Error


or generalization error, or generalization risk,
Z Z
R(f ) = EX ,Y [L(f (X ), Y )] = L(f (x), y)p(x, y)dxdy

where L is a loss function that measures the accuracy of a prediction f (x) to a target value y.

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 8 / 74


Introduction to supervised learning

Supervised learning: different tasks, different losses

Regression
Support Vector Machine Regression
1

0.5

If Y ∈ Ro , it is a regression task.
0

y
Standard loss are (y − f (x))2 or |y − f (x)|. −0.5

−1

−1.5
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x

Classification / Discrimination
3

If Y in a discrete set, it is a classification or 1


0
1

discrimination task. 0
−1
0
−1

Standard loss is Θ(−yf (x))2 where Θ is the step


−1
1
function for the binary case. −1
−1
0
0

1
−1

−2

−1

1
−3 0
−3 −2 −1 0 1 2 3

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 9 / 74


Introduction to supervised learning

Supervised learning: Experimental setup

Available data
Data consists in a set of n examples (x, y) where x ∈ X and y ∈ Y is split into:
A training set that will be used to choose f ,
i.e. to learn the parameters w of the model
A test set to evaluate the chosen f
(A validation set to choose the hyper-parameters of f )
Because of the human cost of labelling data, one can found a separate unlabelled set, i.e.
examples with only the feature x (see semi-supervised learning)

Evaluation: Empirical risk

n
1 X
RS (f ) = L(f (x), y)
card(S)
(x,y)∈S

where S is the train set during learning, the test set during final evaluation.

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 10 / 74


Introduction to supervised learning

Supervised learning: Overfitting

Test set
Empirical risk

Learning set

Low High
Model complexity

Adding noise to data or to model parameters (dark age)


Limiting model capacity ⇒ Regularization

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 11 / 74


Introduction to supervised learning

Supervised learning as an optimization problem

Tikhonov regularization scheme


X
arg min L(f (x; w), y) + λΩ(w)
w
(x,y)∈Strain

where
L is a loss term that measures the accuracy of the model,
Ω is a regularization term that limits the capacity of the model,
λ ∈ [0, ∞[ is the regularization hyper-parameter.

Example: Ridge regression


Linear regression with the sum squared error as loss and a L2-norm as regularization:
2 2
X
arg min ||Y − X.w|| + λ ||wd ||
w∈Rd d

Solution
| −1 |
w(λ) = (X X + λI) X Y
Regularization path:

{w(λ)|λ ∈ [0, ∞[}

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 12 / 74


Introduction to supervised learning

Ridge regression: illustration

5
Reg. term
Loss term
4 Reg. Path

3
λ=0

2
w1

0
λ = +∞

−1

−2
−2 −1 0 1 2 3 4 5
w0

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 13 / 74


Introduction to supervised learning

Introducing sparsity

Lasso
Linear regression with the sum squared error as loss and a L1-norm as regularization:
X
arg min ||Y − X.w||2 + λ |wd |
w∈Rd d

which is equivalent to

||Y − X.(w+ − w− )||2 + λ + + w− )


P
arg min d (w
w+ ∈Rd ,w− ∈Rd
s.t.
w+
i
≥ 0 ∀i ∈ [1..d]
w−
i
≥ 0 ∀i ∈ [1..d]

Why is it sparse ?

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 14 / 74


Introduction to supervised learning

Lasso: illustration

5
Reg. term
Loss term
4 Reg. Path

3
λ=0

2
w1

0
λ = +∞

−1

−2
−2 −1 0 1 2 3 4 5
w0

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 15 / 74


Introduction to Neural Networks

Outline

1 What are we talking about ?

2 Introduction to supervised learning

3 Introduction to Neural Networks

4 Multi-Layer Perceptron - Feed-forward network

5 Deep Neural Networks

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 16 / 74


Introduction to Neural Networks

History . . .

1943 : Formal neuron (Mc Culloch & Pitts)


1957 : Perceptron (Rosenblatt)
1969 : Limitation of the perceptron (Minsky & Papert)
1974 : Gradient back-propagation (Werbos)
no success !?!?
1986 : Gradient back-propagation bis (Rumelhart & McClelland, Lecun)
New neural networks architectures
New Applications :
Character recognition
Speech recognition and synthesis
Vision (image processing) CNN
2005 : Deep networks
Deep Belief Machine, DBM (Hinton and Salakhutdinov, 2006)
Deep Neural Network, DNN
Generative Adversarial Networks, GAN, 2014

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 16 / 74


Introduction to Neural Networks

Biological neuron

Figure: Scheme of a biological neuron [Wikimedia commons - M. R. Villarreal]


R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 17 / 74
Introduction to Neural Networks

Formal neuron (1)

Origin

Warren McCulloch and Walter Pitts (1943), Frank Rosenblatt (1957),


Mathematical representation of a biological neuron

Schematic

x1
w1
x2
w2
Σ cd ŷ1
...

wm b
xm

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 18 / 74


Introduction to Neural Networks

Formal neuron (2)

Formulation

ŷ = f (hw, xi + b) (1)
where
x, input vector,
ŷ, output estimation,
w, weights linked to each input (model parameter),
b, bias (model parameter),
f , activation function.

Evaluation
Typical losses are
Classification

L(ŷ, y) = − (y .log(ŷ) + (1 − y ).log(1 − ŷ))


Regression
2
L(ŷ , y) = ||y − ŷ||

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 19 / 74


Introduction to Neural Networks

Formal neuron (3)

Activation functions are typically step function, sigmoid function ([0 1]) or hyperbolic tangent
([−1 1]).

f (x)
1 f (x) = sigm(x)
x
1

Figure: Sigmoid

f (x)
1 f (x) = tanh(x)
x
1

Figure: Hyperbolic tangent

If loss and activation function are differentiable, parameters w and b can be learned by gradient
descent.

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 20 / 74


Introduction to Neural Networks

A perceptron

x0 = 1
w10
w20
x1 w11
w21 P
S1 y1
f
w12
x2 w22
y2
P
w13 S2 f
w23
x3

Let’s be xi input number i and yj output number j

X
Sj = Wji xi
i
yj = f (Sj )

with Wj0 = bj and x0 = 1.

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 21 / 74


Introduction to Neural Networks

PseudoCode

Code
function O= f o r w a r d t a n h (W, I )
% Forward d’ une couche de MLP
% avec une fonction de transfert tanh
% - W parametres de la couche: matrice (n entrees+1)*(n sorties)
% - I entrees de la couche: matrice n exemples*n entrees
% - O sorties de la couche: matrice n exemples*n sorties

% Forme lineaire
% On rajoute 1 à l’entrée pour faire le biais
nI=size ( I , 1 ) ;
S = [ I ones ( nI , 1 ) ] *W ;
% Fonction de transfert
O = tanh ( S ) ;

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 22 / 74


Introduction to Neural Networks

Gradient descent: principle

If a function F (w) taking its values in R is defined and differentiable in a then F (w) decrease faster
in the opposite direction of the gradient of F in a, −∇F (a) .

f :R→R
There is a η small enough so that,

f (b) ≤ f (a) où b = a − ηf 0 (a) .

F : Cn → R
There is a η small enough so that,

F (b) ≤ F (a) où b = a − η∇F (a) .

η is called the learning rate.

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 23 / 74


Introduction to Neural Networks

Gradient descent: Algorithm

1 Initialize w0 and η0 such that 0 < η0 < 1


2 Assign η0 to ηn

ηn ← η0

3 Compute,

wn+1 = wn − ηn ∇F (wn )

4 If F (wn+1 ) > F (wn ), the learning rate ηn is too big, correct it by,

ηw ← αηw ,

with 0 < α < 1. Return to (3).


5 Loop to (2) until F (wn ) − F (wn+1 ) lower to , a given tolerance, or that the number of
maximum iterations is reached.

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 24 / 74


Introduction to Neural Networks

Example : 1D, convex

6
f (w)

0
3 2 1 0 1 2 3
w

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 25 / 74


Introduction to Neural Networks

Example : 1D, non-convex

3.5

3.0

2.5

2.0
f (w)

1.5

1.0

0.5

0.0

0.5

1.0
3 2 1 0 1 2 3
w

We find a local minimum.

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 26 / 74


Introduction to Neural Networks

Example : 2D, convex

2.0

1.5

1.0

0.5
w2

0.0

0.5

1.0

1.5

2.0
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
w1

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 27 / 74


Introduction to Neural Networks

Example : 2D, non-convex

0
w2

3
3 2 1 0 1 2
w1

The local minimum depends on the initialization.

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 28 / 74


Introduction to Neural Networks

Example : 2D, non-convex

0
w2

3
3 2 1 0 1 2
w1

The local minimum depends on the initialization.

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 28 / 74


Introduction to Neural Networks

Gradient Machine I

Principle
The learning problem is considered such as an optimization problem solved by a gradient descent.

Notation
(x, y) ∈ (Xtrain , Ytrain ) a training example, x is the feature vector of the example, y its outputs
or label,
F the model,
W the model parameters,
ŷ = F (x, W) the estimated output of the model when called on an example x with W
parameters.
L(x, W, y) = L(ŷ = F (x, W), y), the loss function used for the training.

We can use a gradient descent as long as the derivative of the loss with respect to the parameters
∂L(x,W,y)
of the model is computable, i.e. ∂W
exists ∀x, W, y.
i

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 29 / 74


Introduction to Neural Networks

Gradient Machine II

What are we minimizing ?


The optimization problem is

W∗ = arg min L(Xtrain , W, Ytrain ))


W
X
W∗ = arg min L(x, W, y)
W
(x,y)∈(Xtrain ,Ytrain )

At each iteration of the gradient descent, we assign

Wn+1 = Wn − ηn ∆Wn

Warning

∂L(xi , W, yi ) ∂L(xj , W, yj )
6=
∂Wk ∂Wk
How to compute the gradient on multiple examples ?

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 30 / 74


Introduction to Neural Networks

Gradient Machine III

Two main strategies are possible :

Batch Gradient
The batch (X , Y) is a subset of the training set.

1 X ∂L(x, W, y)
∆Wi =
card(X ) ∂Wi
(x,y)∈(X ,Y)

The gradient is average over all the examples in the batch and then used to updated the
parameters.
A batch is not processed again until all other batches are processed.

On-line Gradient

∂L(x, W, y)
∆Wi =
∂Wi
Parameters are update each time an example is processed. It is equivalent of a batch gradient
with a batch size of 1, card(X ) = 1.

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 31 / 74


Introduction to Neural Networks

Gradient Machine IV

Stochastic gradient
We speak of stochastic gradient when the batches are presented in a random order.
As with the batch gradient, one batch only reappears when all others bacthes have been
processed.

Extreme setup :
On small dataset: 1 batch and non-stochastic.
On big dataset: on-line stochastic

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 32 / 74


Introduction to Neural Networks

Gradient descent : general algorithm

Input: Integer Nb : Batch number


Input: Boolean Sto : Stochastic grad ?
Input: (Xtrain , Ytrain ) : Training set
W ← random initialization
(Xsplit , Ysplit ) ← split ((Xtrain , Ytrain ), Nb)
while stopping criterion not reached do
if Sto then 
(Xsplit , Ysplit ) ← randperm (Xsplit , Ysplit )
end if
for (Xbloc , Ybloc ) ∈ (Xsplit , Ysplit ) do
∆W ← 0
for (x, y) ∈ (Xbloc , Ybloc ) do
∂L(x,W,y)
∆Wi ← ∆Wi + ∂W ∀i
i
end for
∆W
∆W ← card(X
bloc )
W ← W − η∆W
end for
end while

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 33 / 74


Introduction to Neural Networks

A perceptron

x0 = 1
w10
w20
x1 w11
w21 P
S1 y1
f
w12
x2 w22
y2
P
w13 S2 f
w23
x3

∂L
As the loss is differentiable, we can compute ∂yj
.

∂L ∂L ∂yj ∂Sj
=
∂wji ∂yj ∂Sj ∂wji
∂L ∂L 0
= f (Sj )xi
∂wji ∂yj

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 34 / 74


Introduction to Neural Networks

PseudoCode
Criterion Gradient

function [ e r r , g r a d C r i t ] = c r i t e r i o n m s e (O,TARGET)
% Critere L en MSE
% - O sorties du MLP
% - TARGET valeur cible
% - err valeur de L
% - gradCrit Grad de L par rapport à O

e r r =sum ( ( TARGET−O) . ˆ 2 ) ;
g r a d C r i t = −(TARGET−O) ;

Model Gradient

function [ gradW , g r a d I n ] = backward tanh (W, I , O, g r a d C r i t )


% Backward d’une couche de MLP
% avec une fonction de transfert tanh
% - W parametres de la couche: matrice (n entrees+1)*(n sorties)
% - I entrees de la couche: matrice 1 * n entrees
% - O sorties de la couche: matrice 1 * n sorties
% - gradCrit Grad de L par rapport à O
% - gradW Grad de L par rapport à W
% - gradIn Grad de L par rapport à I

% Grad de tanh par rapport à S


fGrad = (1 −O. ˆ 2 ) ;
% Grad de L par rapport à W
gradW = [ I 1 ] ’ * ( g r a d C r i t . * fGrad ) ;

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 35 / 74


Introduction to Neural Networks

Neural network

A perceptron can only solve linearly separable problems

Neural network
To solve more complex problems, we need to build a network of perceptrons

Principles

The network is an oriented graph, each node represent a formal neuron,


Information follows graph edges,
Calculus is distributed over nodes

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 36 / 74


Introduction to Neural Networks

Multi-Layer Perceptron - Feed-forward network

x1

x2 ŷ1

x3 ŷ2

x4

Figure: Feed-forward network, with two layers and one hidden representation

Neurons are layered.


Calculus always flows in one direction.

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 37 / 74


Introduction to Neural Networks

Recurrent network

At least one retroactive loop


Hysteresis effect

x1

x2 ŷ1

x3 ŷ2

x4

Figure: Recurrent network

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 38 / 74


Introduction to Neural Networks

Recurrent network

x1,t

x2,t

x3,t ŷ1,t

ŷ1,t−3

ŷ1,t−2

ŷ1,t−1

Figure: NARX Recurrent network

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 39 / 74


Multi-Layer Perceptron - Feed-forward network

Outline

1 What are we talking about ?

2 Introduction to supervised learning

3 Introduction to Neural Networks

4 Multi-Layer Perceptron - Feed-forward network

5 Deep Neural Networks

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 40 / 74


Multi-Layer Perceptron - Feed-forward network

Scheme of a Multi Layer Perceptron

x1

x2 ŷ1

x3 ŷ2

x4

Figure: Example of feed-forward network: a 2-layer perceptron

This MLP has


Formalism:
an input layer and an output layer (2 layers),
Layer, computational element,
an input, a hidden and output representations (3
Representation, data element
representations).

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 40 / 74


Multi-Layer Perceptron - Feed-forward network

Estimation of ŷ: Forward path

(l)
I0 = 1
w10
(l) w20
I1 w11
w21 P (l) (l)
S1 f (l) O1
(l) w12
I2 w22
P (l) (l)
w13 S2 f (l) O2
(l) w23
I3

(l) (l)
If we look at layer (l), let’s be Ii input number i and Oj output number j,

(l) (l) (l)


X
Sj = Wji Ii
i
(l) (l) (l) (l+1)
Oj = f (Sj ) = I

Starts with I (0) = x and finishes with O (last) = ŷ

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 41 / 74


Multi-Layer Perceptron - Feed-forward network

How to learn parameters ? Gradient back-propagation

(l)
I0 = 1
w10
(l) w20
I1 w11
w21 P (l) (l)
S1 f (l) O1
(l) w12
I2 w22
P (l) (l)
w13 S2 f (l) O2
(l) w23
I3

We assume to know ∂L
(l)
∂O
j
(l) (l)
∂L ∂L ∂Oj ∂Sj
=
(l) (l) (l) (l)
∂wji ∂Oj ∂Sj ∂wji

∂L ∂L 0(l) (l) (l)


= f (Sj )Ii
(l) (l)
∂wji ∂Oj

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 42 / 74


Multi-Layer Perceptron - Feed-forward network

How to learn parameters ? Gradient back-propagation

(l)
I0 = 1
w10
(l) w20
I1 w11
w21 P (l) (l)
S1 f (l) O1
(l) w12
I2 w22
P (l) (l)
w13 S2 f (l) O2
(l) w23
I3

Now we compute ∂L
(l)
∂I
i
(l)
∂L X ∂L ∂Oj
=
(l) (l) (l)
∂Ii j ∂Oj ∂Ii
(l) (l)
∂L X ∂L ∂Oj ∂Sj
=
(l) (l) (l) (l)
∂Ii j ∂Oj ∂Sj ∂Ii

∂L X ∂L 0(l) (l)
= f (Sj )wji
(l) (l)
∂Ii j ∂Oj

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 42 / 74


Multi-Layer Perceptron - Feed-forward network

How to learn parameters ? Gradient back-propagation

(l)
I0 = 1
w10
(l) w20
I1 w11
w21 P (l) (l)
S1 f (l) O1
(l) w12
I2 w22
P (l) (l)
w13 S2 f (l) O2
(l) w23
I3

Start ∂L ∂L
= ∂ŷ
(last) j
∂O
j
Backward recurrence
∂L ∂L 0(l) (l) (l)
= f (Sj )Ii
(l) (l)
∂wji ∂Oj

∂L X ∂L 0(l) (l)
= f (Sj )wji
(l) (l)
∂Ii j ∂Oj

∂L ∂L
=
(l−1) (l)
∂Oj ∂Ii

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 42 / 74


Multi-Layer Perceptron - Feed-forward network

PseudoCode
Criterion Gradient

function [ e r r , g r a d C r i t ] = c r i t e r i o n m s e (O,TARGET)
% Critere L en MSE
% - O sorties du MLP
% - TARGET valeur cible
% - err valeur de L
% - gradCrit Grad de L par rapport à O

e r r =sum ( ( TARGET−O) . ˆ 2 ) ;
g r a d C r i t = −(TARGET−O) ;

Layer Gradient (repeated backward)

function [ gradW , g r a d I n ] = backward tanh (W, I , O, gradOut )


% Backward d’une couche de MLP
% avec une fonction de transfert tanh
% - W parametres de la couche: matrice (n entrees+1)*(n sorties)
% - I entrees de la couche: matrice 1 * n entrees
% - O sorties de la couche: matrice 1 * n sorties
% - gradOut Grad de L par rapport à O
% - gradW Grad de L par rapport à W
% - gradIn Grad de L par rapport à I

% Grad de tanh par rapport à S


fGrad = (1 −O. ˆ 2 ) ;
% Grad de L par rapport à W
gradW = [ I 1 ] ’ * ( gradOut . * fGrad ) ;
% Grad de L par rapport à I
g r a d I n = ( W( 1 : ( end−1) , : ) * ( gradOut . * fGrad ) ’ ) ’ ;
R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 43 / 74
Deep Neural Networks

Outline

1 What are we talking about ?

2 Introduction to supervised learning

3 Introduction to Neural Networks

4 Multi-Layer Perceptron - Feed-forward network

5 Deep Neural Networks

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 44 / 74


Deep Neural Networks

Deep architecture

x1

x2

x3 ŷ1

x4

x5

Why ?

Some problems needs exponential number of neurons on the hidden representation,


Build / extract features inside the NN in order not to rely on handcrafted extraction (human
prior).

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 44 / 74


Deep Neural Networks

The vanishing gradient problem

f (x)
1 f (x) = tanh(x)
x
1

Figure: Hyperbolic tangent

∂L X ∂L (l)
(l)
= (l)
f 0(l) (Sj )wji
∂Ii j ∂Oj

When neurons at higher layers are saturated, the gradient decreases toward zero.

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 45 / 74


Deep Neural Networks

Rectified Linear Units (ReLU)

Less saturated units ⇒ more gradient

f (x)

f (x) = ln(1 + ex )

f (x) = max(0, x)
1

x
1

Figure: ReLU = max(0, x) and Softplus = ln(1 + ex )

LeakyReLU, ELU, . . .

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 46 / 74


Deep Neural Networks

Improve optimization with batch normalization

Idea
Perform normalization not only when preprocessing data but also in between each layer.

One normalization per dimension: no covariance matrix inversion


Training phase parameters based on the current batch statistics
Testing phase parameters based on the full training set statistics

x − µB
x0 = q
σB2 +

where x is only one dimension.

Back-propagation
∂L ∂L
How to compute ∂x
from ∂x 0
?

Ioffe, S., and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167.

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 47 / 74


Deep Neural Networks

Back-propagation on batch normalization

Three propagation paths


Mean and variance also depend on x !

x − µB (xi )
xi0 = qi
2 (x ) + 
σB i

where x is only one feature, i.e. a scalar, and i ∈ B an example.

m
∂L −1  2 −3/2 X ∂L
= σB +  (xi − µB )
2
∂σB 2 ∂xi0
i=1
m m
∂L 1 X ∂L 1 ∂L X
= −q 0
− 2
2(xi − µB )
∂µB 2 +
σB ∂xi m ∂σB
i=1 i=1
!
∂L ∂L 1 1 ∂L ∂L
= + 2(x i − µ B ) +
∂xi0 σ 2 +  2
q
∂xi m ∂σB ∂µB
B

where m is the number of examples in B.


R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 48 / 74
Deep Neural Networks

Convolutional network: 1D layer


A unit on representation (l) is connected to a sub-slice of o units from representation (l − 1). All the weights
between units are tied leading to only o weights. Warning, bias are not tied.
If representation (l − 1) is in Rm and (l) is in Rn , number of parameters:

(m + 1) ∗ n → (o + 1) ∗ n

w1

w2

w3
w1

w2

w1 w3

w2

w3

Figure: 1D convolutional network

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 49 / 74


Deep Neural Networks

Convolutional network : 2D Convolution layer I

Input feature map Kernel filter Output feature map

Figure: One output feature

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 50 / 74


Deep Neural Networks

Convolutional network : 2D Convolution layer II

Input feature map 2 kernel filters 2 output feature maps

Figure: Two output features

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 51 / 74


Deep Neural Networks

Convolutional network : 2D Subsampling I

Input feature map Max operator Output feature map

Figure: Maxpooling (sliding by 2 elements)

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 52 / 74


Deep Neural Networks

Convolutional network : 2D Subsampling II

Input feature map Kernel filter Output feature map

Figure: Dilated convolution

Yu, Fisher, and Vladlen Koltun. ”Multi-scale context aggregation by dilated convolutions.” ICLR 2016, arXiv preprint arXiv:1511.07122 (2015).

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 53 / 74


Deep Neural Networks

Convolutional network : Global template

Convolutional blocks (n times)

Convolutional layer
Batch normalization
Relu activation
(Subsampling)

Dense blocks (n times)

Batch normalization
Dense layer (+ weights regularization)
Relu/Tanh activation

Last dense block


Batch normalization
Dense layer (+ weights regularization)
Tanh/Sigmoid/Softmax activation

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 54 / 74


Deep Neural Networks

Convolutional network : 2D examples I

Figure: [LeCun 2010]

LeCun, Y. (1989). Generalization and network design strategies. Connections in Perspective. North-Holland, Amsterdam, 143-55.

LeCun, Y., Kavukcuoglu, K., & Farabet, C. (2010, May). Convolutional networks and applications in vision. In Circuits and Systems (ISCAS),
Proceedings of 2010 IEEE International Symposium on (pp. 253-256). IEEE.

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 55 / 74


56 / 74
Szegedy, Christian, et al. ”Going deeper with convolutions.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015

softmax2
Automne 2021

SoftmaxActivation
FC
AveragePool
7x7+1(V)
DepthConcat
Conv Conv Conv Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
Conv Conv MaxPool
1x1+1(S) 1x1+1(S) 3x3+1(S)
DepthConcat
Conv Conv Conv Conv
softmax1
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
Conv Conv MaxPool
SoftmaxActivation
1x1+1(S) 1x1+1(S) 3x3+1(S)
MaxPool
FC
3x3+2(S)
DepthConcat FC
Conv Conv Conv Conv Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 1x1+1(S)
Conv Conv MaxPool AveragePool
Figure: GoogLeNet [Szegedy 2015]

1x1+1(S) 1x1+1(S) 3x3+1(S) 5x5+3(V)


DepthConcat
Conv Conv Conv Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
Conv Conv MaxPool
1x1+1(S) 1x1+1(S) 3x3+1(S)
DepthConcat softmax0
Conv Conv Conv Conv
SoftmaxActivation

Deep Learning
Convolutional network : 2D examples II

1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)


Conv Conv MaxPool
FC
1x1+1(S) 1x1+1(S) 3x3+1(S)
DepthConcat FC
Conv Conv Conv Conv Conv
Deep Neural Networks

1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 1x1+1(S)


Conv Conv MaxPool AveragePool
1x1+1(S) 1x1+1(S) 3x3+1(S) 5x5+3(V)
DepthConcat
Conv Conv Conv Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
Conv Conv MaxPool
1x1+1(S) 1x1+1(S) 3x3+1(S)
MaxPool
3x3+2(S)
DepthConcat
Conv Conv Conv Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
Conv Conv MaxPool
1x1+1(S) 1x1+1(S) 3x3+1(S)
DepthConcat
Conv Conv Conv Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
Conv Conv MaxPool
1x1+1(S) 1x1+1(S) 3x3+1(S)

R. H ÉRAULT (INSA LITIS)


MaxPool
3x3+2(S)
LocalRespNorm
Conv
3x3+1(S)
Conv
1x1+1(V)
LocalRespNorm
MaxPool
3x3+2(S)
Conv
7x7+2(S)
input
Deep Neural Networks

Fully Convolutional networks I

Only Convolutional blocks (n times)

Convolutional layer
Batch normalization
Relu activation
(Subsampling)

No Dense blocks !
Advantages: Variable input size, small number of parameters,
Disadvantages: Variable output size, can not modelize global dependencies.

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 57 / 74


Deep Neural Networks

Fully Convolutional networks II

Figure: Semantic image segmentation

Long, Jonathan, Evan Shelhamer, and Trevor Darrell. ”Fully convolutional networks for semantic segmentation.” Proceedings of the IEEE conference
on computer vision and pattern recognition. 2015.

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 58 / 74


Deep Neural Networks

Skip connection

Principle
When a layer (or a group of layers) has the same input and output size, add the input to the output.
⇒ Gradient can ”fly” over the layer.

h Layers + h’

Figure: A skip connection

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 59 / 74


2016.
VGG-19 34-layer plain 34-layer residual
image image image
output
3x3 conv, 64
size: 224
3x3 conv, 64

pool, /2
output
size: 112
3x3 conv, 128

3x3 conv, 128 7x7 conv, 64, /2 7x7 conv, 64, /2

pool, /2 pool, /2 pool, /2

R. H ÉRAULT (INSA LITIS)


output
size: 56
3x3 conv, 256 3x3 conv, 64 3x3 conv, 64

3x3 conv, 256 3x3 conv, 64 3x3 conv, 64

3x3 conv, 256 3x3 conv, 64 3x3 conv, 64

3x3 conv, 256 3x3 conv, 64 3x3 conv, 64

3x3 conv, 64 3x3 conv, 64

3x3 conv, 64 3x3 conv, 64

pool, /2 3x3 conv, 128, /2 3x3conv, 128, /2


output
size: 28
3x3 conv, 512 3x3 conv, 128 3x3 conv, 128

3x3 conv, 512 3x3 conv, 128 3x3 conv, 128

3x3 conv, 512 3x3 conv, 128 3x3 conv, 128

3x3 conv, 512 3x3 conv, 128 3x3 conv, 128

3x3 conv, 128 3x3 conv, 128

3x3 conv, 128 3x3 conv, 128


Deep Neural Networks

Skip connection example: Res-Net

3x3 conv, 128 3x3 conv, 128


output
pool, /2 3x3 conv, 256, /2 3x3 conv, 256, /2
size: 14
3x3 conv, 512 3x3 conv, 256 3x3 conv, 256

Deep Learning
3x3 conv, 512 3x3 conv, 256 3x3 conv, 256
Figure: Res-net
3x3 conv, 512 3x3 conv, 256 3x3 conv, 256

3x3 conv, 512 3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

output
pool, /2 3x3 conv, 512, /2 3x3 conv, 512, /2
size: 7
3x3 conv, 512 3x3 conv, 512

3x3 conv, 512 3x3 conv, 512

3x3 conv, 512 3x3 conv, 512

3x3 conv, 512 3x3 conv, 512

3x3 conv, 512 3x3 conv, 512


output
fc 4096 avg pool avg pool
size: 1
fc 4096 fc 1000 fc 1000

fc 1000
Automne 2021
He, Kaiming, et al. ”Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition.

60 / 74
Deep Neural Networks

Skip connection example: U-net

Figure: U-net

Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. ”U-net: Convolutional networks for biomedical image segmentation.” International Conference
on Medical image computing and computer-assisted intervention. Springer, Cham, 2015.

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 61 / 74


Deep Neural Networks

Better initialization through unsupervised learning (old fashioned)

The learning is split into two steps:

Pre-training
A unsupervised pre-training of the input layers with auto-encoders. Intuition: learning the manifold
where the input data resides.
Can take into account an unlabelled dataset.

Finetuning
A finetuning of the whole network with supervised back-propagation.

Hinton, G. E., Osindero, S. and Teh, Y. (2006) A fast learning algorithm for deep belief nets. Neural Computation, 18, pp 1527-1554

Hinton, G. E. and Salakhutdinov, R. R. (2006) Reducing the dimensionality of data with neural networks. Science, Vol. 313. no. 5786, pp. 504 - 507,
28 July 2006.

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 62 / 74


Deep Neural Networks

Diabolo network, Autoencoders


Autoencoders are neural network where the input and output representations have the same
number of units. The learned target is the input itself.

x1 x̂1

x2 x̂2
h1
x3 x̂3 x
h2
x4 x̂4

x5 x̂5

Figure: Diabolo network

When 2 layers :
The input layer is called the encoder,
The output layer, the decoder.
|
Tied weights Wdec = Wenc , convergence? PCA ?
R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 63 / 74
Deep Neural Networks

Diabolo network, Autoencoders


Autoencoders are neural network where the input and output representations have the same
number of units. The learned target is the input itself.

x1 x̂1

x2 x̂2
h1
x3 x̂3 x
h2
x4 x̂4

x5 x̂5

Figure: Diabolo network

Undercomplete, size(h) < size(x)


Overcomplete, size(x) < size(h).

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 63 / 74


Deep Neural Networks

Building from auto-encoders

Autoencoders are stacked and learned layer by layer. When a layer is pre-trained, its weights are
fixed until the finetuning.

x1 x̂1

x2 x̂2

x3 x̂3

x4 x̂4

x5 x̂5

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 64 / 74


Deep Neural Networks

Building from auto-encoders

Autoencoders are stacked and learned layer by layer. When a layer is pre-trained, its weights are
fixed until the finetuning.

x1 h1,1 ĥ1,1

x2 h1,2 ĥ1,2

x3 h1,3 ĥ1,3

x4 h1,4 ĥ1,4

x5 h1,5 ĥ1,5

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 64 / 74


Deep Neural Networks

Building from auto-encoders

Autoencoders are stacked and learned layer by layer. When a layer is pre-trained, its weights are
fixed until the finetuning.

x1

x2 h2,1 ĥ2,1

x3 h2,2 ĥ2,2

x4 h2,3 ĥ2,3

x5

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 64 / 74


Deep Neural Networks

Building from auto-encoders

Autoencoders are stacked and learned layer by layer. When a layer is pre-trained, its weights are
fixed until the finetuning.

x1

x2

x3 ŷ1

x4

x5

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 64 / 74


Deep Neural Networks

Simplified stacked AE Algorithm

Input: X , a training feature set of size Nbexamples × Nbfeatures


Input: Y , a corresponding training label set of size Nbexamples × Nblabels
Input: Ninput , the number of input layers to be pre-trained
Output: [w1 , w2 , . . . , wN ], the parameters for all the layers
Randomly initialize [w1 , w2 , . . . , wN ]
Input pre-training
R←X
for i ← 1..Ninput do
{Training an AE on R and keeps its encoding parameters}
[wi , wdummy ] ← MLPT RAIN([wi , w|i ], R, R)
Drop wdummy
R ← MLPF ORWARD([wi ], R)
end for
Final supervised learning
[w1 , w2 , . . . , wN ] ← MLPT RAIN([w1 , w2 , . . . , wN ], X , Y )

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 65 / 74


Deep Neural Networks

Pre-training ⇒ Transfer learning


Source Task: Classification
Alexnet, VGG16, VGG19, Googlenet, . . .
ImageNet (14M samples)

C1 C2 C3 C4 C5 FC1 FC2 FC3

Parameter Transfer 1000 classes

C1 C2 C3 C4 C5 FC1

Target Task: Regression


L3 slice prediction (pixel)

L3CT1 (642 samples)

Cireşan, D. C., Meier, U., & Schmidhuber, J. (2012, June). Transfer learning for Latin and Chinese characters with deep neural networks. In Neural
Networks (IJCNN), The 2012 International Joint Conference on (pp. 1-6)

S. Belharbi, C. Chatelain, R. Hérault, S. Adam, S. Thureau, M. Chastan, R. Modzelewski, Spotting L3 slice in CT scans using deep convolutional
network and transfer learning, Computers in Biology and Medicine

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 66 / 74


Deep Neural Networks

Improve optimization by adding noise 1/3

Denoising (undercomplete) auto-encoders


The auto-encoder is learned from x̃, a disturbed x; the target is still x.

x1 x̃1 x̂1

x2 x̃2 x̂2
Disturbance

h1
x3 x̃3 x̂3 x
h2
x4 x̃4 x̂4

x5 x̃5 x̂5

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 67 / 74


Deep Neural Networks

Improve optimization by adding noise 2/3

Prevent co-adaptation in (overcomplete) autoencoders


During training, randomly disconnect hidden units.

h1
x1 x̂1
h2
x2 x̂2
h3
x3 x̂3
h4

h5

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 68 / 74


Deep Neural Networks

Improve optimization by adding noise 2/3

Prevent co-adaptation in (overcomplete) autoencoders


During training, randomly disconnect hidden units.

h1
x1 x̂1
h2
x2 x̂2
h3
x3 x̂3
h4

h5

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 68 / 74


Deep Neural Networks

Improve optimization by adding noise 2/3

Prevent co-adaptation in (overcomplete) autoencoders


During training, randomly disconnect hidden units.

Figure: MNIST [Hinton 2012]

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 68 / 74


Deep Neural Networks

Improve optimization by adding noise 3/3

Dropout

During training, randomly disconnect at each iteration weights by probability p.


# actual disconnections
At testing, multiply the weights by # iterations
(6= p).

x1

x2

x3 ŷ1

x4

x5

Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of
feature detectors. arXiv:1207.0580.

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 69 / 74


Deep Neural Networks

Improve optimization by adding noise 3/3

Dropout

During training, randomly disconnect at each iteration weights by probability p.


# actual disconnections
At testing, multiply the weights by # iterations
(6= p).

x1

x2

x3 ŷ1

x4

x5

Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of
feature detectors. arXiv:1207.0580.

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 69 / 74


Deep Neural Networks

Improve optimization by adding noise 3/3

Dropout

During training, randomly disconnect at each iteration weights by probability p.


# actual disconnections
At testing, multiply the weights by # iterations
(6= p).

Figure: Reuters dataset

Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of
feature detectors. arXiv:1207.0580.

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 69 / 74


Deep Neural Networks

Tikhonov regularization scheme (new school)

Noise and early stopping are connected to regularization.


So why not using Tikhonov regularization scheme ?
X
J = L(yi , f (xi ; w)) + λ.Ω(w)
i

Notation
2-layer MLP

ŷ = fMLP (x; win , wout ) = fout (bout + wout .fin (bin + win .x))
AE

x̂ = fAE (x; wenc , wdec ) = fdec (bdec + wdec .fenc (benc + wenc .x))
Tied weights

win ↔ wenc , wdec ↔ w|enc

Bishop, C. M. (1995). Training with noise is equivalent to Tikhonov regularization. Neural computation, 7(1), 108-116

Collobert, R. and Bengio, S. (2004). Links between perceptrons, MLPs and SVMs. In ICML’2004

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 70 / 74


Deep Neural Networks

Regularization on weights

X
J = L(yi , fMLP (xi ; w)) + λ.Ω(wout )
i

L2 (Gaussian prior): X
Ω(wout ) = ||wd ||2
d

L1 (Laplace prior): X
Ω(wout ) = |wd |
d

t-Student: X
Ω(wout ) = log(1 + w2d )
d

With infinite units,


L1 : boosting
L2 : SVM
Bengio, Y., Roux, N. L., Vincent, P., Delalleau, O., & Marcotte, P. (2005). Convex neural networks. In Advances in neural information processing
systems (pp. 123-130)

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 71 / 74


Deep Neural Networks

Regularization brought by multi-task learning / embedding

Combine multiple tasks in the same optimization problem.


Tasks are sharing parameters.

X
J = λL L(yi , fMLP (xi ; wout , win ))
i∈L
X
+λU L(xi , fAE (xi ; win ))
i∈L∪U
+λΩ Ω(wout )

Mix supervised and unsupervised data.


Weston, J., Ratle, F., and Collobert, R. . Deep learning via semi-supervised embedding. ICML, pages 1168–1175, 2008

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 72 / 74


Deep Neural Networks

Open question: Adversarial examples

Forge miss-classifying examples by back-propagating the gradient to the example space.

Figure: Some adversarial examples

Szegedy, Christian, et al. ”Intriguing properties of neural networks.” arXiv preprint arXiv:1312.6199 (2013).

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 73 / 74


Deep Neural Networks

Thank you

Questions ?

R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 74 / 74

You might also like