Professional Documents
Culture Documents
CM01 Deep
CM01 Deep
Romain H ÉRAULT
Automne 2021
Outline
Object classification
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. ”Imagenet classification with deep convolutional neural networks.” Advances in neural
information processing systems. 2012.
Object detection
person : 0.992
dog : 0.994
horse : 0.993
bus : 0.996
boat : 0.970
person : 0.983
person : 0.736 person : 0.983
person : 0.925
person : 0.989
Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in
neural information processing systems (pp. 91-99).
(a) Testing (b) Truth (c) Predict (d) Testing (e) Truth (f) Predict
Guosheng Lin, Chunhua Shen, Ian Reid, et al. Efficient piecewise training of deep structured models for semantic segmentation. arXiv:1504.01013,
2015.
Image captioning
Andrej Karpathy et Li Fei-Lei. Deep visual-semantic alignments for generating image descriptions. In : Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. 2015. p. 3128-3137.
Outline
Setup
Objective
Find the link f : X → Y (or the dependencies p(y|x) ) between the input and the output spaces.
Hypotheses space
f belongs to a hypotheses space H that depends on the chosen methods (MLP,SVM, Decision
trees, . . . ).
How to choose f within H ?
where L is a loss function that measures the accuracy of a prediction f (x) to a target value y.
Regression
Support Vector Machine Regression
1
0.5
If Y ∈ Ro , it is a regression task.
0
y
Standard loss are (y − f (x))2 or |y − f (x)|. −0.5
−1
−1.5
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x
Classification / Discrimination
3
discrimination task. 0
−1
0
−1
1
−1
−2
−1
1
−3 0
−3 −2 −1 0 1 2 3
Available data
Data consists in a set of n examples (x, y) where x ∈ X and y ∈ Y is split into:
A training set that will be used to choose f ,
i.e. to learn the parameters w of the model
A test set to evaluate the chosen f
(A validation set to choose the hyper-parameters of f )
Because of the human cost of labelling data, one can found a separate unlabelled set, i.e.
examples with only the feature x (see semi-supervised learning)
n
1 X
RS (f ) = L(f (x), y)
card(S)
(x,y)∈S
where S is the train set during learning, the test set during final evaluation.
Test set
Empirical risk
Learning set
Low High
Model complexity
where
L is a loss term that measures the accuracy of the model,
Ω is a regularization term that limits the capacity of the model,
λ ∈ [0, ∞[ is the regularization hyper-parameter.
Solution
| −1 |
w(λ) = (X X + λI) X Y
Regularization path:
5
Reg. term
Loss term
4 Reg. Path
3
λ=0
2
w1
0
λ = +∞
−1
−2
−2 −1 0 1 2 3 4 5
w0
Introducing sparsity
Lasso
Linear regression with the sum squared error as loss and a L1-norm as regularization:
X
arg min ||Y − X.w||2 + λ |wd |
w∈Rd d
which is equivalent to
Why is it sparse ?
Lasso: illustration
5
Reg. term
Loss term
4 Reg. Path
3
λ=0
2
w1
0
λ = +∞
−1
−2
−2 −1 0 1 2 3 4 5
w0
Outline
History . . .
Biological neuron
Origin
Schematic
x1
w1
x2
w2
Σ cd ŷ1
...
wm b
xm
Formulation
ŷ = f (hw, xi + b) (1)
where
x, input vector,
ŷ, output estimation,
w, weights linked to each input (model parameter),
b, bias (model parameter),
f , activation function.
Evaluation
Typical losses are
Classification
Activation functions are typically step function, sigmoid function ([0 1]) or hyperbolic tangent
([−1 1]).
f (x)
1 f (x) = sigm(x)
x
1
Figure: Sigmoid
f (x)
1 f (x) = tanh(x)
x
1
If loss and activation function are differentiable, parameters w and b can be learned by gradient
descent.
A perceptron
x0 = 1
w10
w20
x1 w11
w21 P
S1 y1
f
w12
x2 w22
y2
P
w13 S2 f
w23
x3
X
Sj = Wji xi
i
yj = f (Sj )
PseudoCode
Code
function O= f o r w a r d t a n h (W, I )
% Forward d’ une couche de MLP
% avec une fonction de transfert tanh
% - W parametres de la couche: matrice (n entrees+1)*(n sorties)
% - I entrees de la couche: matrice n exemples*n entrees
% - O sorties de la couche: matrice n exemples*n sorties
% Forme lineaire
% On rajoute 1 à l’entrée pour faire le biais
nI=size ( I , 1 ) ;
S = [ I ones ( nI , 1 ) ] *W ;
% Fonction de transfert
O = tanh ( S ) ;
If a function F (w) taking its values in R is defined and differentiable in a then F (w) decrease faster
in the opposite direction of the gradient of F in a, −∇F (a) .
f :R→R
There is a η small enough so that,
F : Cn → R
There is a η small enough so that,
ηn ← η0
3 Compute,
wn+1 = wn − ηn ∇F (wn )
4 If F (wn+1 ) > F (wn ), the learning rate ηn is too big, correct it by,
ηw ← αηw ,
6
f (w)
0
3 2 1 0 1 2 3
w
3.5
3.0
2.5
2.0
f (w)
1.5
1.0
0.5
0.0
0.5
1.0
3 2 1 0 1 2 3
w
2.0
1.5
1.0
0.5
w2
0.0
0.5
1.0
1.5
2.0
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
w1
0
w2
3
3 2 1 0 1 2
w1
0
w2
3
3 2 1 0 1 2
w1
Gradient Machine I
Principle
The learning problem is considered such as an optimization problem solved by a gradient descent.
Notation
(x, y) ∈ (Xtrain , Ytrain ) a training example, x is the feature vector of the example, y its outputs
or label,
F the model,
W the model parameters,
ŷ = F (x, W) the estimated output of the model when called on an example x with W
parameters.
L(x, W, y) = L(ŷ = F (x, W), y), the loss function used for the training.
We can use a gradient descent as long as the derivative of the loss with respect to the parameters
∂L(x,W,y)
of the model is computable, i.e. ∂W
exists ∀x, W, y.
i
Gradient Machine II
Wn+1 = Wn − ηn ∆Wn
Warning
∂L(xi , W, yi ) ∂L(xj , W, yj )
6=
∂Wk ∂Wk
How to compute the gradient on multiple examples ?
Batch Gradient
The batch (X , Y) is a subset of the training set.
1 X ∂L(x, W, y)
∆Wi =
card(X ) ∂Wi
(x,y)∈(X ,Y)
The gradient is average over all the examples in the batch and then used to updated the
parameters.
A batch is not processed again until all other batches are processed.
On-line Gradient
∂L(x, W, y)
∆Wi =
∂Wi
Parameters are update each time an example is processed. It is equivalent of a batch gradient
with a batch size of 1, card(X ) = 1.
Gradient Machine IV
Stochastic gradient
We speak of stochastic gradient when the batches are presented in a random order.
As with the batch gradient, one batch only reappears when all others bacthes have been
processed.
Extreme setup :
On small dataset: 1 batch and non-stochastic.
On big dataset: on-line stochastic
A perceptron
x0 = 1
w10
w20
x1 w11
w21 P
S1 y1
f
w12
x2 w22
y2
P
w13 S2 f
w23
x3
∂L
As the loss is differentiable, we can compute ∂yj
.
∂L ∂L ∂yj ∂Sj
=
∂wji ∂yj ∂Sj ∂wji
∂L ∂L 0
= f (Sj )xi
∂wji ∂yj
PseudoCode
Criterion Gradient
function [ e r r , g r a d C r i t ] = c r i t e r i o n m s e (O,TARGET)
% Critere L en MSE
% - O sorties du MLP
% - TARGET valeur cible
% - err valeur de L
% - gradCrit Grad de L par rapport à O
e r r =sum ( ( TARGET−O) . ˆ 2 ) ;
g r a d C r i t = −(TARGET−O) ;
Model Gradient
Neural network
Neural network
To solve more complex problems, we need to build a network of perceptrons
Principles
x1
x2 ŷ1
x3 ŷ2
x4
Figure: Feed-forward network, with two layers and one hidden representation
Recurrent network
x1
x2 ŷ1
x3 ŷ2
x4
Recurrent network
x1,t
x2,t
x3,t ŷ1,t
ŷ1,t−3
ŷ1,t−2
ŷ1,t−1
Outline
x1
x2 ŷ1
x3 ŷ2
x4
(l)
I0 = 1
w10
(l) w20
I1 w11
w21 P (l) (l)
S1 f (l) O1
(l) w12
I2 w22
P (l) (l)
w13 S2 f (l) O2
(l) w23
I3
(l) (l)
If we look at layer (l), let’s be Ii input number i and Oj output number j,
(l)
I0 = 1
w10
(l) w20
I1 w11
w21 P (l) (l)
S1 f (l) O1
(l) w12
I2 w22
P (l) (l)
w13 S2 f (l) O2
(l) w23
I3
We assume to know ∂L
(l)
∂O
j
(l) (l)
∂L ∂L ∂Oj ∂Sj
=
(l) (l) (l) (l)
∂wji ∂Oj ∂Sj ∂wji
(l)
I0 = 1
w10
(l) w20
I1 w11
w21 P (l) (l)
S1 f (l) O1
(l) w12
I2 w22
P (l) (l)
w13 S2 f (l) O2
(l) w23
I3
Now we compute ∂L
(l)
∂I
i
(l)
∂L X ∂L ∂Oj
=
(l) (l) (l)
∂Ii j ∂Oj ∂Ii
(l) (l)
∂L X ∂L ∂Oj ∂Sj
=
(l) (l) (l) (l)
∂Ii j ∂Oj ∂Sj ∂Ii
∂L X ∂L 0(l) (l)
= f (Sj )wji
(l) (l)
∂Ii j ∂Oj
(l)
I0 = 1
w10
(l) w20
I1 w11
w21 P (l) (l)
S1 f (l) O1
(l) w12
I2 w22
P (l) (l)
w13 S2 f (l) O2
(l) w23
I3
Start ∂L ∂L
= ∂ŷ
(last) j
∂O
j
Backward recurrence
∂L ∂L 0(l) (l) (l)
= f (Sj )Ii
(l) (l)
∂wji ∂Oj
∂L X ∂L 0(l) (l)
= f (Sj )wji
(l) (l)
∂Ii j ∂Oj
∂L ∂L
=
(l−1) (l)
∂Oj ∂Ii
PseudoCode
Criterion Gradient
function [ e r r , g r a d C r i t ] = c r i t e r i o n m s e (O,TARGET)
% Critere L en MSE
% - O sorties du MLP
% - TARGET valeur cible
% - err valeur de L
% - gradCrit Grad de L par rapport à O
e r r =sum ( ( TARGET−O) . ˆ 2 ) ;
g r a d C r i t = −(TARGET−O) ;
Outline
Deep architecture
x1
x2
x3 ŷ1
x4
x5
Why ?
f (x)
1 f (x) = tanh(x)
x
1
∂L X ∂L (l)
(l)
= (l)
f 0(l) (Sj )wji
∂Ii j ∂Oj
When neurons at higher layers are saturated, the gradient decreases toward zero.
f (x)
f (x) = ln(1 + ex )
f (x) = max(0, x)
1
x
1
LeakyReLU, ELU, . . .
Idea
Perform normalization not only when preprocessing data but also in between each layer.
x − µB
x0 = q
σB2 +
Back-propagation
∂L ∂L
How to compute ∂x
from ∂x 0
?
Ioffe, S., and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167.
x − µB (xi )
xi0 = qi
2 (x ) +
σB i
m
∂L −1 2 −3/2 X ∂L
= σB + (xi − µB )
2
∂σB 2 ∂xi0
i=1
m m
∂L 1 X ∂L 1 ∂L X
= −q 0
− 2
2(xi − µB )
∂µB 2 +
σB ∂xi m ∂σB
i=1 i=1
!
∂L ∂L 1 1 ∂L ∂L
= + 2(x i − µ B ) +
∂xi0 σ 2 + 2
q
∂xi m ∂σB ∂µB
B
(m + 1) ∗ n → (o + 1) ∗ n
w1
w2
w3
w1
w2
w1 w3
w2
w3
Yu, Fisher, and Vladlen Koltun. ”Multi-scale context aggregation by dilated convolutions.” ICLR 2016, arXiv preprint arXiv:1511.07122 (2015).
Convolutional layer
Batch normalization
Relu activation
(Subsampling)
Batch normalization
Dense layer (+ weights regularization)
Relu/Tanh activation
LeCun, Y. (1989). Generalization and network design strategies. Connections in Perspective. North-Holland, Amsterdam, 143-55.
LeCun, Y., Kavukcuoglu, K., & Farabet, C. (2010, May). Convolutional networks and applications in vision. In Circuits and Systems (ISCAS),
Proceedings of 2010 IEEE International Symposium on (pp. 253-256). IEEE.
softmax2
Automne 2021
SoftmaxActivation
FC
AveragePool
7x7+1(V)
DepthConcat
Conv Conv Conv Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
Conv Conv MaxPool
1x1+1(S) 1x1+1(S) 3x3+1(S)
DepthConcat
Conv Conv Conv Conv
softmax1
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
Conv Conv MaxPool
SoftmaxActivation
1x1+1(S) 1x1+1(S) 3x3+1(S)
MaxPool
FC
3x3+2(S)
DepthConcat FC
Conv Conv Conv Conv Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 1x1+1(S)
Conv Conv MaxPool AveragePool
Figure: GoogLeNet [Szegedy 2015]
Deep Learning
Convolutional network : 2D examples II
Convolutional layer
Batch normalization
Relu activation
(Subsampling)
No Dense blocks !
Advantages: Variable input size, small number of parameters,
Disadvantages: Variable output size, can not modelize global dependencies.
Long, Jonathan, Evan Shelhamer, and Trevor Darrell. ”Fully convolutional networks for semantic segmentation.” Proceedings of the IEEE conference
on computer vision and pattern recognition. 2015.
Skip connection
Principle
When a layer (or a group of layers) has the same input and output size, add the input to the output.
⇒ Gradient can ”fly” over the layer.
h Layers + h’
pool, /2
output
size: 112
3x3 conv, 128
Deep Learning
3x3 conv, 512 3x3 conv, 256 3x3 conv, 256
Figure: Res-net
3x3 conv, 512 3x3 conv, 256 3x3 conv, 256
output
pool, /2 3x3 conv, 512, /2 3x3 conv, 512, /2
size: 7
3x3 conv, 512 3x3 conv, 512
fc 1000
Automne 2021
He, Kaiming, et al. ”Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition.
60 / 74
Deep Neural Networks
Figure: U-net
Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. ”U-net: Convolutional networks for biomedical image segmentation.” International Conference
on Medical image computing and computer-assisted intervention. Springer, Cham, 2015.
Pre-training
A unsupervised pre-training of the input layers with auto-encoders. Intuition: learning the manifold
where the input data resides.
Can take into account an unlabelled dataset.
Finetuning
A finetuning of the whole network with supervised back-propagation.
Hinton, G. E., Osindero, S. and Teh, Y. (2006) A fast learning algorithm for deep belief nets. Neural Computation, 18, pp 1527-1554
Hinton, G. E. and Salakhutdinov, R. R. (2006) Reducing the dimensionality of data with neural networks. Science, Vol. 313. no. 5786, pp. 504 - 507,
28 July 2006.
x1 x̂1
x2 x̂2
h1
x3 x̂3 x
h2
x4 x̂4
x5 x̂5
When 2 layers :
The input layer is called the encoder,
The output layer, the decoder.
|
Tied weights Wdec = Wenc , convergence? PCA ?
R. H ÉRAULT (INSA LITIS) Deep Learning Automne 2021 63 / 74
Deep Neural Networks
x1 x̂1
x2 x̂2
h1
x3 x̂3 x
h2
x4 x̂4
x5 x̂5
Autoencoders are stacked and learned layer by layer. When a layer is pre-trained, its weights are
fixed until the finetuning.
x1 x̂1
x2 x̂2
x3 x̂3
x4 x̂4
x5 x̂5
Autoencoders are stacked and learned layer by layer. When a layer is pre-trained, its weights are
fixed until the finetuning.
x1 h1,1 ĥ1,1
x2 h1,2 ĥ1,2
x3 h1,3 ĥ1,3
x4 h1,4 ĥ1,4
x5 h1,5 ĥ1,5
Autoencoders are stacked and learned layer by layer. When a layer is pre-trained, its weights are
fixed until the finetuning.
x1
x2 h2,1 ĥ2,1
x3 h2,2 ĥ2,2
x4 h2,3 ĥ2,3
x5
Autoencoders are stacked and learned layer by layer. When a layer is pre-trained, its weights are
fixed until the finetuning.
x1
x2
x3 ŷ1
x4
x5
C1 C2 C3 C4 C5 FC1
Cireşan, D. C., Meier, U., & Schmidhuber, J. (2012, June). Transfer learning for Latin and Chinese characters with deep neural networks. In Neural
Networks (IJCNN), The 2012 International Joint Conference on (pp. 1-6)
S. Belharbi, C. Chatelain, R. Hérault, S. Adam, S. Thureau, M. Chastan, R. Modzelewski, Spotting L3 slice in CT scans using deep convolutional
network and transfer learning, Computers in Biology and Medicine
x1 x̃1 x̂1
x2 x̃2 x̂2
Disturbance
h1
x3 x̃3 x̂3 x
h2
x4 x̃4 x̂4
x5 x̃5 x̂5
h1
x1 x̂1
h2
x2 x̂2
h3
x3 x̂3
h4
h5
h1
x1 x̂1
h2
x2 x̂2
h3
x3 x̂3
h4
h5
Dropout
x1
x2
x3 ŷ1
x4
x5
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of
feature detectors. arXiv:1207.0580.
Dropout
x1
x2
x3 ŷ1
x4
x5
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of
feature detectors. arXiv:1207.0580.
Dropout
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of
feature detectors. arXiv:1207.0580.
Notation
2-layer MLP
ŷ = fMLP (x; win , wout ) = fout (bout + wout .fin (bin + win .x))
AE
x̂ = fAE (x; wenc , wdec ) = fdec (bdec + wdec .fenc (benc + wenc .x))
Tied weights
Bishop, C. M. (1995). Training with noise is equivalent to Tikhonov regularization. Neural computation, 7(1), 108-116
Collobert, R. and Bengio, S. (2004). Links between perceptrons, MLPs and SVMs. In ICML’2004
Regularization on weights
X
J = L(yi , fMLP (xi ; w)) + λ.Ω(wout )
i
L2 (Gaussian prior): X
Ω(wout ) = ||wd ||2
d
L1 (Laplace prior): X
Ω(wout ) = |wd |
d
t-Student: X
Ω(wout ) = log(1 + w2d )
d
X
J = λL L(yi , fMLP (xi ; wout , win ))
i∈L
X
+λU L(xi , fAE (xi ; win ))
i∈L∪U
+λΩ Ω(wout )
Szegedy, Christian, et al. ”Intriguing properties of neural networks.” arXiv preprint arXiv:1312.6199 (2013).
Thank you
Questions ?