Professional Documents
Culture Documents
Regularization
Regularization
Regularization for
Deep Learning
Safak Ozkan
April 15, 2017
1 Safak Ozkan
Chapter 7: Regularization for Deep Learning
L2 Parameter Regularization
L1 Parameter Regularization
Norm Penalties and Constrained Optimization
Regularization and Under-Constrained Problems
Dataset Augmentation
Noise Robustness
Injecting Noise at Output Targets
Early Stopping
Semi Supervised Learning
Multi-Task Learning
Parameter Tying and Parameter Sharing
Bagging and Other Ensemble Methods
Dropout
Adversarial Training
Tangent Distance, Manifold Tangent Classifier
2 of 13 Safak Ozkan
Definition
3 of 13 Safak Ozkan
L2 Regularization
(a.k.a. Weight decay, Tikhonov regularization, Ridge regression)
Regularization
parameter
Additional term
4 of 13 Safak Ozkan
L2 Regularization
is equivalent to optimizing
such that .
5 of 13 Safak Ozkan
L2 Regularization
Unregularized
solution
Large
6 of 13 Safak Ozkan
L2 Regularization
At ,
7 of 13 Safak Ozkan
L2 Regularization
Normal Equations for Linear Regression
Assume:
8 of 13 Safak Ozkan
L1 Regularization
(a.k.a. LASSO)
Regularization
Term
(Induces
Sparsity)
9 of 13 Safak Ozkan
Under-Constrained Problems
E.g. Logistic Regression
10 of 13 Safak Ozkan
Data Augmentation
Best way to improve generalization of a model is
to train it on more data.
Data Augmentation works particularly well for
Object Recognition tasks.
Injecting noise to input works well for
Speech Recognition.
Affine Elastic
Distortion Noise Deformation
Original
Input Image
11 of 13 Safak Ozkan
Noise Robustness
Addition of noise with a small variance is
equivalent to imposing norm penalty on weights.
Noise on weights: A stochastic implementation of
Bayesian Inference (uncertainty on weights are
represented by a probability distribution)
modified cost
function
regularization term
12 of 13 Safak Ozkan
Early Stopping
regularization
number of parameter
learning rate
steps
13 of 13 Safak Ozkan
Early Stopping
HAPTER 7. REGULARIZATION FOR DEEP LEARNING
Early stopping: Terminate while validation set
performance is better
0.20
Loss (negative log-likelihood)
0.10
0.05
0.00
0 50 100 150 200 250
Time (epochs)
gure 7.3: Learning curves showing how the negative log-likelihood loss changes o
14 of 13 Safak Ozkan