Professional Documents
Culture Documents
L11+ Regularization
L11+ Regularization
L11+ Regularization
Khoat Than
School of Information and Communication Technology
Hanoi University of Science and Technology
2023
Contents 2
¡ Probabilistic modeling
¡ Regularization
¡ Reinforcement learning
¡ Practical advice
Revisiting overfiting 3
¨ However,
from Bishop, Figure 1.5 those functions might generalize badly.
1 f(x)
Training
Test
ERMS
Error
0.5
0
0 3 6 9
M
Complexity
x
The Bias-Variance Decomposition 4
$
𝐸𝑟𝑟𝑜𝑟(𝑥) = $
𝜎 + 𝐵𝑖𝑎𝑠 𝑓 # + 𝑉𝑎𝑟 𝑓#
= 𝐼𝑟𝑟𝑒𝑑𝑢𝑐𝑖𝑏𝑙𝑒 𝐸𝑟𝑟𝑜𝑟 + 𝐵𝑖𝑎𝑠 $ + 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒
¡ The more complex the model 𝑓# 𝑥; 𝑫 is, the more data 7.3 The Bias
points it can capture, and the lower the bias can be.
v However, higher complexity will make the model "move" more to
38 capture
2. Overview of the dataLearning
Supervised points, and hence its variance will be larger.
k−NN − Regression
High Bias Low Bias
0.4
0.4
Low Variance High Variance
Prediction Error
0.3
0.3
Expected
prediction
error
0.2
0.2
Test Sample
Bias
0.1
0.1
Variance
Training Sample
0.0
0.0
50 40 30 20 10 0
Model Complexity
k−NN − Classification
Regularization: introduction 7
¨ L1-norm: w 1 =∑w
i=1
n
2
¨ L2-norm: w 2 = ∑ wi2
i=1
p p p
¨ Lp-norm: w p= w1 +... + wn
Regularization in Ridge regression 12
𝜆 "#|% |
𝑝 𝑤! 𝜆) = 𝑒 $
2
¨ The larger λ, the more possibility that wi = 0.
Regularization in SVM 14
𝑤 ∗ = arg max log Pr(𝑫|𝑤) + log Pr(𝑤) = arg max log Pr(𝑤|𝑫)
.∈𝑾 .∈𝑾
¡ Vector w* = (w0, s1, s2, s3, s4, s5, s6, Age, Sex, BMI, BP)
changes when λ changes in Ridge regression.
¨ w* goes to 0 as λ increases.
λ
Regularization: practical effectiveness 22
¨ Why??
Bias-Variance tradeoff: revisit 23
Prediction Error
v Lower bias, higher variance Test Sample
¡ Modern phenomenon:
Very rich models such as neural networks
Training Sample
v
are trained to exactly fit the data, but Model Complexity
Low High
variance tends to increase and the squared bias tends to decrease. The op-
posite behavior occurs as the model complexity is decreased. For k-nearest
DALLE-3, … neighbors, the model complexity is controlled by k.
Typically we would like to choose our model complexity to trade bias
off with variance in such a way as to minimize !the test error. An obvious
¡ Why??? 1 2
estimate of test error is the training error N i (yi − ŷi ) . Unfortunately
training error is not a good estimate of test error, as it does not properly
account for model complexity.
Figure 2.11 shows the typical behavior of the test and training error, as
Modeliscomplexity
model complexity varied. The training error tends to decrease whenever
we increase the model complexity, that is, whenever we fit the data harder.
Regularization: summary 24
¡ Advantages:
¨ Avoid overfitting.
¡ Limitations:
¨ Consume time to select a good regularization constant.
¡ Belkin, M., Hsu, D., Ma, S., & Mandal, S. (2019). Reconciling modern machine-
learning practice and the classical bias–variance trade-off. Proceedings of the
National Academy of Sciences, 116(32), 15849-15854.
¡ Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network
training by reducing internal covariate shift. In International Conference on
Machine Learning (pp. 448-456).
¡ Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with
deep convolutional neural networks. Advances in Neural Information Processing
Systems, 25, 1097-1105.
¡ Tibshirani, R (1996). Regression shrinkage and selection via the Lasso. Journal of the
Royal Statistical Society, vol. 58(1), pp. 267-288.
¡ Trevor Hastie, Robert Tibshirani, Jerome Friedman. The Elements of Statistical
Learning. Springer, 2009.
¡ Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2021). Understanding
deep learning (still) requires rethinking generalization. Communications of the
ACM, 64(3), 107-115.