Professional Documents
Culture Documents
L11+ Regularization
L11+ Regularization
Khoat Than
School of Information and Communication Technology
Hanoi University of Science and Technology
2024
Contents 2
1 y
Training
Test
ERMS
Error
0.5
0
0 3 6 9
M
Complexity
x
The Bias-Variance Decomposition 4
v Bias: how far is the true value from the mean of predictions by method
𝒜?
v Variance: how much does each prediction by 𝒜 vary around its
mean?
0.4
Low Variance High Variance
Prediction Error
0.3
Expected
prediction
error
0.2
Test Sample
Bias
0.1
Training Sample
Variance
0.0
50 40 30 20 10 0
Model Complexity
Regularization: introduction 7
¨ L1-norm:
0
𝒘 - = G 𝑤.
./-
¨ L2-norm:
0
𝒘 "
" = G 𝑤."
./-
!
Lp-norm: 𝒘 = 𝑤- 1 + ⋯ + 𝑤0 1
¨ 1
Regularization in Ridge regression 11
𝜆 ,-|/ |
𝑝 𝑤+ 𝜆) = 𝑒 #
2
¨ The larger λ, the more possibility that wi = 0.
Regularization in SVM 13
¨ Minimize "
#
𝒘3 𝒘
¨ Minimize
4
- 3
"𝒘 𝒘 + 𝐶 G 𝜉.
./-
𝒘∗ = arg max log Pr(𝑫|𝒘) + log Pr(𝒘) = arg max log Pr(𝒘|𝑫)
𝒘∈𝑾 𝒘∈𝑾
𝑓 𝑥 𝜇, 𝑏
𝑓 𝑥 𝜇, 𝜎
x x
(Figure by Wikipedia)
18
Regularization: limiting the search space
𝑓 𝑥 𝜇, 𝑏
𝑓 𝑥 𝜇, 𝜎
x x
19
Regularization: limiting the search space
¡ Vector w* = (w0, s1, s2, s3, s4, s5, s6, Age, Sex, BMI, BP)
changes when λ changes in Ridge regression.
Hesterberg et al./LARS and !1 penalized regression 67
¨ w* goes to 0 as λ increases.
S5
500
BMI
S2
BP
S4
S3
S6
beta
AGE
SEX
−500
S1
¨ Why??
Bias-Variance tradeoff: revisit 38 2. Overview of Supervised Learning
22
¡ Classical view:
Low Variance High Variance
Prediction Error
More complex model 𝑓# 𝑥; 𝑫
Test Sample
¡ Advantages:
¨ Avoid overfitting.
¡ Limitations:
¨ Consume time to select a good regularization constant.
¡ Belkin, M., Hsu, D., Ma, S., & Mandal, S. (2019). Reconciling modern machine-
learning practice and the classical bias–variance trade-off. Proceedings of the
National Academy of Sciences, 116(32), 15849-15854.
¡ Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network
training by reducing internal covariate shift. In International Conference on
Machine Learning (pp. 448-456).
¡ Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with
deep convolutional neural networks. Advances in Neural Information Processing
Systems, 25, 1097-1105.
¡ Hesterberg, T., Choi, N. H., Meier, L., & Fraley, C. (2008). Least angle and L1
penalized regression: A review. Statistics Surveys.
¡ Tibshirani, R (1996). Regression shrinkage and selection via the Lasso. Journal of the
Royal Statistical Society, vol. 58(1), pp. 267-288.
¡ Trevor Hastie, Robert Tibshirani, Jerome Friedman. The Elements of Statistical
Learning. Springer, 2009.
¡ Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2021). Understanding
deep learning (still) requires rethinking generalization. Communications of the
ACM, 64(3), 107-115.