L11+ Regularization

Machine Learning
(Học máy – IT3190E)
Khoat Than
School of Information and Communication Technology
Hanoi University of Science and Technology
2024
Contents 2
¡ Introduction to Machine Learning

¡ Supervised learning
¡ Probabilistic modeling
¡ Regularization
¡ Reinforcement learning
¡ Practical advice
Revisiting overfitting 3
¡ The complexity of the learned function: 𝑦 = 𝑓# 𝒙; 𝑫

¨ ! the more
For a given training data D: the more complicated 𝑓,
possibility that 𝑓! fits D better.
Overfitting
¨ For a given D: there exist many functions that fit D perfectly
g set size,(i.e.,
varyno
H error on D).
complexity (e.g., degree of polynomials)
Bishop, Figure 1.5
¨ However, those functions might generalize badly.
1 y
Training
Test
ERMS
Error
0.5
0
0 3 6 9
M
Complexity
x
The Bias-Variance Decomposition 4
¡ Consider 𝑦 𝒙 = 𝑦 ∗ 𝒙 + 𝜖 as the (unknown) regression function

v 𝜖~𝑁𝑜𝑟𝑚𝑎𝑙 0, 𝜎 ! is a Gaussian noise with mean 0 and variance 𝜎 ! .
v 𝜖 may represent the noise due to measurement or data collection.
¡ Let 𝑓! 𝒙; 𝑫 be the regressor, learned by method 𝒜 from a training set D

¡ Note: We want that 𝑓, well approximates the truth y ∗ .
v 𝑓, 𝒙; 𝑫 is random, according to the randomness when collecting D.
"
¡ For any instance x, the error made by 𝑓! is 𝑦(𝒙) − 𝑓! 𝒙; 𝑫
¡ The error made by learning method 𝓐:

(Lỗi của thuật toán 𝒜 khi phán đoán x)
"
𝑒𝑟𝑟# 𝒙 = 𝔼𝑫,& 𝑦(𝒙) − 𝑓! 𝒙; 𝑫
v Why expectation? a different training set 𝑫’ will make 𝒜 to return a

different function 𝑓" 𝒙; 𝑫′
The Bias-Variance Decomposition (2) 5
𝑒𝑟𝑟# 𝒙 = 𝜎 " + 𝐵𝑖𝑎𝑠 " + 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒

"
v 𝐵𝑖𝑎𝑠 = 𝑦∗ 𝒙 − 𝔼𝑫𝑓! 𝒙; 𝑫 ; 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = 𝔼𝑫 𝑓! 𝒙; 𝑫 − 𝔼𝑫' 𝑓! 𝒙; 𝑫′
¡ This is known as Bias-Variance Decomposition

(
v 𝜎 : cannot be avoided due to noises or uncontrolled factors
v Bias: how far is the true value from the mean of predictions by method
𝒜?
v Variance: how much does each prediction by 𝒜 vary around its
mean?
¡ To obtain a small prediction error:

v Small bias? Increase model complexity è Variance tends to increase
Trade-off
v Small variance? Decrease model complexity è Bias tends to increase
Bias-Variance tradeoff: classical view 6
¡ The more complex the model 𝑓! 𝒙; 𝑫 is, the more data

points it can capture, and the lower the bias can be. 7.3 The
Classical
v However, higher complexity will make the model "move" view
8 more to
2. Overview of capture
Supervisedthe data points, and hence its variance
Learning
will be larger.
k−NN − Regression
High Bias Low Bias
0.4
Low Variance High Variance
Prediction Error
0.3
Expected
prediction
error
0.2
Test Sample
Bias
0.1
Training Sample
Variance
0.0
50 40 30 20 10 0
Low High Number of Neighbors k
Model Complexity
Regularization: introduction 7
¡ Regularization is now a popular and useful technique in ML.
¡ It is a technique to exploit further information to

¨ Reduce overfitting in ML.
¨ Solve ill-posed problems in Maths.
¡ The further information is often enclosed in a penalty on the

complexity of 𝑓# 𝒙; 𝑫 .
¨ More penalty will be imposed on complex functions.
¨ We prefer simpler functions among all that fit well the training data.
Regularization: the principle 8
¡ We need to learn a function 𝑓(𝒙, 𝒘) from the training set D

¨ x is a data example and belongs to input space.
¨ w is the parameter and often belongs to a parameter space W.
¨ 𝑭 = {𝑓 𝒙, 𝒘 : 𝒘 ∈ 𝑾} is the function space, parameterized by w.
¡ For many ML models, the training problem is often reduced to an

optimization problem:
𝒘∗ = arg min 𝐿(𝑓 𝒙, 𝒘 , 𝑫) (1)
𝒘∈𝑾
¨ w sometimes tells the size/complexity of that function.
¨ 𝐿(𝑓 𝒙, 𝒘 , 𝑫) is an empirical loss/risk which depends on D. This loss

shows how well function f fits D.
¡ Another view:
𝑓 ∗ = arg min 𝐿(𝑓, 𝑫)
+∈𝑭
Regularization: the principle 9
¡ Adding a penalty to (1), we consider

𝒘∗ = arg min 𝐿(𝑓 𝒙, 𝒘 , 𝑫) + 𝜆𝑔(𝒘) (2)
𝒘∈𝑾
¨ Where 𝜆 > 0 is called the regularization/penalty constant.
¨ 𝑔(𝒘) measures the complexity of w: 𝑔(𝒘) ≥ 0
¡ 𝐿(𝑓, 𝑫) measures the goodness of function f on D.

¡ The penalty (regularization) term: 𝜆𝑔(𝑤)
¨ Allows to trade off the fitness on D and the generalization.
(cho phép đánh đổi lỗi trên tập học với khả năng tổng quát hoá)
¨ The greater λ, the heavier penalty, implying that 𝑔(𝒘) should be

smaller.
¨ In practice, λ should be neither too small nor too large.
(λ không nên quá lớn hoặc quá bé trong thực tế)
Regularization: popular types 10
¡ 𝑔(𝒘) often relates to some norms when w is an n-

dimensional vector.
¨ L0-norm: ||w||0 counts the number of non-zeros in w.
¨ L1-norm:
0
𝒘 - = G 𝑤.
./-
¨ L2-norm:
0
𝒘 "
" = G 𝑤."
./-
!
Lp-norm: 𝒘 = 𝑤- 1 + ⋯ + 𝑤0 1
¨ 1
Regularization in Ridge regression 11
¡ Ridge regression can be derived from OLS by adding a

penalty term into the objective function when learning.
¡ Learning a regressor in Ridge is reduced to
(
𝒘∗ = arg min 𝑅𝑆𝑆 𝒘, 𝑫 + 𝜆 𝒘 (
𝐰
¨ Where λ is a positive constant.

"
¨ The term 𝜆 𝒘 " plays the role as regularization.
¨ Large λ reduces the size of w.

Regularization in Lasso 12
¡ Lasso [Tibshirani, 1996] is a variant of OLS for linear regression by

using L1 to do regularization.
¡ Learning a linear regressor is reduced to
𝒘∗ = arg min 𝑅𝑆𝑆 𝒘, 𝑫 + 𝜆 𝒘 -

𝐰
¨ Where λ is a positive constant.
¨ 𝜆 𝒘 ! is the regularization term. Large λ reduces the size of w.
¡ Regularization here amounts to imposing a Laplace distribution

(as prior) over each wi, with density function:
𝜆 ,-|/ |
𝑝 𝑤+ 𝜆) = 𝑒 #
2
¨ The larger λ, the more possibility that wi = 0.
Regularization in SVM 13
¡ Learning a classifier in SVM is reduced to the following problem:
¨ Minimize "
#
𝒘3 𝒘
¨ Conditioned on: 𝑦" 𝒘# 𝒙" + 𝑏 ≥ 1, ∀𝑖 ∈ {1, … , 𝑟}
¡ In the cases of noises/errors, learning is reduced to
¨ Minimize
4
- 3
"𝒘 𝒘 + 𝐶 G 𝜉.
./-
𝑦" 𝒘# 𝒙" + 𝑏 ≥ 1 − 𝜉" ,

¨ Conditioned on A
𝜉" ≥ 0, ∀𝑖 ∈ {1, … , 𝑟}
¡ 𝜉1 + ⋯ + 𝜉𝑟 measures the training error,

𝒘3 𝒘 is the regularization term.
"
#
Some other regularization methods 14
¡ Dropout: (Hilton and his colleagues, 2012)

¨ At each iteration of the training process, randomly drop out some
parts and just update the other parts of our model.
¡ Batch normalization [Ioffe & Szegedy, 2015]
¨ Normalize the inputs at each neuron of a neural network
¨ Reduce input variance, easier training, faster convergence

¡ Data augmentation
¨ Produce different versions of an example in the training set, by
adding simple noises, translation, rotation, cropping, …
¨ Those versions are added to the training data set
¡ Early stopping
¨ Stop training early to avoid overtraining & reduce overfitting
Regularization: MAP role 15
¡ Under some conditions, we can view regularization as
𝒘∗ = arg min 𝐿 𝑓 𝒙, 𝒘 , 𝑫 + 𝜆𝑔(𝒘)

/∈𝑾
Likelihood Prior
¨ Where D is a sample from a probability distribution whose log

likelihood is −𝐿 𝑓 𝒙, 𝒘 , 𝑫 .
¨ w is a random variable and follows the prior with density

𝑝(𝒘) ∝ exp(−𝜆𝑔 𝒘 )
¡ Then 𝒘∗ = arg max {−𝐿 𝑓 𝒙, 𝒘 , 𝑫 − 𝜆𝑔 𝒘 }

𝒘∈𝑾
𝒘∗ = arg max log Pr(𝑫|𝒘) + log Pr(𝒘) = arg max log Pr(𝒘|𝑫)
𝒘∈𝑾 𝒘∈𝑾
¡ As a result, regularization in fact helps us to learn an MAP

solution w*.
Regularization: MAP in Ridge 16
¡ Consider the regression model: 𝒙 ∈ ℝ3

𝑦 thus follows
vPick 𝒘 ~ 𝑁𝑜𝑟𝑚𝑎𝑙 0, 𝜎2𝑰 , where I is the identity matrix a Normal
distribution
vGenerate sample 𝒙 ~ 𝑁𝑜𝑟𝑚𝑎𝑙(0, "! 𝑰) with mean 0
and variance
vLet 𝑦 = 𝒘𝑇𝒙 1
¡ Then the MAP estimation from the training data 𝑫 is

𝒘∗ = arg max log Pr(𝒘|𝑫) = arg max log[Pr(𝑫|𝒘) Pr 𝒘 ]
𝒘 𝒘
= arg max W log Pr(𝒙, 𝑦|𝒘) + log Pr(𝒘)

𝒘 Ridge regression
(𝒙,))∈𝑫
@@
1 1 #
= arg min W 𝑦 − 𝒘# 𝒙 - + 2
𝒘 𝒘 + 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡
𝒘 2 2𝜎
(𝒙,))∈𝑫
¡ Regularization using 𝐿2 with penalty constant 𝜆 = 𝜎 ,( .

Regularization: MAP in Ridge & Lasso 17
¡ The regularization constant in Ridge: 𝜆 = 𝜎 ,(

¡ The regularization constant in Lasso: 𝜆 = 𝑏 ,5
¡ Gaussian (left) and Laplace distribution (right)
1 |𝑥 − 𝜇|
1 𝑥−𝜇 $ 𝑓 𝑥 𝜇, 𝑏 = exp −
𝑓 𝑥 𝜇, 𝜎 = exp −
2𝑏 𝑏
𝜎 2𝜋 2𝜎 $
𝑓 𝑥 𝜇, 𝑏
𝑓 𝑥 𝜇, 𝜎
x x
(Figure by Wikipedia)
18
Regularization: limiting the search space
¡ The regularization constant in Ridge: 𝜆 = 𝜎 ,(

¡ The regularization constant in Lasso: 𝜆 = 𝑏 ,5
¡ The larger λ, the higher probability that x occurs around 0.
1 |𝑥 − 𝜇|
1 𝑥−𝜇 $ 𝑓 𝑥 𝜇, 𝑏 = exp −
𝑓 𝑥 𝜇, 𝜎 = exp − 2𝑏 𝑏
𝜎 2𝜋 2𝜎 $
𝑓 𝑥 𝜇, 𝑏
𝑓 𝑥 𝜇, 𝜎
x x
19
Regularization: limiting the search space
¡ The regularized problem:

𝑤 ∗ = arg min 𝐿(𝑓 𝑥, 𝑤 , 𝑫) + 𝜆𝑔(𝑤) (2)
/∈𝑾
¡ A result from the optimization literature shows that (2) is

equivalent to the following:
𝑤 ∗ = arg min 𝐿(𝑓 𝑥, 𝑤 , 𝑫) such that 𝑔 𝑤 ≤ 𝑠 (3)
/∈𝑾
¨ For some constant s.
¡ Note that the constraint of g(w) ≤ s plays the role as limiting

the search space of w.
Regularization: effects of λ 20
¡ Vector w* = (w0, s1, s2, s3, s4, s5, s6, Age, Sex, BMI, BP)
changes when λ changes in Ridge regression.
Hesterberg et al./LARS and !1 penalized regression 67
¨ w* goes to 0 as λ increases.
S5
500
BMI
S2
BP
S4
S3
S6
beta
AGE
SEX
−500
S1
0.0 0.1 1.0 10.0

Regularization: practical effectiveness 21
¡ Ridge regression was under investigation on a prostate

dataset with 67 observations.
¨ Performance was measured by RMSE (root mean square errors)
and Correlation coefficient.
λ 0.1 1 10 100 1000 10000

RMSE 0.74 0.74 0.74 0.84 1.08 1.16
Correlation 0.77 0.77 0.78 0.76 0.74 0.73
coeficient
¨ Too high or too low values of λ often result in bad predictions.
¨ Why??
Bias-Variance tradeoff: revisit 38 2. Overview of Supervised Learning
22
High Bias Low Bias
¡ Classical view:
Low Variance High Variance
Prediction Error
More complex model 𝑓# 𝑥; 𝑫
Test Sample
v Lower bias, higher variance
¡ Modern phenomenon: Training Sample
Very rich models such as neural networks

Low High
v Model Complexity
are trained to exactly fit the data, but
FIGURE 2.11. Test and training error as a function of model complexi
often obtain high accuracy on test data

[Belkin et al., 2019; Zhang et al., 2021] be close to f (x0 ). As k grows, the neighbors are further away, and
anything can happen.
The variance term is simply the variance of an average here, and
v 𝐵𝑖𝑎𝑠 ≅ 0 B creases as the inverse of k. So as k varies, there is a bias–variance trad
More generally, as the model complexity of our procedure is increased
variance tends to increase and the squared bias tends to decrease. The
v GPT-4, ResNets, posite behavior occurs as the model complexity is decreased. For k-nea
Risk (Error)
StyleGAN, neighbors, the model complexity is controlled by k.

Typically we would like to choose our model complexity to trade
DALLE-3, … off with variance in such a way as to minimize !the test error. An obv
estimate of test error is the training error N1 i (yi − ŷi )2 . Unfortuna
training error is not a good estimate of test error, as it does not prop
¡ Why??? account for model complexity.
Figure 2.11 shows the typical behavior of the test and training erro
model complexity is varied. The training error tends to decrease when
Model complexity
we increase the model complexity, that is, whenever we fit the data ha
However with too much fitting, the model adapts itself too closely to
Regularization: summary 23
¡ Advantages:
¨ Avoid overfitting.
¨ Limit the search space of the function to be learned.
¨ Reduce bad effects from noises or errors in observations.
¨ Might model data better. As an example, L1 often work well

with data/model which are inherently sparse.
¡ Limitations:
¨ Consume time to select a good regularization constant.
¨ Might pose some difficulties to design an efficient algorithm.

References 24
¡ Belkin, M., Hsu, D., Ma, S., & Mandal, S. (2019). Reconciling modern machine-
learning practice and the classical bias–variance trade-off. Proceedings of the
National Academy of Sciences, 116(32), 15849-15854.
¡ Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network
training by reducing internal covariate shift. In International Conference on
Machine Learning (pp. 448-456).
¡ Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with
deep convolutional neural networks. Advances in Neural Information Processing
Systems, 25, 1097-1105.
¡ Hesterberg, T., Choi, N. H., Meier, L., & Fraley, C. (2008). Least angle and L1
penalized regression: A review. Statistics Surveys.
¡ Tibshirani, R (1996). Regression shrinkage and selection via the Lasso. Journal of the
Royal Statistical Society, vol. 58(1), pp. 267-288.
¡ Trevor Hastie, Robert Tibshirani, Jerome Friedman. The Elements of Statistical
Learning. Springer, 2009.
¡ Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2021). Understanding
deep learning (still) requires rethinking generalization. Communications of the
ACM, 64(3), 107-115.

L11+ Regularization

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

L11+ Regularization

Uploaded by

Copyright:

Available Formats

Machine Learning

(Học máy – IT3190E)

¡ Introduction to Machine Learning

¡ The complexity of the learned function: 𝑦 = 𝑓# 𝒙; 𝑫

¡ Consider 𝑦 𝒙 = 𝑦 ∗ 𝒙 + 𝜖 as the (unknown) regression function

¡ Let 𝑓! 𝒙; 𝑫 be the regressor, learned by method 𝒜 from a training set D

¡ The error made by learning method 𝓐:

v Why expectation? a different training set 𝑫’ will make 𝒜 to return a

𝑒𝑟𝑟# 𝒙 = 𝜎 " + 𝐵𝑖𝑎𝑠 " + 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒

¡ This is known as Bias-Variance Decomposition

¡ To obtain a small prediction error:

¡ The more complex the model 𝑓! 𝒙; 𝑫 is, the more data

Low High Number of Neighbors k

¡ Regularization is now a popular and useful technique in ML.

¡ It is a technique to exploit further information to

¡ The further information is often enclosed in a penalty on the

¡ We need to learn a function 𝑓(𝒙, 𝒘) from the training set D

¡ For many ML models, the training problem is often reduced to an

¨ w sometimes tells the size/complexity of that function.

¨ 𝐿(𝑓 𝒙, 𝒘 , 𝑫) is an empirical loss/risk which depends on D. This loss

¡ Adding a penalty to (1), we consider

¨ Where 𝜆 > 0 is called the regularization/penalty constant.

¨ 𝑔(𝒘) measures the complexity of w: 𝑔(𝒘) ≥ 0

¡ 𝐿(𝑓, 𝑫) measures the goodness of function f on D.

¨ The greater λ, the heavier penalty, implying that 𝑔(𝒘) should be

¡ 𝑔(𝒘) often relates to some norms when w is an n-

¡ Ridge regression can be derived from OLS by adding a

¨ Where λ is a positive constant.

¨ Large λ reduces the size of w.

¡ Lasso [Tibshirani, 1996] is a variant of OLS for linear regression by

𝒘∗ = arg min 𝑅𝑆𝑆 𝒘, 𝑫 + 𝜆 𝒘 -

¨ Where λ is a positive constant.

¨ 𝜆 𝒘 ! is the regularization term. Large λ reduces the size of w.

¡ Regularization here amounts to imposing a Laplace distribution

¡ Learning a classifier in SVM is reduced to the following problem:

¨ Conditioned on: 𝑦" 𝒘# 𝒙" + 𝑏 ≥ 1, ∀𝑖 ∈ {1, … , 𝑟}

¡ In the cases of noises/errors, learning is reduced to

𝑦" 𝒘# 𝒙" + 𝑏 ≥ 1 − 𝜉" ,

¡ 𝜉1 + ⋯ + 𝜉𝑟 measures the training error,

¡ Dropout: (Hilton and his colleagues, 2012)

¨ Reduce input variance, easier training, faster convergence

¡ Under some conditions, we can view regularization as

𝒘∗ = arg min 𝐿 𝑓 𝒙, 𝒘 , 𝑫 + 𝜆𝑔(𝒘)

¨ Where D is a sample from a probability distribution whose log

¨ w is a random variable and follows the prior with density

¡ Then 𝒘∗ = arg max {−𝐿 𝑓 𝒙, 𝒘 , 𝑫 − 𝜆𝑔 𝒘 }

¡ As a result, regularization in fact helps us to learn an MAP

¡ Consider the regression model: 𝒙 ∈ ℝ3

¡ Then the MAP estimation from the training data 𝑫 is

= arg max W log Pr(𝒙, 𝑦|𝒘) + log Pr(𝒘)

¡ Regularization using 𝐿2 with penalty constant 𝜆 = 𝜎 ,( .

¡ The regularization constant in Ridge: 𝜆 = 𝜎 ,(

¡ The regularization constant in Ridge: 𝜆 = 𝜎 ,(

¡ The regularized problem:

¡ A result from the optimization literature shows that (2) is

¨ For some constant s.

¡ Note that the constraint of g(w) ≤ s plays the role as limiting

0.0 0.1 1.0 10.0

¡ Ridge regression was under investigation on a prostate

λ 0.1 1 10 100 1000 10000

¨ Too high or too low values of λ often result in bad predictions.

High Bias Low Bias

v Lower bias, higher variance

¡ Modern phenomenon: Training Sample