Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

Regularizations

Shusen Wang
The ℓ" -Norm Regularization
Linear Regression
Input: feature matrix 𝐗 ∈ ℝ&×( and labels 𝐲 ∈ ℝ& .
(
Output: vector 𝐰 ∈ ℝ such that 𝐗𝐰 ≈ 𝐲.
Task
Linear Regression
Input: feature matrix 𝐗 ∈ ℝ&×( and labels 𝐲 ∈ ℝ& .
(
Output: vector 𝐰 ∈ ℝ such that 𝐗𝐰 ≈ 𝐲.
Task
• Least squares regression:
1 "
min 𝐗𝐰 − 𝐲 .
𝐰 & " Methods
• Ridge regression:
1 " "
min 𝐗𝐰 − 𝐲 + 𝛾 𝐰 "
.
𝐰 & "

Loss Function Regularization


Ridge Regression: Algorithms
Algorithms
• Analytical solution: 𝐰 ⋆ = 𝐗 7 𝐗 + 𝑛𝛾𝐈( :1 7
𝐗 𝐲.
• Time complexity: 𝑂(𝑛𝑑" + 𝑑>).
Ridge Regression: Algorithms
Algorithms
• Analytical solution: 𝐰 ⋆ = 𝐗 7 𝐗 + 𝑛𝛾𝐈( :1 7
𝐗 𝐲.
• Time complexity: 𝑂(𝑛𝑑" + 𝑑>).
• Derivations:
1 " "
• The objective function is 𝑄 𝐰 = 𝐗𝐰 − 𝐲 +𝛾 𝐰 "
.
& "
"
• The gradient is 𝛻𝑄 𝐰 = 𝐗 7 𝐗𝐰 − 𝐲 + 2𝛾𝐰.
&
" " 7
• Set 𝛻𝑄 𝐰⋆ = 0 leads to 𝐗 7 𝐗 + 𝑛𝛾𝐈 ( 𝐰⋆ = 𝐗 𝐲.
& &
• Time complexity:
• 𝑂(𝑛𝑑") time for the multiplication 𝐗 7 𝐗.
• 𝑂 𝑑> time for the inversion of the 𝑑×𝑑 matrix 𝐗 7 𝐗 + 𝑛𝛾𝐈( .
Ridge Regression: Algorithms
Algorithms
• Conjugate gradient (CG)
&
•𝑂 𝜅 log iterations to reach 𝜖 precision.
I
"
• Hessian matrix: 𝛻 "𝑄 𝐰 = 𝐗 7 𝐗 + 𝑛𝛾𝐈( .
&
KLMN 𝐗O 𝐗 P&Q
• 𝜅= is the condition number of the Hessian.
KLRS 𝐗O 𝐗 P&Q
Usefulness of Regularization
Question: Why do we use the ℓ" -norm regularization?
Usefulness of Regularization
Question: Why do we use the ℓ" -norm regularization?

• Reason 1: easier to optimize.


&
• Conjugate gradient (CG) requires 𝑂 𝜅 log iterations to reach 𝜖 precision.
I
KLMN 𝐗O 𝐗
• Least squares: 𝜅 = .
KLRS 𝐗O 𝐗

KLMN 𝐗O 𝐗 P&Q
• Ridge regression: 𝜅 = . (𝛾 ↑, 𝜅 ↓).
KLRS 𝐗O 𝐗 P&Q

• CG converges faster as 𝛾 increases.


Usefulness of Regularization
Question: Why do we use the ℓ" -norm regularization?

• Reason 1: easier to optimize.


• Reason 2: better generalization.
• Least squares has better training error (due to the optimality).
• Ridge regression makes better prediction on test set (due to bias-variance
decomposition).

Test MSE (LS)


Test MSE (Ridge)

Train MSE (Ridge)


Train MSE (LS)
The ℓ1 -Norm Regularization
Motivations
( prediction
𝐱∈ℝ 𝑦∈ℝ
Fact 1: 𝑦 can be independent of some of the 𝑑 feature.

Fact 2: if 𝑑 ≫ 𝑛, linear models are likely to overfit.


Motivations
( prediction
𝐱∈ℝ 𝑦∈ℝ
Fact 1: 𝑦 can be independent of some of the 𝑑 feature.

Fact 2: if 𝑑 ≫ 𝑛, linear models are likely to overfit.

Example: Use genomic data to predict disease.


• 𝑑 is huge: human have 20K protein-coding genes.
• 𝑛 is small: tens or hundreds of human participants in an experiment.
• Most genes are irrelevant to a specific disease.
Motivations
( prediction
𝐱∈ℝ 𝑦∈ℝ
Fact 1: 𝑦 can be independent of some of the 𝑑 feature.

Fact 2: if 𝑑 ≫ 𝑛, linear models are likely to overfit.

Goal 1: Select the features relevant to 𝑦.


Motivations
( prediction
𝐱∈ℝ 𝑦∈ℝ
Fact 1: 𝑦 can be independent of some of the 𝑑 feature.

Fact 2: if 𝑑 ≫ 𝑛, linear models are likely to overfit.

Goal 1: Select the features relevant to 𝑦.

Goal 2: Prevent overfitting for large 𝑑, small 𝑛 problems.


The ℓ1 -Norm Constraint
1 "
• LASSO: min 𝐗𝐰 − 𝐲 ; s. t. 𝐰 1
≤ 𝑡.
𝐰 "& "

The feasible set 𝐰: 𝐰 1


≤ 𝑡 is convex.
The ℓ1 -Norm Constraint
1 "
• LASSO: min 𝐗𝐰 − 𝐲 ; s. t. 𝐰 1
≤ 𝑡.
𝐰 "& "

The feasible set 𝐰: 𝐰 ≤ 𝑡 is convex.


𝑤" 1

−𝑡 𝑡
𝑤1

−𝑡
The ℓ1 -Norm Constraint
1 "
• LASSO: min 𝐗𝐰 − 𝐲 ; s. t. 𝐰 1
≤ 𝑡.
𝐰 "& "
• It is a convex optimization model.
• The optimal solution 𝐰 ⋆ is sparse (i.e., most entries are zeros).
• Smaller 𝑡 è sparser 𝐰 ⋆ .
The ℓ1 -Norm Constraint
1 "
• LASSO: min 𝐗𝐰 − 𝐲 ; s. t. 𝐰 1
≤ 𝑡.
𝐰 "& "
• It is a convex optimization model.
• The optimal solution 𝐰 ⋆ is sparse (i.e., most entries are zeros).
• Smaller 𝑡 è sparser 𝐰 ⋆ .
• Sparsity feature selection. Why?
• Let 𝐱 ^ be a test feature vector.

• The prediction is 𝐱 ^7 𝐰 ⋆ = 𝑤1⋆ 𝑥1^ + 𝑤"⋆ 𝑥"^ + ⋯ + 𝑤(⋆ 𝑥(^ .


• If 𝑤1⋆ = 0, then the prediction is independent of 𝑥1^ .
The ℓ1 -Norm Regularization
1 "
• LASSO: min 𝐗𝐰 − 𝐲 ; s. t. 𝐰 1
≤ 𝑡.
𝐰 "& "

1 "
• Another form: min 𝐗𝐰 − 𝐲 + 𝛾 𝐰 1
.
𝐰 "& "

Loss Function Regularization


Summary
Regularized ERM
• Regularized empirical risk minimization:
1
mina ∑&df1 𝐿(𝐰; 𝐱 d , 𝑦d ) + 𝑅(𝐰).
𝐰∈ℝ &
Regularized ERM
• Regularized empirical risk minimization:
1
mina ∑&df1 𝐿(𝐰; 𝐱 d , 𝑦d ) + 𝑅(𝐰).
𝐰∈ℝ &

Loss Function

1
• Linear regression: 𝐿 𝐰; 𝐱 d , 𝑦d = 𝐰 7 𝐱 d − 𝑦d "
"
• Logistic regression: 𝐿 𝐰; 𝐱 d , 𝑦d = log 1 + exp −𝑦d 𝐰 7 𝐱 d
• SVM: 𝐿 𝐰; 𝐱 d , 𝑦d = max 0, 1 − 𝑦d 𝐰 7 𝐱 d
Regularized ERM
• Regularized empirical risk minimization:
1
mina ∑&df1 𝐿(𝐰; 𝐱 d , 𝑦d ) + 𝑅(𝐰).
𝐰∈ℝ &

Regularization

• ℓ1 -norm: 𝑅 𝐰 =𝛾 𝐰 1
"
• ℓ" -norm: 𝑅 𝐰 =𝛾 𝐰 "
"
• Elastic net: 𝑅 𝐰 = 𝛾1 𝐰 1
+ 𝛾" 𝐰 "

You might also like