3 Regularizations

Regularizations
Shusen Wang
The ℓ" -Norm Regularization
Linear Regression
Input: feature matrix 𝐗 ∈ ℝ&×( and labels 𝐲 ∈ ℝ& .
(
Output: vector 𝐰 ∈ ℝ such that 𝐗𝐰 ≈ 𝐲.
Task
Linear Regression
Input: feature matrix 𝐗 ∈ ℝ&×( and labels 𝐲 ∈ ℝ& .
(
Output: vector 𝐰 ∈ ℝ such that 𝐗𝐰 ≈ 𝐲.
Task
• Least squares regression:
1 "
min 𝐗𝐰 − 𝐲 .
𝐰 & " Methods
• Ridge regression:
1 " "
min 𝐗𝐰 − 𝐲 + 𝛾 𝐰 "
.
𝐰 & "
Loss Function Regularization

Ridge Regression: Algorithms
Algorithms
• Analytical solution: 𝐰 ⋆ = 𝐗 7 𝐗 + 𝑛𝛾𝐈( :1 7
𝐗 𝐲.
• Time complexity: 𝑂(𝑛𝑑" + 𝑑>).
Algorithms
• Analytical solution: 𝐰 ⋆ = 𝐗 7 𝐗 + 𝑛𝛾𝐈( :1 7
𝐗 𝐲.
• Time complexity: 𝑂(𝑛𝑑" + 𝑑>).
• Derivations:
1 " "
• The objective function is 𝑄 𝐰 = 𝐗𝐰 − 𝐲 +𝛾 𝐰 "
.
& "
"
• The gradient is 𝛻𝑄 𝐰 = 𝐗 7 𝐗𝐰 − 𝐲 + 2𝛾𝐰.
&
" " 7
• Set 𝛻𝑄 𝐰⋆ = 0 leads to 𝐗 7 𝐗 + 𝑛𝛾𝐈 ( 𝐰⋆ = 𝐗 𝐲.
& &
• Time complexity:
• 𝑂(𝑛𝑑") time for the multiplication 𝐗 7 𝐗.
• 𝑂 𝑑> time for the inversion of the 𝑑×𝑑 matrix 𝐗 7 𝐗 + 𝑛𝛾𝐈( .
Algorithms
• Conjugate gradient (CG)
&
•𝑂 𝜅 log iterations to reach 𝜖 precision.
I
"
• Hessian matrix: 𝛻 "𝑄 𝐰 = 𝐗 7 𝐗 + 𝑛𝛾𝐈( .
&
KLMN 𝐗O 𝐗 P&Q
• 𝜅= is the condition number of the Hessian.
KLRS 𝐗O 𝐗 P&Q
Usefulness of Regularization
Question: Why do we use the ℓ" -norm regularization?
• Reason 1: easier to optimize.

&
• Conjugate gradient (CG) requires 𝑂 𝜅 log iterations to reach 𝜖 precision.
I
KLMN 𝐗O 𝐗
• Least squares: 𝜅 = .
KLRS 𝐗O 𝐗
KLMN 𝐗O 𝐗 P&Q
• Ridge regression: 𝜅 = . (𝛾 ↑, 𝜅 ↓).
KLRS 𝐗O 𝐗 P&Q
• CG converges faster as 𝛾 increases.

• Reason 1: easier to optimize.

• Reason 2: better generalization.
• Least squares has better training error (due to the optimality).
• Ridge regression makes better prediction on test set (due to bias-variance
decomposition).
Test MSE (LS)

Test MSE (Ridge)
Train MSE (Ridge)

Train MSE (LS)
The ℓ1 -Norm Regularization
Motivations
( prediction
𝐱∈ℝ 𝑦∈ℝ
Fact 1: 𝑦 can be independent of some of the 𝑑 feature.
Fact 2: if 𝑑 ≫ 𝑛, linear models are likely to overfit.

Motivations
( prediction
Example: Use genomic data to predict disease.

• 𝑑 is huge: human have 20K protein-coding genes.
• 𝑛 is small: tens or hundreds of human participants in an experiment.
• Most genes are irrelevant to a specific disease.
Motivations
( prediction
Goal 1: Select the features relevant to 𝑦.

Motivations
( prediction
Goal 1: Select the features relevant to 𝑦.
Goal 2: Prevent overfitting for large 𝑑, small 𝑛 problems.

The ℓ1 -Norm Constraint
1 "
• LASSO: min 𝐗𝐰 − 𝐲 ; s. t. 𝐰 1
≤ 𝑡.
𝐰 "& "
The feasible set 𝐰: 𝐰 1

≤ 𝑡 is convex.
1 "
≤ 𝑡.
𝐰 "& "
The feasible set 𝐰: 𝐰 ≤ 𝑡 is convex.

𝑤" 1
−𝑡 𝑡
𝑤1
−𝑡
1 "
≤ 𝑡.
𝐰 "& "
• It is a convex optimization model.
• The optimal solution 𝐰 ⋆ is sparse (i.e., most entries are zeros).
• Smaller 𝑡 è sparser 𝐰 ⋆ .
1 "
≤ 𝑡.
𝐰 "& "
• It is a convex optimization model.
• The optimal solution 𝐰 ⋆ is sparse (i.e., most entries are zeros).
• Smaller 𝑡 è sparser 𝐰 ⋆ .
• Sparsity feature selection. Why?
• Let 𝐱 ^ be a test feature vector.
• The prediction is 𝐱 ^7 𝐰 ⋆ = 𝑤1⋆ 𝑥1^ + 𝑤"⋆ 𝑥"^ + ⋯ + 𝑤(⋆ 𝑥(^ .

• If 𝑤1⋆ = 0, then the prediction is independent of 𝑥1^ .
The ℓ1 -Norm Regularization
1 "
≤ 𝑡.
𝐰 "& "
1 "
• Another form: min 𝐗𝐰 − 𝐲 + 𝛾 𝐰 1
.
𝐰 "& "
Loss Function Regularization

Summary
Regularized ERM
• Regularized empirical risk minimization:
1
mina ∑&df1 𝐿(𝐰; 𝐱 d , 𝑦d ) + 𝑅(𝐰).
𝐰∈ℝ &
Regularized ERM
1
𝐰∈ℝ &
Loss Function
1
• Linear regression: 𝐿 𝐰; 𝐱 d , 𝑦d = 𝐰 7 𝐱 d − 𝑦d "
"
• Logistic regression: 𝐿 𝐰; 𝐱 d , 𝑦d = log 1 + exp −𝑦d 𝐰 7 𝐱 d
• SVM: 𝐿 𝐰; 𝐱 d , 𝑦d = max 0, 1 − 𝑦d 𝐰 7 𝐱 d
Regularized ERM
1
𝐰∈ℝ &
Regularization
• ℓ1 -norm: 𝑅 𝐰 =𝛾 𝐰 1
"
• ℓ" -norm: 𝑅 𝐰 =𝛾 𝐰 "
"
• Elastic net: 𝑅 𝐰 = 𝛾1 𝐰 1
+ 𝛾" 𝐰 "

3 Regularizations

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

3 Regularizations

Uploaded by

Copyright:

Available Formats

Regularizations

Loss Function Regularization

• Reason 1: easier to optimize.

• CG converges faster as 𝛾 increases.

• Reason 1: easier to optimize.

Test MSE (LS)

Train MSE (Ridge)

Fact 2: if 𝑑 ≫ 𝑛, linear models are likely to overfit.

Fact 2: if 𝑑 ≫ 𝑛, linear models are likely to overfit.

Example: Use genomic data to predict disease.

Fact 2: if 𝑑 ≫ 𝑛, linear models are likely to overfit.

Goal 1: Select the features relevant to 𝑦.

Fact 2: if 𝑑 ≫ 𝑛, linear models are likely to overfit.

Goal 1: Select the features relevant to 𝑦.

Goal 2: Prevent overfitting for large 𝑑, small 𝑛 problems.

The feasible set 𝐰: 𝐰 1

The feasible set 𝐰: 𝐰 ≤ 𝑡 is convex.

• The prediction is 𝐱 ^7 𝐰 ⋆ = 𝑤1⋆ 𝑥1^ + 𝑤"⋆ 𝑥"^ + ⋯ + 𝑤(⋆ 𝑥(^ .

Loss Function Regularization

You might also like