Lecture 7 Loss Function and Regularization

Advanced Machine Learning
Loss Function and Regularization

Amit Sethi
Electrical Engineering, IIT Bombay
Learning outcomes for the lecture
• Write expressions for common loss functions
• Match loss functions to qualitative objectives
• List advantages and disadvantages of loss

functions
Contents
• Revisiting MSE and L2 regularization
• How L1 regularization lead to sparsity
• Other losses inspired by L1 and L2
• Hinge loss leads to small number of support vectors
• Link between logistic regression and cross entropy

Assumptions behind MSE loss
• MSE is related to RMSE
• RMSE is the standard deviation of the error
• The mean of the error will be zero for a convex

problem
Regularization in regression
• Why regularize?
– Reduce variance, at the cost of bias
– Increase test (validation) accuracy
– Get interpretable models
• How to regularize?
– Shrink coefficients
– Reduce features
Regularization is constraining a model
• How to regularize?
– Reduce the number of parameters
• Share weights in structure
– Constrain parameters to be small
– Encourage sparsity of output in loss
• Most commonly Tikhonov (or L2, or ridge)
regularization (a.k.a. weight decay)
– Penalty on sums of squares of individual weights
𝑁 𝑛 𝑛
1 2
𝜆
𝐽= 𝑦𝑖 − 𝑓 𝑥𝑖 + 𝑤𝑗 2 ; 𝑓 𝑥𝑖 = 𝑤𝑗 𝑥𝑖 𝑗 ;
𝑁 2
𝑖=1 𝑗=1 𝑗=0
Coefficient shrinkage using ridge
Source: Regression Shrinkage and Selection via the Lasso, by Robert Tibshirani, Journal of Royal Stat. Soc., 1996
L2-regularization visualized
Contents

Subset selection
• Set the coefficients with lowest absolute value
to zero
Level sets of Lq norm of coefficients
Which one is ridge? Subset selection? Lasso?
Other forms of regularization
• L1-regularization
(sparsity
inducing norm)
– Penalty on sums
of absolute
values of
weights
Lasso coeff paths with decreasing λ
Compare to coeff shrinkage path of
ridge
Source: Sci-kit learn tutorial

Contents

Smoothly Clipped Absolute Deviation
(SCAD) Penalty
Source: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties, by Fan and Li, Journal of Am. Stat. Assoc., 2001
Thresholding in three cases: No
alteration of large coefficients by SCAD
and Hard
Source: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties, by Fan and Li, Journal of Am. Stat. Assoc., 2001
Motivation for elastic net
• The p >> n problem and grouped selection
– Microarrays: p > 10,000 and n < 100.
– For those genes sharing the same biological “pathway”, the
correlations among them can be high.
• LASSO limitations
– If p > n, the lasso selects at most n variables. The number
of
– Grouped variables: the lasso fails to do grouped selection.
It tends to select one variable from a group and ignore the
others.
Source: Elastic net, by Zou and Hastie

Elastic net: Use both L2 and L2
penalties

Geometry of elastic net

Elastic net selects correlated variables
as “group”

Elastic net selects correlated variables as
“group” and stabilizes the coefficient paths

Why L2 penalty keeps coefficients of
groups together?
• Try to think of an example with correlated
variables
This analysis
can be
generalized
to linear
SVM
Source: Elastic SCAD SVM, by Becker, Toedt, Lichter and Benner, in BMC Bioinformatics2011
A family of loss functions
Source: “A General and Adaptive Robust Loss Function” Jonathan T. Barron, ArXiv 2017
A family of loss functions
Source: “A General and Adaptive Robust Loss Function” Jonathan T. Barron, ArXiv 2017
Contents

What is hinge loss?
• Surrogate loss
function to 0-1 loss
• There are other

surrogate losses
possible
Contents

Why logistic regression and BCE
• Let us assume a Bernoulli distribution
• 𝑃(𝑥) = 𝜇𝑥 (1 − 𝜇)1−𝑥
• An exponential family distribution is
𝜃𝑥−𝑏 𝜃
• 𝑃(𝑥|𝜃, 𝜑) = exp* + 𝑐 𝑥, 𝜑 +
𝑎 𝜑
• So, Bernoulli can be re-written as
μ
• 𝑃(𝑥) = exp 𝑥 log + log(1 − μ)
1−μ
μ
• Log odds of success is θ = log
1−μ
1
• So, μ =
1−𝑒 −θ
Source: Why the logistic function? A tutorial discussion on probabilities and neural networks, by Michael I. Jordan
Generative vs. discriminative
Generative Discriminative
• Belief network A is more • More robust
modular
– Class-conditional densities are – Don’t need precise model
likely to be local, characteristic specification, so long as it is
functions of the ob jects being from exponential family
classiffied, invariant to the
nature and number of the • Requires fewer parameters
other classes
– O(n) as opposed to O(n2)
• More “natural”
– Deciding what kind of object to
generate and then generating it
from a recipe
• More efficient to estimate
mode, if correct
Source: Why the logistic function? A tutorial discussion on probabilities and neural networks, by Michael I. Jordan ftp://psyche.mit.edu/pub/jordan/uai.ps
Losses for ranking and metric learning
• Margin loss
• Cosine similarity
• Ranking
– Point-wise
– Pair-wise
• φ(z) = (1-z)+, e-z, log(1-e-z)
– List-wise
Source: “Ranking Measures and Loss Functions in Learning to Rank” Chen et al, NIPS 2009
Dropout: Drop a unit out to prevent
co-adaptation
Source: “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, by Srivastava, Hinton, et al. in JMLR 2014.
Why dropout?
• Make other features unreliable to break co-
adaptation
• Equivalent to adding noise
• Train several (dropped out) architectures in one
architecture (O(2n))
• Average architectures at run time
– Is this a good method for averaging?
– How about Bayesian averaging?
– Practically, this work well too
Model averaging
• Average output should be the same

• Alternatively,
– w/p at training time
– w at testing time
Difference between non-DO and DO
features
Indeed, DO leads to sparse activation
There is a sweet spot with DO, even if
you increase the number of neurons

Lecture 7 Loss Function and Regularization

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 7 Loss Function and Regularization

Uploaded by

Copyright:

Available Formats

Advanced Machine Learning

Loss Function and Regularization

• Write expressions for common loss functions

• Match loss functions to qualitative objectives

• List advantages and disadvantages of loss

• How L1 regularization lead to sparsity

• Other losses inspired by L1 and L2

• Hinge loss leads to small number of support vectors

• Link between logistic regression and cross entropy

• RMSE is the standard deviation of the error

• The mean of the error will be zero for a convex

• How L1 regularization lead to sparsity

• Other losses inspired by L1 and L2

• Hinge loss leads to small number of support vectors

• Link between logistic regression and cross entropy

Which one is ridge? Subset selection? Lasso?

Source: Sci-kit learn tutorial

• How L1 regularization lead to sparsity

• Other losses inspired by L1 and L2

• Hinge loss leads to small number of support vectors

• Link between logistic regression and cross entropy

Source: Elastic net, by Zou and Hastie

Source: Elastic net, by Zou and Hastie

Source: Elastic net, by Zou and Hastie

Source: Elastic net, by Zou and Hastie

Source: Elastic net, by Zou and Hastie

• How L1 regularization lead to sparsity

• Other losses inspired by L1 and L2

• Hinge loss leads to small number of support vectors

• Link between logistic regression and cross entropy

• There are other

• How L1 regularization lead to sparsity

• Other losses inspired by L1 and L2

• Hinge loss leads to small number of support vectors

• Link between logistic regression and cross entropy

• Average output should be the same

You might also like