Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

Advanced Machine Learning

Loss Function and Regularization


Amit Sethi
Electrical Engineering, IIT Bombay
Learning outcomes for the lecture

• Write expressions for common loss functions

• Match loss functions to qualitative objectives

• List advantages and disadvantages of loss


functions
Contents
• Revisiting MSE and L2 regularization

• How L1 regularization lead to sparsity

• Other losses inspired by L1 and L2

• Hinge loss leads to small number of support vectors

• Link between logistic regression and cross entropy


Assumptions behind MSE loss
• MSE is related to RMSE

• RMSE is the standard deviation of the error

• The mean of the error will be zero for a convex


problem
Regularization in regression
• Why regularize?
– Reduce variance, at the cost of bias
– Increase test (validation) accuracy
– Get interpretable models

• How to regularize?
– Shrink coefficients
– Reduce features
Regularization is constraining a model
• How to regularize?
– Reduce the number of parameters
• Share weights in structure
– Constrain parameters to be small
– Encourage sparsity of output in loss
• Most commonly Tikhonov (or L2, or ridge)
regularization (a.k.a. weight decay)
– Penalty on sums of squares of individual weights
𝑁 𝑛 𝑛
1 2
𝜆
𝐽= 𝑦𝑖 − 𝑓 𝑥𝑖 + 𝑤𝑗 2 ; 𝑓 𝑥𝑖 = 𝑤𝑗 𝑥𝑖 𝑗 ;
𝑁 2
𝑖=1 𝑗=1 𝑗=0
Coefficient shrinkage using ridge

Source: Regression Shrinkage and Selection via the Lasso, by Robert Tibshirani, Journal of Royal Stat. Soc., 1996
L2-regularization visualized
Contents
• Revisiting MSE and L2 regularization

• How L1 regularization lead to sparsity

• Other losses inspired by L1 and L2

• Hinge loss leads to small number of support vectors

• Link between logistic regression and cross entropy


Subset selection
• Set the coefficients with lowest absolute value
to zero
Level sets of Lq norm of coefficients

Which one is ridge? Subset selection? Lasso?

Source: Regression Shrinkage and Selection via the Lasso, by Robert Tibshirani, Journal of Royal Stat. Soc., 1996
Other forms of regularization
• L1-regularization
(sparsity
inducing norm)
– Penalty on sums
of absolute
values of
weights
Lasso coeff paths with decreasing λ

Source: Regression Shrinkage and Selection via the Lasso, by Robert Tibshirani, Journal of Royal Stat. Soc., 1996
Compare to coeff shrinkage path of
ridge

Source: Sci-kit learn tutorial


Contents
• Revisiting MSE and L2 regularization

• How L1 regularization lead to sparsity

• Other losses inspired by L1 and L2

• Hinge loss leads to small number of support vectors

• Link between logistic regression and cross entropy


Smoothly Clipped Absolute Deviation
(SCAD) Penalty

Source: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties, by Fan and Li, Journal of Am. Stat. Assoc., 2001
Thresholding in three cases: No
alteration of large coefficients by SCAD
and Hard

Source: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties, by Fan and Li, Journal of Am. Stat. Assoc., 2001
Motivation for elastic net
• The p >> n problem and grouped selection
– Microarrays: p > 10,000 and n < 100.
– For those genes sharing the same biological “pathway”, the
correlations among them can be high.
• LASSO limitations
– If p > n, the lasso selects at most n variables. The number
of
– Grouped variables: the lasso fails to do grouped selection.
It tends to select one variable from a group and ignore the
others.

Source: Elastic net, by Zou and Hastie


Elastic net: Use both L2 and L2
penalties

Source: Elastic net, by Zou and Hastie


Geometry of elastic net

Source: Elastic net, by Zou and Hastie


Elastic net selects correlated variables
as “group”

Source: Elastic net, by Zou and Hastie


Elastic net selects correlated variables as
“group” and stabilizes the coefficient paths

Source: Elastic net, by Zou and Hastie


Why L2 penalty keeps coefficients of
groups together?
• Try to think of an example with correlated
variables
This analysis
can be
generalized
to linear
SVM

Source: Elastic SCAD SVM, by Becker, Toedt, Lichter and Benner, in BMC Bioinformatics2011
A family of loss functions

Source: “A General and Adaptive Robust Loss Function” Jonathan T. Barron, ArXiv 2017
A family of loss functions

Source: “A General and Adaptive Robust Loss Function” Jonathan T. Barron, ArXiv 2017
Contents
• Revisiting MSE and L2 regularization

• How L1 regularization lead to sparsity

• Other losses inspired by L1 and L2

• Hinge loss leads to small number of support vectors

• Link between logistic regression and cross entropy


What is hinge loss?
• Surrogate loss
function to 0-1 loss

• There are other


surrogate losses
possible
Contents
• Revisiting MSE and L2 regularization

• How L1 regularization lead to sparsity

• Other losses inspired by L1 and L2

• Hinge loss leads to small number of support vectors

• Link between logistic regression and cross entropy


Why logistic regression and BCE
• Let us assume a Bernoulli distribution
• 𝑃(𝑥) = 𝜇𝑥 (1 − 𝜇)1−𝑥
• An exponential family distribution is
𝜃𝑥−𝑏 𝜃
• 𝑃(𝑥|𝜃, 𝜑) = exp* + 𝑐 𝑥, 𝜑 +
𝑎 𝜑
• So, Bernoulli can be re-written as
μ
• 𝑃(𝑥) = exp 𝑥 log + log(1 − μ)
1−μ
μ
• Log odds of success is θ = log
1−μ
1
• So, μ =
1−𝑒 −θ
Source: Why the logistic function? A tutorial discussion on probabilities and neural networks, by Michael I. Jordan
Generative vs. discriminative
Generative Discriminative
• Belief network A is more • More robust
modular
– Class-conditional densities are – Don’t need precise model
likely to be local, characteristic specification, so long as it is
functions of the ob jects being from exponential family
classiffied, invariant to the
nature and number of the • Requires fewer parameters
other classes
– O(n) as opposed to O(n2)
• More “natural”
– Deciding what kind of object to
generate and then generating it
from a recipe
• More efficient to estimate
mode, if correct

Source: Why the logistic function? A tutorial discussion on probabilities and neural networks, by Michael I. Jordan ftp://psyche.mit.edu/pub/jordan/uai.ps
Losses for ranking and metric learning
• Margin loss

• Cosine similarity

• Ranking
– Point-wise
– Pair-wise
• φ(z) = (1-z)+, e-z, log(1-e-z)
– List-wise
Source: “Ranking Measures and Loss Functions in Learning to Rank” Chen et al, NIPS 2009
Dropout: Drop a unit out to prevent
co-adaptation

Source: “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, by Srivastava, Hinton, et al. in JMLR 2014.
Why dropout?
• Make other features unreliable to break co-
adaptation
• Equivalent to adding noise
• Train several (dropped out) architectures in one
architecture (O(2n))
• Average architectures at run time
– Is this a good method for averaging?
– How about Bayesian averaging?
– Practically, this work well too

Source: “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, by Srivastava, Hinton, et al. in JMLR 2014.
Model averaging

• Average output should be the same


• Alternatively,
– w/p at training time
– w at testing time
Source: “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, by Srivastava, Hinton, et al. in JMLR 2014.
Difference between non-DO and DO
features

Source: “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, by Srivastava, Hinton, et al. in JMLR 2014.
Indeed, DO leads to sparse activation

Source: “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, by Srivastava, Hinton, et al. in JMLR 2014.
There is a sweet spot with DO, even if
you increase the number of neurons

Source: “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, by Srivastava, Hinton, et al. in JMLR 2014.

You might also like