06 Optimization Basics PDF

CMSC498L
Introduction to Deep Learning

Abhinav Shrivastava
Optimization Basics
Convex functions
source
Convex functions
source
Convex functions
source
Concave functions
source
What about this?
source
Minimize a function
! " = "$
Recall: Supervised Learning
Formulation:
• Given training data: {(x1,y1), …, (xN,yN)}, i.i.d from distribution D
• Find y = f(x) ∈ " using training data
• such that f is correct on test data i.i.d from distribution D
Credit: Lazebnik, Liang

Recall: Supervised Learning
Formulation:
• such that expected loss is small:
# $ = &((,*)∽- [/ $, 0, 1 ]

Supervised Learning
Formulation:
• such that expected loss is small:
# $ = &((,*)∽- [/ $, 0, 1 ]

Training Linear Classifiers
• Given: i.i.d. training data !" , $" , % = 1, … , ) ,
$" ∈ {−1,1}

$" ∈ −1,1
• Hypothesis class: ,- (!) = sgn(3 4 !)
• Classification with bias, i.e. ,- ! = sgn(3 4 ! + 6), can be reduced to the

case w/o bias by letting 3 4 = 3; 6 and ! 4 = [!; 1]

$" ∈ −1,1
• Hypothesis class: ,- (!) = sgn(3 4 !)
• Loss: minimizing the number of mistakes

:
1
65 ,- = 7 ;[sgn 3 4 !" ≠ $" ]
)
"89
• Difficult to optimize directly!

Use Linear Regression
• Find !" ($) = ' ( $ that minimizes )* loss or mean squared error
3
1
,+ !" = /(' ( $0 − 50 )*
.
012
• Ignores the fact that 5 ∈ {−1,1} but is easy to optimize

Linear Regression: Optimization
• Let ! be a matrix whose ith row is "#$ , % be the vector ('( , … , '+ )$
+
1
.- /0 = 4(6 $ "# − '# )8 = !6 − % 8
8
3
#5(

+
1
.- /0 = 4(6 $ "# − '# )8 = !6 − % 8
8
3
#5(
• This is a convex function of the weights

+
1
.- /0 = 4(6 $ "# − '# )8 = !6 − % 8
8
3
#5(
• Find the gradient w.r.t. 6:
∇0 !6 − % 88

+
1
.- /0 = 4(6 $ "# − '# )8 = !6 − % 8
8
3
#5(
• Find the gradient w.r.t. 6:
∇0 !6 − % 88 = ∇0 !6 − % $
!6 − %
= ∇0 6 $ ! $ !6 − 26 $ ! $ % + % $ % = 2! $ !6 − 2! $ %
• Set gradient to zero to get the minimizer:
! $ !6 = ! $ %
6 = (! $ !)<( ! $ %
• Linear algebra view
• If ! is invertible, simply solve !" = $ and
get " = ! %&$
• But typically ! is a “tall” matrix so you need to find a least squares solution to an
over-constrained system

• MLE view
• Interpretation of !" loss: negative log likelihood assuming # is normally
distributed with mean $% & = ( ) & + +
y = wTx + b
(xi, yi)

Maximum Likelihood Estimation (MLE)
• Given: i.i.d. training data !" , $" , % = 1, … , )
• Let *+ $ ! , , ∈ Θ be a family of distributions parameterized by ,
• Maximum (conditional) likelihood estimate:
,/01 = argmax+ 7 *+ ($" |!" )
"
= argmin+ − ∑" log *+ ($" |!" )

Maximum Likelihood Estimation (MLE)
!"# = argmin+ − ∑. log 1+ (3. |5. )
• Assume 1+ 3 5 = Normal(3; 9+ 5 , ; < )

<
1 3 − 9+ 5
log 1+ 3. 5. = log exp − <
2?; < 2;
1 < 1
= − < 3 − 9+ 5 − log ; − log(2?)
2; 2
<
!"# = argmin+ C 3. − 9+ 5.
.
Problem with Linear Regression
Credit: Lazebnik, Figure from PRML

Linear Classifier: Attempt 2
• Given: i.i.d. training data
!" , $" , % = 1, … , ) ,
$" ∈ {−1,1}
• Hypothesis ./ (!) = sgn(5 6 !)
• $ = +1 if 5 9 ! > 0
• $ = −1 if 5 9 ! < 0
• Find ./ (!) = sgn(5 6 !) that minimizes the number of mistakes
B
1
>= ./ = ? C[sgn 5 6 !" ≠ $" ]
)
"@A

Attempt 2: Perceptron training algorithm
Perceptron: figure from the lecture note of Nina Balcan

Attempt 2: Perceptron training algorithm
• Initialize weights randomly
• Cycle through training examples in multiple passes (epochs)
• For each training example ("# , %# ):
• If current prediction sgn(* + "# ) does not match %# then update weights:
*,-. ← *, + %# "#

Understanding the Perceptron update rule
• Perceptron update rule: If !" ≠ sgn(( ) *" ) then update weights:
(,-. ← (, + !" *"
If !" = 1, the response ( ) *" is initially negative and will be __________

)
(,-. *" = (, + *" ) *" = (,) *" + *") *" = (,) *" + !" (*") *" )
If !" = −1, the response ( ) *" is initially positive and will be __________
)
(,-. *" = (, − *" ) *" = (,) *" − *") *" = (,) *" + !" (*") *" )

(,-. ← (, + !" *"
If !" = 1, the response ( ) *" is initially negative and will be increased

)
(,-. *" = (, + *" ) *" = (,) *" + *") *" = (,) *" + !" (*") *" )
If !" = −1, the response ( ) *" is initially positive and will be decreased
)
(,-. *" = (, − *" ) *" = (,) *" − *") *" = (,) *" + !" (*") *" )

(,-. ← (, + !" *"
If !" = 1, the response ( ) *" is initially negative and will be increased

)
(,-. *" = (, + *" ) *" = (,) *" + *") *" = (,) *" + !" (*") *" )
If !" = −1, the response ( ) *" is initially positive and will be decreased
)
(,-. *" = (, − *" ) *" = (,) *" − *") *" = (,) *" + !" (*") *" )

(,-. ← (, + !" *"
• The raw response of the classifier changes to
(,) *" + !" *" 1

( ← ( + !" *"
( ) *" + !" *" .

Perceptron update rule with Learning Rate
( ← ( + .!" *"
( ) *" + .!" *" /
where . is a learning rate that should decay slowly* over time

Perceptron update rule with Learning Rate
( ← ( + .!" *"
( ) *" + .!" *" /
where . is a learning rate that should decay slowly* over time
* will go over this in detail later.

The Perceptron Theorem
• Intuitively, makes sense
• How do we know it will not get stuck on a bad local minimum?
• This was Rosenblatt’s seminal contribution:
• Theorem:
• Suppose there exists ! ∗ that correctly classifies $% , '%
• W.L.O.G., all $% and ! ∗ have length 1 ( $ = ! = 1 ), so the minimum distance of
any example to the decision boundary is:
* = min ! ∗ . $%
%
/ 1 / 1
• Then Perceptron makes at most 0
mistakes, and will converge in < 0 steps
Credit: Lazebnik, Liang, source from Nuno Vasconcelos, another source

• Intuitively, makes sense
• How do we know it will not get stuck on a bad local minimum?
• This was Rosenblatt’s seminal contribution:
• Theorem:
• Suppose there exists ! ∗ that correctly classifies $% , '%
• W.L.O.G., all $% and ! ∗ have length 1 ( $ = ! = 1 ), so the minimum distance of
any example to the decision boundary is:
* = min ! ∗ . $%
%
/ 1 / 1
• Then Perceptron makes at most 0
mistakes, and will converge in < 0
steps

The Perceptron Theorem: Notes
• Not the only proof.
• This result was the start of learning theory
• For the first time there was a proof that a learning machine could actually
learn something!

• Rosenblatt’s result generated a lot of excitement about learning in the 50s
• Later, Minsky and Papert identified serious problems with the Perceptron
• There are very simply logic problems that it cannot solve
Credit: Lazebnik, Liang, source from Nuno Vasconcelos, another source, Figure from PRML
Side Notes: Connectionism vs Symbolism
Symbolism: AI can be achieved by representing concepts as symbols

• Example: rule-based expert system, formal grammar
Connectionism: explain intellectual abilities using connections between

neurons (i.e., artificial neural networks)
• Example: perceptron, larger scale neural networks

Symbolism Example
Example from Machine learning lecture notes by Tom Mitchell

Connectionism Example
Credit: Lazebnik, Liang, PRML

Side Notes: Connectionism vs Symbolism
Formal theories of logical reasoning, grammar, and other higher mental

faculties compel us to think of the mind as a machine for rule-based
manipulation of highly structured arrays of symbols. What we know of the
brain compels us to think of human information processing in terms of
manipulation of a large unstructured set of numbers, the activity levels of
interconnected neurons.
---- The Central Paradox of Cognition (Smolensky et al., 1992)

• Rosenblatt’s result generated a lot of excitement about learning in the 50s
• Later, Minsky and Papert identified serious problems with the Perceptron
• There are very simply logic problems that it cannot solve
• This killed off the enthusiasm until an old result by Kolmogorov saved the
day
• Universal Approximation Theorem
• 2-layer neural network = Universal function approximator
Credit: Lazebnik, Liang, source from Nuno Vasconcelos, another source, Figure from PRML
• Linearly separable data: converges to a perfect solution
1
1
! = argmin* - 2[sgn ! 5 6. ≠ 8. ]
∗
,
./0
• Is the solution unique?
• Non-separable data: converges to a minimum-error solution assuming:
a. Examples are presented in random sequence
b. The learning rate decays as : 1/< where < is the number of iterations and
examples are presented in random sequence

Linear Classification: Recap
We want to learn a linear classifier that minimizes the number of mistakes
• Attempt 1: relax objective by dropping the sign function, using !" loss –
linear regression
• Easy to optimize, not really appropriate for binary classification, sensitive to outliers
• Attempt 2: perceptron update rule
• Gives good convergence guarantees, hard to extend to optimizing more complicated
models
• Attempt 3: relax objective by turning the sign function into a differentiable
“soft threshold” – logistic regression

Recap: Sigmoid function
Squash the linear response of the classifier to the interval [0,1]:
(
1
& ' ) =
1 + exp(−' ( ))

Recap: Sigmoid function
• Output of sigmoid can be interpreted as posterior label probability or
confidence returned by classifier:
)
1
!" # = 1 & = ' ( & =
1 + exp(−( ) &)
!" # = −1 & = 1 − ' ( ) &
12345 6" 7 8 61 345(6" 7 8) 1

= = =
12345 6" 7 8 12345(6" 7 8) 345 " 7 8 21
= ' −( ) &
• Sigmoid is symmetric: 1 − ' 9 = ' −9
Sigmoid: Interpretation
We can model the posteriors !(#|%) in terms of class-conditional densities
! %# :
! %#=1 ! #=1
! #=1% =
!(%)
! %#=1 ! #=1
=
! % # = 1 ! # = 1 + ! % # = −1 !(# = −1)
1 !(# = 1|%)
= = /(.), . = log
1 + exp(−.) !(# = −1|%)
Sigmoid: Interpretation
• Adopting a linear + sigmoid model is equivalent to assuming linear log odds:
$(& = 1|*)
log = -.* + 0
$(& = −1|*)
• This happens when $ * & = 1 and $ * & = −1 are Gaussians with

different means and the same covariance matrices and - is related to the
difference between the means

Logistic Regression
• Given: !" , $" , % = 1, … , ) , $" ∈ {−1,1}
• Find . that minimizes
4
1
0/ . = − 1 log 89 $" !"
)
"23
3 3
= − 4 ∑4":<=23 log > . ? !" − 4 ∑4":<=2@3 log[1 − >(. ? !" )]
3 4
= − 4 ∑"23 log > $" . ? !"
• No closed-form solution, need to use gradient descent

Gradient Descent
• Goal: find ! to minimize loss #" (!)
• Start with some initial estimate of !
• At each step, find ∇#" (!), the gradient of the loss w.r.t. !, and take a small
step in the opposite direction
!'() ← !' − , ∇#" (!)

Gradient Descent
!'() ← !' − , ∇#" !
Note: step size plays a crucial role
Credit: Lazebnik, Liang, Figure source

Why Gradient Descent?
Credit: Lazebnik, Liang, Fouhey

Global minimum

• Each point in the picture is a function evaluation

• Some functions are easy/quick, some may take
optimize may take hours to evaluate
• Most functions we care about will take hours

Credit: Fouhey, Karpathy and Fei-Fei

Option 1: Grid Search
#systematically try things
best, bestScore = None, Inf
for dim1Value in dim1Values:
….
for dimNValue in dimNValues:
w = [dim1Value, …, dimNValue]
if L(w) < bestScore:
best, bestScore = w, L(w)
return best
Credit: Fouhey
Option 1: Grid Search
Pros
• Super simple
• Only requires being able to evaluate
model
Cons
• Scales horribly to high dimensional
spaces
Complexity: samplesPerDimnumberOfDims
Credit: Fouhey
What else?
Credit: Fouhey, Karpathy and Fei-Fei

Option 2: Random Search
#Do random stuff
best, bestScore = None, Inf
for iter in range(numIters):
w = random(N,1) #sample
score = ! " #evaluate
if score < bestScore:
best, bestScore = w, score
return best
Credit: Fouhey
Option 2: Random Search
Pros
• Super simple
• Only requires being able to sample
random vectors and evaluate model
Cons
• Slow and stupid –throwing darts at high
dimensional dart board
• Might miss something
Credit: Fouhey
When can we use Option 1 or 2?
• Number of dimensions small, space bounded
• Objective is impossible to analyze
Credit: Fouhey
When can we use Option 1 or 2?
• Number of dimensions small, space bounded
• Objective is impossible to analyze
Credit: Fouhey
Option 3: Use Gradients
Arrows:
gradient
Credit: Fouhey
Arrows:
gradient direction
(scaled to unit length)
Credit: Fouhey
Arrows:
gradient direction
Want: arg min ((')

'
Which is bigger (for small α)?
≤?
( ' ( ' + ,∇' ((')
>?
Credit: Fouhey
Arrows:
gradient direction
Want: arg min ((')

'
1
Which is bigger (for small α)?
1 + ,2
≤?
( ' ( ' + ,∇' ((')
>?
Credit: Fouhey
• w0 = initialize() #initialize
• for iter in range(numIters):
• g = ∇" # " #eval gradient
• w = w + -stepsize(iter)*g #update w
• return w
Credit: Fouhey
Gradient Descent
Credit: Fouhey
Gradient Descent: Note on Subgradients
What is the derivative of |x|?
|x|
Derivatives/Gradients
Defined everywhere but 0
!
# " = sign(") " ≠0
!"
undefined "=0
Credit: Fouhey
Gradient Descent: Note on Subgradients
What is the derivative of |x|?
Use Subgradient: any underestimate of function
|x|
Subderivatives/subgradients
Defined everywhere
!
# " = sign(") " ≠0
!"
!
# " ∈ [−1,1] "=0
!"
In practice: at discontinuity, pick value on either side.

Credit: Fouhey
Gradient Descent: Note on Learning Rate (LR)
1x10-2 10x10-2 12x10-2
falls short converges diverges
Credit: Fouhey
Back to Logistic Regression Gradient Descent
!'() ← !' − , ∇#" !
Note: step size plays a crucial role
Credit: Lazebnik, Liang, Figure source

Gradient Descent for Logistic Regression
,
1
!" # = − ( log 0 1) # 2 3)
'
)*+
,
1
∇"! # = − ( ∇5 log 0 1) # 2 3)
'
)*+
,
1 ∇5 0 1) # 2 3)
=− (
' 0 1) # 2 3)
)*+
Derivative of sigmoid:
0 6 7 = 0 7 1 − 0(7) = 0 7 0 −7

,
1
!" # = − ( log 0 1) # 2 3)
'
)*+
,
1
∇"! # = − ( ∇5 log 0 1) # 2 3)
'
)*+
,
1 ∇5 0 1) # 2 3)
=− (
' 0 1) # 2 3)
)*+
,
1 0 1) # 2 3) 0 −1) # 2 3) 1) 3)
=− (
' 0 1) # 2 3)
)*+
8
Credit: Lazebnik, Liang We also used the chain rule: 6 7 3 = 68 7 3 7′(3)
,
1
!" # = − ( log 0 1) # 2 3)
'
)*+
,
1
∇"! # = − ( ∇5 log 0 1) # 2 3)
'
)*+
,
1 ∇5 0 1) # 2 3)
=− (
' 0 1) # 2 3)
)*+
, ,
1 0 1) # 2 3) 0 −1) # 2 3) 1) 3) 1
=− ( = − ( 0 −1) # 2 3) 1) 3)
' 0 1) # 2 3) '
)*+ )*+

-
1
∇#" $ = − ) . −/* $ 0 1* /* 1*
(
*+,
Update for $:
-
1
$ ← $ + 4 ) . −/* $ 0 1* /* 1*
(
*+,
• For a single parameter update, need to cycle through the entire training set!
• This is also called a batch update

Stochastic Gradient Descent
At each iteration, take a single data point !" , $" and perform a parameter
update using ∇& ', !" , $" , the gradient of the loss for that point
' ← ' − * ∇& ', !" , $"
This is called an online or stochastic update

(Stochastic) Gradient Descent
,
1
"! # = − ( log 0 1) # 2 3)
'
)*+
GD update for #:
,
1
# ← # + 6 ( 0 −1) # 2 3) 1) 3)
'
)*+
Loss for a single sample:
7 #, 3) , 1) = − log 0 1) # 2 3)
SGD update for #:
# ← # + 6 0 −1) # 2 3) 1) 3)
SGD update for logistic regression:
! ← ! + $ % −'( ! ) *( '( *(
SGD update for perceptron:

! ← ! + $ +['( ! ) *( < 0]'( *(

SGD update for logistic regression:
! ← ! + $ % −'( ! ) *( '( *(
SGD update for perceptron:

! ← ! + $ +['( ! ) *( < 0]'( *(
SGD update for linear regression

! ← ! − $ (! ) *( − '( )*(

Next to next class
• Gradient Descent in practice
• Impact of LR in detail
• Momentum in Gradient Descent
• Regularization

06 Optimization Basics PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

06 Optimization Basics PDF

Uploaded by

Copyright:

Available Formats

CMSC498L

Introduction to Deep Learning

Credit: Lazebnik, Liang

Credit: Lazebnik, Liang

Credit: Lazebnik, Liang

Credit: Lazebnik, Liang

• Classification with bias, i.e. ,- ! = sgn(3 4 ! + 6), can be reduced to the

Credit: Lazebnik, Liang

• Loss: minimizing the number of mistakes

Credit: Lazebnik, Liang

Credit: Lazebnik, Liang

Credit: Lazebnik, Liang

Credit: Lazebnik, Liang

Credit: Lazebnik, Liang

Credit: Lazebnik, Liang

Credit: Lazebnik, Liang

Credit: Lazebnik, Liang

• Assume 1+ 3 5 = Normal(3; 9+ 5 , ; < )

Credit: Lazebnik, Figure from PRML

Credit: Lazebnik, Liang

Perceptron: figure from the lecture note of Nina Balcan

Credit: Lazebnik, Liang

Credit: Lazebnik, Liang

If !" = 1, the response ( ) *" is initially negative and will be __________

Credit: Lazebnik, Liang

If !" = 1, the response ( ) *" is initially negative and will be increased

Credit: Lazebnik, Liang

If !" = 1, the response ( ) *" is initially negative and will be increased

Credit: Lazebnik, Liang

Credit: Lazebnik, Liang

Credit: Lazebnik, Liang

where . is a learning rate that should decay slowly* over time

Credit: Lazebnik, Liang

where . is a learning rate that should decay slowly* over time

* will go over this in detail later.

Credit: Lazebnik, Liang

Credit: Lazebnik, Liang, source from Nuno Vasconcelos, another source

Credit: Lazebnik, Liang, source from Nuno Vasconcelos, another source

Credit: Lazebnik, Liang, source from Nuno Vasconcelos, another source

Symbolism: AI can be achieved by representing concepts as symbols

Connectionism: explain intellectual abilities using connections between

Credit: Lazebnik, Liang

Example from Machine learning lecture notes by Tom Mitchell

Credit: Lazebnik, Liang

Credit: Lazebnik, Liang, PRML

Formal theories of logical reasoning, grammar, and other higher mental

Credit: Lazebnik, Liang

Credit: Lazebnik, Liang

Credit: Lazebnik, Liang

Credit: Lazebnik, Liang

12345 6" 7 8 61 345(6" 7 8) 1

• This happens when $ * & = 1 and $ * & = −1 are Gaussians with

Credit: Lazebnik, Liang

• No closed-form solution, need to use gradient descent

Credit: Lazebnik, Liang

Credit: Lazebnik, Liang

Credit: Lazebnik, Liang, Figure source

Credit: Lazebnik, Liang, Fouhey

Credit: Lazebnik, Liang, Fouhey

• Each point in the picture is a function evaluation

Credit: Lazebnik, Liang, Fouhey

Credit: Fouhey, Karpathy and Fei-Fei