Professional Documents
Culture Documents
06 Optimization Basics PDF
06 Optimization Basics PDF
source
Convex functions
source
Convex functions
source
Concave functions
source
What about this?
source
Minimize a function
! " = "$
Recall: Supervised Learning
Formulation:
• Given training data: {(x1,y1), …, (xN,yN)}, i.i.d from distribution D
• Find y = f(x) ∈ " using training data
• such that f is correct on test data i.i.d from distribution D
y = wTx + b
(xi, yi)
*,-. ← *, + %# "#
If !" = −1, the response ( ) *" is initially positive and will be __________
)
(,-. *" = (, − *" ) *" = (,) *" − *") *" = (,) *" + !" (*") *" )
If !" = −1, the response ( ) *" is initially positive and will be decreased
)
(,-. *" = (, − *" ) *" = (,) *" − *") *" = (,) *" + !" (*") *" )
If !" = −1, the response ( ) *" is initially positive and will be decreased
)
(,-. *" = (, − *" ) *" = (,) *" − *") *" = (,) *" + !" (*") *" )
Credit: Lazebnik, Liang, source from Nuno Vasconcelos, another source, Figure from PRML
Side Notes: Connectionism vs Symbolism
Credit: Lazebnik, Liang, source from Nuno Vasconcelos, another source, Figure from PRML
The Perceptron Theorem
• Linearly separable data: converges to a perfect solution
1
1
! = argmin* - 2[sgn ! 5 6. ≠ 8. ]
∗
,
./0
• Is the solution unique?
• Non-separable data: converges to a minimum-error solution assuming:
a. Examples are presented in random sequence
b. The learning rate decays as : 1/< where < is the number of iterations and
examples are presented in random sequence
(
1
& ' ) =
1 + exp(−' ( ))
= ' −( ) &
• Sigmoid is symmetric: 1 − ' 9 = ' −9
Credit: Lazebnik, Liang
Sigmoid: Interpretation
We can model the posteriors !(#|%) in terms of class-conditional densities
! %# :
! %#=1 ! #=1
! #=1% =
!(%)
! %#=1 ! #=1
=
! % # = 1 ! # = 1 + ! % # = −1 !(# = −1)
1 !(# = 1|%)
= = /(.), . = log
1 + exp(−.) !(# = −1|%)
Credit: Lazebnik, Liang
Sigmoid: Interpretation
• Adopting a linear + sigmoid model is equivalent to assuming linear log odds:
$(& = 1|*)
log = -.* + 0
$(& = −1|*)
Global minimum
Credit: Fouhey
Option 1: Grid Search
Pros
• Super simple
• Only requires being able to evaluate
model
Cons
• Scales horribly to high dimensional
spaces
Complexity: samplesPerDimnumberOfDims
Credit: Fouhey
What else?
Credit: Fouhey
Option 2: Random Search
Pros
• Super simple
• Only requires being able to sample
random vectors and evaluate model
Cons
• Slow and stupid –throwing darts at high
dimensional dart board
• Might miss something
Credit: Fouhey
When can we use Option 1 or 2?
• Number of dimensions small, space bounded
• Objective is impossible to analyze
Credit: Fouhey
When can we use Option 1 or 2?
• Number of dimensions small, space bounded
• Objective is impossible to analyze
Credit: Fouhey
Option 3: Use Gradients
Arrows:
gradient
Credit: Fouhey
Option 3: Use Gradients
Arrows:
gradient direction
(scaled to unit length)
Credit: Fouhey
Option 3: Use Gradients
Arrows:
gradient direction
(scaled to unit length)
≤?
( ' ( ' + ,∇' ((')
>?
Credit: Fouhey
Option 3: Use Gradients
Arrows:
gradient direction
(scaled to unit length)
Credit: Fouhey
Gradient Descent
Credit: Fouhey
Gradient Descent: Note on Subgradients
What is the derivative of |x|?
|x|
Derivatives/Gradients
Defined everywhere but 0
!
# " = sign(") " ≠0
!"
undefined "=0
Credit: Fouhey
Gradient Descent: Note on Subgradients
What is the derivative of |x|?
Use Subgradient: any underestimate of function
|x|
Subderivatives/subgradients
Defined everywhere
!
# " = sign(") " ≠0
!"
!
# " ∈ [−1,1] "=0
!"
Credit: Fouhey
Back to Logistic Regression Gradient Descent
• Goal: find ! to minimize loss #" (!)
• Start with some initial estimate of !
• At each step, find ∇#" (!), the gradient of the loss w.r.t. !, and take a small
step in the opposite direction
!'() ← !' − , ∇#" !
Note: step size plays a crucial role