Download as pdf or txt
Download as pdf or txt
You are on page 1of 82

CMSC498L

Introduction to Deep Learning


Abhinav Shrivastava
Optimization Basics
Convex functions

source
Convex functions

source
Convex functions

source
Concave functions

source
What about this?

source
Minimize a function

! " = "$
Recall: Supervised Learning
Formulation:
• Given training data: {(x1,y1), …, (xN,yN)}, i.i.d from distribution D
• Find y = f(x) ∈ " using training data
• such that f is correct on test data i.i.d from distribution D

Credit: Lazebnik, Liang


Recall: Supervised Learning
Formulation:
• Given training data: {(x1,y1), …, (xN,yN)}, i.i.d from distribution D
• Find y = f(x) ∈ " using training data
• such that expected loss is small:
# $ = &((,*)∽- [/ $, 0, 1 ]

Credit: Lazebnik, Liang


Supervised Learning
Formulation:
• Given training data: {(x1,y1), …, (xN,yN)}, i.i.d from distribution D
• Find y = f(x) ∈ " using training data
• such that expected loss is small:
# $ = &((,*)∽- [/ $, 0, 1 ]

Credit: Lazebnik, Liang


Training Linear Classifiers
• Given: i.i.d. training data !" , $" , % = 1, … , ) ,
$" ∈ {−1,1}

Credit: Lazebnik, Liang


Training Linear Classifiers
• Given: i.i.d. training data !" , $" , % = 1, … , ) ,
$" ∈ −1,1
• Hypothesis class: ,- (!) = sgn(3 4 !)

• Classification with bias, i.e. ,- ! = sgn(3 4 ! + 6), can be reduced to the


case w/o bias by letting 3 4 = 3; 6 and ! 4 = [!; 1]

Credit: Lazebnik, Liang


Training Linear Classifiers
• Given: i.i.d. training data !" , $" , % = 1, … , ) ,
$" ∈ −1,1
• Hypothesis class: ,- (!) = sgn(3 4 !)

• Loss: minimizing the number of mistakes


:
1
65 ,- = 7 ;[sgn 3 4 !" ≠ $" ]
)
"89
• Difficult to optimize directly!

Credit: Lazebnik, Liang


Use Linear Regression
• Find !" ($) = ' ( $ that minimizes )* loss or mean squared error
3
1
,+ !" = /(' ( $0 − 50 )*
.
012
• Ignores the fact that 5 ∈ {−1,1} but is easy to optimize

Credit: Lazebnik, Liang


Linear Regression: Optimization
• Let ! be a matrix whose ith row is "#$ , % be the vector ('( , … , '+ )$
+
1
.- /0 = 4(6 $ "# − '# )8 = !6 − % 8
8
3
#5(

Credit: Lazebnik, Liang


Linear Regression: Optimization
• Let ! be a matrix whose ith row is "#$ , % be the vector ('( , … , '+ )$
+
1
.- /0 = 4(6 $ "# − '# )8 = !6 − % 8
8
3
#5(
• This is a convex function of the weights

Credit: Lazebnik, Liang


Linear Regression: Optimization
• Let ! be a matrix whose ith row is "#$ , % be the vector ('( , … , '+ )$
+
1
.- /0 = 4(6 $ "# − '# )8 = !6 − % 8
8
3
#5(
• Find the gradient w.r.t. 6:
∇0 !6 − % 88

Credit: Lazebnik, Liang


Linear Regression: Optimization
• Let ! be a matrix whose ith row is "#$ , % be the vector ('( , … , '+ )$
+
1
.- /0 = 4(6 $ "# − '# )8 = !6 − % 8
8
3
#5(
• Find the gradient w.r.t. 6:
∇0 !6 − % 88 = ∇0 !6 − % $
!6 − %
= ∇0 6 $ ! $ !6 − 26 $ ! $ % + % $ % = 2! $ !6 − 2! $ %
• Set gradient to zero to get the minimizer:
! $ !6 = ! $ %
6 = (! $ !)<( ! $ %
Credit: Lazebnik, Liang
Linear Regression: Optimization
• Linear algebra view
• If ! is invertible, simply solve !" = $ and
get " = ! %&$
• But typically ! is a “tall” matrix so you need to find a least squares solution to an
over-constrained system

Credit: Lazebnik, Liang


Linear Regression: Optimization
• MLE view
• Interpretation of !" loss: negative log likelihood assuming # is normally
distributed with mean $% & = ( ) & + +

y = wTx + b
(xi, yi)

Credit: Lazebnik, Liang


Maximum Likelihood Estimation (MLE)
• Given: i.i.d. training data !" , $" , % = 1, … , )
• Let *+ $ ! , , ∈ Θ be a family of distributions parameterized by ,
• Maximum (conditional) likelihood estimate:
,/01 = argmax+ 7 *+ ($" |!" )
"
= argmin+ − ∑" log *+ ($" |!" )

Credit: Lazebnik, Liang


Maximum Likelihood Estimation (MLE)
!"# = argmin+ − ∑. log 1+ (3. |5. )

• Assume 1+ 3 5 = Normal(3; 9+ 5 , ; < )


<
1 3 − 9+ 5
log 1+ 3. 5. = log exp − <
2?; < 2;
1 < 1
= − < 3 − 9+ 5 − log ; − log(2?)
2; 2
<
!"# = argmin+ C 3. − 9+ 5.
.
Credit: Lazebnik, Liang
Problem with Linear Regression

Credit: Lazebnik, Figure from PRML


Linear Classifier: Attempt 2
• Given: i.i.d. training data
!" , $" , % = 1, … , ) ,
$" ∈ {−1,1}
• Hypothesis ./ (!) = sgn(5 6 !)
• $ = +1 if 5 9 ! > 0
• $ = −1 if 5 9 ! < 0
• Find ./ (!) = sgn(5 6 !) that minimizes the number of mistakes
B
1
>= ./ = ? C[sgn 5 6 !" ≠ $" ]
)
"@A

Credit: Lazebnik, Liang


Attempt 2: Perceptron training algorithm

Perceptron: figure from the lecture note of Nina Balcan

Credit: Lazebnik, Liang


Attempt 2: Perceptron training algorithm
• Initialize weights randomly
• Cycle through training examples in multiple passes (epochs)
• For each training example ("# , %# ):
• If current prediction sgn(* + "# ) does not match %# then update weights:

*,-. ← *, + %# "#

Credit: Lazebnik, Liang


Understanding the Perceptron update rule
• Perceptron update rule: If !" ≠ sgn(( ) *" ) then update weights:
(,-. ← (, + !" *"

If !" = 1, the response ( ) *" is initially negative and will be __________


)
(,-. *" = (, + *" ) *" = (,) *" + *") *" = (,) *" + !" (*") *" )

If !" = −1, the response ( ) *" is initially positive and will be __________
)
(,-. *" = (, − *" ) *" = (,) *" − *") *" = (,) *" + !" (*") *" )

Credit: Lazebnik, Liang


Understanding the Perceptron update rule
• Perceptron update rule: If !" ≠ sgn(( ) *" ) then update weights:
(,-. ← (, + !" *"

If !" = 1, the response ( ) *" is initially negative and will be increased


)
(,-. *" = (, + *" ) *" = (,) *" + *") *" = (,) *" + !" (*") *" )

If !" = −1, the response ( ) *" is initially positive and will be decreased
)
(,-. *" = (, − *" ) *" = (,) *" − *") *" = (,) *" + !" (*") *" )

Credit: Lazebnik, Liang


Understanding the Perceptron update rule
• Perceptron update rule: If !" ≠ sgn(( ) *" ) then update weights:
(,-. ← (, + !" *"

If !" = 1, the response ( ) *" is initially negative and will be increased


)
(,-. *" = (, + *" ) *" = (,) *" + *") *" = (,) *" + !" (*") *" )

If !" = −1, the response ( ) *" is initially positive and will be decreased
)
(,-. *" = (, − *" ) *" = (,) *" − *") *" = (,) *" + !" (*") *" )

Credit: Lazebnik, Liang


Understanding the Perceptron update rule
• Perceptron update rule: If !" ≠ sgn(( ) *" ) then update weights:
(,-. ← (, + !" *"
• The raw response of the classifier changes to
(,) *" + !" *" 1

Credit: Lazebnik, Liang


Understanding the Perceptron update rule
• Perceptron update rule: If !" ≠ sgn(( ) *" ) then update weights:
( ← ( + !" *"
• The raw response of the classifier changes to
( ) *" + !" *" .

Credit: Lazebnik, Liang


Perceptron update rule with Learning Rate
• Perceptron update rule: If !" ≠ sgn(( ) *" ) then update weights:
( ← ( + .!" *"
• The raw response of the classifier changes to
( ) *" + .!" *" /

where . is a learning rate that should decay slowly* over time

Credit: Lazebnik, Liang


Perceptron update rule with Learning Rate
• Perceptron update rule: If !" ≠ sgn(( ) *" ) then update weights:
( ← ( + .!" *"
• The raw response of the classifier changes to
( ) *" + .!" *" /

where . is a learning rate that should decay slowly* over time

* will go over this in detail later.

Credit: Lazebnik, Liang


The Perceptron Theorem
• Intuitively, makes sense
• How do we know it will not get stuck on a bad local minimum?
• This was Rosenblatt’s seminal contribution:
• Theorem:
• Suppose there exists ! ∗ that correctly classifies $% , '%
• W.L.O.G., all $% and ! ∗ have length 1 ( $ = ! = 1 ), so the minimum distance of
any example to the decision boundary is:
* = min ! ∗ . $%
%
/ 1 / 1
• Then Perceptron makes at most 0
mistakes, and will converge in < 0 steps

Credit: Lazebnik, Liang, source from Nuno Vasconcelos, another source


The Perceptron Theorem
• Intuitively, makes sense
• How do we know it will not get stuck on a bad local minimum?
• This was Rosenblatt’s seminal contribution:
• Theorem:
• Suppose there exists ! ∗ that correctly classifies $% , '%
• W.L.O.G., all $% and ! ∗ have length 1 ( $ = ! = 1 ), so the minimum distance of
any example to the decision boundary is:
* = min ! ∗ . $%
%
/ 1 / 1
• Then Perceptron makes at most 0
mistakes, and will converge in < 0
steps

Credit: Lazebnik, Liang, source from Nuno Vasconcelos, another source


The Perceptron Theorem: Notes
• Not the only proof.
• This result was the start of learning theory
• For the first time there was a proof that a learning machine could actually
learn something!

Credit: Lazebnik, Liang, source from Nuno Vasconcelos, another source


The Perceptron Theorem: Notes
• Rosenblatt’s result generated a lot of excitement about learning in the 50s
• Later, Minsky and Papert identified serious problems with the Perceptron
• There are very simply logic problems that it cannot solve

Credit: Lazebnik, Liang, source from Nuno Vasconcelos, another source, Figure from PRML
Side Notes: Connectionism vs Symbolism

Symbolism: AI can be achieved by representing concepts as symbols


• Example: rule-based expert system, formal grammar

Connectionism: explain intellectual abilities using connections between


neurons (i.e., artificial neural networks)
• Example: perceptron, larger scale neural networks

Credit: Lazebnik, Liang


Symbolism Example

Example from Machine learning lecture notes by Tom Mitchell

Credit: Lazebnik, Liang


Connectionism Example

Credit: Lazebnik, Liang, PRML


Side Notes: Connectionism vs Symbolism

Formal theories of logical reasoning, grammar, and other higher mental


faculties compel us to think of the mind as a machine for rule-based
manipulation of highly structured arrays of symbols. What we know of the
brain compels us to think of human information processing in terms of
manipulation of a large unstructured set of numbers, the activity levels of
interconnected neurons.
---- The Central Paradox of Cognition (Smolensky et al., 1992)

Credit: Lazebnik, Liang


The Perceptron Theorem: Notes
• Rosenblatt’s result generated a lot of excitement about learning in the 50s
• Later, Minsky and Papert identified serious problems with the Perceptron
• There are very simply logic problems that it cannot solve
• This killed off the enthusiasm until an old result by Kolmogorov saved the
day
• Universal Approximation Theorem
• 2-layer neural network = Universal function approximator

Credit: Lazebnik, Liang, source from Nuno Vasconcelos, another source, Figure from PRML
The Perceptron Theorem
• Linearly separable data: converges to a perfect solution
1
1
! = argmin* - 2[sgn ! 5 6. ≠ 8. ]

,
./0
• Is the solution unique?
• Non-separable data: converges to a minimum-error solution assuming:
a. Examples are presented in random sequence
b. The learning rate decays as : 1/< where < is the number of iterations and
examples are presented in random sequence

Credit: Lazebnik, Liang


Linear Classification: Recap
We want to learn a linear classifier that minimizes the number of mistakes
• Attempt 1: relax objective by dropping the sign function, using !" loss –
linear regression
• Easy to optimize, not really appropriate for binary classification, sensitive to outliers
• Attempt 2: perceptron update rule
• Gives good convergence guarantees, hard to extend to optimizing more complicated
models
• Attempt 3: relax objective by turning the sign function into a differentiable
“soft threshold” – logistic regression

Credit: Lazebnik, Liang


Recap: Sigmoid function
Squash the linear response of the classifier to the interval [0,1]:

(
1
& ' ) =
1 + exp(−' ( ))

Credit: Lazebnik, Liang


Recap: Sigmoid function
• Output of sigmoid can be interpreted as posterior label probability or
confidence returned by classifier:
)
1
!" # = 1 & = ' ( & =
1 + exp(−( ) &)
!" # = −1 & = 1 − ' ( ) &

12345 6" 7 8 61 345(6" 7 8) 1


= = =
12345 6" 7 8 12345(6" 7 8) 345 " 7 8 21

= ' −( ) &
• Sigmoid is symmetric: 1 − ' 9 = ' −9
Credit: Lazebnik, Liang
Sigmoid: Interpretation
We can model the posteriors !(#|%) in terms of class-conditional densities
! %# :

! %#=1 ! #=1
! #=1% =
!(%)

! %#=1 ! #=1
=
! % # = 1 ! # = 1 + ! % # = −1 !(# = −1)

1 !(# = 1|%)
= = /(.), . = log
1 + exp(−.) !(# = −1|%)
Credit: Lazebnik, Liang
Sigmoid: Interpretation
• Adopting a linear + sigmoid model is equivalent to assuming linear log odds:

$(& = 1|*)
log = -.* + 0
$(& = −1|*)

• This happens when $ * & = 1 and $ * & = −1 are Gaussians with


different means and the same covariance matrices and - is related to the
difference between the means

Credit: Lazebnik, Liang


Logistic Regression
• Given: !" , $" , % = 1, … , ) , $" ∈ {−1,1}
• Find . that minimizes
4
1
0/ . = − 1 log 89 $" !"
)
"23
3 3
= − 4 ∑4":<=23 log > . ? !" − 4 ∑4":<=2@3 log[1 − >(. ? !" )]
3 4
= − 4 ∑"23 log > $" . ? !"

• No closed-form solution, need to use gradient descent

Credit: Lazebnik, Liang


Gradient Descent
• Goal: find ! to minimize loss #" (!)
• Start with some initial estimate of !
• At each step, find ∇#" (!), the gradient of the loss w.r.t. !, and take a small
step in the opposite direction
!'() ← !' − , ∇#" (!)

Credit: Lazebnik, Liang


Gradient Descent
• Goal: find ! to minimize loss #" (!)
• Start with some initial estimate of !
• At each step, find ∇#" (!), the gradient of the loss w.r.t. !, and take a small
step in the opposite direction
!'() ← !' − , ∇#" !
Note: step size plays a crucial role

Credit: Lazebnik, Liang, Figure source


Why Gradient Descent?

Credit: Lazebnik, Liang, Fouhey


Why Gradient Descent?

Global minimum

Credit: Lazebnik, Liang, Fouhey


Why Gradient Descent?

• Each point in the picture is a function evaluation


• Some functions are easy/quick, some may take
optimize may take hours to evaluate
• Most functions we care about will take hours

Credit: Lazebnik, Liang, Fouhey


Why Gradient Descent?

Credit: Fouhey, Karpathy and Fei-Fei


Option 1: Grid Search
#systematically try things
best, bestScore = None, Inf
for dim1Value in dim1Values:
….
for dimNValue in dimNValues:
w = [dim1Value, …, dimNValue]
if L(w) < bestScore:
best, bestScore = w, L(w)
return best

Credit: Fouhey
Option 1: Grid Search
Pros
• Super simple
• Only requires being able to evaluate
model
Cons
• Scales horribly to high dimensional
spaces

Complexity: samplesPerDimnumberOfDims
Credit: Fouhey
What else?

Credit: Fouhey, Karpathy and Fei-Fei


Option 2: Random Search
#Do random stuff
best, bestScore = None, Inf
for iter in range(numIters):
w = random(N,1) #sample
score = ! " #evaluate
if score < bestScore:
best, bestScore = w, score
return best

Credit: Fouhey
Option 2: Random Search
Pros
• Super simple
• Only requires being able to sample
random vectors and evaluate model
Cons
• Slow and stupid –throwing darts at high
dimensional dart board
• Might miss something

Credit: Fouhey
When can we use Option 1 or 2?
• Number of dimensions small, space bounded
• Objective is impossible to analyze

Credit: Fouhey
When can we use Option 1 or 2?
• Number of dimensions small, space bounded
• Objective is impossible to analyze

Credit: Fouhey
Option 3: Use Gradients
Arrows:
gradient

Credit: Fouhey
Option 3: Use Gradients
Arrows:
gradient direction
(scaled to unit length)

Credit: Fouhey
Option 3: Use Gradients
Arrows:
gradient direction
(scaled to unit length)

Want: arg min ((')


'
Which is bigger (for small α)?

≤?
( ' ( ' + ,∇' ((')
>?
Credit: Fouhey
Option 3: Use Gradients
Arrows:
gradient direction
(scaled to unit length)

Want: arg min ((')


'
1
Which is bigger (for small α)?
1 + ,2
≤?
( ' ( ' + ,∇' ((')
>?
Credit: Fouhey
Option 3: Use Gradients
• w0 = initialize() #initialize
• for iter in range(numIters):
• g = ∇" # " #eval gradient
• w = w + -stepsize(iter)*g #update w
• return w

Credit: Fouhey
Gradient Descent

Credit: Fouhey
Gradient Descent: Note on Subgradients
What is the derivative of |x|?
|x|
Derivatives/Gradients
Defined everywhere but 0
!
# " = sign(") " ≠0
!"

undefined "=0

Credit: Fouhey
Gradient Descent: Note on Subgradients
What is the derivative of |x|?
Use Subgradient: any underestimate of function
|x|
Subderivatives/subgradients
Defined everywhere
!
# " = sign(") " ≠0
!"
!
# " ∈ [−1,1] "=0
!"

In practice: at discontinuity, pick value on either side.


Credit: Fouhey
Gradient Descent: Note on Learning Rate (LR)
1x10-2 10x10-2 12x10-2
falls short converges diverges

Credit: Fouhey
Back to Logistic Regression Gradient Descent
• Goal: find ! to minimize loss #" (!)
• Start with some initial estimate of !
• At each step, find ∇#" (!), the gradient of the loss w.r.t. !, and take a small
step in the opposite direction
!'() ← !' − , ∇#" !
Note: step size plays a crucial role

Credit: Lazebnik, Liang, Figure source


Gradient Descent for Logistic Regression
,
1
!" # = − ( log 0 1) # 2 3)
'
)*+
,
1
∇"! # = − ( ∇5 log 0 1) # 2 3)
'
)*+
,
1 ∇5 0 1) # 2 3)
=− (
' 0 1) # 2 3)
)*+
Derivative of sigmoid:
0 6 7 = 0 7 1 − 0(7) = 0 7 0 −7

Credit: Lazebnik, Liang


Gradient Descent for Logistic Regression
,
1
!" # = − ( log 0 1) # 2 3)
'
)*+
,
1
∇"! # = − ( ∇5 log 0 1) # 2 3)
'
)*+
,
1 ∇5 0 1) # 2 3)
=− (
' 0 1) # 2 3)
)*+
,
1 0 1) # 2 3) 0 −1) # 2 3) 1) 3)
=− (
' 0 1) # 2 3)
)*+
8
Credit: Lazebnik, Liang We also used the chain rule: 6 7 3 = 68 7 3 7′(3)
Gradient Descent for Logistic Regression
,
1
!" # = − ( log 0 1) # 2 3)
'
)*+
,
1
∇"! # = − ( ∇5 log 0 1) # 2 3)
'
)*+
,
1 ∇5 0 1) # 2 3)
=− (
' 0 1) # 2 3)
)*+
, ,
1 0 1) # 2 3) 0 −1) # 2 3) 1) 3) 1
=− ( = − ( 0 −1) # 2 3) 1) 3)
' 0 1) # 2 3) '
)*+ )*+

Credit: Lazebnik, Liang


Gradient Descent for Logistic Regression
-
1
∇#" $ = − ) . −/* $ 0 1* /* 1*
(
*+,
Update for $:
-
1
$ ← $ + 4 ) . −/* $ 0 1* /* 1*
(
*+,
• For a single parameter update, need to cycle through the entire training set!
• This is also called a batch update

Credit: Lazebnik, Liang


Stochastic Gradient Descent
At each iteration, take a single data point !" , $" and perform a parameter
update using ∇& ', !" , $" , the gradient of the loss for that point

' ← ' − * ∇& ', !" , $"

This is called an online or stochastic update

Credit: Lazebnik, Liang


(Stochastic) Gradient Descent
,
1
"! # = − ( log 0 1) # 2 3)
'
)*+
GD update for #:
,
1
# ← # + 6 ( 0 −1) # 2 3) 1) 3)
'
)*+
Loss for a single sample:
7 #, 3) , 1) = − log 0 1) # 2 3)
SGD update for #:
# ← # + 6 0 −1) # 2 3) 1) 3)
Credit: Lazebnik, Liang
Stochastic Gradient Descent
SGD update for logistic regression:
! ← ! + $ % −'( ! ) *( '( *(

SGD update for perceptron:


! ← ! + $ +['( ! ) *( < 0]'( *(

Credit: Lazebnik, Liang


Stochastic Gradient Descent
SGD update for logistic regression:
! ← ! + $ % −'( ! ) *( '( *(

SGD update for perceptron:


! ← ! + $ +['( ! ) *( < 0]'( *(

SGD update for linear regression


! ← ! − $ (! ) *( − '( )*(

Credit: Lazebnik, Liang


Next to next class
• Gradient Descent in practice
• Impact of LR in detail
• Momentum in Gradient Descent
• Regularization

You might also like