Introduction to Deep Learning

Abhinav Shrivastava
Optimization Basics
Convex functions

Convex functions

Convex functions

Concave functions

What about this?

Minimize a function

! " = "$
Recall: Supervised Learning
• Given training data: {(x1,y1), …, (xN,yN)}, i.i.d from distribution D
• Find y = f(x) ∈ " using training data
• such that f is correct on test data i.i.d from distribution D

Credit: Lazebnik, Liang

Recall: Supervised Learning
• Given training data: {(x1,y1), …, (xN,yN)}, i.i.d from distribution D
• Find y = f(x) ∈ " using training data
• such that expected loss is small:
# $ = &((,*)∽- [/ $, 0, 1 ]

Credit: Lazebnik, Liang

Supervised Learning
• Given training data: {(x1,y1), …, (xN,yN)}, i.i.d from distribution D
• Find y = f(x) ∈ " using training data
• such that expected loss is small:
# $ = &((,*)∽- [/ $, 0, 1 ]

Credit: Lazebnik, Liang

Training Linear Classifiers
• Given: i.i.d. training data !" , $" , % = 1, … , ) ,
$" ∈ {−1,1}

Credit: Lazebnik, Liang

Training Linear Classifiers
• Given: i.i.d. training data !" , $" , % = 1, … , ) ,
$" ∈ −1,1
• Hypothesis class: ,- (!) = sgn(3 4 !)

• Classification with bias, i.e. ,- ! = sgn(3 4 ! + 6), can be reduced to the

case w/o bias by letting 3 4 = 3; 6 and ! 4 = [!; 1]

Credit: Lazebnik, Liang

Training Linear Classifiers
• Given: i.i.d. training data !" , $" , % = 1, … , ) ,
$" ∈ −1,1
• Hypothesis class: ,- (!) = sgn(3 4 !)

• Loss: minimizing the number of mistakes

65 ,- = 7 ;[sgn 3 4 !" ≠ $" ]
• Difficult to optimize directly!

Credit: Lazebnik, Liang

Use Linear Regression
• Find !" ($) = ' ( $ that minimizes )* loss or mean squared error
,+ !" = /(' ( $0 − 50 )*
• Ignores the fact that 5 ∈ {−1,1} but is easy to optimize

Credit: Lazebnik, Liang

Linear Regression: Optimization
• Let ! be a matrix whose ith row is "#$ , % be the vector ('( , … , '+ )$
.- /0 = 4(6 $ "# − '# )8 = !6 − % 8

Credit: Lazebnik, Liang

Linear Regression: Optimization
• Let ! be a matrix whose ith row is "#$ , % be the vector ('( , … , '+ )$
.- /0 = 4(6 $ "# − '# )8 = !6 − % 8
• This is a convex function of the weights

Credit: Lazebnik, Liang

Linear Regression: Optimization
• Let ! be a matrix whose ith row is "#$ , % be the vector ('( , … , '+ )$
.- /0 = 4(6 $ "# − '# )8 = !6 − % 8
• Find the gradient w.r.t. 6:
∇0 !6 − % 88

Credit: Lazebnik, Liang

Linear Regression: Optimization
• Let ! be a matrix whose ith row is "#$ , % be the vector ('( , … , '+ )$
.- /0 = 4(6 $ "# − '# )8 = !6 − % 8
• Find the gradient w.r.t. 6:
∇0 !6 − % 88 = ∇0 !6 − % $
!6 − %
= ∇0 6 $ ! $ !6 − 26 $ ! $ % + % $ % = 2! $ !6 − 2! $ %
• Set gradient to zero to get the minimizer:
! $ !6 = ! $ %
6 = (! $ !)<( ! $ %
Credit: Lazebnik, Liang
Linear Regression: Optimization
• Linear algebra view
• If ! is invertible, simply solve !" = $ and
get " = ! %&$
• But typically ! is a “tall” matrix so you need to find a least squares solution to an
over-constrained system

Credit: Lazebnik, Liang

Linear Regression: Optimization
• MLE view
• Interpretation of !" loss: negative log likelihood assuming # is normally
distributed with mean $% & = ( ) & + +

y = wTx + b
(xi, yi)

Credit: Lazebnik, Liang

Maximum Likelihood Estimation (MLE)
• Given: i.i.d. training data !" , $" , % = 1, … , )
• Let *+ $ ! , , ∈ Θ be a family of distributions parameterized by ,
• Maximum (conditional) likelihood estimate:
,/01 = argmax+ 7 *+ ($" |!" )
= argmin+ − ∑" log *+ ($" |!" )

Credit: Lazebnik, Liang

Maximum Likelihood Estimation (MLE)
!"# = argmin+ − ∑. log 1+ (3. |5. )

• Assume 1+ 3 5 = Normal(3; 9+ 5 , ; < )

1 3 − 9+ 5
log 1+ 3. 5. = log exp − <
2?; < 2;
1 < 1
= − < 3 − 9+ 5 − log ; − log(2?)
2; 2
!"# = argmin+ C 3. − 9+ 5.
Credit: Lazebnik, Liang
Problem with Linear Regression

Credit: Lazebnik, Figure from PRML

Linear Classifier: Attempt 2
• Given: i.i.d. training data
!" , $" , % = 1, … , ) ,
$" ∈ {−1,1}
• Hypothesis ./ (!) = sgn(5 6 !)
• $ = +1 if 5 9 ! > 0
• $ = −1 if 5 9 ! < 0
• Find ./ (!) = sgn(5 6 !) that minimizes the number of mistakes
>= ./ = ? C[sgn 5 6 !" ≠ $" ]

Credit: Lazebnik, Liang

Attempt 2: Perceptron training algorithm

Perceptron: figure from the lecture note of Nina Balcan

Credit: Lazebnik, Liang

Attempt 2: Perceptron training algorithm
• Initialize weights randomly
• Cycle through training examples in multiple passes (epochs)
• For each training example ("# , %# ):
• If current prediction sgn(* + "# ) does not match %# then update weights:

*,-. ← *, + %# "#

Credit: Lazebnik, Liang

Understanding the Perceptron update rule
• Perceptron update rule: If !" ≠ sgn(( ) *" ) then update weights:
(,-. ← (, + !" *"

If !" = 1, the response ( ) *" is initially negative and will be __________

(,-. *" = (, + *" ) *" = (,) *" + *") *" = (,) *" + !" (*") *" )

If !" = −1, the response ( ) *" is initially positive and will be __________
(,-. *" = (, − *" ) *" = (,) *" − *") *" = (,) *" + !" (*") *" )

Credit: Lazebnik, Liang

Understanding the Perceptron update rule
• Perceptron update rule: If !" ≠ sgn(( ) *" ) then update weights:
(,-. ← (, + !" *"

If !" = 1, the response ( ) *" is initially negative and will be increased

(,-. *" = (, + *" ) *" = (,) *" + *") *" = (,) *" + !" (*") *" )

If !" = −1, the response ( ) *" is initially positive and will be decreased
(,-. *" = (, − *" ) *" = (,) *" − *") *" = (,) *" + !" (*") *" )

Credit: Lazebnik, Liang

Understanding the Perceptron update rule
• Perceptron update rule: If !" ≠ sgn(( ) *" ) then update weights:
(,-. ← (, + !" *"

If !" = 1, the response ( ) *" is initially negative and will be increased

(,-. *" = (, + *" ) *" = (,) *" + *") *" = (,) *" + !" (*") *" )

If !" = −1, the response ( ) *" is initially positive and will be decreased
(,-. *" = (, − *" ) *" = (,) *" − *") *" = (,) *" + !" (*") *" )

Credit: Lazebnik, Liang

Understanding the Perceptron update rule
• Perceptron update rule: If !" ≠ sgn(( ) *" ) then update weights:
(,-. ← (, + !" *"
• The raw response of the classifier changes to
(,) *" + !" *" 1

Credit: Lazebnik, Liang

Understanding the Perceptron update rule
• Perceptron update rule: If !" ≠ sgn(( ) *" ) then update weights:
( ← ( + !" *"
• The raw response of the classifier changes to
( ) *" + !" *" .

Credit: Lazebnik, Liang

Perceptron update rule with Learning Rate
• Perceptron update rule: If !" ≠ sgn(( ) *" ) then update weights:
( ← ( + .!" *"
• The raw response of the classifier changes to
( ) *" + .!" *" /

where . is a learning rate that should decay slowly* over time

Credit: Lazebnik, Liang

Perceptron update rule with Learning Rate
• Perceptron update rule: If !" ≠ sgn(( ) *" ) then update weights:
( ← ( + .!" *"
• The raw response of the classifier changes to
( ) *" + .!" *" /

where . is a learning rate that should decay slowly* over time

* will go over this in detail later.

Credit: Lazebnik, Liang

The Perceptron Theorem
• Intuitively, makes sense
• How do we know it will not get stuck on a bad local minimum?
• This was Rosenblatt’s seminal contribution:
• Theorem:
• Suppose there exists ! ∗ that correctly classifies $% , '%
• W.L.O.G., all $% and ! ∗ have length 1 ( $ = ! = 1 ), so the minimum distance of
any example to the decision boundary is:
* = min ! ∗ . $%
/ 1 / 1
• Then Perceptron makes at most 0
mistakes, and will converge in < 0 steps

Credit: Lazebnik, Liang, source from Nuno Vasconcelos, another source

The Perceptron Theorem
• Intuitively, makes sense
• How do we know it will not get stuck on a bad local minimum?
• This was Rosenblatt’s seminal contribution:
• Theorem:
• Suppose there exists ! ∗ that correctly classifies $% , '%
• W.L.O.G., all $% and ! ∗ have length 1 ( $ = ! = 1 ), so the minimum distance of
any example to the decision boundary is:
* = min ! ∗ . $%
/ 1 / 1
• Then Perceptron makes at most 0
mistakes, and will converge in < 0

Credit: Lazebnik, Liang, source from Nuno Vasconcelos, another source

The Perceptron Theorem: Notes
• Not the only proof.
• This result was the start of learning theory
• For the first time there was a proof that a learning machine could actually
learn something!

Credit: Lazebnik, Liang, source from Nuno Vasconcelos, another source

The Perceptron Theorem: Notes
• Rosenblatt’s result generated a lot of excitement about learning in the 50s
• Later, Minsky and Papert identified serious problems with the Perceptron
• There are very simply logic problems that it cannot solve

Credit: Lazebnik, Liang, source from Nuno Vasconcelos, another source, Figure from PRML
Side Notes: Connectionism vs Symbolism

Symbolism: AI can be achieved by representing concepts as symbols

• Example: rule-based expert system, formal grammar

Connectionism: explain intellectual abilities using connections between

neurons (i.e., artificial neural networks)
• Example: perceptron, larger scale neural networks

Credit: Lazebnik, Liang

Symbolism Example

Example from Machine learning lecture notes by Tom Mitchell

Credit: Lazebnik, Liang

Connectionism Example

Credit: Lazebnik, Liang, PRML

Side Notes: Connectionism vs Symbolism

Formal theories of logical reasoning, grammar, and other higher mental

faculties compel us to think of the mind as a machine for rule-based
manipulation of highly structured arrays of symbols. What we know of the
brain compels us to think of human information processing in terms of
manipulation of a large unstructured set of numbers, the activity levels of
interconnected neurons.
---- The Central Paradox of Cognition (Smolensky et al., 1992)

Credit: Lazebnik, Liang

The Perceptron Theorem: Notes
• Rosenblatt’s result generated a lot of excitement about learning in the 50s
• Later, Minsky and Papert identified serious problems with the Perceptron
• There are very simply logic problems that it cannot solve
• This killed off the enthusiasm until an old result by Kolmogorov saved the
• Universal Approximation Theorem
• 2-layer neural network = Universal function approximator

Credit: Lazebnik, Liang, source from Nuno Vasconcelos, another source, Figure from PRML
The Perceptron Theorem
• Linearly separable data: converges to a perfect solution
! = argmin* - 2[sgn ! 5 6. ≠ 8. ]

• Is the solution unique?
• Non-separable data: converges to a minimum-error solution assuming:
a. Examples are presented in random sequence
b. The learning rate decays as : 1/< where < is the number of iterations and
examples are presented in random sequence

Credit: Lazebnik, Liang

Linear Classification: Recap
We want to learn a linear classifier that minimizes the number of mistakes
• Attempt 1: relax objective by dropping the sign function, using !" loss –
linear regression
• Easy to optimize, not really appropriate for binary classification, sensitive to outliers
• Attempt 2: perceptron update rule
• Gives good convergence guarantees, hard to extend to optimizing more complicated
• Attempt 3: relax objective by turning the sign function into a differentiable
“soft threshold” – logistic regression

Credit: Lazebnik, Liang

Recap: Sigmoid function
Squash the linear response of the classifier to the interval [0,1]:

& ' ) =
1 + exp(−' ( ))

Credit: Lazebnik, Liang

Recap: Sigmoid function
• Output of sigmoid can be interpreted as posterior label probability or
confidence returned by classifier:
!" # = 1 & = ' ( & =
1 + exp(−( ) &)
!" # = −1 & = 1 − ' ( ) &

12345 6" 7 8 61 345(6" 7 8) 1

= = =
12345 6" 7 8 12345(6" 7 8) 345 " 7 8 21

= ' −( ) &
• Sigmoid is symmetric: 1 − ' 9 = ' −9
Credit: Lazebnik, Liang
Sigmoid: Interpretation
We can model the posteriors !(#|%) in terms of class-conditional densities
! %# :

! %#=1 ! #=1
! #=1% =

! %#=1 ! #=1
! % # = 1 ! # = 1 + ! % # = −1 !(# = −1)

1 !(# = 1|%)
= = /(.), . = log
1 + exp(−.) !(# = −1|%)
Credit: Lazebnik, Liang
Sigmoid: Interpretation
• Adopting a linear + sigmoid model is equivalent to assuming linear log odds:

$(& = 1|*)
log = -.* + 0
$(& = −1|*)

• This happens when $ * & = 1 and $ * & = −1 are Gaussians with

different means and the same covariance matrices and - is related to the
difference between the means

Credit: Lazebnik, Liang

Logistic Regression
• Given: !" , $" , % = 1, … , ) , $" ∈ {−1,1}
• Find . that minimizes
0/ . = − 1 log 89 $" !"
3 3
= − 4 ∑4":<=23 log > . ? !" − 4 ∑4":<=2@3 log[1 − >(. ? !" )]
3 4
= − 4 ∑"23 log > $" . ? !"

• No closed-form solution, need to use gradient descent

Credit: Lazebnik, Liang

Gradient Descent
• Goal: find ! to minimize loss #" (!)
• Start with some initial estimate of !
• At each step, find ∇#" (!), the gradient of the loss w.r.t. !, and take a small
step in the opposite direction
!'() ← !' − , ∇#" (!)

Credit: Lazebnik, Liang

Gradient Descent
• Goal: find ! to minimize loss #" (!)
• Start with some initial estimate of !
• At each step, find ∇#" (!), the gradient of the loss w.r.t. !, and take a small
step in the opposite direction
!'() ← !' − , ∇#" !
Note: step size plays a crucial role

Credit: Lazebnik, Liang, Figure source

Why Gradient Descent?

Credit: Lazebnik, Liang, Fouhey

Why Gradient Descent?

Global minimum

Credit: Lazebnik, Liang, Fouhey

Why Gradient Descent?

• Each point in the picture is a function evaluation

• Some functions are easy/quick, some may take
optimize may take hours to evaluate
• Most functions we care about will take hours

Credit: Lazebnik, Liang, Fouhey

Why Gradient Descent?

Credit: Fouhey, Karpathy and Fei-Fei

Option 1: Grid Search
#systematically try things
best, bestScore = None, Inf
for dim1Value in dim1Values:
for dimNValue in dimNValues:
w = [dim1Value, …, dimNValue]
if L(w) < bestScore:
best, bestScore = w, L(w)
return best

Credit: Fouhey
Option 1: Grid Search
• Super simple
• Only requires being able to evaluate
• Scales horribly to high dimensional

Complexity: samplesPerDimnumberOfDims
Credit: Fouhey
What else?

Credit: Fouhey, Karpathy and Fei-Fei

Option 2: Random Search
#Do random stuff
best, bestScore = None, Inf
for iter in range(numIters):
w = random(N,1) #sample
score = ! " #evaluate
if score < bestScore:
best, bestScore = w, score
return best

Credit: Fouhey
Option 2: Random Search
• Super simple
• Only requires being able to sample
random vectors and evaluate model
• Slow and stupid –throwing darts at high
dimensional dart board
• Might miss something

Credit: Fouhey
When can we use Option 1 or 2?
• Number of dimensions small, space bounded
• Objective is impossible to analyze

Credit: Fouhey
When can we use Option 1 or 2?
• Number of dimensions small, space bounded
• Objective is impossible to analyze

Credit: Fouhey
Option 3: Use Gradients

Credit: Fouhey
Option 3: Use Gradients
gradient direction
(scaled to unit length)

Credit: Fouhey
Option 3: Use Gradients
gradient direction
(scaled to unit length)

Want: arg min ((')

Which is bigger (for small α)?

( ' ( ' + ,∇' ((')
Credit: Fouhey
Option 3: Use Gradients
gradient direction
(scaled to unit length)

Want: arg min ((')

Which is bigger (for small α)?
1 + ,2
( ' ( ' + ,∇' ((')
Credit: Fouhey
Option 3: Use Gradients
• w0 = initialize() #initialize
• for iter in range(numIters):
• g = ∇" # " #eval gradient
• w = w + -stepsize(iter)*g #update w
• return w

Credit: Fouhey
Gradient Descent

Credit: Fouhey
Gradient Descent: Note on Subgradients
What is the derivative of |x|?
Defined everywhere but 0
# " = sign(") " ≠0

undefined "=0

Credit: Fouhey
Gradient Descent: Note on Subgradients
What is the derivative of |x|?
Use Subgradient: any underestimate of function
Defined everywhere
# " = sign(") " ≠0
# " ∈ [−1,1] "=0

In practice: at discontinuity, pick value on either side.

Credit: Fouhey
Gradient Descent: Note on Learning Rate (LR)
1x10-2 10x10-2 12x10-2
falls short converges diverges

Credit: Fouhey
Back to Logistic Regression Gradient Descent
• Goal: find ! to minimize loss #" (!)
• Start with some initial estimate of !
• At each step, find ∇#" (!), the gradient of the loss w.r.t. !, and take a small
step in the opposite direction
!'() ← !' − , ∇#" !
Note: step size plays a crucial role

Credit: Lazebnik, Liang, Figure source

Gradient Descent for Logistic Regression
!" # = − ( log 0 1) # 2 3)
∇"! # = − ( ∇5 log 0 1) # 2 3)
1 ∇5 0 1) # 2 3)
=− (
' 0 1) # 2 3)
Derivative of sigmoid:
0 6 7 = 0 7 1 − 0(7) = 0 7 0 −7

Credit: Lazebnik, Liang

Gradient Descent for Logistic Regression
!" # = − ( log 0 1) # 2 3)
∇"! # = − ( ∇5 log 0 1) # 2 3)
1 ∇5 0 1) # 2 3)
=− (
' 0 1) # 2 3)
1 0 1) # 2 3) 0 −1) # 2 3) 1) 3)
=− (
' 0 1) # 2 3)
Credit: Lazebnik, Liang We also used the chain rule: 6 7 3 = 68 7 3 7′(3)
Gradient Descent for Logistic Regression
!" # = − ( log 0 1) # 2 3)
∇"! # = − ( ∇5 log 0 1) # 2 3)
1 ∇5 0 1) # 2 3)
=− (
' 0 1) # 2 3)
, ,
1 0 1) # 2 3) 0 −1) # 2 3) 1) 3) 1
=− ( = − ( 0 −1) # 2 3) 1) 3)
' 0 1) # 2 3) '
)*+ )*+

Credit: Lazebnik, Liang

Gradient Descent for Logistic Regression
∇#" $ = − ) . −/* $ 0 1* /* 1*
Update for $:
$ ← $ + 4 ) . −/* $ 0 1* /* 1*
• For a single parameter update, need to cycle through the entire training set!
• This is also called a batch update

Credit: Lazebnik, Liang

Stochastic Gradient Descent
At each iteration, take a single data point !" , $" and perform a parameter
update using ∇& ', !" , $" , the gradient of the loss for that point

' ← ' − * ∇& ', !" , $"

This is called an online or stochastic update

Credit: Lazebnik, Liang

(Stochastic) Gradient Descent
"! # = − ( log 0 1) # 2 3)
GD update for #:
# ← # + 6 ( 0 −1) # 2 3) 1) 3)
Loss for a single sample:
7 #, 3) , 1) = − log 0 1) # 2 3)
SGD update for #:
# ← # + 6 0 −1) # 2 3) 1) 3)
Credit: Lazebnik, Liang
Stochastic Gradient Descent
SGD update for logistic regression:
! ← ! + $ % −'( ! ) *( '( *(

SGD update for perceptron:

! ← ! + $ +['( ! ) *( < 0]'( *(

Credit: Lazebnik, Liang

Stochastic Gradient Descent
SGD update for logistic regression:
! ← ! + $ % −'( ! ) *( '( *(

SGD update for perceptron:

! ← ! + $ +['( ! ) *( < 0]'( *(

SGD update for linear regression

! ← ! − $ (! ) *( − '( )*(

Credit: Lazebnik, Liang

Next to next class
• Gradient Descent in practice
• Impact of LR in detail
• Momentum in Gradient Descent
• Regularization

