Professional Documents
Culture Documents
Stochastic Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
Stochastic Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
Ryan Tibshirani
Convex Optimization 10-725
Last time: proximal gradient descent
Consider the problem
2
Outline
Today:
• Stochastic gradient descent
• Convergence rates
• Mini-batches
• Early stopping
3
Stochastic gradient descent
Consider minimizing an average of functions
m
1 X
min fi (x)
x m
i=1
Pm Pm
As ∇ i=1 fi (x) = i=1 ∇fi (x), gradient descent would repeat:
m
(k) (k−1) 1 X
x =x − tk · ∇fi (x(k−1) ), k = 1, 2, 3, . . .
m
i=1
4
Two rules for choosing index ik at iteration k:
• Randomized rule: choose ik ∈ {1, . . . , m} uniformly at
random
• Cyclic rule: choose ik = 1, 2, . . . , m, 1, 2, . . . , m, . . .
Randomized rule is more common in practice. For randomized rule,
note that
E[∇fik (x)] = ∇f (x)
so we can view SGD as using an unbiased estimate of the gradient
at each step
5
Example: stochastic logistic regression
6
Small example with n = 10, p = 2 to show the “classic picture” for
batch versus stochastic methods:
● Batch
20
● Random
Blue: batch steps, O(np)
Red: stochastic steps, O(p)
10
●
●
●
●
●● ●
●
●
●
● ● ●
●
●
●
● ●● ●
●
●
●
●
●
●
●
●●
●● ●
●
● • generally thrive far
●● ●
from optimum
−10
●●●
●● ● ●
●●
●● ●
●● ●
●●●
●
●● ●●
● ●●● ● ●
• generally struggle close
● ●●● ●
●●
to optimum
−20
●●
●
−20 −10 0 10 20
7
Step sizes
Standard in SGD is to use diminishing step sizes, e.g., tk = 1/k
Why not fixed step sizes? Here’s some intuition. Suppose we take
cyclic rule for simplicity. Set tk = t for m updates in a row, we get:
m
X
(k+m) (k)
x =x −t ∇fi (x(k+i−1) )
i=1
8
Convergence rates
1
For example, Nemirosvki et al. (2009), “Robust stochastic optimization
approach to stochastic programming”
9
Even worse is the following discrepancy!
2
For example, Nemirosvki et al. (2009), “Robust stochastic optimization
approach to stochastic programming”
10
Mini-batches
Also common is mini-batch stochastic gradient descent, where we
choose a random subset Ik ⊆ {1, . . . , m}, |Ik | = b m, repeat:
1X
x(k) = x(k−1) − tk · ∇fi (x(k−1) ), k = 1, 2, 3, . . .
b
i∈Ik
3
For example, Dekel et al. (2012), “Optimal distributed online prediction
using mini-batches”
11
Back to logistic regression, let’s now consider a regularized version:
n
1 X T
λ
min − yi xTi β + log(1 + exi β ) + kβk22
β n 2
i=1
1 Pn
Full gradient computation is ∇f (β) = n i=1 yi − pi (β) xi + λβ.
Comparison between methods:
• One batch update costs O(np)
• One mini-batch update costs O(bp)
• One stochastic update costs O(p)
12
Example with n = 10, 000, p = 20, all methods use fixed step sizes:
Full
Stochastic
Mini−batch, b=10
Mini−batch, b=100
0.65
Criterion fk
0.60
0.55
0.50
0 10 20 30 40 50
Iteration number k
13
What’s happening? Now let’s parametrize by flops:
Full
Stochastic
Mini−batch, b=10
Mini−batch, b=100
0.65
Criterion fk
0.60
0.55
0.50
Flop count
14
Finally, looking at suboptimality gap (on log scale):
1e−03
Criterion gap fk−fstar
1e−06
1e−09
Full
Stochastic
Mini−batch, b=10
1e−12
Mini−batch, b=100
0 10 20 30 40 50
Iteration number k
15
End of the story?
Short story:
• SGD can be super effective in terms of iteration cost, memory
• But SGD is slow to converge, can’t adapt to strong convexity
• And mini-batches seem to be a wash in terms of flops (though
they can still be useful in practice)
16
SGD in large-scale ML
4
For example, Bottou (2012), “Stochastic gradient descent tricks”
17
Early stopping
18
Consider the following, for a very small constant step size :
• Start at β (0) = 0, solution to regularized problem at t = 0
• Perform gradient descent on unregularized criterion
n
(k) (k−1) 1X
β =β −· (yi − pi (β (k−1) ))xi , k = 1, 2, 3, . . .
n
i=1
19
An intruiging connection
When we plot gradient descent iterates ... it resembles the solution
path of the `2 regularized problem for varying t!
0.8
0.6
0.6
0.4
0.4
Coordinates
Coordinates
0.2
0.2
0.0
0.0
−0.2
−0.2
−0.4
−0.4
20
What’s the connection?
The intuitive connection comes from the steepest descent view of
gradient descent. Let k · k and k · k∗ be dual norms (e.g., `p and `q
norms with 1/p + 1/q = 1)
∆x = k∇f (x)k∗ · u
u = argmin ∇f (x)T v
kvk≤1
21
References and further reading
22