Professional Documents
Culture Documents
Learning Rate Free Learning by D Adaptation
Learning Rate Free Learning by D Adaptation
Aaron Defazio
Meta AI, Fundamental AI Research (FAIR) team
Konstantin Mishchenko
Samsung AI Center
arXiv:2301.07733v2 [cs.LG] 20 Jan 2023
Abstract
The speed of gradient descent for convex Lipschitz functions is highly dependent on the choice of
learning rate. Setting the learning rate to achieve the optimal convergence rate requires knowing the
distance D from the initial point to the solution set. In this work, we describe a single-loop method,
with no back-tracking or line searches, which does not require knowledge of D yet asymptotically
achieves the optimal rate of convergence for the complexity class of convex Lipschitz functions.
Our approach is the first parameter-free method for this class without additional multiplicative log
factors in the convergence rate. We present extensive experiments for SGD and Adam variants of
our method, where the method automatically matches hand-tuned learning rates across more than
a dozen diverse machine learning problems, including large-scale vision and language problems.
Our method is practical, efficient and requires no additional function value or gradient evaluations
each step. An open-source implementation is available1 .
1. Introduction
We consider the problem of unconstrained convex minimization,
min f (x),
x∈Rp
where f has Lipschitz constant G and a non-empty set of minimizers. The standard approach to
solving it is the subgradient method that, starting at a point x0 , produces new iterates following the
update rule:
xk+1 = xk − γk gk ,
where gk ∈ ∂f (xk ) is a subgradient of f . The learning rate γk , also known as the step size, is
the main quantity controlling if and how fast the method converges. If the learning rate sequence is
chosen too large, the method might oscillate around the solution, whereas small values lead to very
slow progress.
Setting γk optimally requires knowledge of the distance to a solution. In particular, denote x∗
to be any minimizer of f , D to be the associated distance D = kx0 − x∗ k, and f∗ to be the optimal
value, f∗ = f (x∗ ). Then, using the step size
D
γk = √ ,
G n
1. https://github.com/facebookresearch/dadaptation
the average iterate x̂n converges in terms of function value at an inverse square-root rate:
√
f (x̂n ) − f∗ = O(DG/ n).
This rate is worst-case optimal for this complexity class [21]. Knowledge of the constant G can be
removed using AdaGrad-Norm step sizes [10, 31, 33],
D
γk = qP ,
k 2
i=0 kgi k
together with projection onto the D-ball around the origin. In the (typical) case where we don’t
have knowledge of D, we can start with loose lower and upper bounds d0 and dmax , and perform a
hyper-parameter grid search on a log-spaced scale, with the rate:
DG log(dmax /d0 )
f (xn ) − f∗ = O √ .
n+1
In most machine learning applications this grid search is the current standard practice.
In this work we take a different approach. We describe a modification of dual averaging that
achieves the optimal rate, for sufficiently large n, by maintaining and updating a lower bound on D.
Using this lower bound is provably sufficient to achieve the optimal rate of convergence, with no
additional log factors, avoiding the need for hyper-parameter grid searches.
2. Algorithm
The algorithm we propose is Algorithm 1. It is a modification of the AdaGrad step size applied
to weighted dual averaging, together with our key innovation: D lower bounding. At each step, we
2
L EARNING -R ATE -F REE L EARNING BY D-A DAPTATION
construct a lower bound dˆk on D using empirical quantities. If this bound is better (i.e. larger) than
our current best bound dk of D, we use dk = dˆk in subsequent steps.
To construct the lower bound, we show that a weighted sum of the function values is bounded
above as:
n n
X X γk γn+1
dk (f (xk ) − f∗ ) ≤ D ksn+1 k + d2k kgk k2 − ksn+1 k2 .
2 2
k=0 k=0
Firstly, we are able to gain an additional negative term − 12 γn+1 ksn+1 k2 . Secondly, we replace the
typical D2 error term with D ksn+1 k, following the idea of Carmon and Hinder [2]. This bound is
tighter than the classical bound, and equivalent when D = kx0 − xn+1 k, since:
1 1 −1 2 1
D ksn+1 k − γn+1 ksn+1 k2 = γn+1 D − (D − kx0 − xn+1 k)2 ≤ γn+1 −1
D2 .
2 2 2
From our bound, using the fact that
n
X
dk (f (xk ) − f∗ ) ≥ 0,
k=0
we have:
n
X γk γn+1
0 ≤ D ksn+1 k + d2k kgk k2 − ksn+1 k2 ,
2 2
k=0
which can be rearranged to yield a lower bound on D, involving only known quantities:
2 Pn 2 2
γ n+1 ks n+1 k − k=0 γk dk kgk k
D ≥ dˆn+1 = .
2 ksn+1 k
This bound is potentially vacuous if ksn+1 k2 is small in comparison to nk=0 γk d2k kgk k2 . This only
P
occurs once the algorithm is making fast-enough progress that bound adjustment is not necessary at
that time.
Theorem 1 For a convex G-Lipschitz function f , Algorithm 1 returns a point x̂n such that:
DG
f (x̂n ) − f (x∗ ) = O √ ,
n+1
as n → ∞, where D = kx0 − x∗ k for any x∗ in the set of minimizers of f , as long as d0 ≤ D.
The above result is asymptotic due to the potential of worst-case functions. For any fixed choice of
n, a function could be constructed such that Algorithm 1 run for n steps has a dependence on d0 . In
the next theorem, we prove a non-asymptotic bound that is worse only by a factor of log2 (D/d0 ).
This guarantee is significantly better than using the subgradient method with step size proportional
to d0 , which would incur an extra factor of D/d0 .
3
L EARNING -R ATE -F REE L EARNING BY D-A DAPTATION
Theorem 2 Consider Algorithm 1 run for n ≥ log2 (1 + D/d0 ) steps with the step size modified to
be
1
γk+1 = q . (1)
2
P k 2
G + i=0 kgi k
t
If we return the point x̂t = Pt 1 d
P
k
k=0 dk xk where t is chosen to be
k=0
dk+1
t = arg min Pk ,
i=0 di
k≤n
then
√
v
u t
log2 (1 + D/d0 ) u X 2 DG log2 (1 + D/d0 ) t + 1
f (x̂t ) − f∗ ≤ 8 D t kgk k ≤ 8 .
n+1 n+1
k=0
The worst-case behavior occurs when dk grows exponentially from d0 , but slowly, only reaching D
at the last step. For this reason, the worst case construction requires knowledge of the stopping time
n. The modification to the step size can be avoided at the cost of having an extra term, namely we
would have the following guarantee for the same iterate x̂t :
Notice that, unlike the bound in the theorem above, it also depends on the initial gradient norm
kg0 k.
3. D-Adapted AdaGrad
The D-Adaptation technique can be applied on top of the coordinate-wise scaling variant of
AdaGrad with appropriate modifications. Algorithm 2 presents this method. This variant estimates
the distance to the solution in the `∞ -norm instead of the Euclidean norm, D∞ = kx0 − x∗ k∞ .
The theory for AdaGrad without D-Adaptation also uses the same norm to measure the distance
to solution, so this modification is natural, and results in the same adaptive convergence rate as
AdaGrad up to constant factors without requiring knowledge of D∞ .
Theorem 3 For a convex p-dimensional function with G∞ = maxx k∇f (x)k∞ , D-Adapted
AdaGrad (Algorithm 2) returns a point x̂n such that
kan+1 k1 D∞ pG∞ D∞
f (x̂n ) − f∗ = O =O √ ,
n+1 n+1
Similarly to Theorem 2, we could achieve the same result up to higher order terms without using
G∞ in the initialization of a0 .
4
L EARNING -R ATE -F REE L EARNING BY D-A DAPTATION
1.0 1.0 D
0.8 0.8
dk
d̂k
0.6 0.6
f(x)
d
0.4 0.4
0.2 0.2
0.0
1.0 0.5 0.0 0.5 1.0 0 20 40 60 80 100
x Step
Figure 1: Toy problem illustrating the estimate of D over time, f (x) = |x|. x0 = 1.0 is shown as a
blue dot on the left plot, and the following iterates are shown in purple.
2 ksk+1 k1
dk+1 = max dk , dˆk+1
4. Discussion
Figure 1 depicts the behavior of D-Adaptation on a toy problem - minimizing an absolute value
function starting at x0 = 1.0. Here d0 is started at 0.1, below the known D value of 1.0. This
example illustrates the growth of dk towards D. The value of dk typically doesn’t asymptotically
approach D, as this is not guaranteed nor required by our theory. Instead, we shown in Theorem 19
that under a mild assumption, d is asymptotically greater than or equal to D/3. The lower bound
dˆk will often start to decrease, and even go negative, once dk is large enough. Negative values of dˆk
were seen in most of the experiments in Section 7.
5
L EARNING -R ATE -F REE L EARNING BY D-A DAPTATION
The numerator of the D bound is not tight, it can be replaced with a larger inner product quantity:
n n
X γn+1 X γk
γk dk hgk , sk i ≥ ksn+1 k2 − d2 kgk k2 .
2 2 k
k=0 k=0
The inner product between the step direction s and the gradient g is a quantity known as the
(negative) hyper-gradient [1, 4, 7, 11, 25]. In classical applications of the hyper-gradient, the
learning rate is increased when the gradient points in the same direction as the previous step, and
it is decreased otherwise. In essence, the hyper-gradient indicates if the current learning rate is too
large or to small. An additional hyper-learning rate parameter is needed to control the rate of change
of the learning rate, whereas our approach requires no extra parameters beyond the initial d0 .
In our approach, the hyper-gradient quantity is used to provide an actual estimate of the
magnitude of the optimal learning rate (or more precisely a lower bound), which is far more
information than just a directional signal of too-large or too-small. This is important for instance
when a learning rate schedule is being used, as we can anneal the learning rate down over time,
without the hyper-gradient responding by pushing the learning rate back up. This is also useful
during learning rate warmup, as we are able to build an estimate of D during the warmup, which is
not possible when using a classical hyper-gradient approach.
Our analysis applies to a very restricted problem setting of convex Lipschitz functions. In
Carmon and Hinder [2], an approach for the same setting is extended to the stochastic setting in
high probability. The same extension may also be applicable here.
Our algorithm requires an initial lower bound d0 on D. The value of d0 does not appear in the
convergence rate bound for the asymptotic setting as its contribution goes to zero as k → ∞, and
hence is suppressed when big-O notation is used. In practice very small values can be used, as dk
will grow exponentially with k when d0 is extremely small.
5. Related Work
There are a number of techniques for optimizing Lipschitz functions that achieve independence of
problem parameters. We review the major classes of approaches below.
6
L EARNING -R ATE -F REE L EARNING BY D-A DAPTATION
5.3. Bisection
Instead of running subgradient descent on every grid-point on a log spaced grid from d0 to dmax ,
we can use more sophisticated techniques to instead run a bisection algorithm on the same grid,
resulting in a log log, rather than log dependence on dmax /d0 [2]:
DG log log(dmax /d0 )
f (xn ) − f∗ = O √ ,
n+1
This can be further improved by estimating dmax , which allows us to replace dmax with D in this
bound.
5.4. Coin-betting
If we assume knowledge of G but not D, coin betting approaches can be used. Coin-betting [19,
23, 24, 40] is normally analyzed in the online optimization framework, which is more general than
our setting and for that class, coin-betting methods achieve optimal regret among methods without
knowledge of D [22]:
p
Regretn = O DG (n + 1) log (1 + D) ,
which is a sqrt-log-factor worse than the best possible regret with knowledge. Using online to batch
conversion gives a rate of convergence in function value of
p !
DG log (1 + D)
O √ .
n+1
p
A dependence on log(1 + D/d0 ) can also be obtained using similar techniques, which is better
by a sqrt-factor than our non-asymptotic result.
7
L EARNING -R ATE -F REE L EARNING BY D-A DAPTATION
ksk+1 k1
= max dk , dˆk+1
dk+1
end for
• No bias correction is included as it doesn’t appear necessary based on our experiments. The
implicit learning rate warm-up of D-Adaptation has a similar effect.
We include an optional γk constant sequence as input to the algorithms. This sequence should
be set following a learning rate schedule if one is needed for the problem. This schedule should
8
L EARNING -R ATE -F REE L EARNING BY D-A DAPTATION
consider 1.0 as the base value, increase towards 1.0 during warm-up (if needed), and decrease from
1 during learning rate annealing. Typically the same schedule can be used as would normally be
used without D-Adaptation.
7. Experimental Results
We compared our D-Adapted variants of Adam and SGD on a range of machine learning problems
to demonstrate their effectiveness in practice. For the deep learning problems, we varied both the
models and datasets to illustrate the effectiveness of D-Adaptation across a wide range of situations.
In each case we used the standard learning rate schedule typically used for the problem, with the
base learning rate set by D-Adaptation. Full hyper-parameter settings for each problem are included
in the Appendix. We plot the mean of multiple seeds, with the error bars in each plot indicating a
range of 2 standard errors from the mean. The number of seeds used for each problem is listed in
the Appendix.
9
L EARNING -R ATE -F REE L EARNING BY D-A DAPTATION
Glass Letter
78
70
76
Accuracy (%)
Accuracy (%)
60
74
50 72
40 Adam (72.0 SE 0.16) 70 Adam (77.8 SE 0.01)
D-Adapt Adam (72.0 SE 0.29) D-Adapt Adam (77.7 SE 0.01)
0 20 40 60 80 100 0 20 40 60 80 100
Epoch Epoch
USPS Vehicle
98 80
96
Accuracy (%)
Accuracy (%)
70
94
60
92
Adam (98.6 SE 0.01) 50 Adam (82.1 SE 0.08)
90 D-Adapt Adam (98.6 SE 0.03) D-Adapt Adam (82.5 SE 0.10)
0 20 40 60 80 100 0 20 40 60 80 100
Epoch Epoch
Vowel
80
70
Accuracy (%)
60
50
Adam (77.4 SE 0.11)
40 D-Adapt Adam (77.8 SE 0.10)
0 20 40 60 80 100
Epoch
Figure 2: Logistic Regression experiments.
10
L EARNING -R ATE -F REE L EARNING BY D-A DAPTATION
Train Loss
100
85
80 D-Adapt SGD (95.56% SE 0.04) 10 2
SGD (95.59% SE 0.03)
75
0 50 100 150 200 250 300 0 50 100 150 200 250 300
Epoch Epoch
70
65
60 D-Adapt SGD (76.90% SE 0.06) 10 1
SGD (76.41% SE 0.14)
55
0 50 100 150 200 250 300 0 50 100 150 200 250 300
Epoch Epoch
60 101
50 D-Adapt SGD (76.43% SE 0.01)
SGD (76.09% SE 0.06) 100
40
0 20 40 60 80 0 20 40 60 80
Epoch Epoch
11
L EARNING -R ATE -F REE L EARNING BY D-A DAPTATION
Training Loss
10.0
Test Loss
6
7.5
5
5.0
0 10000 20000 30000 40000 50000 60000 0 10000 20000 30000 40000 50000 60000
Step Step
15
10.0
7.5 10
5.0 5
5000 10000 15000 20000 0 5000 10000 15000 20000
Step Step
30
60
25
40
20
15 20
0 10000 20000 30000 40000 50000 60000 0 10000 20000 30000 40000 50000 60000
Step Step
12
L EARNING -R ATE -F REE L EARNING BY D-A DAPTATION
COCO 2017 (Faster RCNN X-101) COCO 2017 (Faster RCNN X-101)
45 1.2
D-Adapt SGD (0.42 SE 0.01)
40 1.0 SGD (0.41 SE 0.00)
Test BBox AP
Train Loss
35
0.8
30
25 D-Adapt SGD (43.07 SE 0.09) 0.6
SGD (42.99 SE 0.08)
20 0.4
0 50000 100000 150000 200000 250000 0 50000 100000 150000 200000 250000
Epoch Step
train on the Book-Wiki corpus (combining books from Zhu et al. [41] and a snapshot of Wikipedia).
D-Adaptation again matches the baseline in test-set perplexity.
13
L EARNING -R ATE -F REE L EARNING BY D-A DAPTATION
ILSVRC 2012 ImageNet (Vision Transformer) ILSVRC 2012 ImageNet (Vision Transformer)
0.015
D-Adapt Adam (0.0039)
Test Accuracy (%)
60 Adam (0.0037)
0.010
Train Loss
40
0.005
20 D-Adapt Adam (72.05%)
Adam (74.01%)
0 0.000
0 50 100 150 200 250 300 0 50 100 150 200 250 300
Epoch Epoch
Train loss
0.85 0.6
0.5
0.80 D-Adapt Adam (0.9103 SE 0.00066)
Adam (0.9103 SE 0.00032) 0.4
0 10 20 30 40 50 0 10 20 30 40 50
Epoch Epoch
D-Adaptation chooses a learning rate approximately twice the standard rate used for Adam on this
problem. The cosine schedule decreases the learning rate less aggressively than other schedules
early on, which may explain the performance gap.
7.8. fastMRI
The fastMRI Knee Dataset [39] is a large-scale release of raw MRI data. The reconstruction task
consists of producing a 2-dimensional, grey-scale image of the anatomy from the raw sensor data,
under varying under-sampling regimes. We trained a VarNet 2.0 [30] model, a strong baseline
model on this dataset, using the code and training setup released by Meta [5, 16]. We again match
the highly tuned baseline learning rate with D-Adaptation.
14
L EARNING -R ATE -F REE L EARNING BY D-A DAPTATION
0.785 0.46
Train loss
0.780
0.45
0.775 D-Adapt Adam (0.7905 SE 0.00009)
Adam (0.7906 SE 0.00007) 0.44
0.770
0 20 40 60 80 100 0 20 40 60 80 100
Epoch Epoch
8. Conclusion
We have presented a simple approach to achieving parameter free learning of convex Lipshitz
functions, by constructing successively better lower bounds on the key unknown quantity: the
distance to solution kx0 − x∗ k. Our approach for constructing these lower bounds may be of
independent interest. Our method is also highly practical, demonstrating excellent performance
across a range of large and diverse machine learning problems.
Acknowledgements
We would like to thank Ashok Cutkosky for suggesting substantial simplifications to the proof of
Lemma 5.
References
[1] Yoshua Bengio. Gradient-based optimization of hyperparameters. In Neural Computation,
2000.
[2] Yair Carmon and Oliver Hinder. Making SGD parameter-free. arXiv preprint
arXiv:2205.02160, 2022.
[3] Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa Bentivogli, and Marcello Federico.
Report on the 11th IWSLT evaluation campaign, IWSLT 2014. In IWSLT, 2014.
[4] Kartik Chandra, Audrey Xie, Jonathan Ragan-Kelley, and Erik Meijer. Gradient descent: The
ultimate optimizer. In 36th Conference on Neural Information Processing Systems (NeurIPS
2022), 2022.
[5] Aaron Defazio. Offset sampling improves deep learning based accelerated mri reconstructions
by exploiting symmetry. arXiv preprint arXiv:1912.01101, 2019.
[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training
of deep bidirectional transformers for language understanding. Proceedings of the 2019
Conference of the North American Chapter of the Association for Computational Lingustics,
2019.
15
L EARNING -R ATE -F REE L EARNING BY D-A DAPTATION
[7] Justin Domke. Generic methods for optimization-based modeling. In Fifteenth International
Conference on Artificial Intelligence and Statistics, 2012.
[8] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly,
Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image
recognition at scale. In International Conference on Learning Representations, 2021. URL
https://openreview.net/forum?id=YicbFdNTTy.
[9] Yoel Drori and Adrien B. Taylor. Efficient first-order methods for convex minimization: a
constructive approach. Mathematical Programming, 2020.
[10] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning
and stochastic optimization. Journal of Machine Learning Research, 12(61):2121–2159, 2011.
[11] Matthias Feurer and Frank Hutter. Automated Machine Learning, chapter Hyperparameter
Optimization. Springer International Publishing, 2019.
[12] Baptiste Goujaud, Adrien Taylor, and Aymeric Dieuleveut. Optimal first-order methods for
convex functions with a quadratic upper bound. Technical report, INRIA, 2022.
[13] Elad Hazan and Sham M. Kakade. Revisiting the polyak step size. Technical report, Google
AI Princeton, 2019.
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for
image recognition. In Proceedings of the IEEE conference on computer vision and pattern
recognition, 2016.
[15] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. Densely
connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 2261–2269, 2017. doi: 10.1109/CVPR.2017.243.
[16] Florian Knoll, Jure Zbontar, Anuroop Sriram, Matthew J. Muckley, Mary Bruno, Aaron
Defazio, Marc Parente, Krzysztof J. Geras, Joe Katsnelson, Hersh Chandarana, Zizhao Zhang,
Michal Drozdzalv, Adriana Romero, Michael Rabbat, Pascal Vincent, James Pinkerton, Duo
Wang, Nafissa Yakubova, Erich Owens, C. Lawrence Zitnick, Michael P. Recht, Daniel K.
Sodickson, and Yvonne W. Lui. fastMRI: A publicly available raw k-space and DICOM
dataset of knee images for accelerated MR image reconstruction using machine learning.
Radiology: Artificial Intelligence, 2(1):e190007, 2020. doi: 10.1148/ryai.2020190007. URL
https://doi.org/10.1148/ryai.2020190007. PMID: 32076662.
[17] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report,
University of Toronto, 2009.
[18] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A robustly optimized BERT
pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
16
L EARNING -R ATE -F REE L EARNING BY D-A DAPTATION
[19] H. Brendan McMahan and Francesco Orabona. Unconstrained online linear learning in
hilbert spaces: Minimax algorithms and normal approximations. In Maria Florina Balcan,
Vitaly Feldman, and Csaba Szepesvári, editors, Proceedings of The 27th Conference on
Learning Theory, volume 35 of Proceedings of Machine Learning Research, pages 1020–
1039, Barcelona, Spain, 13–15 Jun 2014. PMLR. URL https://proceedings.mlr.
press/v35/mcmahan14.html.
[20] Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan
Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G.
Azzolini, Dmytro Dzhulgakov, Andrey Mallevich, Ilia Cherniavskii, Yinghai Lu, Raghuraman
Krishnamoorthi, Ansha Yu, Volodymyr Kondratenko, Stephanie Pereira, Xianjie Chen,
Wenlin Chen, Vijay Rao, Bill Jia, Liang Xiong, and Misha Smelyanskiy. Deep
learning recommendation model for personalization and recommendation systems. CoRR,
abs/1906.00091, 2019. URL https://arxiv.org/abs/1906.00091.
[21] Yurii Nesterov. Lectures on Convex Optimization. Springer Nature, 2018.
[22] Francesco Orabona. A modern introduction to online learning. Technical report, Boston
University, 2019.
[23] Francesco Orabona and Dávid Pál. Parameter-free stochastic optimization of variationally
coherent functions, 2021. URL https://arxiv.org/abs/2102.00236.
[24] Francesco Orabona and Tatiana Tommasi. Training deep networks without learning rates
through coin betting. In Advances in Neural Information Processing Systems, volume 30.
Curran Associates, Inc., 2017.
[25] Fabian Pedregosa. Hyperparameter optimization with approximate gradient. In International
conference on machine learning, pages 737–746. PMLR, 2016.
[26] Boris T. Polyak. Introduction to optimization. Optimization Software, Inc., 1987.
[27] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language
understanding by generative pre-training. Technical report, OpenAI, 2019.
[28] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-
time object detection with region proposal networks. In C. Cortes, N. Lawrence, D. Lee,
M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems,
volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.
cc/paper/2015/file/14bfa6bb14875e45bba028a21ed38046-Paper.pdf.
[29] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng
Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-
Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer
Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.
[30] Anuroop Sriram, Jure Zbontar, Tullie Murrell, Aaron Defazio, C. Lawrence Zitnick, Nafissa
Yakubova, Florian Knoll, and Patricia Johnson. End-to-end variational networks for
accelerated MRI reconstruction. In International Conference on Medical Image Computing
and Computer-Assisted Intervention, pages 64–73. Springer, 2020.
17
L EARNING -R ATE -F REE L EARNING BY D-A DAPTATION
[31] Matthew Streeter and H. Brendan McMahan. Less regret via online conditioning, 2010. URL
https://arxiv.org/abs/1002.4862.
[32] Matthew Streeter and H. Brendan McMahan. No-regret algorithms for unconstrained online
convex optimization. In Proceedings of the 25th International Conference on Neural
Information Processing Systems - Volume 2, NIPS’12, Red Hook, NY, USA, 2012. Curran
Associates Inc.
[33] Rachel Ward, Xiaoxia Wu, and Leon Bottou. Adagrad stepsizes: sharp convergence over
nonconvex landscapes. In International Conference on Machine Learning, pages 6677–6686,
2019.
[36] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2.
https://github.com/facebookresearch/detectron2, 2019.
[37] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual
transformations for deep neural networks. In 2017 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), pages 5987–5995, 2017. doi: 10.1109/CVPR.2017.634.
[38] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In Edwin R. Hancock
Richard C. Wilson and William A. P. Smith, editors, Proceedings of the British Machine Vision
Conference (BMVC), pages 87.1–87.12. BMVA Press, September 2016. ISBN 1-901725-59-6.
doi: 10.5244/C.30.87. URL https://dx.doi.org/10.5244/C.30.87.
[39] Jure Zbontar, Florian Knoll, Anuroop Sriram, Matthew J. Muckley, Mary Bruno, Aaron
Defazio, Marc Parente, Krzysztof J. Geras, Joe Katsnelson, Hersh Chandarana, et al. fastMRI:
An open dataset and benchmarks for accelerated MRI. arXiv preprint arXiv:1811.08839,
2018.
[40] Zhiyu Zhang, Ashok Cutkosky, and Ioannis Ch. Paschalidis. Pde-based optimal strategy
for unconstrained online learning. In Proceedings of the 39th International Conference on
Machine Learning (ICML 2022), 2022.
[41] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio
Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations
by watching movies and reading books. In Proceedings of the 2015 IEEE International
Conference on Computer Vision (ICCV), 2015. doi: 10.1109/ICCV.2015.11. URL https:
//doi.org/10.1109/ICCV.2015.11.
18
L EARNING -R ATE -F REE L EARNING BY D-A DAPTATION
with λ weights:
n n
X γn+1 γn+1 X 2
−γn+1 λk hgk , sk i = − ksn+1 k2 + λk kgk k2 .
2 2
k=0 k=0
19
L EARNING -R ATE -F REE L EARNING BY D-A DAPTATION
Theorem 6 The initial distance to solution, D = kx0 − x∗ k, can be lower bounded as follows
Pnmagnitude of D in the case when the other terms on the right are
gives some indication as to the
negative. To proceed, we use k=0 λk (f (xk ) − f∗ ) ≥ 0, giving:
n
X γk γn+1
0 ≤ D ksn+1 k + λ2k kgk k2 − ksn+1 k2 ,
2 2
k=0
20
L EARNING -R ATE -F REE L EARNING BY D-A DAPTATION
Therefore:
γn+1
ksn+1 k2 − nk=0 γk 2
kgk k2
P
2 2 λk
D≥ .
ksn+1 k
Proof Using the definition of dˆn+1 from Theorem 6, and the property dˆn+1 ≤ dn+1 , we derive
n
γn+1 X γk
ksn+1 k2 − λ2 kgk k2 = dˆn+1 ksn+1 k ≤ dn+1 ksn+1 k .
2 2 k
k=0
2d2n+1 γn+1
Using inequality 2αβ ≤ α2 + β 2 with α2 = γn+1 and β 2 = 2 ksn+1 k
2 and then the bound
above, we establish
n
2d2n+1 γn+1 2d2 X γk
2αβ = 2dn+1 ksn+1 k ≤ + ksn+1 k2 ≤ n+1 + dn+1 ksn+1 k + λ2 kgk k2 .
γn+1 2 γn+1 2 k
k=0
Proposition 8 (From Streeter and McMahan [31]) The gradient error term can be bounded as:
v
n u n
X kgk k2 uX
q ≤ 2t kgk k2 . (3)
2
P k−1 2
k=0 G + i=0 kgi k k=0
1
Moreover, if γk = q , then
G2 + k−1 2
P
i=0 kgi k
n n
!
X γk 2 2
X 2
kgk k ≤ γn+1 G + kgk k . (4)
2
k=0 k=0
21
L EARNING -R ATE -F REE L EARNING BY D-A DAPTATION
Proof In the case where g0 = 0, f (x0 ) = f (x∗ ) and the theorem is trivially true, so we assume that
kg0 k2 > 0. We will show the result holds for some n, where we choose n sufficiently large so that
a number of criteria are met:
Criterion 1: since dk is a non-decreasing sequence upper bounded by D, there must exist some
n̂ such that after n̂ steps, dk ≥ 12 dn+1 for all k, n ≥ n̂. We take n ≥ 2n̂.
22
L EARNING -R ATE -F REE L EARNING BY D-A DAPTATION
Criterion 2: since we assume the bound kgk k2 ≤ G2 , there must exist some r such that kgn k2 ≤
Pn−1 2
k=0 kgk k P for all n ≥ r. Let us choose the smallest r that satisfies this condition, in which case
kgr−1 k ≥ r−2
2 2
k=0 kgk k , otherwise we could have chosen r − 1. Moreover, we have by definition
1
γk ≤ kg0 k for all k ≤ r − 1. Combining this with the first bound from Proposition 8, we derive
n
X n
X r−1
X
γk kgk k2 = γk kgk k2 + γk kgk k2
k=0 k=r k=0
v
u n r−1
uX 1 X
≤ 2t kgk k2 + kgk k2
kg0 k
k=r k=0
v
u n
uX 2
≤ 2t kgk k2 + kgr−1 k2
kg0 k
k=0
v
u n
uX G2
≤ 2t kgk k2 + 2 .
kg0 k
k=0
hence
1 4
Pn ≤ .
k=0 dk (n + 1)dn+1
Plugging this back yields
v
n u n n
1 X 8D u X 4D X
Pn dk (f (xk ) − f∗ ) ≤ t kgk k2 + γk kgk k2 .
k=0 dk k=0 (n + 1)
k=0
n+1
k=0
23
N ON - ASYMPTOTIC CASE
Using Jensen’s inequality, we can convert this to a bound on the average iterate defined as
n
1 X
x̂n = Pn dk xk ,
k=0 dk k=0
implying
12DG 8DG2
f (x̂n ) − f∗ ≤ √ + .
n + 1 (n + 1)kg0 k
Note that the second term on the right decreases faster than the first term with respect to n, so
DG
f (x̂n ) − f∗ = O √ .
n+1
Let r = dlog2 (dN /d0 )e. We proceed by an inductive argument on r. In the base case, if r = 0
then the result follows immediately:
dn+1 dN +1 dN +1
min Pn = PN =
n≤N d
k=0 k k=0 dk
(N + 1)dN +1
1 log (1 + dN +1 /d0 )
= ≤2 2 .
N +1 N +1
So assume that r > 0. First we show that no induction is needed, and we may take n = N, if
1 N +1
dk ≥ dN +1 , for all k ≥ N + 1 − .
2 log2 (1 + dN +1 /d0 )
dN +1 log (1 + dN +1 /d0 )
PN ≤2 2 ,
k=0 dk
N +1
24
N ON - ASYMPTOTIC CASE
and therefore
dn+1 log (1 + dN +1 /d0 )
min Pn ≤2 2 .
n≤N k=0 dk N +1
j k
So, instead suppose that dn0 +1 ≤ 21 dN +1 , for n0 = N + 1 − log (1+d N +1
N +1 /d0 )
. Note that +1
2
is due to the fact that the above case includes the edge case where an increase occurs exactly at the
beginning of the interval. Assume the inductive hypothesis that:
dn+1 log2 (1 + dn0 +1 /d0 ) 0 N +1
min n ≤2 , for n = N + 1 − .
n0 + 1
P
n≤n0 k=0 dk log2 (1 + dN +1 /d0 )
Theorem 12 P Consider Algorithm 1 run for n steps, where n ≥ log2 (D/d0 ), if we return the point
t
x̂t = Pt 1 d k=0 dk xk where t is chosen to be:
k=0 k
dk+1
t = arg min Pk ,
i=0 di
k≤n
Then: v
u t
log2 (1 + D/d0 ) uX
f (x̂t ) − f∗ ≤ 8 Dt kgk k2 .
n+1
k=0
25
N ON - ASYMPTOTIC CASE
d
Now using Lemma 11, we can return the point x̂t and at time t = arg mink≤n Pkk+1 , ensuring
i=0 di
that
dt+1 dk+1 (5) log2 (1 + dn+1 /d0 )
Pt = min Pk ≤2 ,
k=0 dk
k≤n
i=0 di
n+1
giving us an upper bound:
v
u t
log2 (1 + D/d0 ) uX
f (x̂t ) − f∗ ≤ 8 Dt kgk k2 .
n+1
k=0
We note that a similar proof can be used to remove the G2 term from the numerator of γk . To this
end, we could reuse the bound obtained in the proof of Theorem 10:
v
n u n
X 2
uX G2
γk kgk k ≤ 2t kgk k2 + 2 ,
kg0 k
k=0 k=0
which holds for γk = qP 1 . In the proof of Theorem 10, this bound was stated for n ≥ r,
k−1
i=0 kgi k2
where r is the smallest number such that kgk k2 ≤ i=0
Pk−1
kgi k2 for all k ≥ r. However, the bound
itself does not require n ≥ r, since for n < r it holds even without the first term in the right-hand
side. The second term in that bound does not increase with n, and it would result in the following
bound for the same iterate x̂t as in Theorem 12:
26
N ON - ASYMPTOTIC CASE
D∞ = kx0 − x∗ k∞
and:
ksn+1 k2A−1 −
Pn 2
k=0 λk kgk k2A−1
dˆn+1 = n+1 k
.
2 ksn+1 k1
The following lemma applies to Algorithm 2 with general weights λk .
Lemma 13 The inner product λk gk , A−1
k sk is a key quantity that occurs in our theory. Suppose
that An+1 An for all n, then we can bound the sum of these inner products as follows:
n n
X 1 1X 2
λk gk , A−1 2
λk kgk k2A−1 .
− k sk ≤ − ksn+1 k −1
An+1
+
2 2 k
k=0 k=0
1 1
ksn+1 k2A−1 ≤ ksn+1 k2A−1
2 n+1 2 n
1 1 2
= ksn k2A−1 −1
+ λn kgn k2A−1
+ λ n gn , An sn .
2 n 2 n
Therefore
1 1 1
−λn gn , A−1 ksn k2A−1 − ksn+1 k2A−1 + λ2n kgn k2A−1
n sn ≤ .
2 n 2 n+1 2 n
27
N ON - ASYMPTOTIC CASE
and, therefore,
ksn+1 k2A−1 −
Pn 2
k=0 λk kgk k2A−1
n+1 k
D∞ ≥ .
2 ksn+1 k1
28
N ON - ASYMPTOTIC CASE
p p
s2(n+1)i
!
X X
2dn+1 ksn+1 k1 = 2dn+1 |s(n+1)i | ≤ 2d2n+1 a(n+1)i +
2a(n+1)i
i=1 i=1
1
= 2d2n+1 kan+1 k1 + ksn+1 k2A−1
2 n+1
n
1X 2
≤ 2d2n+1 kan+1 k1 + dn+1 ksn+1 k1 + λk kgk k2A−1
2 k
k=0
≤ 2d2n+1 kan+1 k1 + dn+1 ksn+1 k1 + 2
dn+1 kan+1 k1 .
Rearranging, we get
dn+1 ksn+1 k1 ≤ 3d2n+1 kan+1 k1 .
Theorem 18 For a convex function with G∞ = maxx k∇f (x)k∞ , D-Adapted AdaGrad returns a
point x̂n such that
kan+1 k1 D∞ pG∞ D∞
f (x̂n ) − f∗ = O =O √
n+1 n+1
as n → ∞, where D = kx0 − x∗ k∞ for any x∗ in the set of minimizers of f , as long as d0 ≤ D∞
29
N ON - ASYMPTOTIC CASE
Proof We will show the result holds for some n, where we choose n sufficiently large so that the
following condition is satisfied. Since dk is a non-decreasing sequence upper bounded by D, there
must exist some n̂ such that after n̂ steps, dk ≥ 21 dn+1 for all k, n ≥ n̂. We take n ≥ 2n̂.
Then:
n
X 1
dk ≥ (n + 1)dn+1 ,
4
k=0
1 4
∴ Pn ≤ .
k=0 dk (n + 1)dn+1
So we have from Lemma 15 that:
n n
!
1 X 4 1X 2
Pn dk (f (xk ) − f∗ ) ≤ ksn+1 k1 D∞ + dk kgk k2A−1 .
k=0 dk k=0 (n + 1)dn+1 2
k=0
k
16pG∞ D∞
f (x̂n ) − f∗ ≤ √ ,
n+1
which yields the result.
30
N ON - ASYMPTOTIC CASE
Table 1: Logistic regression experiment. The Table 2: CIFAR10 experiment. Our data
problems are part of the LIBSVM repository. augmentation pipeline followed standard practice:
Since there are no standard train/test splits, and random horizontal flipping, then random cropping to
due to the small sizes of the datasets, we present 32×32 (padding 4), then normalization by centering
training accuracy curves only. around (0.5, 0.5, 0.5).
31
N ON - ASYMPTOTIC CASE
Table 5: fastMRI experiment. We used the Table 6: IWSLT14 experiment. Our implementation
implementation from https://github. used FairSeq https://github.com/
com/facebookresearch/fastMRI. facebookresearch/fairseq defaults except
for the parameters listed below. Note that the default
Hyper-parameter Value Adam optimizer uses decoupled weight decay.
Architecture 12 layer VarNet 2.0
Epochs 50 Hyper-parameter Value
GPUs 8×V100 Architecture lstm_wiseman_iwslt_de_en
Batch size per GPU 1 Max Epoch 55
Acceleration factor 4 GPUs 1×V100
Low frequency lines 16 Max tokens per batch 4096
Mask type Offset-1 Warmup steps 4000
LR schedule flat Dropout 0.3
Seeds 5 Label smoothing 0.1
Decay 0.0 Share decoder, input,
True
Adam LR 0.0003 output embed
β1 , β2 0.9, 0.999 Float16 True
Update Frequency 1
LR schedule Inverse square-root
Seeds 10
Decay 0.05
Adam LR 0.01
β1 , β2 0.9, 0.98
Table 7: RoBERTa BookWiki experiment. Our Table 8: GPT BookWiki experiment. Our
implementation used FairSeq defaults except for implementation used FairSeq defaults except for
the parameters listed below. the parameters listed below.
32
N ON - ASYMPTOTIC CASE
Hyper-parameter Value
Iterations 300 000
Batch Size 128
Schedule Flat
Emb Dimension 16
Seeds 5
Decay 0.0
Adam LR 0.0001
β1 , β2 0.9, 0.999
33
N ON - ASYMPTOTIC CASE
The last term can be simplified using the definition of γn to finally produce:
n−1
!
X
γn ksn k ≤ 2dn + γn2 dn G2 + kgk k2 = 2dn + dn = 3dn .
k=0
Now, assume that xn → x∗ in norm, so kxn − x∗ k → 0. In that case, the bounds combined yield
D ≤ lim(kxn − x∗ k + γn ksn k) = lim γn ksn k ≤ 3 lim dn .
n n→∞ n→∞
D
Thus, the value of dn is asymptotically lower bounded by 3.
34
N ON - ASYMPTOTIC CASE
Notice that it always holds den ≥ dˆn . The only complication that we can face is with Lemma 7,
where we used the definition of dˆn to obtain the upper bound. Nevertheless, one can prove the same
bound with dˆn replaced by den by repeating the same argument:
n
γn+1 X γk
ksn+1 k2 − λ2k kgk k2 = dˆn+1 ksn+1 k ≤ den+1 ksn+1 k ≤ dn+1 ksn+1 k .
2 2
k=0
From that place, the rest of the proof of Lemma 7 follows in exactly the same way. The other proofs
only use the monotonicity of the sequence and its boundedness by D, dk ≤ dn+1 ≤ D, which
would remain valid if replace dˆn with den .
35