Line Search For Convex Minimization

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Line Search for Convex Minimization

Laurent Orseau, Marcus Hutter


Google DeepMind
{lorseau,mhutter}@google.com
arXiv:2307.16560v1 [math.OC] 31 Jul 2023

Abstract
Golden-section search and bisection search are the two main principled algorithms for 1d minimization of
quasiconvex (unimodal) functions. The first one only uses function queries, while the second one also uses
gradient queries. Other algorithms exist under much stronger assumptions, such as Newton’s method for
twice-differentiable, strongly convex functions with Lipschitz gradients. However, to the best of our knowl-
edge, there is no principled exact line search algorithm for general convex functions — including piecewise-
linear and max-compositions of convex functions — that takes advantage of convexity. We propose two such
algorithms: ∆-Bisection is a variant of bisection search that uses (sub)gradient information and convexity to
speed up convergence, while ∆-Secant is a variant of golden-section search and uses only function queries.
Both algorithms are based on a refined definition of the optimality region ∆ containing the minimum
point, for general convex functions.
While bisection search reduces the x interval by a factor 2 at every iteration, ∆-Bisection reduces the
(sometimes much) smaller x∗ -gap ∆x (the x coordinates of ∆) by at least a factor 2 at every iteration.
Similarly, ∆-Secant also reduces the x∗ -gap by at least a factor 2 every second function query.
Moreover, and possibly more importantly, the y ∗ -gap ∆y (the y coordinates of ∆) also provides a refined
stopping criterion, which can also be used with other algorithms.
Experiments on a few convex functions confirm that our algorithms are always faster than their quasicon-
vex counterparts, often by more than a factor 2.
We further design a quasi-exact line search algorithm based on ∆-Secant. It can be used with gradient
descent as a replacement for backtracking line search, for which some parameters can be finicky to tune —
and we provide examples to this effect, on strongly-convex and smooth functions. We provide convergence
guarantees, and confirm the efficiency of quasi-exact line search on a few single- and multivariate convex
functions.1

1 Introduction
We consider the problem of exact function minimization in one dimension for general convex functions that are
expensive to evaluate (Boyd and Vandenberghe, 2004). This can occur for example when looking for a step size
for gradient descent or Frank-Wolfe (Frank and Wolfe, 1956) for logistic regression with millions of parameters.
While a backtracking line search is often used for this problem (Armijo, 1966; Boyd and Vandenberghe,
2004), such algorithms are quite sensitive to their parameters, and there is no universal good choice. In the past,
exact line searches have been used, but they were deemed too computationally costly. In this work we propose
a quasi -exact line search for convex functions, which is both computationally efficient and has tight guarantees
while being easy to tune — that is, the default value of its single parameter appears to be a universally good
choice.
When gradient information is available and computationally cheap enough, a classic algorithm for minimizing
a 1-d quasi-convex function is bisection search. It ensures that the x interval is halved at each iteration of the
search. We improve bisection search for 1-d convex functions by taking advantage of convexity to refine the
interval in which the optimum lies, which we call the optimality region ∆. We call this algorithm ∆-Bisection
and we show that it is provably faster than bisection search: the size of the optimality interval |∆x | on the x
axis (which we call the x∗ -gap) reduces by more than a factor 2 at each iteration, and often significantly more.
Moreover, for general convex functions we can use the optimality interval |∆y | (called the y ∗ -gap) to design a
1 One implementation of ∆-Secant can be found at https://github.com/deepmind/levintreesearch_cm/blob/main/lts-cm/

delta-secant.rkt

1
stopping criterion. The y ∗ -gap is almost always bounded for convex functions, while it is almost always infinite
for general quasi-convex functions.
When gradient information is not available or is too expensive to compute, another classic algorithm to
minimize
√ 1-d quasi-convex functions is golden-section search (GSS). It reduces the x interval by a factor (1 −
5)/2 ≈ 1.62 at each iteration. Again, we take advantage of convexity to refine the optimality region ∆(P ),
for a set P of function evaluations. We design an algorithm called ∆-Secant that uses this information. It
is guaranteed to reduce the x∗ -gap by a factor at least 2 every second iteration. Note that the x∗ -gap can
be significantly smaller than the x interval, so the guarantees of GSS and ∆-Secant are not straightforwardly
comparable.
Experiments on a few convex functions confirm that our algorithms are significantly faster than their quasi-
convex counterparts.
Finally, we use ∆-Secant to design a quasi-exact line search algorithm to be combined with gradient descent-
like algorithms for minimization of multivariate convex functions. We prove a tight guarantee for strongly
convex and smooth functions, which is only a factor 2 worse than the guarantee for exact line search, while being
significantly faster in practice. We validate on some experiments that our quasi-exact line search is competitive
with a well-tuned backtracking line search, while being more robust and does not require to precisely tune a
parameter. Moreover, we demonstrate that backtracking line search has failure cases even for smooth convex
functions, making it non-straightforward to tune its parameters.

2 Notation and Background


For a point pi , we write xi and y i for its x and y coordinates, that is pi = (xi , y i ).
For two points pi and pj , and define the secant line f ij (x) = aij (x − xi ) + y i to be the line going through pi
i
−y j
and pj , where aij = xyi −x j

For four points pi , pj , pk , pl , define pij,kl = (xij,kl , y ij,kl ) to be the intersection point of the line going through
pi and pj with the line going through pk and pl . It can be checked that it can be expressed as
f ij (0) − f kl (0) akl f ij (0) − aij f kl (0)
xij,kl = , y ij,kl = f ij (xij,kl ) = f kl (xij,kl ) = . (1)
akl − aij akl − aij
There are many ways to express this point, but we choose this one as it is simple and is more closely related to
the numerical expression we use for stability (see Appendix B).
Let P be a set of points, then we define plow (P ) = argmin(x,y)∈P y, and we also define (y low (P ), xlow (P )) =
p (P ), and similarly for phigh .
low

We assume that the user provides a convex function f : [xleft , xright ] → R ∪ {∞} and the interval [xleft , xright ].
We write x∗ ∈ argminx∈[xleft ,xright ] f (x) for some arbitrary minimum x∗ of f within the given bounds, and
y ∗ = f (x∗ ).
Given a convex function f : [xleft , xright ] → R, we say that a set of points P is (f, xleft , xright )-convex if for all
(x, y) ∈ P , y = f (x), and argminx∈[xleft ,xright ] f (x) ⊆ [xlow (P ), xhigh (P )]. (The last condition is easy to fulfill by
including the points at xleft and xright in P , but, practically, it may be useful to remove unnecessary points in
P .)
We also assume that the user provides a tolerance y tol and that the algorithms considered later return a point
(x, y) only when it can be proven that y − y ∗ ≤ y tol .
Recall that a function f : Rd → R is m-strongly convex if, for all x, x′ ∈ Rd :
m ′
f (x′ ) ≥ f (x) + ⟨x′ − x, ∇f (x)⟩ + ∥x − x∥22 ,
2
and f is M -smooth if for all x, x′ ∈ Rd :
M ′
f (x′ ) ≤ f (x) + ⟨x′ − x, ∇f (x)⟩ + ∥x − x∥22 .
2

3 Algorithms for 1-d convex minimization


We first consider the case where gradient information is used, and define the ∆-Bisection algorithm. Thereafter
we consider the case where gradient information is not used, and define the ∆-Secant algorithm.

2
3.1 ∆-Bisection: Convex line search with gradient information
It is well-known that 1-d convex minimization can be performed using the bisection method to guarantee a
reduction of the x interval by a factor 1/2 at each iteration. Quoting Boyd and Vandenberghe (2004):
The volume reduction factor is the best it can be: it is always exactly 1/2.

For general quasi-exact functions, only the sign of the gradient is informative — and provides exactly 1 bit
of information. However, more information is available for convex functions, namely, the magnitude of the
gradient. Thus, it is almost always possible to reduce the interval by a factor less than 1/2, using nothing
more than gradient information and convexity, and without performing more function or gradient queries than
the original bisection algorithm. For this purpose, we propose the ∆-Bisection algorithm. To show how this is
achieved, we need to consider a refinement of the interval where the optimum x∗ can be located. See Fig. 1.

x interval p1

x∗ −gap ∆x opt. region ∆


0 − +
p =p p f (x)
y⁰
y⁰
y ∗ −gap ∆y

tangent at
tangent at x⁰
x⁰

(x∗ , y ∗ ) tangent at
tangent at x¹

Δ-Bisection query
Δ-Bisection query xq
xq
bisection search
bisection search query
query xq'
xq'

x⁰=x⁻
x⁰=x⁻ xq
xq x⁺
x⁺ xq'
xq' x¹

Figure 1: ∆-Bisection: An example where the x∗ -gap ∆x = [x− , x+ ] ≈ [0.2, 1.85] is significantly smaller (almost
by a factor 3 here) than the x interval [x0 , x1 ] = [0.2, 5], after querying the function f and the gradients at x0 and
x1 . x+ is the x value of the intersection of the tangent at x1 with the horizontal axis at min{y 0 , y 1 }. Similarly,
x− is the x-value of the intersection of the tangent at x0 with the horizontal axis at min{y 0 , y 1 } which, here, is
x0 . Due to convexity, we know that the minimum x∗ cannot be outside [x− , x+ ]. The original bisection search

algorithm would query at the middle of [x0 , x1 ] at xq = 2.6 — which is outside [x− , x+ ] — which, while reducing
the [x , x ] interval by a factor 2, would reduce the ∆x interval only marginally (using the tangent information
0 1

at xq — not shown). By contrast, the ∆-Bisection algorithm queries at xq in the middle of ∆x , ensuring a
reduction of ∆x by at least a factor 2: once the gradient at xq becomes known, we will know that x∗ ∈ / [xq , x+ ]
q
on this figure, which gives a reduction of a factor 2, and the tangent at x will be used to calculate the new
values of x− and x+ .

At each iteration t, the algorithm maintains two points, p0t = (x0t , yt0 ) and p1t = (x1t , yt1 ), with x0t ≤ x1t ,
and the gradients of f at these points f ′ (x0t ) and f ′ (x1t ). Define xhigh
t ∈ {x0t , x1t } such that ythigh = f (xhigh
t )=
0 1 low 0 1 high low low
max{f (xt ), f (xt )}, and xt ∈ {xt , xt } \ {xt } with yt = f (xt ).
Define fba (x) = f (xab ) + (x − xab )f ′ (xab ) to be the tangent function of f at xab . Then ft0 (·) and ft1 (·) are the
two tangents at the two points x0t and x1t . Then by convexity we know that f (x∗ ) ≥ ft0 (x∗ ) and f (x∗ ) ≥ ft1 (x∗ ),
and we also know that f (x∗ ) ≤ ytlow . We use these constraints to define the optimality region ∆t at iteration t:

∆t = {(x, y) ∈ R2 | x ∈ [x0t , x1t ] ∧ y ≤ ytlow ∧ ft0 (x) ≤ y ∧ ft1 (x) ≤ y} ,

3
We also define the x∗ -gap ∆xt and the y ∗ -gap ∆yt :

∆xt = {x | (x, ·) ∈ ∆t } , ∆yt = {y | (·, y) ∈ ∆t } .

Then necessarily (x∗ , f (x∗ )) ∈ ∆t for all t, and in particular x∗ ∈ ∆xt and f (x∗ ) ∈ ∆yt .
Define x− x + x − + 0 1
t = inf ∆t and xt = sup ∆t . Then xt (resp. xt ) is the intersection of ft (·) (resp. ft (·)) with the
horizontal line at ytlow , that is,

yt0 − ytlow yt1 − ytlow


∆xt = [x− +
t , xt ] , x− 0
t = xt + , x+ 1
t = xt − . (2)
−f ′ (x0t ) f ′ (x1t )

Note that either x0t = x− 1 + x


t or xt = xt . Then the ∆-Bisection algorithm queries in the middle of ∆t :

1 −
xqt = (x + x+
t ).
2 t
As for bisection search, if f ′ (xqt ) < 0, then x∗ > xqt and we set x0t+1 = xqt and x1t+1 = x1t ; otherwise we set
x0t+1 = x0t and x1t+1 = xqt (the case illustrated in Figure 1. The pseudocode is given in Algorithm 1.
The y ∗ -gap is used to stop the algorithm when |∆yt | ≤ y tol for some user-specified tolerance y tol . ∆yt is simply
the y interval between the intersection Yt of the two tangents, ft0 and ft1 , and the horizontal line at ytlow :

∆yt = [Yt , ytlow ] , Yt = ft0 (x) for x s.t. ft0 (x) = ft1 (x)
yt1 f ′ (x0t ) − yt0 f ′ (x1t ) + f ′ (x0t )f ′ (x1t )(x0t − x1t )
Yt = (3)
f ′ (x0t ) − f ′ (x1t )
Some experimental results can be found in Table 1. They show that ∆-Bisection always outperforms the
original bisection algorithm, often by a factor 2 or more.
Theorem 1 (∆-Bisection gap-reduction). For Algorithm 1, assuming ∆yt+1 > y tol , at each iteration t ≥ 2,
the x∗ -gap ∆xt = [x− +
t , xt ] reduces at least by a factor of 2:

|∆xt+1 | 1 low
yt+1 − zt 1 high low
= · ≤ , zt = ft+1 (xt+1 ) .
x
|∆t | 2 max{ytq , ytlow } − zt 2
high
where ft+1 is the tangent of f at xhigh
t+1 . ⋄

Proof. For the inequality, observe that max{ytq , ytlow } ≥ ytlow ≥ yt+1 low
, therefore |∆xt+1 |/|∆xt | ≤ 12 , since yt+1
low
> zt .
′ left ′ right
For the equality, first of all, if f (x ) ≥ 0 or f (x ) ≤ 0, then the algorithm terminates immediately since
the y ∗ -gap ∆yt is zero. Hence, in what follows we can safely assume that f ′ (xleft ) < 0 and f ′ (xright ) > 0, which
implies that f ′ (x0t ) ≤ 0 and f ′ (x1t ) ≥ 0 for all t.
We focus on the case yt0 ≤ yt1 . The case yt0 > yt1 is handled similarly.
The result is shown by considering 3 separate cases. The first case, f ′ (xqt ) > 0 and ytq < yt0 , is considered in
Fig. 2 (see caption). The second case, f ′ (xqt ) < 0 and ytq < yt0 , is considered in Fig. 3. The last case, f ′ (xqt ) > 0
and yt0 < ytq < yt1 , is considered in Fig. 4. All of these cases follow by using proportionality and each figure shows
that the result holds for each particular case.
Finally, the case yq > yt1 is impossible by convexity, as well as the case f ′ (ytq ) > 0 and ytq > yt0 .
Remark 2 (Root finding). ∆-Bisection can be adapted to improve upon bisection search for root finding of
monotone increasing functions, if integral information is available. ⋄

3.2 ∆-Secant: Convex line search without a gradient


When gradient information is not available, at the very least we can mimic the ∆-Bisection algorithm by querying
twice at two very close points, which straightforwardly (disregarding numerical
√ stability issues) gives a reduction
of a factor at least 2 every second query, that is, a reduction of a factor 2 of the x∗ -gap at every iteration on
average. This would require exactly twice as many iterations as the gradient-based algorithm, but querying twice

4
p1t
f(x)
f(x)
tangent at
tangent at x⁰
x⁰
tangent at
tangent at x¹

0
1 x
2 ∆t
(xqt , ·)
pt
p+
t

∆xt+1 pqt
axis
yy axis

p−
t+1
B
A

(·, z)

xx axis
axis

Figure 2: ∆-Bisection Case 1: f ′ (xqt ) > 0 and ytq < yt0 , that is, x∗ ∈ [x0t , xqt ] or more precisely x∗ ∈
q q q q
[x− 1 + 0 0
t+1 , xt ]. The query point (xt , yt ) is denoted q on the picture. Then xt+1 = xt = xt+1 and xt = xt+1 . By
|∆x t+1 | ytq −z
low
yt+1 −z
proportionality, 1 x = A
B = yt0 −z
= ytlow −z
high low
with z = ft0 (xqt ) = ft+1 (xt+1 ).
2 |∆t |

p1t
f(x)
f(x)
tangent at
tangent at x⁰
x⁰
tangent at
tangent at x¹

(xqt , ·) 1 x
2 ∆t
p0t
p+
t
axis
yy axis

B
pqt ∆xt+1
p+
t+1
A

(·, z)

xx axis
axis

Figure 3: ∆-Bisection Case 2: f ′ (xqt ) < 0 and ytq < yt0 , that is, x∗ ∈ [xqt , x1t ] or more precisely x∗ ∈
|∆x t+1 | ytq −z
low
yt+1 −z
[xqt , x+ 0 q − 1 1
t+1 ]. Then xt+1 = xt = xt+1 and xt = xt+1 . By proportionality, 1
|∆ x| = A
B = yt0 −z
= ytlow −z
with
2 t

z= ft1 (xqt ) = high low


ft+1 (xt+1 ).

5
p1t
f(x)
f(x)
tangent at
tangent at x⁰
x⁰
tangent at
tangent at x¹

tangent at
tangent at xq
xq
axis
yy axis

1 x
2 ∆t pqt

∆xt+1 (xqt , ·)
p0t B
p+
t+1 A p+
t

(·, z)

xx axis
axis

Figure 4: ∆-Bisection Case 3: f ′ (xqt ) > 0 and yt0 < ytq < yt1 , that is, x∗ ∈ [x0t , xqt ] or more precisely
q −
x∗ ∈ [x0t , x+ 1 0 0 low
t+1 ]). Then xt+1 = xt and xt = xt+1 = xt+1 . Also note that yt
low
= yt+1 = yt0 . By proportionality,
|∆x t+1 | yt0 −z
low
yt+1 −z
1 x = A
B = ytq −z
= ytq −z
high low
with z = ftq (x0t ) = ft+1 (xt+1 ).
2 |∆t |

f (x) xleft xright Bisection ∆-Bisection


−x -20 7 0 0
|x| -20 7 37 1
max{−x, 2x} -20 7 37 19
max{−x, 2x} -0.01 100 37 14
|x|1.1 -20 7 34 11
2
√ x -20 7 20 12
1 + x2 -1000 900 27 19
x log x − x 0.001 20 20 12
max{x2 , (x − 3)2 } -5 55 38 9
max{x2 , (x/2 − 3)2 } -5 55 39 21
x4 -20 7 11 8
1/x + x2
2
0.001 100 23 18

Table 1: Number of iterations (not counting the initial queries at the boundaries) required by
bisection search and ∆-Bisection for several convex functions, until provably reducing the y ∗ -gap ∆yt
to at most a tolerance of y tol = 10−10 . Observe that when the minimum is at a boundary, ∆y0 = 0 and the
algorithms terminate immediately. ∆-Bisection is a little lucky for f (x) = |x| because the x∗ -gap is reduced to
0 on the second iteration, due to the symmetry of the function.

6
f (x) x0 x1 w/ gradient queries no gradient queries
Bisection ∆-Bisection ∆-Secant GSS
−x -20 7 4 4 3 3
|x| -20 7 78 6 7 50
max{−x, 2x} -20 7 78 42 23 50
max{−x, 2x} -0.01 100 78 32 18 52
|x|1.1 -20 7 72 26 28 50
2
√ x -20 7 42 28 27 33
1 + x2 -1000 900 58 42 23 41
x log x − x 0.001 20 44 28 23 32
max{x2 , (x − 3)2 } -5 55 80 22 18 60
max{x2 , (x/2 − 3)2 } -5 55 82 46 26 58
x4 -20 7 26 20 18 21
1/x2 + x2 0.001 100 50 40 31 36

Table 2: Number of queries, including the initial queries at the boundaries required by different
algorithms until provably reaching a y ∗ -gap of |∆y (Pt )| ≤ y tol = 10−10 , for several convex functions. For
gradient-based algorithms, a function+gradient query counts as two queries. Note that our version of the
golden-section search algorithm (GSS) also uses the 5-point y ∗ -gap ∆y (Pt ) as stopping criterion.

Algorithm 1 (∆-Bisection) Line search for convex functions with gradient information.

def ∆-Bisection(f, f ′ , xleft , xright , y_tol):


x0 = xleft , y 0 = f (x0 ), g 0 = f ′ (x0 )
x1 = xright , y 1 = f (x1 ), g 1 = f ′ (x1 )
for t = 1, 2, ...:
if |∆yt | ≤ y_tol: return Pt # see Eq. (3)
x− = inf ∆xt # see Eq. (2)
x+ = sup ∆xt
xq = (x− + x+ )/2
Pt+1 = Pt ∪ {(xq , f (xq ))}

7
at the same point seems wasteful: after the first one of two close queries, some information is already gained and
the x∗ -gap has likely already reduced, so why then query at almost the same point again?
We propose a no-gradient line search algorithm ∆-Secant for general convex functions that is also guaranteed
to reduce the x∗ -gap by a factor at least 2 at every second iteration, but is likely to reduce it by more than this.
This is validated on experiments, which also show that ∆-Secant is often significantly faster than golden-section
search, and than ∆-Bisection when counting a gradient query as a separate query.
First we need to define a version of the optimality region ∆(P ) based only on a set of points P , without
gradient information.

3.2.1 Five-point optimality region


Let ftab (x) be the line going through the points pat and pbt (see Section 2). Let f be a convex function on
[xleft , xright ], and let pa = (xa , f (xa )), pb = (xb , f (xb )), then we know by convexity that not only f (x) ≤ f ab (x)
for all x ∈ [xa , xb ], but just as importantly we also have f (x) ≥ f ab (x) for all x ∈ / [xa , xb ]. Let P be a
left right ∗
(f, x , x )-convex set of points. Recall that x ∈ argminx∈[xleft ,xright ] f (x). Then, we use these convexity
constraints to define the optimality region ∆(P ) that contains (x∗ , f (x∗ )), together with f (x∗ ) ≤ y low (P ) (see
Section 2):
  
′ ′
∆(P ) = (x, y) ∈ R2 | x ∈ min ′
x , max

x ∧
(x ,·)∈P (x ,·)∈P
low
y≤y (P ) ∧

∀pi , pj ∈ P, x ∈
/ (xi , xj ) : f ij (x) ≤ y , (4)

and we define the x∗ -gap ∆x (P ) and y ∗ -gap ∆y (P ):

∆x (P ) = {x | (x, ·) ∈ ∆(P )} , ∆y (P ) = {y | (·, y) ∈ ∆(P )} . (5)

Example 3. Take f (x) = x, and P = {(0, 0), ( 12 , 12 ), (1, 1)} then ∆(P ) = {(0, 0)}. ⋄
Example 4. Take f (x) = |x|, and P = {(−10, 10), (−5, 5), (−1, 1), (1, 1), (5, 5), (10, 10)} then ∆x (P ) = [−1, 1]
and ∆y (P ) = [0, 1]. ⋄
Theorem 5 (Optimality region). Let P be a (f, xleft , xright )-convex set of points. Then,

(x∗ , f (x∗ )) ∈ ∆(P ) for all x∗ ∈ argmin f (x) . ⋄


x∈[xleft ,xright ]

Proof. First, by definition of x∗ and P , we must have f (x∗ ) ≤ y low (P ). Additionally, for all (x1 , y 1 ) ∈ P and
(x2 , y 2 ) ∈ P , if x∗ ∈
/ (x1 , x2 ) then by convexity of f we have that f (x∗ ) ≥ f 12 (x∗ ). Therefore, by definition of ∆
in Eq. (4), it follows that (x∗ , f (x∗ )) ∈ ∆(P ).
If only two points are available, that is, |P | = 2, then ∆x (P ) = [xlow (P ), xhigh (P )], and ∆y (P ) = (−∞, y low (P )].
For |P | ≥ 3, ∆y (P ) is (usually) not infinite anymore, and the optimality region ∆(P ) consists of at most two
triangles, while both ∆x (P ) and ∆y (P ) are bounded intervals — see Fig. 5. Also, it turns out that having more
than 5 points does not help, because the constraints generated by the points further away from plow (P ) provide
only redundant information.
Theorem 6 (Five points are enough). Let P = {pi }i∈[|P |] be a (f, xleft , xright )-convex set of points such that
x1 ≤ x2 ≤ . . . Let k be such that pk = plow (P ), and assume 2 that 3 ≤ k ≤ |P | − 2. Define P ′ to be the subset
of P containing five points centered around the minimum, that is,

P ′ = {pk−2 , pk−1 , pk , pk+1 , pk+2 }

then P ′ is (f, xleft , xright )-convex and

∆(P ′ ) = ∆(P ) . ⋄
2 This assumption can always be made to hold by adding specific virtual points to P .

8
Proof. Observe that necessarily ∆x (P ) ⊆ [xk−1 , xk+1 ]. Recall that f kj is the line going through pk and pj .
First, We show that pk−3 (if it exists) is superfluous.
For all i ≥ k + 1, for all x ∈ ∆x (P ) then necessarily x ∈ [xk−3 , xi ], and thus the constraint on f k−3,i in
Eq. (4) is vacuous. Thus, all constraints in Eq. (4) involving f k−3,i can be removed.
For all i ≤ k − 1, i ̸= k − 3, by convexity we have f i,k−3 (xk−1 ) ≤ y k−1 and thus for all x ≥ xk−1 , we have
i,k−3
f (x) ≤ f i,k−1 (x). Hence, for all x ≥ xk−1 , for all y, f i,k−1 (x) ≤ y implies that f i,k−3 (x) ≤ y. Thus all
constraints involving f i,k−3 in Eq. (4) for all i ≤ k − 1, i ̸= k − 3 are superfluous.
Similarly, the constraint involving f k−3,k is made redundant by the constraint involving f k−1,k .
Therefore, all constraints in Eq. (4) involving pk−3 are superfluous, and thus ∆(P \ {pk−3 }) = ∆(P ).
Recursively, we can thus eliminate from P all points to the left of pk−2 , and similarly we can eliminate all
points to the right of k + 2. Finally, there remain only the points of P ′ .
The fact that P ′ is (f, xleft , xright )-convex is now straightforward.
For simplicity, in what follows we consider the set Pt of all points queried up to step t, but an algorithm
should instead only maintain the 5 points around the minimum — as per Theorem 6.
Example 7 (4-point vs 3-point y ∗ -gap). Take f (x) = |x|, and P = {(−a, a), (ε, ε), (1, 1)} for ε ∈ (0, 1) and
a > 0, and take Q = P ∪ {(−(a + 1), a + 1)}. Note that the additional point in Q is further away from the
minimum 0 of f than the points in P . Then it is easy to check that |∆y (P )| = a + ε, while |∆y (Q)| = ε. Hence,
a 4-point y ∗ -gap can be arbitrarily more accurate than a 3-point one. Also note that more points only constrain
the optimality region more, that is, ∆(P ∪ {p}) ⊆ ∆(P ) for any point p as long as P ∪ p satisfies the assumptions
of Theorem 6. ⋄
The optimality region ∆(·) can be used with existing line-search algorithms, such as golden-section search
for example, assuming the function to minimize is convex, stopping the line search when |∆y (Pt )| ≤ y tol .
For a given set P = {pi }i∈{0,1,2,3,4} of five points, assuming plow (P ) = p2 , similarly to Eqs. (2) and (3) we
have
y2 − y1 y0 − y1
x− = inf ∆x (P ) = x1 + , a01 = ,
a01 x0 − x1
y2 − y3 y3 − y4
x+ = sup ∆x (P ) = x3 + , a34 = 3 ,
a34 x − x4
|∆x (P )| = x+ − x− ,
|∆y (P )| = y 2 − min{y 01,23 , y 12,34 } , (6)

where x− is the intersection of f 01 with the horizontal line at y 2 = y low (P ), x+ is the intersection of f 34 with
y 2 . See Eq. (1) for the definition of y ij,kl and Appendix B for numerical stability considerations.
Remark 8 (Missing points). If there are fewer than 2 points to the left and right of plow (P ), virtual points can
be added to restore this property. Assume P = {p2 , p3 , . . . , pn−2 } with x2 ≤ x3 ≤ · · · ≤ xn−2 is (f, xleft , xright )-
convex. Define Q = P ∪ {p0 , p1 , pn−1 , pn } where x0 = x2 − 2ε, x1 = x2 − ε, xn−1 = xn−2 + ε, xn = xn−2 + 2ε, and
y 0 = y 1 = y n−1 = y n−2 = ∞ for some infinitesimally small ε. Then necessarily plow (P ) is surrounded by 2 points
on each side, while ∆(Q) = ∆(P ) since no information regarding the location of the minimum is introduced,
since by assumption x∗ = argminx∈[xleft ,xright ] ∈ ∆(P ). Technically, Q is not (f, xleft , xright )-convex, but instead
(f˜, xlow (P ), xhigh (P ))-convex where f˜ = f except on the additional points of Q. ⋄

3.2.2 ∆-Secant
The ∆-Secant algorithm is an adaptation of ∆-Bisection, using ∆(Pt ) instead of ∆t . See Algorithm 2. It may
happen in rather specific situations that xlow (Pt ) is exactly the midpoint of ∆x (Pt ), which means that the query
xqt would coincide with xlow (Pt ). This poses no issue with the theory below as we can simply take the limit
toward the midpoint, in which case the secant of these two infinitesimally close points would give a gradient
information at xlow (Pt ). However, it does pose numerical issues as the algorithm would instead loop forever.
This can be easily circumvented by moving the query slightly away 3 from xlow (Pt ) by some factor ε of |∆x (Pt )|.
3 Still for numerical stability, this ‘repulsion’ should even be applied whenever |xlow (P ) − ∆x (P )/2| < ε. Querying the gradient
t t
at xlow (Pt ), if available, would be even better.

9
Algorithm 2 (∆-Secant line search algorithm for 1-d convex minimization) The improvement of
Remark 9 is not implemented. We advise to take ε = 2−7 . Not that in practice at most 5 points around plow (Pt )
actually need to be maintained in Pt , as per Theorem 6. We decompose ∆-Secant into two functions because we
will re-use ∆-Secant _core elsewhere.
def ∆-Secant(f, xleft , xright , y_tol):
def stop_when(P ): return |∆y (P )| ≤ y_tol
P = {(xleft , f (xleft )), (xright , f (xright ))}
return ∆-Secant_core(f, P , stop_when)

def ∆-Secant_core(f, P1 , stop_when):


for t = 1, 2, ...:
if stop_when(Pt ): return Pt

x− = inf ∆x (Pt ) # see Eq. (6)


x+ = sup ∆x (Pt )
xm = (x− + x+ )/2
if (xq , ·) ∈ Pt :
xq = xm + ε(x+ − x− ) # necessarily, xq ∈ / Pt for ε ∈ (0, 21 )
else:
xq = xm
Pt+1 = Pt ∪ {(xq , f (xq ))} # simple but inefficient: some points can be removed

Remark 9 (Saving one query). One early function query can sometimes be saved: given the initial bounds
xleft and xright , after obtaining f (xleft ) and f (xright ), a query at xq1 = (xleft + xright )/2 is always required since
the algorithm cannot terminate with only 2 points, since |∆y (·)| = ∞ (by contrast to ∆-Bisection). However, if
we query at xq1 before querying at xright and if it happens that f (xleft ) ≤ f (xq ), then by convexity it is not useful
to query at xright . ⋄

Theorem 10 (∆-Secant ∆x reduction). For ∆-Secant in Algorithm 2 where we take ε → 0, for any iteration
t ∈ N1 such that ∆y (Pt+1 ) > y tol (i.e., the algorithm does not terminate before t + 2), the x∗ -gap |∆x (·)| reduces
by at least a factor 2 every second iteration: for every t ∈ N1 , |∆x (Pt+2 )| ≤ 12 |∆x (Pt )|. ⋄

Proof. First, observe that for all t ∈ N1 , ∆(Pt+1 ) ⊆ ∆(Pt ) since by definition of ∆ in Eq. (4), Pt+1 is more
constrained than Pt . Also observe that plow (Pt ) ∈ ∆(Pt ) (see Fig. 5).
q
Let [x− + x 1 − +
t , xt ] = ∆ (Pt ), xt = 2 (xt + xt ) for all t.
q q q
Case 1) If yt ≥ y (Pt ), then necessarily either x∗ ∈ [x−
low ∗ +
t , xt ] or x ∈ [xt , xt ] — depending on which
low low
segment contains x (Pt+1 ) = x (Pt ) — and this information is reflected in ∆(Pt+1 ). Thus |∆(Pt+1 )| ≤
max{xqt − x− + q 1 1
t , xt − xt } = 2 |∆(Pt )|. Hence |∆(Pt+2 )| ≤ |∆(Pt+1 )| ≤ 2 |∆(Pt )|.
q low
Case 2) Assume instead that yt ≤ y (Pt ).
q
Case 2a) At t + 1, if yt+1 ≥ y low (Pt+1 ) = ytq , then by case 1) we have |∆t+2 | ≤ 21 |∆t+1 |, and thus |∆t+2 | ≤
1
2 |∆t |.
q
Case 2b) If instead yt+1 ≤ y low (Pt+1 ) = ytq , then necessarily either x∗ ∈ [x− q ∗ q +
t , xt ] or x ∈ [xt , xt ] —
low q
depending on which segment contains x (Pt+2 ) = xt+1 — and this information is reflected in ∆(Pt+2 ). Then
again ∆(Pt+2 ) ≤ max{xqt − x− + q 1
t , xt − xt } = 2 |∆(Pt )|.

Remark 11 (Golden-section search). Let PtGSS be the set of all points queried by Golden-section search
on a convex function f up to step t. Then, as mentioned earlier, Golden-section search is only guaranteed to
k−1 k+1
√ xt − xt
reduce the x-interval around the current minimum xkt = xlow (PtGSS ) (with xtk−1 , xkt , xk+1 ∈ PtGSS )
by a factor (1 + 5)/2 ≈ 1.62 at every iteration, but this does not necessarily lead to a strict reduction of the
x∗ -gap ∆x (PtGSS ) at every (second) iteration. x GSS
√ In fact, this may not even reduce ∆ (Pt ) for many iterations.
Consider for example the function f (x) = 1 + x2 for [xleft , xright ] = [−1000, 900]. This function behaves like
|x| outside of [−1, 1], and like x2 inside [−1, 1]. After 4 iterations of ∆-Secant, the x∗ -gap is reduced to less than
2, while it takes 13 iterations for GSS to reach the same ∆x precision. ⋄

10
opt. region ∆(P ) middle of∆x (P )
01 23
p4
f f
f 34 f 12
p0
f (x)
p3

p1
∆x (P )

∆y (P )
p− p2 p+
p01,23
p12,34
Figure 5: (∆-Secant) The optimality region ∆ based on 5 points P = {p0 , . . . p4 } on a convex function f ,
without gradient information. The middle of ∆x (P ) corresponds to the query performed by ∆-Secant.

3.3 Quasi-exact line search for learning rate tuning


In this section we consider optimizing a multidimensional function f : Rd → R, using a base algorithm such as
Gradient Descent (GD) where the learning rate is tuned at each step using a line search. The typical line search
algorithm for such cases is a backtracking line search with Armijo’s stopping condition (Armijo, 1966), which is
called an inexact line search (Boyd and Vandenberghe, 2004).
Boyd and Vandenberghe (2004) mentions that people have stopped using exact line searches in favor of
backtracking line search due to the former being very expensive.
Using the y ∗ -gap ∆y , we change the stopping criterion of ∆-Secant so as to be suitable for tuning learning
rates, and prove convergence guarantees when combined with GD or Frank-Wolfe. We call this type of line search
algorithm a quasi-exact line search.
We also show that Armijo’s condition can lead to poor performance even on simple cases such as a quadratic
and an exponential, and that neither large nor small values of the stopping criterion parameter are universally
suitable. By contrast, a quasi-exact line search appears significantly more stable for a wide range of values of its
single parameter.
More precisely, at gradient descent iteration k, the current point being xk ∈ Rd , define the 1-d convex function
˜
f : [0, ∞) → R:
f˜(α) = f (xk − α∇f (xk )) .

It is known that there exists an α′ > 0 such that for all α ∈ [0, α′ ], f˜(α) < f˜(0) if the function is smooth (see
Section 2). The quasi-exact line search runs ∆-Secant until the y ∗ -gap (which is an upper bound on y low (Pt )−y ∗ )
is smaller than a constant fraction of the current improvement from the first queried point y left (Pt ) − y low (Pt ):
y left (Pt ) − y low (Pt ) ≥ c|∆y (Pt )| (7)
for the algorithm parameter c > 0. See quasi_exact_line_search in Algorithm 3. Moreover, if the point
returned by ∆-Secant is at the end of the original interval, the interval size is increased by a factor 4 and
∆-Secant is run again, and so on until the returned point is not at the boundary. This scheme assumes
that f˜(α) eventually increases with α, which applies in particular to strongly-convex functions (see Section 2).
quasi_exact_line_search can also use an initial query αprev which defaults to 1. We also propose a version
of GD+Backtracking that also increases the interval size as required, see Algorithm 4 — also see Unbounded
Backtracking (Truong and Nguyen, 2021).

11
We show in Theorem 12 that the terminating condition Eq. (7) incurs a slowdown factor of c/(c+1) compared
to an exact line search (c = ∞), in terms of the number of gradient descent steps. Note that the conditions for
Theorem 12 are weaker than smoothness and strong-convexity.
Theorem 12 (Quasi-exact line search). Let f˜ : [0, ∞) → R be a convex function. Run quasi_exact_line_search
(Algorithm 3) on f˜(·) and c > 0 and collect the return value αc . Let α∞ ∈ argminα≥0 f˜(α) and assume α∞ < ∞
and that f (∞) > f (α∞ ). Then
c ˜ 
f˜(0) − f˜(αc ) ≥ f (0) − f˜(α∞ ) . ⋄
c+1
Proof. First, since α∞ < ∞ and f (∞) > f (α∞ ), there there exists an iteration i of quasi_exact_line_search
right
at which ∆-Secant _core returns with xlow i (P ) < xi .
Let P be the points returned by ∆-Secant at the last iteration of quasi_exact_line_search. From Theo-
rem 5 we know that |∆y (P )| ≥ y low (P ) − f˜(α∞ ) = f˜(αc ) − f˜(α∞ ). Thus by Eq. (7),

f˜(0) − f˜(αc ) ≡ y left (P ) − y low (P ) ≥ c|∆y (P )| ≥ c(f˜(αc ) − f˜(α∞ )) .

Adding cf˜(0) − cf˜(αc ) on each side we obtain

(c + 1)(f˜(0) − f˜(αc )) ≥ c(f˜(0) − f˜(α∞ ))

and the result follows by dividing both sides by c + 1.


We now prove convergence guarantees for Gradient Descent with a quasi-exact line search, using a straight-
forward adaptation of the proof by Boyd and Vandenberghe (2004, Section 9.3.1).

Algorithm 3 (Gradient descent with quasi-exact line search) for strongly convex and smooth functions
for unbounded domains. The line search is repeated with an increased range as long as the minimum is at the
right boundary. This allows to adapt to the smoothness parameter M without having to know it, even if M < 1.
Note that ∆-Secant is very fast when the minimum is at a boundary, in particular since points of the previous
iteration are reused. The previously found learning rate αprev is also used as a starting point for the line search.
See Appendix B for numerical stability.

def quasi_exact_line_search(f˜, c, αprev ):


# stopping criterion of Eq. (7)
def stop_when(P): return c|∆y (P )| ≤ f˜(0) − y low (P )

P0 = {(0, f˜(0))}
for i = 0, 1, 2, ...:
xright = αprev 4i
# Re-use previous points and add the new upper bound,
# which forces ∆x (Pi′ ) > ∆x (Pi ) by Eq. (4)
Pi′ = Pi ∪ {(xright , f (xright ))}

Pi+1 = ∆-Secant_core(f˜, Pi′ , stop_when)


if xlow (Pi ) < xright : # minimum not at the rightmost boundary
return y low (Pi )

def GD_quasi_exact(f , x0 , c = 1):


x1 = x0
αprev = 1 # previous learning rate
for k = 1, 2, ...:
def f˜(α): return f (xk − α∇f (xk ))

αc = quasi_exact_line_search(f˜, c, αprev )
xk+1 = xk − αc ∇f (xk )
αprev ← αc

12
Algorithm 4 (Gradient Descent with backtracking line search) for strongly convex and smooth functions
for unbounded domains The line search is repeated with an increased range as long as the minimum is at the
right boundary. This allows to adapt to the smoothness parameter M without having to know it, even if M < 1.
The previously found learning rate αprev is also used as a starting point for the line search. See also Truong and
Nguyen (2021).

def backtracking(f˜, G, τ, ε, αprev ):


α = αprev
while True:
i= 0
while f˜(0) − f˜(α) < εαG:
α ← ατ
i += 1
if i == 1: # α possibly too small
α ← 4α
else:
return α

def GD_backtracking(f , x0 , τ = 12 , ε):


x = x0
αprev = 1
for k = 1, 2, ...:
def f˜(α): return f (xk − α∇f (xk ))

α = backtracking(f˜, ∥∇f (xk )∥22 , τ, ε, αprev )


xk+1 = xk − α∇f (xk )
αprev ← α

Theorem 13 (GD with quasi-exact line search). Consider an m-strongly convex and M -smooth function
f : Rd → R with (unique) minimizer x∗ ∈ Rd . Starting at x0 ∈ Rd , after k iterations of gradient descent with
quasi-exact line search (Algorithm 3), we have
 k
c m
f (xk+1 ) − f (x∗ ) ≤ (f (x1 ) − f (x∗ )) 1 − . ⋄
c+1M
Proof. First, recall that strong convexity implies that the minimum x∗ is unique. At iteration k, let αk∞ =
argminα≥0 f (xk − α∇f (xk )), and let x∞ ∞
k = x − αk ∇f (x). From Boyd and Vandenberghe (2004), Chapter 9.3,
at some gradient iteration k:
1
f (x∞
k ) ≤ f (xk ) − ∥∇f (xk )∥2 .
2M
Together with Theorem 12,
c c 1
f (xk ) − f (xk+1 ) ≥ (f (xk ) − f (x∞
k )) ≥ ∥f (xk )∥2
c+1 c + 1 2M
Rearranging and subtracting f (x∗ ) on both sides,
c 1
f (xk+1 ) − f (x∗ ) ≤ f (xk ) − f (x∗ ) − ∥f (xk )∥2
c + 1 2M
Following Boyd and Vandenberghe (2004) again, we combine with ∥∇f (xk )∥2 ≥ 2m(f (xk ) − f (x∗ )),
 
c m
f (xk+1 ) − f (x∗ ) ≤ (f (xk ) − f (x∗ )) 1 −
c+1M
≤ ...
 k
∗ c m
≤ (f (x1 ) − f (x )) 1 − .
c+1M

13
Note that this implies
 
c m
f (xk+1 ) − f (x∗ ) ≤ (f (x1 ) − f (x∗ )) exp −k ,
c+1M
which suggests that with the default value c = 1, the quasi-exact line search is at most a factor 2 slower (in the
number of iterations k) than using an exact line search, while being significantly more efficient in the number of
function queries.
Remark 14 (Initial gradient). Algorithm 3 can also be modified to make use of the known initial gradient,
by adding two virtual points at −1 and −2 for example. The gradient information tells us that
f (x − α∇f (x)) ≥ f (x) − α∥∇f (x)∥22 ,

that is, f˜(α) ≥ f˜(0)−α∥∇f (x)∥22 . Hence, at gradient descent iteration k, we can augment the initial set of points
P0 of quasi_exact_line_search in Algorithm 3 with {(−2, f (xk ) + 2∥∇f (xk )∥22 ), (−1, f (xk ) + ∥∇f (xk )∥22 )}.
See Appendix B, however. ⋄

3.4 Comparison with backtracking line search


The backtracking line search with Armijo’s stopping condition (Armijo, 1966) is an appealing algorithm: it is
simple to implement, has good convergence guarantees, and works well in practice. Armijo’s condition stops the
backtracking line search at the first iteration t such that
f (x) − f (x − τ t ∇f (x)) ≥ ετ t ∥∇f (x)∥22 (8)
where τ ∈ (0, 1) is the exponential decrease factor of the learning rate, and ε ∈ (0, 1) is the ‘acceptable linearity’
parameter.
Boyd and Vandenberghe (2004) provide the following bound for GD with backtracking line search. We
slightly modify the traditional backtracking line search to extend the interval and start again whenever the
returned value is at the boundary. See Algorithm 4, but compare Truong and Nguyen (2021) who define similar
algorithms. Then, if f is an m-strongly convex and M -smooth function, then GD+backtracking converges
linearly in unconstrained optimization:
 m k  m
f (xk+1 ) − f (x∗ ) ≤ (f (x1 ) − f (x∗ )) 1 − 2τ ε ≤ (f (x1 ) − f (x∗ )) exp −2kτ ε .
M M
The proof is a simple adaptation of the standard proof (Boyd and Vandenberghe, 2004). In what follows we
always use τ = 1/2, which strikes a good balance between practical speed and theoretical analysis, as it leads to
a learning rate that is always within a factor 2 of the optimal learning rate that satisfies Armijo’s condition. If
αε is the optimal learning rate for Armijo’s condition for a fixed ε, only log(1/αε )/ log(1/τ ) iterations are needed
for the line search to stop within a factor τ of αε . While taking τ ≪ 1/2 may sometimes speed up the line
search by a small log factor, it also degrades the guarantee by a much larger linear factor: e.g., if τ = 1/1000,
this can speed up the search by a factor log 1000/ log 2 < 10, but the convergence speed guarantee degrades by
a factor 500(!) While setting τ = 1/2 is a good universal choice, finding the right value for ε is another matter
altogether, as we show in this section. The closer ε is to 1, the more we require the line between (xk , f (xk )) and
(xk+1 , f (xk + 1)) to be close to the tangent of f at xk , at GD iteration k. This means that many line search
iterations may be required to find a point that is close enough to the tangent, and in practice a small value of ε
is chosen — Boyd and Vandenberghe (2004) mention ε ∈ [0.01, 0.3]. But even taking ε = 0.1 ‘slows down’ the
convergence guarantee by a factor 10.
The following example shows (as is well-known) that small values of ε can make the algorithm sluggish even
for quadratic functions — the ‘nicest’ class of functions for convex minimization, where M = m.
Example 15 (Armijo’s flying squirrel). Take f (x) = ax2 . Then,
f (x) − f (x − τ t 2ax) = ax2 − ax2 (1 − τ t 2a)2 = 4a2 x2 τ t (1 − τ t a) .
Hence, Armijo’s condition Eq. (8) is satisfied for the smallest integer t ≥ 0 where
4a2 x2 τ t (1 − τ t a) ≥ ετ t ∥∇f (x)∥2 = 4ετ t a2 x2 ,
⇔ 1 − τ ta ≥ ε .

14
and the condition must also not be satisfied at t − 1:

4a2 x2 τ t−1 (1 − τ t−1 a) < ετ t−1 ∥∇f (x)∥2 = 4ετ t−1 a2 x2 ,


⇔ 1 − τ t−1 a < ε .

Take a = τ −t (1 − ε), with t′ ≥ 0 integer and ε ∈ (0, 1). Then it can be verified that the two conditions above are
satisfied for t = t′ . Furthermore, x − τ t ∇(x) = x − τ t 2ax = −x(1 − ε) for t = t′ , which means that at iteration k
of GD with backtracking, we have xk+1 = −(1 − ε)xk . That is, xk constantly jumps between the left-hand side
and right-hand side branches of the quadratic function — like an over-excited flying squirrel — getting closer to
0 only by factor 1 − ε at each iteration.
The number of iterations it takes before |xk | = |x1 |(1 − ε)k ≤ 1 is linear with 1/ε for k ≤ 1/ε. This means
that for small values of ε, a GD with backtracking with parameters τ and ε takes linear time on the x-axis in
1/ε to get close to 0. For large ε such as ε = 1/2, this is all fine, but for small ε such as ε = 1/100 this becomes
unacceptable. Note that the problem is not linked to τ , or to the line search being bounded in [0, 1]. See Table 3
for experimental results confirming this analysis. ⋄

The previous example shows that taking a small value for ε can lead to slow progress. The next example
shows exactly the opposite: that large values of ε can also lead to slower progress than necessary — see also the
example from Equation (9.20) from Boyd and Vandenberghe (2004).
Example 16 (Armijo’s snowplow skier). Consider f (x) = eax for a ≥ 1. Define g(x) = ∇f (x) = aeax , then
the backtracking line search stops at step t∗ such that (see Eq. (8))

t∗ = min t ∈ N0 : f (x) − f (x − τ t g(x)) ≥ ετ t g(x)2




= min t ∈ N0 : f (x)(1 − exp(−aτ t g(x))) ≥ ετ t g(x)2




= min t ∈ N0 : 1 − exp(−aτ t g(x)) ≥ aετ t g(x)




≥ min t ∈ N0 : 1 ≥ aετ t g(x)



  
ax + ln(aε)
= max 0, .
ln 1/τ
∗ ∗
and since 1 ≥ aετ t g(x) (from line of the inequality in the above display), it follows that τ t g(x) ≤ 1/(aε), and
so x − τ t∗ g(x) ≥ x − 1/(aε). Moreover, each GD iteration k takes at least axk / ln 1/τ backtracking iterations
leading to at least a quadratic number a2 εx2 /(2 ln 1/τ ) of queries in total!
On this very ‘curvy’ function, Armijo’s condition forces the backtracking line search to return a point that
descends the slope extremely cautiously, despite the function being queried at much lower points during the line
search.
For example, if x0 = 100, ε = 1/2, a = 4, then it takes GD+backtracking more than 200 GD iterations before
xk < 0, and a total of at least 55 000 function queries. By contrast to Example 15, here a larger ε makes the
matter worse. ⋄

See Table 3 for some experimental results, comparing backtracking line search and quasi-exact line search.

Many-dimensional optimization. We also compare the algorithms on an artificial loss function of the form:
"N #−1
Y exp(ai,ni xni )
L(x) = P (9)
i=1 j∈[d] exp(ai,j xj )

with the parameters x0 initialized uniformly randomly in [−20, 20]d , ai,j ∈ [−1, 1] drawn uniformly randomly,
and N = 10, d = 100, and each ni ∈ [d] sampled uniformly. We use Gradient Descent with line search and stop
after 2000 (function+gradient) queries. Fig. 6 shows the sensitivity of GD with Backtracking to the ε parameter,
while GD with quasi-exact line search performs well a wide range of the parameter c. Indeed, the exponential
curvature makes it difficult for Armijo’s condition with large parameter ε to return a value within only few
iterations.

15
quasi-exact_c2
quasi-exact_c2
quasi-exact_c1
quasi-exact_c1
quasi-exact_c0.1
quasi-exact_c0.1
backtracking_ε0.5
backtracking_ε0.5
backtracking_ε0.1
backtracking_ε0.1
10⁵¹
10⁵¹
backtracking_ε0.01
backtracking_ε0.01
(log-scale)
loss (log-scale)

10³⁸
10³⁸
loss

10²⁵
10²⁵

400
400 600
600 800
800 1000
1000

cumulative function+gradient
cumulative function+gradient queries
queries

Figure 6: (Loss per function+gradient query for gradient descent with quasi-exact line search
(Algorithm 3) and backtracking line search (Algorithm 4) for the loss function of Eq. (9)) The x-
axis is strongly correlated with computation time. This figure illustrates a typical example. Due to exponentials,
ε ≥ 0.1 is too conservative for backtracking line search (see Example 16). By contrast, quasi-exact line search is
robust to various choices of c, and the default c = 1 appears to be a safe choice.

16
f (x) = 3.95x2 f (x) = e3x + e−3x f (x, y) = x4 + y 4
Line search type GD iter. Queries GD iter. Queries GD iter. Queries
quasi-exact c = 100 4 45 6 954 4 57
quasi-exact c = 10 4 35 7 957 4 49
quasi-exact c = 4 4 30 8 959 4 44
quasi-exact c = 2 4 25 8 955 5 44
quasi-exact c = 1 4 25 10 965 5 42
quasi-exact c = 0.5 4 25 10 965 5 38
quasi-exact c = 0.1 4 25 44 1090 7 43
quasi-exact c = 0.01 754 2265 44 1089 7 44
backtracking ε = 0.8 67 476 898 193817 74 (17 700) 388 (35662)
backtracking ε = 0.5 4 25 229 48266 24 (17 674) 118 (35416)
backtracking ε = 0.3 4 25 147 31486 10 (17 669) 68 (35 376)
backtracking ε = 0.1 4 25 47 9613 5 (17 655) 39 (35 327)
backtracking ε = 0.01 754 3020 13 1376 8 (17 662) 54 (35 344)
backtracking ε = 0.001 754 3020 10 867 8 (17 662) 54 (35 344)

Table 3: (Experimental results comparing backtracking line search and quasi-exact line search)
Number of GD iterations, and number of function+gradient queries for quasi-exact line search and backtracking
line search (τ = 1/2) for various values of the terminating condition parameters c and ε, until f (xt ) − f (x∗ ) ≤
10−10 for different functions. For f (x) = 3.95x2 with x0 = 1000, small values of ε make the backtracking line
search slow (see Example 15). For f (x) = e3x + e−3x with x0 = 100, large values of ε make the backtracking
line search slow (see Example 16). By contrast, quasi-exact line search is fast for a wide range of values. For
f (x, y) = x4 +y 4 with (x0 , y0 ) = (0.1, 15), for which the global minimum is degenerate (Example 3.7 from Truong
and Nguyen (2021)), standard backtracking line search (numbers in parenthesis) is stuck with a far too small
learning rate, impeding progress. Backtracking line search as in Algorithm 3 and quasi-exact line search as in
Algorithm 3 avoid this particular issue. See also Truong and Nguyen (2021) for other variants. In line with
Theorem 12, taking c = 0.01 for quasi-exact line search can incur a slowdown factor of c/(c + 1) ≈ 0.01 compared
to an exact line search, and we can indeed observe on the first function that performance is poor. The table
shows that there is no good default setting for backtracking line search. By contrast, quasi-exact line search is
robust across a wide range of parameter values. We recommend taking c = 1 to balance practical speed and
tight theoretical guarantees.

3.5 Frank Wolfe with quasi-exact line search


A quasi-exact line search fits very well with Frank-Wolfe (FW) (Frank and Wolfe, 1956), for a convex and smooth
function f : X → R over a bounded convex domain X ⊆ Rd . At FW iteration k, define f˜(α) = f ((1−α)xk +αsk ),
where sk = argminx∈X ⟨x, ∇f (xk )⟩. By contrast to Algorithm 3 (with GD), the range [xleft , xright ] should be
fixed at [0, 1] and not dynamically increased.
By making sure that the first line search query of iteration k is made at α = 2/(k +2), the O(1/k) convergence
proof of Frank-Wolfe with a quasi-exact line search goes through straightforwardly (Jaggi, 2013), since the
returned point will necessarily have a lower y-value than f˜(2/(k + 2)).
Note that this is not the case for the learning rate that satisfies Armijo’s condition in Eq. (8).

4 Conclusion
∆-Bisection and ∆-Secant are two algorithms that perform exact convex line search, the former using gradients
while the latter uses only function queries. These algorithms come with a guarantee on the optimality/duality
gap. We use this guarantee can as a stopping criterion to develop a quasi-exact line search and prove convergence
guarantees that are within a factor c/(c + 1) of the convergence speed of an exact line search (for which c = ∞),
while the slack given by c allows to terminate the search much before reaching numerical precision.
We recommend using ∆-Secant for quasi-exact line search with for example c = 1 to balance practical speed
and theoretical guarantees.
Our algorithms compare favorably with bisection search, golden-section search and backtracking line search
in their respective domains.

17
References
Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, Eng-
land, 2004.
Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quar-
terly, 3(1-2):95–110, 1956.
Larry Armijo. Minimization of functions having Lipschitz continuous first partial derivatives. Pacific Journal of
Mathematics, 16(1):1 – 3, 1966.
Tuyen Truong and Hang-Tuan Nguyen. Backtracking gradient descent method and some applications in large
scale optimisation. part 2: Algorithms and experiments. Applied Mathematics & Optimization, 84:1–30, 12
2021.
Martin Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In Proceedings of the 30th
International Conference on Machine Learning, volume 28(1) of Proceedings of Machine Learning Research,
pages 427–435, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR.

Appendix
A A bad case for tangent intersection selection
It is tempting to consider a variant of Algorithm 1 where at each iteration we query at the intersection of the
tangents of the two external points, that is, xqt = argmin(x,y)∈∆t y. This algorithm can be very slow, however.
Consider for example the function f (x) = max{−x/100, ebx } for b > 0, with x01 = −100, x11 = 100. It can be
verified that, for this algorithm, for approximately 100/b steps t, x0t = x01 while x1t ≈ 100 − t/b. That is, only
linear progress is made on the x axis.
By constrast, ∆-Secant takes less than 15 iterations for b = 10 for example.

B Numerical stability
B.1 Line-line intersection
Some numerical stability care needs to be taken when implementing the line-line intersection function from 4
points to calculate the y-gap.
Empirically, we found that the following formula is more stable than the determinant method. For 4 points
2
−y 1 3 4
p1 , p2 , p3 , p4 , with y 1 ≥ y 2 and y 3 ≤ y 4 , define a12 = xy2 −x 1 and a
34
= xy3 −y 12 12 2 2
−x4 , and f (x) = a (x − x ) + y and
f 34 (x) = a34 (x − x3 ) + y 3 .
Then, for any x̂ ∈ R, the intersection of f 12 and f 34 is at

f 12 (x̂) − f 34 (x̂)
xc = x̂ +
a34 − a12
First we pick x̂ = x2 , then we pick x̂ = xc and repeat the latter one more time to obtain a more accurate value.
Finally, we return the last xc , and we take y c = min{f 12 (xc ), f 34 (xc )} conservatively (for function minimization).
The quantity |f 12 (xc ) − f 34 (xc )| is the approximation error.
Additionally, (possibly numerical) zeros and infinities must be taken care of in a conservative manner (always
underestimating y c ).

B.2 Initial gradient for quasi-exact line search


When using an initial gradient for quasi-exact line search, it is important to keep the line constraint as a function
α 7→ f (x) − α∥∇f (x)∥2 as is done in backtracking line search, rather than the more tempting solution of adding
virtual points and letting the 5-point algorithm recover the tangent.

18
C Table of notation
f a convex function to minimize on [xleft , xright ]
f′ derivative of f
pat = (xat , yta ) a point labelled a at iteration t
fta (x) tangent of f at xat
ftab (x) line going through pat and pbt
[x , xright ]
left
initial interval in which to start the search
y left , y right f (xleft ), f (xright )
xlow , y low
xhigh , y high
y tol User-specified tolerance used to terminate the line search
[x0t , x1t ] interval maintained by the gradient-based algorithms
xqt query coordinate of an algorithm at iteration t
∆t optimality region for ∆-Bisection at step t; contains (x∗ , f (x∗ ))
∆xt , ∆yt x and y coordinates of ∆t
Pt (f, xleft , xright )-convex set of points maintained by the ∆-Secant algorithm
∆(P ) optimality region for a (f, ·, ·) set of points P
∆x (Pt ), ∆y (Pt ) Sets of x and y coordinates of ∆(Pt )
x− , x+ inf ∆(P ), sup ∆(P )
ij,kl
p = (xij,kl , y ij,kl ) Intersection point of f ij and f kl

19

You might also like