Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Optimizing quantum optimization algorithms

via faster quantum gradient computation

András Gilyén∗ Srinivasan Arunachalam∗ Nathan Wiebe†


Downloaded 04/10/23 to 203.110.242.40 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

Abstract ments for several optimization problems [24, 18, 28, 26,
We consider a generic framework of optimization algorithms 14, 10, 3, 9, 2, 31, 4, 13]. However, applying non-
based on gradient descent. We develop a quantum algo- Grover techniques to real-word optimization problems
rithm that computes the gradient of a multi-variate real- has proven challenging, because generic problems of-
valued function f : Rd → R by evaluating it at only a
logarithmic number of times in superposition. Our algo-
ten fail to satisfy the delicate requirements of advanced
rithm is an improved version of Jordan’s gradient compu- quantum techniques.
tation algorithm [28], providing an approximation of the A different paradigm of quantum optimization is
gradient ∇f with quadratically better dependence on the based on variational quantum circuits. These circuits
evaluation accuracy of f , for an important class of smooth are usually based on heuristics, nevertheless they are
functions. Furthermore, we show that objective functions
arising from variational quantum circuits usually satisfy the
promising candidates for providing quantum advan-
necessary smoothness conditions, hence our algorithm pro- tage in real-word problems, including quantum simu-
vides a quadratic improvement in the complexity of com- lation [39, 47], optimization [19], quantum neural net-
puting their gradient. We also show that in a continuous works [20] and machine learning [43]. These variational
phase-query model, our gradient computation algorithm has circuits usually try to optimize some objective function,
optimal query complexity up to poly-logarithmic factors, for
a particular class of smooth functions. Moreover, we show
which could correspond to, e.g., the energy of a quantum
that for low-degree multivariate polynomials our algorithm state in a molecule or the prediction loss in a learning
can provide exponential speedups compared to Jordan’s al- model. These quantum circuits have the appealing fea-
gorithm in terms of the dimension d. ture that they often have low depth, therefore can po-
One of the technical challenges in applying our gradient tentially be implemented on NISQ (noisy intermediate-
computation procedure for quantum optimization problems scale quantum) hardware. The training is usually per-
is the need to convert between a probability oracle (which formed by running the circuit several times and using
is common in quantum optimization procedures) and a classical gradient descent on the parameter space.
phase oracle (which is common in quantum algorithms) of In the long term it is expected that training could
the objective function f . We provide efficient subroutines become a bottleneck as the size of the variational quan-
to perform this delicate interconversion between the two tum circuits grow, similarly to for example the training
types of oracles incurring only a logarithmic overhead, of classical deep neural networks. However, the ulti-
which might be of independent interest. Finally, using mate limitations of variational training algorithms are
these tools we improve the runtime of prior approaches not well understood. In particular it has been unclear
for training quantum auto-encoders, variational quantum whether it is possible to achieve improvements beyond
eigensolvers (VQE), and quantum approximate optimization the simple quantum speedups provided by amplitude es-
algorithms (QAOA). timation techniques. This underscores the importance
of understanding the performance of training algorithms
1 Introduction as we begin to push beyond NISQ-era devices. Our
Quantum optimization. Optimization is a funda- main contribution is showing that non-trivial speedups
mentally important task that touches on virtually ev- can be achieved via quantum gradient computation of
ery area of science. Recently, there have been many the objective function. We remark that in this varia-
quantum algorithms that provide substantial improve- tional setting the prior works on quantum gradient de-
scent [40, 30] are not applicable.
It is usually difficult to reduce the number of itera-
∗ QuSoft/CWI, Amsterdam, Netherlands. Supported tions using quantum computers in iterative algorithms
by ERC Consolidator Grant QPROGRESS and partially
such as gradient descent. We therefore focus on the
supported by QuantERA project QuantAlgo 680-91-034.
{gilyen,arunacha}@cwi.nl cost per iteration, and speed up gradient computation.
† Station Q QuArC, Microsoft Research, USA. Since in general very little is known about the landscape
nawiebe@microsoft.com

Copyright © 2019 by SIAM


1425 Unauthorized reproduction of this article is prohibited
of objective functions arising from variational quantum Our improved gradient computation algorithm.
circuits, we study the problem in a black-box setting Suppose f : Rd → R is given by an oracle, which on in-
where we can only learn about the objective function put x , outputs f (x x) binarily with some finite accuracy.
by evaluating it at arbitrary points. This setting sim- Jordan [28] constructed a quantum algorithm that out-
plifies the problem and allows us to reason about upper puts an ε-coordinate-wise approximation of the gradient
and lower bounds on the cost of gradient computation. ∇f using a single evaluation of the binary oracle. This
Downloaded 04/10/23 to 203.110.242.40 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

demonstrates a striking quantum advantage, since clas-


Classical gradient-based optimization algo- sically one needs Ω(d) queries. His algorithm prepares a
rithms. We first give a high-level description of a uniform superposition of evaluation points over a finite
basic classical gradient-based optimization technique. grid, then approximately implements a phase unitary
The problem is, given p : Rd → R, compute √
2πi ε2d f (x
x)
O f : |x
x i →
7 e |x
xi,
(1.1) OPT = min{p(x x) : x ∈ Rd }. √ 
using a single O ε2 / d -accurate evaluation of f and
A heuristic solution of the optimization problem (1.1)
then applies an inverse Fourier transformation to obtain
can be obtained by computing the gradient of p:
  an approximation of the gradient. Although this algo-
∂p ∂p ∂p rithm only uses a single query, the required precision of
(1.2) ∇p = , ,...,
∂x1 ∂x2 ∂xd the function evaluation can be prohibitive. Moreover,
It is a well-known fact in calculus that p decreases the original analysis of Jordan [28] implicitly assumes
the fastest in the direction of −(∇p(x x)). This sim- that the function is essentially quadratic, by neglecting
ple observation is the basis of gradient-based optimiza- third and higher-order contributions. Using Jordan’s al-
tion algorithms. Given the generality of the optimiza- gorithm in our framework, where we assume access to a
tion problem (1.1) and the simplicity of the algorithm, probability oracle, evaluating the function with the pre-
gradient-based techniques are widely used in mathemat- scribed precision √ using amplitude amplification would
ics, physics and engineering. require Ω( d/ε2 ) queries.
Our improvements are √ two-fold: our quantum algo-
Probability oracles. Since quantum algorithms such rithm requires only O(ε/ e d)-accurate evaluations of the
as QAOA or VQE work with an objective function that function; on the other hand it also works for functions
needs to be learned by sampling, we assume the function with non-negligible higher-order terms, such as the ob-
is given by a probability oracle, which for every x ∈ Rd jective functions arising from variational circuits. The
acts as: main new ingredient is the use of higher-degree central-
(1.3) difference formulas, a technique borrowed from calcu-
p p  lus. The heart of our proof is showing that if f is suf-
Op : |~0i|x
xi 7→ p(xx)|1i|ψx(1) i + 1 − p(x x)|0i|ψx(0) i |x
xi ficiently smooth, then by using central-difference for-
where the continuous variable x is represented binarily mulas we can approximately linearize the function √ on
with some finite precision. 1 a diameter-1 hypercube by performing only log( d/ε)
The classical analogue of this model is when we get function evaluations. We prove this by bounding the
samples from the distribution (p(x x), 1 − p(xx)). Using “second moment” of higher-order bounded tensors using
empirical estimation, O(1/ε2 ) samples suffice for esti- Lemma 5.3, which might be of independent interest.
mating p(x x) with precision O(ε). If the function is suf- Our algorithm works by evaluating the approxi-
ficiently smooth, using standard techniques we can com- mately linearized function over a uniform superposi-
pute an ε-approximation of ∇i p(x x) using a logarithmic tion of grid points in the hypercube, followed by a d-
number of such function estimations. Computing the dimensional quantum Fourier transform, providing a
gradient this way uses O(d/ε e 2
) samples. classical description of an approximate gradient, sim-
Our main result shows that such an ε-approximate ilarly to Jordan’s algorithm.
gradient estimate (in the `∞ -norm) can be computed Theorem 1.1. (See Theorem 5.4 & Theorem 6.1)
with quadratically fewer quantum queries to a proba- Let ε > 0, d ∈ N, and c = O(1). Suppose we are given
bility oracle. We remark that our quadratic speedup is probability (or phase) oracle access to a function

not yet-another Grover speed-up, instead it comes from f : Rd → R, such that |∂i1 ,i2 ,...,ik f (00)| ≤ ck k! for
applying the Fourier transformation together with some every k ∈ N and (i1 , i2 , . . . , ik ) ∈ [d]k . The quantum
optimized interpolation techniques. query complexity of computing (with high prob.) an
ε-coordinatewise-approximation of ∇f (00) is
1 Note that this is a much weaker input model than the oracle √ 
model used by Jordan [28]. Θe d/ε .

Copyright © 2019 by SIAM


1426 Unauthorized reproduction of this article is prohibited
Our algorithm is also gate-efficient; the gate complexity objective function then we wish to find x such that
is O(Q
e + d), where Q is the query complexity. hψ(x x)|H|ψ(xx)i is minimized. For example, in order
to minimize the number of violated constraints of a
Our lower bound techniques. In order to prove constraint PM satisfaction problem, we can choose H =
the optimality of our algorithm we prove an extended m=1 mC to represent the number of violations: Cm is
version of the so-called hybrid method [7]. 1 if and only if the mth constraint is violated, else Cm =
Downloaded 04/10/23 to 203.110.242.40 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

0 [19]. After normalization and using some standard


Theorem 1.2. (Continuous-phase lower bound) techniques we can map this expectation value to some
Let F be a finite set of functions X → R. Suppose we measurement probability. Thus, from the perspective of
have access to f ∈ F via a phase oracle such that our algorithm, QAOA looks exactly like VQE.
The classical auto-encoder paradigm [5] is an impor-
Of : |xi 7→ eif (x) |xi for every x ∈ X. tant technique in machine learning, which is widely used
for data compression. An auto-encoder is essentially a
Let f∗ ∈ F be fixed. Given Of for some unknown f ∈ F, neural network architecture which is tuned for the fol-
determining whether f = f∗ has query complexity lowing task: given a set of high-dimensional vectors, we
 ,s  would like to learn a low-dimensional representation of
p X 2 the vectors, so that computations on the original data
Ω |F| − 1 max |f (x) − f∗ (x)| .
x∈X set can be “approximately” carried out by working only
f ∈F
with the low-dimensional representations. What makes
In order to prove the lower bound in Theorem 1.1, we auto-encoding powerful is that it does not assume any
exhibit a family of functions F for which the functions prior knowledge about the data set. This makes it a vi-
can be well distinguished by calculating their gradients able technique in machine learning, with various appli-
with accuracy ε in the `∞ -norm. Then, with the√ help cations in natural language processing, training neural
of Theorem 1.2 we show that this requires Ω( d/ε) networks, object classification, prediction or extrapo-
queries. lation of information, etc. In this paper, we consider
a natural quantum analogue (which was also consid-
ered before in the works of [45, 41]) of the auto-encoder
Applications. We consider three problems to which
paradigm, and show how to use our quantum gradient
we apply our quantum gradient descent algorithm. We
computation algorithm to quadratically speed up the
briefly describe below the problem of quantum varia-
training of quantum autoencoders.
tional eigensolvers (VQE) [39, 47], quantum approx-
imate optimization algorithms (QAOA) [19], and the
2 Organization of the paper and preliminaries
quantum auto-encoding problem [45, 41]. In each case
we show how our gradient computation algorithm can In Section 3, we give a generic model of quantum
provide a quadratic speedup in terms of the dimension optimization algorithms and a detailed description of
d of the associated problem. the classical gradient descent algorithm. In Section 4,
VQE is widely used to estimate the eigenvalue we describe how to convert a probability oracle to a
corresponding to some eigenstate of a Hamiltonian. phase oracle. In Section 5 we present our quantum
The main idea in VQE is to begin with an efficiently gradient computation algorithm and prove our main
parameterizable ansatz to the eigenstate. For the Theorem 5.4 regarding its complexity. In Section 6,
example of ground state energy estimation, the ansatz we present query lower bounds for algorithms that
state is often taken to be a unitary coupled cluster (approximately) compute the gradient of a function. In
expansion. The terms in that unitary coupled cluster Section 7 we describe some applications. We conclude
expansion are varied to provide the lowest energy for the with some directions for future research in Section 8.
groundstate, and the expected energy of the quantum
state is mapped to the probability of some measurement Notation. Let e 1 , e 2 , . . . , e d ∈ Rd denote the standard
outcome, making it accessible to our methods. basis vectors. We use bold letters for vectors x ∈ Rd ,
QAOA has a similar approach, the basic idea of in particular we use the notation 0 for the 0 vector, and
the algorithm is to consider 1 for the all-1 vector (ee1 + e 2 + · · · + e d ). By writing
Q a parametrized family of
states such as |ψ(x x)i = dj=1 e−ixj Hj |0i. The aim is y +rS we mean {yy +rvv : v ∈ S} for vectors S ⊆ Rd , and
to tune the parameters of |ψ(x x)i in order to minimize use the same notation for sets of numbers. Pd For x ∈ Rd ,
2 1/2
some objective function, which can, e.g., represent some let kx xk∞ = maxi∈[d] |xi | and kx xk = ( i=1 xi ) . For
combinatorial optimization problem. In particular, M ∈ Rd×d , let kM k denote the operator norm of M .
if H is a Hermitian operator corresponding to the Let [d] = {1, 2, . . . , d}. We use the convention

Copyright © 2019 by SIAM


1427 Unauthorized reproduction of this article is prohibited
00 = 1 throughout, and use the notation N0 = N ∪ {0}. We denote the corresponding central-difference coeffi-
When we state the complexity of an algorithm, we cients for ` ∈ {−m, . . . , m}\{0} by
use the notation O(C)
e to hide poly-logarithmic factors `−1 m

in the complexity C. In general, we use H to denote a (2m) (−1) |`| (2m)
a` := m+|`|
 and a0 := 0
finite dimensional Hilbert space. For the n-qubit all-0 `
|`|
⊗n
basis state we use the notation |0i , or simply write
Downloaded 04/10/23 to 203.110.242.40 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

In the full version of this paper [22] we prove some


|~0i when we do not want to focus on the value of n.
bounds on the approximation error of the above formu-
las2 for generic m. Usually such error bounds are only
Higher-order calculus. Many technical lemmas in
derived for some finite values of m, because that is suf-
this paper will revolve around the use of higher-order
ficient in practice, but in order to prove our asymptotic
calculus. We briefly introduce some notation here and
results we need to derive more general results.
give some basic definitions.
Definition 2.1. (Index-sequences) For k ∈ N0 we 3 A generic model of quantum optimization
call α ∈ [d]k a d-dimensional length-kQindex-sequence. algorithms
For a vector r ∈ Rd we define r α := j∈[k] rαj . Also, Variational quantum algorithms designed for quantum
for a k-times differentiable function, we define ∂α f := optimization and machine learning procedures have
∂α1 ∂α2 · · · ∂αk f . Finally, we use the notation |α| = k the following core idea: they approximate an optimal
for denoting the length of the index-sequence. solution to a problem by tuning some parameters in
a quantum circuit. The circuit usually consists of
Definition 2.2. (Analytic function) We say that
several simple gates, some of which have tunable real
the function f : Rd → R is analytic if for all x ∈ Rd
parameters, e.g., the angle of single qubit (controlled)
∞ X rotation gates. Often, if there are enough tunable
X ∂α f (00)
(2.4) x) =
f (x xα . gates arranged in a nice topology, then there exists
k!
k=0 α∈[d]k parameters that induce a unitary capable of achieving
Definition 2.3. (Directional derivative) a close to optimal solution.
Suppose f : Rd → R is k-times differentiable at x ∈ Rd . In such variational approaches, one can decompose
We define the k-th order directional derivative in the the circuit into three parts each having a different role
direction r ∈ Rd using the derivative of a one-parameter (see Figure 1). The circuit starts with a state prepara-
function parametrized by τ ∈ R along the ray in the tion part which prepares the initial quantum state rel-
direction of r : evant for the problem. We call this part ‘Prep.’ in
Figure 1. The middle part consists of tunable param-
dk eters x and fixed gates, which are together referred to
∂rk f (x
x) = x + τrr ).
f (x
(dτ )k as ‘Tuned’ in Figure 1. Finally, there is a verification
circuit that evaluates the output state, and marks suc-
Observe that, using the definitions above, one has cess if the auxiliary qubit is |1i. We call the verification
(2.5) ∂rk f =
X
r α · ∂α f. process V in Figure 1. The quality of the circuit (for
parameter x ) is assessed by the probability of measuring
α∈[d]k
the auxiliary qubit and obtaining 1.
In particular for every i ∈ [d], we have that ∂eki f = ∂ik f . One can think of the tunable circuit as being
Central difference formulas (see, e.g. [34]) are often tuned in a classical way as shown in Figure 1a or a
used to give precise approximations of derivatives of a quantum way as in Figure 1b. In the classical case, the
function h : R → R. These formulas are coming from parameters can be thought of as being manually set.
polynomial interpolation, and yield precise approxima- Alternatively, the parameters can be quantum variables
tions of directional derivatives too. Thus, we can use represented by qubits. The advantage of the latter is
them to approximate the gradient of a high-dimensional that it allows us to use quantum techniques to speedup
function as shown in the following definition. optimization algorithms. However the drawback is that
it requires more qubits to represent the parameters and
Definition 2.4. The degree-2m approximate central- requires implementation of additional controlled-gates,
difference linearization of a function f : Rd → R is: see for example, Fig. 1c.
(2.6)
m
m

X (−1)`−1 |`| 2 One can read out the coefficients described in Definition 2.4
x) :=
f(2m)(x m+|`|
x) ≈ ∇f (00) · x.
 f (`x
` from the second row of the inverse of the Vandermonde matrix,
`=−m |`| as Arjan Cornelissen pointed out to us.
`6=0

Copyright © 2019 by SIAM


1428 Unauthorized reproduction of this article is prohibited
x R(x)
z }| {
(1)
|~0i Prep. Tuned 
 |bx i • ...
V

 (0) ...
|bx i •


|0i |1i? 



(−1) ...
|xi |bx i •
(a) A classically tunable circuit 

 .. ..
Downloaded 04/10/23 to 203.110.242.40 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

. .




 (−n)

... •
|x
xi |bx i

|~0i Prep. Tuned |ψi R(2) R(1) R(2−1 ) ... R(2−n )


V
|0i |1i?
(c) A 2−n precisely tunable rotation gate R(x) for the fixed point binary
(b) A quantumly tunable circuit parameter x = b1 b0 .b−1 · · · b−n .

Figure 1: Two different approaches to tunable quantum optimization. The circuit on the top left has classically
set parameters |x
xi (represented as a vector of fixed point binary numbers), whereas the circuit on the bottom left
has parameters x described by an array of qubits |x xi. The black squares connected to the ‘Tuned’ circuit indicate
non-trivial control structure for which an example is presented on the right figure, showing how to implement a
quantumly tunable rotation gate built from simple controlled rotation gates.

(0) (0)
Let us denote by U (x x) the circuit in Figure 1a as follows: pick N random points {x1 , . . . , xN }. For
and the corresponding circuit in Figure 1b as U := (0)
P each i ∈ [N ], compute ∇p(xi ), and take a δ-step in the
|x
x ihx
x | ⊗ U x
(x ). The goal in these optimization (0) (1) (0) (0)
x direction of ∇p(xi ) leading to xi = xi + δ∇p(xi )
problems is to find the optimal parameters (i.e., x )
(for some step size δ > 0). Repeat this procedure for T
which maximizes the probability of obtaining 1 after (T )
the final measurement, thereby solving the problem steps, obtaining xi which has hopefully approached
some local maxima of (1.1). Finally, take the maximum
(T ) (T )
(3.7) of {p(x1 ), . . . , p(xN )} as an approximation to (1.1).
2
argmax p(x x), where p(x x) = (|1ih1| ⊗ I)U (x

x)|~0i .
By using empirical estimation to evaluate the func-
x tion to precision ε, under some mild smoothness as-
sumption, we can compute the gradient to ε-precision
A well-known technique to solve continuous- e d2 samples. This way the al-
in the `∞ norm using O ε
variable optimization problems like the one above is
gorithm uses the quantum circuit U (x x) a total number
gradient-ascent. In practice, gradient-based methods e 
O N T d/ε2 times.
represent one of the most commonly used paradigms
for solving continuous optimization problems.
3.2 Quantum speedups to the classical algo-
rithm Now, let us consider the possible quantum
3.1 Classical gradient ascent algorithm As we
speedups to this naïve gradient ascent algorithm dis-
discussed earlier, finding globally optimal parameters
cussed in the previous section. The most basic improve-
for optimization problems (3.7) is often hard. There-
ment which works even for classically controlled circuits
fore in practice one usually relies on heuristics to find
x
an approximate solution x a such that p(x xa ) is close to (Figure 1a) is to estimate the probability p(x ) in Step 5
using quantum amplitude estimation rather than doing
optimal. There are several heuristic optimization tech-
repeated measurements and taking the average. If one
niques that are often applied to handle such problems.
wants to determine the value p(x x) up to error ε for some
One of the most common techniques is gradient ascent,
fixed x , the quantum approach uses the circuit O(1/ε)
which follows a greedy strategy to obtain the optimal
times, whereas the classical statistical method would re-
solution. It simply follows the path of steepest ascent
quire Ω(1/ε2 ) repetitions, due to the additive property
on the landscape of the objective function to find a so-
of variances of uncorrelated random variables. Although
lution that is, at least up to small local perturbations,
this is a natural improvement, which does not require
optimal. Such solutions are called locally optimal.
much additional quantum resources many papers that
A naïve gradient-based algorithm3 can be described
describe a similar procedure do not mention it.
3 There are more sophisticated versions of gradient ascent, but Another quantum improvement can be achieved [11,
for simplicity here we discuss a very basic version. 33] by using Grover search, which requires a quantumly

Copyright © 2019 by SIAM


1429 Unauthorized reproduction of this article is prohibited
Method: Simple algorithm

+Amp. est. +Grover
 √ search
  √ paper
+This 
Complexity: Oe T N d/ε2 O(T
e N d/ε) O
e T N d/ε O
e T N d/ε

Table 1: Quantum speedups for the naïve gradient-ascent algorithm

 p 
controlled circuit like in Figure 1b. Let P (zz ) denote
Downloaded 04/10/23 to 203.110.242.40 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

obtain an overall complexity of O


e T N d/ε . For a
the probability that for a randomly chosen starting summary of the speedups see Table 1.
point x 0 we get x T = z , i.e., we end up with z after
performing T gradient steps. Let p̃ be a value such that 4 The three natural input oracle models
P (p(zz ) ≥ p̃) ≥ 1/N . If we use N randomly chosen
As we discussed in the previous section, the optimiza-
initial points then with high probability at least one
xT ) ≥ p̃.4 If tion problem associated with Figure 1 was to maximize
initial point will yield a point x T with p(x
we use the quantum maximum finding algorithm [17] or 2
x)|~0i .
x) = max (|1ih1| ⊗ I)U (x
max p(x

more precisely one if its generalizations √ [37, 3], we can x x
reduce the number of repetitions to O( N ) and still find
a point x T having p(x xT ) ≥ p̃ with high probability. Due Typically, one should think of U as the unitary cor-
to reversability, we need to maintain all points visited responding to some variational quantum circuit or
during the gradient ascent algorithm, thereby possibly parametrized quantum algorithm. In more abstract
introducing a significant overhead in the number of terms, we can view U as a probability oracle that maps

|~0i|x x)|1i|ψx(1) i + 1 − p(x
x)|0i|ψx(0) i |x
p p
qubits used. xi to p(x xi,
However, there is a drawback using Grover search- such that the probability of obtaining 1 by measuring
based techniques. The disadvantage of quantum max- the first qubit is p(x). This measurement probability
imum finding approach over classical methods is, that serves as a “benchmark score” for the corresponding
the amount of time it takes to reach a local maximum unitary U with respect to the vector of parameters x .
using the gradient ascent might vary a lot. The reason
is that classically, once we reached a local maximum we Definition 4.1. (Probability oracle) We say that
can start examining the next starting point, whereas Up : Haux. ⊗ H → Haux. ⊗ H is a probability oracle for
if we use Grover search we do the gradient updates in the function p : X → [0, 1], if {|xi : x ∈ X} is an
superposition so we need to run the procedure for the orthonormal basis of the Hilbert space H, and Up for all
largest possible number of gradient steps. To reduce this x ∈ X acts as
disadvantage one could use variable time amplitude am- p p 
plification techniques introduced by Ambainis [1], how- |~0i|xi 7→ p(x)|1i|ψx(1) i + 1 − p(x)|0i|ψx(0) i |xi,
ever, we leave such investigations for future work.
(1) (0)
where |ψx i and |ψx i are arbitrary (normalized) quan-
Our contribution. We show a quadratic speedup tum states.
in d – the number of control parameters. For this This oracle model is not commonly used in quantum
we also need to use a quantumly controlled circuit, algorithms, therefore we need to convert it to a different
but the overhead in the number of qubits is much format, for example to a phase oracle, in order to use
smaller than in the previous Grover type speedup. The standard quantum techniques.
underlying quantum technique crucially relies on the
quantum Fourier transform as it is based on an improved Definition 4.2. (Phase oracle) We say that Of :
version of Jordan’s gradient computation [28] algorithm. Haux. ⊗ H → Haux. ⊗ H is a phase oracle for f : X →
We can optionally combine this speedup with the above [−1, 1], if {|xi : x ∈ X} is an orthonormal basis of the
mentioned maximum finding improvement, which then Hilbert space H, and Of for all x ∈ X acts as
gives a quantum algorithm that  uses the quantumly

controlled circuit (Figure 1b) O
e T N d/ε times, and Of : |~0i|xi 7→ eif (x) |~0i|xi.
achieves essentially the same guarantees as the classical There is a third type of oracle that is commonly
Algorithm. Therefore we can achieve a quadratic used in (quantum) algorithms, namely the binary oracle
speedup in terms of all parameters except in T and that outputs a finite precision binary representation of
the function [48]. This is the input model used in the
4 I.e.,
with high probability, we will find a point from the top work of Jordan [28], but in fact the first step of his
1/N percentile of the points regarding the objective function p(zz ). algorithm converts this oracle to a phase oracle.

Copyright © 2019 by SIAM


1430 Unauthorized reproduction of this article is prohibited
Definition 4.3. (Binary oracle) For η ∈ R+ , we We implement (fractional) phase oracles by first
say Bfη : H ⊗ Haux. → H ⊗ Haux. is an η-accurate binary turning a probability oracle to a block-encoding of the
oracle for f : X → R, if {|xi : x ∈ X} is an orthonormal diagonal matrix containing the probabilities, then using
basis of the Hilbert space H, and for all x ∈ X it acts the above Hamiltonian simulation result.
as Let Up be a probability oracle, observe that
Bpη : |xi|~0i 7→ |xi|p0 (x)i,
Downloaded 04/10/23 to 203.110.242.40 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

(h~0| ⊗ I) Up† (Z ⊗ I)Up (|~0i ⊗ I) = diag(1 − 2p(x)).



0
where |p (x)i is a fixed-point binary number satisfying
|p0 (x) − p(x)| ≤ η. We define the cost of one query to
Therefore Up† (Z ⊗I)Up is a block-encoding of a diagonal
Bfη as C(η).5,6
matrix with diagonal entries (1 − 2p(x)). Moreover, we
Note that using amplitude estimation7 it is possible can implement a controlled a version of this unitary by
to turn a probability oracle to a binary oracle and then simply replacing Z with a controlled-Z.
convert a binary oracle to a phase oracle. However, it
is more efficient to avoid this analogue-digital-analogue Corollary 4.1. (Probability to phase oracle)
conversion, and directly convert a probability oracle to Suppose that we are given a probability oracle for
a phase oracle. p(x) : X → [0, 1]. Let t ∈ R and f (x) = tp(x) for all
x ∈ X. We can implement an ε-approximate phase
4.1 Converting a probability oracle to a (frac- oracle Of with query complexity O(|t| + log(1/ε)),
tional) phase oracle We use block-encoding and i.e., this many uses of Up and its inverse. Moreover,
if |~0i = |0i⊗a is an a-qubit state, then this query
Hamiltonian simulation techniques to implement the
conversion efficiently. In order to describe our result complexity bound times O(a) upper bounds the gate
we use the following definition, motivated by [23]. complexity.

Definition 4.4. (Block-encoding) Suppose that A Form this we see that a probability oracle can
is an operator on H, then we say that the unitary U be efficiently converted to a (fractional) phase oracle.
acting on Haux. ⊗ H is a block-encoding of A, if Recently it was also shown, that if |f | ≤ 1, then with
a similar logarithmic overhead a phase oracle can be
A = (h~0| ⊗ I)U (|~0i ⊗ I).
converted to fractional phase oracle [23]. For these
Intuitively a block-encoding is a unitary, whose top-left reasons, and for simplicity, throughout the paper we
block contains the desired matrix A: assume that a fractional phase query has the same cost
  as a non-fractional/full phase query.
A . ~ ~
U= ⇐⇒ A = (h0| ⊗ I)U (|0i ⊗ I).
. . Definition 4.5. (Fractional phase oracle) Let
r ∈ [−1, 1], we say that Orf : Haux. ⊗ H → Haux. ⊗ H
Low and Chuang [35] showed how to implement an
is a fractional query8 phase oracle for f : X → [−1, 1],
optimal Hamiltonian simulation circuit given a block-
if {|xi : x ∈ X} is an orthonormal basis of the Hilbert
encoding of the Hamiltonian.
space H, and Orf for all x ∈ X acts as
Theorem 4.1. (Hamiltonian simulation [35])
Suppose that U is a block-encoding of the Hamiltonian Orf : |~0i|xi 7→ eirf (x) |~0i|xi.
H, and Haux. consists of a qubits. Then one can
implement an ε-precise approximation of the Hamilto- 5 Improved quantum gradient computation
nian simulation unitary eitH using 2 additional ancilla algorithm
qubits, with O(|t| + log(1/ε)) uses of controlled-U or its 5.1 Overview of Jordan’s algorithm Stephen
inverse and with O(a|t| + a log(1/ε)) two-qubit gates. Jordan constructed a surprisingly simple quantum al-
gorithm [28, 12] that can approximately calculate the
5 One could also consider a more general definition allowing the d-dimensional gradient of a function f : Rd → R with
oracle to output a superposition of η-accurate answers. a single evaluation of f . In contrast, using standard
6 The cost function would typically be polylog(1/η) for func-

tions that can be calculated using a classical circuit, however,


8 Note that this fractional query is more general than the
when the binary oracle is obtained via quantum phase estimation
this cost is typically 1/η. fractional query introduced by Cleve et al. [15], because we have
7 In some cases people use sampling and classical statistics to a continuous phase rather than discrete. Thus, the results of
learn this probability, however amplitude estimation is quadrat- [15] do not give a way to implement a generic fractional query
ically more efficient. Typically one can improve sampling proce- using a simple phase oracle Of , however one can use qubitization
dures quadratically using quantum techniques [36, 25]. techniques in order to implement fractional queries [23].

Copyright © 2019 by SIAM


1431 Unauthorized reproduction of this article is prohibited
classical techniques, one would use d + 1 function eval- Complexity of the algorithm. For Jordan’s algo-
uations to calculate the gradient at a point x ∈ Rd : one rithm, it remains to pick the parameters of the grid and
can first evaluate f (x x) and then, for every i ∈ [d], evalu- the constant S in Eq. 5.8. For simplicity, assume that
x + δeei ) (for some δ > 0) to get an approximation
ate f (x k∇f (x
x)k∞ ≤ 1, and suppose we want to approximate
of the gradient in direction i using the standard formula ∇f (x
x) coordinate-wise up to ε accuracy, with high suc-
cess probability. Under the assumption that “the 2nd
Downloaded 04/10/23 to 203.110.242.40 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

x + δeei ) − f (x
f (x x) partial derivatives of f have a magnitude of approxi-
∇i f (x
x) ≈ .
δ mately D2 ”, Jordan argues9 that choosing Gxd to be a d-
The basic idea of Jordan’s quantum algorithm [28] dimensional hypercube with edge length ` ≈ D ε√d and
2
is simple. First make two observations. Observe that if with N ≈ 1ε equally spaced grid points in each dimen-
f is twice differentiable
 at x , then f (x x + δ ) = f (x x) + sion, the quantum algorithm yields √
an ε-approximate
2
∇f · + O kδ k , which in particular implies that for
δ δ gradient by setting S = N` ≈ D2ε2 d . Moreover, since the
small kδδ k, the function f is very close to being affine Fourier transform is relatively insensitive to local phase
linear. The second observation is that, using the value errors it is sufficient to implement the phase Sf (x x + δ)
x + δ ), one can implement a phase oracle:
of f (x upto some constant, say 1% accuracy.
(5.8) During the derivation of the above parameters Jor-
O2πSf : |δδ i 7→ e2πiSf (xx+δδ ) |δδ i ≈ e2πiSf (xx) e2πiS∇f ·δδ |δδ i dan makes the assumption, that the third and higher-
order terms of the Taylor expansion of f around x are
for a scaling factor S > 0, where the approximation negligible, however it is not clear from his work [28], how
x + δ ) ≈ f (x
uses f (x x) + ∇f · δ for small kδδ k. The role of
to actually handle the case when they are non-negligible.
S is to make make the phases appropriate for the final This could be a cause of concern for the runtime analy-
quantum Fourier transform. sis, since these higher-order terms potentially introduce
a dependence on the dimension d.
Sketch of the algorithm. Assume that all real vectors Finally, in order to assess the complexity of his
are expressed upto finite amount of precision. In order algorithm, Jordan considers the Binary oracle input
to compute the gradient at x , the algorithm starts with model of Definiton 4.3. This input model captures
a uniform superposition |ψi = √ 1 d
P
δ ∈Gxd δ i over the

|Gx | functions that are evaluated numerically using, say, an
points of a sufficiently small discretized d-dimensional arithmetic circuit. Typically, the number of one and
grid Gxd around x , and applies the phase oracle O2πSf two-qubit gates needed to evaluate such functions up to
(in Eq. 5.8) to |ψi. Next, the inverse quantum Fourier n digits precision is polynomial in n and d. However,
transform is applied to the resulting state and each this input model does not fit the quantum optimization
register is measured to obtain the gradient of f at x framework that we introduced in Section 3.
approximately. Due to approximate linearity of the
phase (as in Eq. (5.8)), applying the Fourier transform Our improvements. We improve on the results of
will approximately give us the gradient. This algorithm Jordan [28] in a number of ways. Jordan [28] argued
uses O2πSf once and Jordan showed how to implement that evaluating the function on a superposition of grid-
O2πSf using one sufficiently precise function evaluation. points symmetrically arranged around 0 is analogous to
In order to improve the accuracy of the simple using a simple central-difference formula. We also place
algorithm above, one could use some natural tricks. If f the grid symmetrically, but we realized that it is possible
is twice continuously differentiable, it is easy to see that to directly use central-difference formulas, which is the
the smaller the grid Gxd becomes, the closer the function main idea behind our modified algorithm.
gets to being linear. This gives control over the precision We realized that in applications of the gradient
of the algorithm, however if we “zoom-in” to the function descent algorithm for optimization, it is natural to
using a smaller grid, the difference between nearby assume access to a phase oracle Of : |x xi 7→ eif (xx) |x
xi
function values becomes smaller, making it harder the (allowing fractional queries as well – see Definition 4.5)
distinguish them and thus increasing the complexity of instead of the binary access oracle Bfη . If we wanted
the algorithm proportionally. to use Jordan’s original algorithm in order to obtain
Also, it is well known that if the derivative is the gradient with accuracy ε, we need to implement the
calculated based on the differences between the points
(f (x − δ/2), f (x + δ/2)) rather than (f (x), f (x + δ)), 9 We specifically refer to equation (4) in [28] (equation (3) in
then one gets a better approximation since the quadratic the arXiv version), and the discussion afterwards. Note that our
correction term cancels. To mimic this trick, Jordan precision parameter ε corresponds to the uncertainty parameter
chose a symmetric grid Gxd around 0. σ in [28].

Copyright © 2019 by SIAM


1432 Unauthorized reproduction of this article is prohibited

query oracle OfS by setting S ≈ D2 d/ε2 , which can Algorithm 1 Jordan’s quantum gradient computation
be achieved using dSe consecutive (fractional) queries. algorithm
Although it gives  a square-root dependence on d, it Registers: Use n-qubit input registers
scales as O 1/ε2 with the precision. Here, we employ |x1 i|x2 i · · · |xd i with each qubit set to |0i.
the phase oracle model and improve the quadratic Labels: Label the n-qubit states of each register
dependence on 1/ε to essentially linear. Additionally, with elements of Gn as in Definition 5.1.
Downloaded 04/10/23 to 203.110.242.40 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

we rigorously prove the square-root scaling with d Input: A function f : Gdn → R with phase-oracle
under reasonable assumptions on the derivatives of f . Of access such that
We also√ show
 that, for a class of smooth functions, n+1 n
the O e d/ε -query complexity is optimal up to poly- Ofπ2 |x1 i· · ·|xd i = e2πi2 f(x1 ,x2 ,...,xd ) |x1 i· · ·|xd i.
logarithmic factors. We describe the algorithm in the
next section, but first present our main result, whose 1: Init Apply a Hadamard transform to each qubit of
proof is deferred to the end of this section. the input registers.
2: Oracle call Apply the modified phase oracle
n+1

5.2 Analysis of Jordan’s algorithm In this sec- Ofπ2 on the input registers.
−1
tion we describe Jordan’s algorithm and provide a 3: QFTGn Fourier transform each register:
generic analysis of its behaviour. In the next subsec-
tion we combine these results with our finite-difference 1 X −2πi2n xk
|xi 7→ √ e |ki.
methods. Before describing the algorithm, we introduce 2n k∈G
n
appropriate representation of our qubit strings suitable
for fixed-point arithmetics. 4: Measure each input register j and denote the
measurement outcome by kj .
Definition 5.1. For every b ∈ {0, 1}n , let j (b) ∈ 5: Output (k1 , k2 , . . . , kd ) as the gradient estimate
{0, . . . , 2n −1} be the integer corresponding to the binary
string b = (b1 , . . . , bn ). We label the n-qubit basis state
|b1 i|b2 i · · · |bn i by |x(b) i, where Lemma 5.1. Let N = 2n , c ∈ R and g ∈ Rd such that
kgg k∞ ≤ 1/3. If f : Gdn → R is such that
(b)
j 1
x(b) = n − + 2−n−1 . 1
2 2 (5.9) |f (x
x) − g · x − c| ≤ ,
42πN
We denoten the set of corresponding labels o as
j (b) d
Gn := 1
2n − 2 + 2
−n−1
: j ∈ {0, . . . , 2 − 1} . for all but a 1/1000 fraction of the points x ∈ Gn , then
(b) n
the output of Algorithm 1 satisfies:
Note that there is a bijection between {j (b) }b∈{0,1}n
and {x(b) }b∈{0,1}n , so we use |x(b) i and |j (b) i inter- Pr[|ki − gi | > 4/N ] ≤ 1/3 for every i ∈ [d].
changeably in Remark 5.1. In the rest of this section
we always label n-qubit basis states by elements of Gn . Proof. First, note that |G | = N from Definition 5.1.
n
Consider the following quantum states
Definition 5.2. For x ∈ Gn we define the Fourier
transform of a state |xi as 1 X
|φi = √ e2πiN f (xx) |x
xi,
Nd x∈Gd
1 X 2πi2n xk n
QF TGn : |xi 7→ √ e |ki. 1 X
2n k∈G
n
|ψi = √ e2πiN (gg ·xx+c) |x
xi.
Nd x ∈Gd
n

We prove the following in the full version of this


paper [22]: Note that |φi is the state we obtain in Algorithm 1
after line 2 and |ψi is its “ideal version” that we try
Claim 5.1. This unitary is the same as the usual quan- to approximate with |φi. Observe that the “ideal” |ψi
tum Fourier transform up to conjugation with a tensor is actually a product state:
product of n single-qubit unitaries.
d 
O 1 X 2πiN g` ·x` 
Now we are ready to precisely describe Jordan’s |ψi = √ e |x` i
N x` ∈Gn
quantum gradient computation algorithm. `=1

Copyright © 2019 by SIAM


1433 Unauthorized reproduction of this article is prohibited
It is easy to see that after applying the inverse where the first inequality used |eiz − eiy | ≤ |z − y|, and
Fourier transform to each register in |ψi, we obtain the second inequality is by the assumptions on f .
d 
O 1 X  In the following theorem we assume that we have
e2πiN x` (g` −k` ) |k` i access to (a high power of) a phase oracle of a function
N
`=1 x` ,k` ∈G2n f that is very well approximated by an affine linear
Downloaded 04/10/23 to 203.110.242.40 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

Suppose we make a measurement and observe function g · z + c on adhypergrid with edge-length r ∈ R


(k1 , . . . , kd ). As shown in the analysis of phase esti- around some y ∈ R . We show that if the relative
mation [38], we have10 : for every i ∈ [d] (for a fixed precision of the approximation is precise enough, then
accuracy parameter κ > 1), the following holds: Algorithm 1 can compute an approximation of g (the
“gradient”) with small query and gate complexity.
h κi 1
Pr |ki − gi | > ≤ for every i ∈ [d]. Theorem 5.1. Let c ∈ R, r, ρ, ε < M ∈ R+ , and y , g ∈
N 2(κ − 1)
Rd such that kgk∞ ≤ M . Let nε := dlog2 (4/(rε)))e,
By fixing κ = 4, we obtain the desired conclusion of nM := dlog (3rM )e and n := nε + nM . Suppose
2
the theorem, i.e., if we had access to |ψi (instead of f : y + rGd → R is such that
n
|φi), then we would get a 4/N -approximation of each
εr
coordinate of the gradient with probability at least |f (yy + rx x) − g · rxx − c| ≤
8 · 42π
5/6. It remains to show that this probability does not
change more than 1/3 − 1/6 = 1/6 if we apply the for all but a 1/1000 fraction of the points x ∈ Gdn . If we

2πi2 f y +rx
(y x)
Fourier transform to |φi instead of |ψi. Observe that have access to a phase oracle O : |x xi → e |x
xi
d
the difference in the probability of any measurement acting on H = Span{|x xi : x ∈ Gn }, then using
d
outcome on these states is bounded by twice the trace Algorithm 1 we can calculate a vector g̃g ∈ R such that
distance between |ψi and |φi which is Pr[ kg̃g − g k∞ > ε] ≤ ρ,
q  
with O log dρ

2 queries to O and with gate complexity
k|ψihψ| − |φihφ|k1 = 2 1 − |hψ|φi| ≤ 2k|ψi − |φik.
 
d M d M
  
Since the Fourier transform is unitary and does not O d log ρ log ε log log ρ log log ε .
change the Euclidean distance, it is sufficient to show
that k|ψi − |φik ≤ 1/12 in order to conclude the Proof. Let NM := 2 , N := 2 , and h(x
nM n x) := f (yyN+rx
M
x)
,
r c εr 1
theorem. Let S ⊆ Gdn denote the set of points satisfying then h(x)−gg NM x − NM ≤ 8·42πNM ≤ 42πN . Note that
Eq. (5.9). We conclude the proof of the theorem O = O2πN h , therefore Algorithm 1 yields and output
2 2
by showing k|ψi − |φik ≤ (1/12) : we upper bound g̃g , which, as shown by Lemma 5.1, is such that that
2
k|φi−|ψik as follows for
each i ∈ [d] with probability at least 2/3 we have
g̃i − r gi ≤ 4 . Thus also NM g̃i −gi ≤ 4NM ≤ ε. By

NM N r rN
1 X 2πiN f (xx) 2

d e − e2πiN (gg ·xx+c) repeating the procedure O(log(d/ρ)) times and taking
N the median coordinate-wise we get a vector g̃g med , such
x ∈Gd n

1 X 2 that kg̃g med − g k∞ ≤ ε with probability at least (1 − ρ).


− e2πiN (gg ·xx+c)
2πiN f (xx)
= d e

N The gate complexity statement follows from the fact
x ∈S that the complexity of Algorithm 1 is dominated by
1 X 2πiN f (xx) 2
+ d − e2πiN (gg ·xx+c) that of the d independent quantum Fourier transforms,
e

N each of which can be approximately implemented using
x ∈Gd
n \S
O(n log n) gates. We repeat the procedure O(log(d/ρ))
1 X 1 X
≤ d |2πN f (x x) − 2πN (gg · x + c)|2+ d 4 times, which amounts to O(d log(d/ρ)n log n) gates.
N N At the end we get d groups of numbers each con-
x ∈S x ∈Gd
n \S
d taining O(log(d/ρ)) numbers with n bits of preci-
1 X |G \ S|
= d (2πN )2 |f (x x) − (gg · x + c)|2+ 4 n d sion. We can sort each group with a circuit hav-
N N ing O(log(d/ρ) log log(d/ρ)n log n) gates.11 So the fi-
x ∈S

1 X 1
 2
4 1 1

1
2 nal gate complexity is O(d log(d/ρ) log log(d/ρ)n log n),
≤ d + ≤ + < , which gives the stated gate complexity by observing that
N 21 1000 441 250 12
x ∈S n = O(log(M/ε)).
10 Note that our Fourier transform is slightly altered, but the 11 Note that using the median of medians algorithm [8] we could

same proof applies as in [38, (5.34)]. In fact this result can do this step with O(log(d/ρ)n) time complexity, but this result
be directly translated to our case by considering the unitary probably does not apply to the circuit model, which is somewhat
conjugations proven in Remark 5.1. weaker than e.g. a Turing machine.

Copyright © 2019 by SIAM


1434 Unauthorized reproduction of this article is prohibited
5.3 Improved quantum gradient algorithm us- In the full version of this paper [22] we prove the
ing higher-degree methods As Theorem 5.1 shows following lemma, showing that if f : R → R is (2m +
Jordan’s algorithm works well if the function is very 1)-times continuously differentiable, then the central-
close to linear function over a large hyprecube. How- difference formula in Eq. (2.6) is a good approximation
ever, in general even highly regular functions tend to to f 0 (0). Eventually we also generalize this to the
quickly diverge from their linear approximations. To setting where f : Rd → R is higher-dimensional.
Downloaded 04/10/23 to 203.110.242.40 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

tackle this problem we borrow ideas from numerical


analysis and use higher-degree finite-difference formu- Lemma 5.2. Let δ ∈ R+ , m ∈ N and suppose f :
las to extend the range of approximate linearity. [−mδ, mδ] → R is (2m + 1)-times differentiable. Then
We will apply Jordan’s algorithm to the approxi-
(5.10)
mate finite-difference linearization of the function rather m

than the function itself. We illustrate the main idea on 0 0 X (2m)

f (0)δ − f(2m)(δ) = f (0)δ − a` f (`δ)

a simple example: suppose we want to calculate the gra-
`=−m

dient at 0 , then we could use the 2-point approximation m

≤ e− 2 f (2m+1) |δ|2m+1 ,

x) − f (−x
(f (x x))/2 instead of f , which has the advantage

that it cancels out even order contributions. The corre-
n
xi → e2 πi(f (xx)−f (−xx)) |x

sponding phase oracle |x xi is also where f (2m+1) ∞ := supξ∈[−mδ,mδ] |f (2m+1) (ξ)| and
easy to implement as the product of the oracles: a`
(2m)
is defined in Definition 2.4. Moreover
n
x)
x i = e2
+ phase oracle: |x πif (x
|x
xi and m
X m
(2m) X 1

n (5.11) a` < ≤ ln(m) + 1.
xi = e−2
− phase oracle: |x x)
πif (−x
|x
xi. `
`=0 `=1

We use high-order central-difference approximation This lemma shows that for small enough δ the approxi-
formulas. There are a variety of other related formulas mation error in (2.6) is upper bounded
by a factor pro-
[34], but we stick to the central difference because the portional to δ 2m+1 . If f (2m+1) ∞ ≤ cm for all m and
absolute values of the coefficients in this formula scale we choose δ ≤ 1/c, then the approximation error be-
favorably with the approximation degree. Since we only comes exponentially small in m, motivating the use of
consider central differences, all our approximations have higher-degree methods in our modified gradient com-
even degree, which is sufficient for our purposes as we putation algorithm. We generalize this statement to
are interested in the asymptotic scaling. Nevertheless, higher dimensions in the full version of this paper [22],
it is not difficult to generalize our approach using other which leads to our first result regarding our improved
formulas [34] that can provide odd-degree approxima- algorithm: (for the definition of the directional partial
tions as well. derivative ∂r2m+1 f see Definition 2.3)

Algorithm 2 Improved quantum gradient computation Theorem 5.2. Let m ∈ Z+ , D ∈ R+ and B ≥ 0.


Suppose f : [−D, D]d → R is given with phase oracle
Input: A point x ∈ Rd and a function f : Rd → R access. If f is (2m + 1)-times differentiable and for all
with probability or (fractional) phase oracle access x ∈ [−D, D]d we have that
Parameters: m : interpolation order; R : rescaling
Function transformation: g(yy ) := R · f (x x +yy /R) |∂r2m+1 f (x
x)| ≤ B for r = x /kx
xk,
Oracle implementation:
then using Algorithm 2 with setting
m   (2m)  
tRa
I ⊗ S`† · Of `
Y
Ogt(2m) :=
s √
· I ⊗ S` ,
  
√ 2m B d m
`=−m R = Θmax d , 
ε D
(2m)
tRa
where Of ` is implemented by Corollary 4.1
or as a product of (fractional) phase oracles; we can compute an approximate gradient g such that
tRa
(2m) kgg− ∇f (00)k∞ ≤ ε with probability at least (1−ρ), using
I acts on the ancilla qubits used by Of ` , and R
  d 
x + `yy /Ri for all y ∈ G evaluation points O ε log(2m) + m log ρ
S : |yy i 7→ |x
`
phase queries.
used by Algorithm 1
Use Jordan’s Algorithm 1: compute ∇g(2m) (00) Suppose that D = Θ(1), and f is a multi-variate
Output: Approximation of ∇g(2m) (00) = ∇f (x
x) polynomial of degree k. Then for m = dk/2e we get
that B = 0, as can be seen by using (2.5), therefore the

Copyright © 2019 by SIAM


1435 Unauthorized reproduction of this article is prohibited
  
above result gives an O
e k
ε log d
ρ query algorithm. If for all but a 1/1000 fraction of points y ∈ r · Gdn .
2 ≤ k = O(log(d)), then this result gives an exponential We can use this result to analyze the complexity
speedup in terms of the dimension d compared to of Algorithm 1 when applied to functions evaluated
Jordan’s algorithm. For comparison note that other with using a central-difference formula. In particular
recent results concerning quantum gradient descent also it makes it easy to prove the following theorem, which
work under the assumption that the objective function
Downloaded 04/10/23 to 203.110.242.40 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

is one of our main results.


is a polynomial [40, 30].
However, we argue in the full version of this pa- Theorem 5.4. Let x ∈ Rd , ε ≤ c ∈ R+ be fixed
per [22] that for non-polynomial functions we can have constants and suppose f : Rd → R is analytic13 and
B ≈ dm even under strong regularity conditions. This satisfies the following: for every k ∈ N and α ∈ [d]k
e d query algorithm, achieving the

then results in an O k
x)| ≤ ck k 2 .
ε √ |∂α f (x
desired scaling in ε but failing to capture the sought d √
scaling. In order to tackle the non-polynomial case we Using Algorithm 2 with setting m = log(c d/ε) and
need to introduce some smoothness conditions.  √ 
R = Θ cm d
5.4 Smoothness conditions and approximation
error bounds In this subsection we show how to we can compute an ε-approximate gradient ∇f ˜ (x
x ) ∈ Rd
improve the result of Theorem 5.2, assuming some such that
smoothness conditions. The calculus gets a bit involved, ∇f (x
˜ (x
x) − ∇f x) ≤ ε,


because we need to handle higher-dimensional analysis.  √ 
In order to focus on the main results, we state the with probability at least 1 − δ, using O e c d log d
ε δ
main result here and refer an interested reader to the queries to a probability or (fractional) phase oracle of f .
full version of this paper [22]. We show that under
reasonable smoothness assumptions, the complexity of Proof. Let g(yy ) := f (x x + y ). By Theorem 5.3 we

our quantum algorithm is O( e d/ε) and in the next know that for a uniformly random y ∈ r · Gdn we
P∞  √ k
section show that for a specific class of smooth functions have |∇g(00)yy − g(2m) (yy )| ≤ k=2m+1 8rcm d with
this is in fact optimal up to polylogarithmic factors.
probability at least 999/1000. Now we choose r such
A key ingredient of our proof is the following lemma, εr
that this becomes smaller that 8·42π . Now let us define
bounding the “second moment” of higher-order bounded √  √ 1/(2m)
tensors, proven in the full version of this paper [22]. R := r−1 := 9cm d 81 · 8 · 42πcm d/ε , then

Lemma 5.3. Let d, k ∈ N+ , and suppose H ∈ we get 8rcm d = 8 81 · 8 · 42πcm d/ε
 √ −1/(2m)
⊗k 9 and so
Rd is an order k tensor of dimension d, hav-
ing all elements bounded by 1 in absolute value, i.e., X∞  √ k
kHk∞ ≤ 1. Suppose {x1 , . . . , xd } are i.i.d. symmet- 8rcm d
ric random variable bounded in [−1/2, 1/2] and sat- k=2m+1
2k−1 √ 2m+1 X ∞  √ k
isfying E[(x i ) ] = 0 for every k ∈ N + . Then 
"
P √  q k
# = 8rcm d 8rcm d
P α∈[d]k Hα xα ≥ 2 r dk ≤ r12k .
k=0
2
−1 ∞  
ε  √ 2m X 8 k
≤ √ 81 · 8 · 42πcm d/ε
Using this lemma, in the full version of the paper [22] 81 · 8 · 42πcm d 9
k=0
we prove the following result about analytic functions:12
ε  √ 2m−1
εr
Theorem 5.3. If r ∈ R+ , f : Rd → R is analytic and = √ 81 · 8 · 42πcm d/ε = ,
k 9cm d · 8 · 42π 8 · 42π
for all k ∈ N, α ∈ [d] we have
k
|∂α f (00)| ≤ ck k 2 , 13 For convenience we assume in the statement that f can be

evaluated at any point of Rd , but in fact we only evaluate it inside


then a finite ball around x . It is straightforward to translate the result

X  √ k when the function is only accessible on an open ball around x .
|∇f (00)yy − f(2m) (yy )| ≤ 8rcm d , However, a finite domain imposes restrictions to the evaluation
points of the function. If x lies too close to the boundary, this
k=2m+1
might impose additional scaling requirements and thus potentially
increases the complexity of the derived algorithm. Fortunately in
12 The functions examined in the following two theorems are our applications it is natural to assume that f can be evaluated
1
essentially Gevrey class G 2 functions [21]. at distant points too, so we don’t need worry about this detail.

Copyright © 2019 by SIAM


1436 Unauthorized reproduction of this article is prohibited
where the inequality is by the choice of r, and the is a product of d (controlled) rotations given by
P∞ k
following equality uses k=0 89 = 9. By Theorem 5.1  
0 −i

we can compute an ε-approximation of ∇g(2m) (x x) with Rot(xj ) = exp ixj = eixj σy
i 0
1

O(log(d/δ)) queries to OSg(2m) , where S = O εr .
and other fixed unitaries. In order to prove this, we first
Observe that
use Lemma 5.4 to show that k∂α U (x x)k ≤ 1, which  by
Downloaded 04/10/23 to 203.110.242.40 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy


Lemma 5.5 implies that ∂α U (x x ) (|1ih1| ⊗ I)U x
(x ) ≤
(2m)
iS m k
P
S y)
iSg(2m)(y a g(`yy ) 2 , hence proving the claim. In fact, we prove slightly
Og(2m) |yy i = e |yy i = e `=−m ` |yy i.
stronger statements, so that these lemmas can be used
Using the relation between f and g, it is easy to see that later in greater generality.
the number of (fractional) phase queries to Of we need
Lemma 5.4. Suppose γ ≥ 0 and
in order to implement a modified oracle call OSg(2m) is
Yd
Pj ⊗ eixj Hj + (I − Pj ) ⊗ I Uj ,

m l m U x
(x ) = U
X (2m)
m X (2m) 0
a` S ≤ 2m + S a` j=1
(5.12) `=−m `=−m
where kU0 k ≤ 1 and for every j ∈ [d] we a have that
(5.11)
≤ 2m + S(2 log(m) + 2). kUj k ≤ 1, Pj is an orthogonal projection and Hj is
 Hermitian with kHj k ≤ γ . Then for every k ∈ N and
k x)k ≤ γ k .

S
Thus Og(2m) can be implemented using O log(m)
+m α ∈ [d] , we have that k∂α U (x
εr

(fractional) queries to Of . Picking Proof. We have that ∂α U (x x) equals
 √ m = log(c  d/ε) the
d 
query complexity becomes O c ε d m log(m) ,14 which is Y |{`∈[k]:α` =j}| ixj Hj
U0 Pj ⊗ (iHj ) e
√ √ ! √ !! j=1
c d c d c d
(5.13) O log log log .

ε ε ε + 0|{`∈[k]:α` =j}| (I − Pj ) ⊗ I Uj .

The above achieves, up to logarithmic factors, the Therefore k∂α U (x


x)k can be upper bound by product of
the norms of the individual terms in the product, hence
√ 1/ε scaling in the precision parameter and also
desired
the d scaling with the dimension. This improves the d
Y
results of [28] both quantitatively and qualitatively. We k∂α U (x
x)k ≤ γ |{`∈[k]:α` =j}| = γ k .
also show that the query complexity for this problem is j=1
almost optimal, by proving a lower bound in Section 6 Lemma 5.5. Suppose that A(x x), B(x
x) are linear opera-
which matches the above upper bound up to log factors. tors parametrized by x ∈ Rd . If for all k ∈ N0 and α ∈
[d]k we have that k∂α Ak ≤ γ k and k∂α Bk ≤ γ k , then for
5.5 Most quantum optimization problems are all k ∈ N0 and α ∈ [d]k we get that k∂α (AB)k ≤ (2γ)k .
“smooth” We now show that the condition on the
derivatives in Theorem 5.4 is fairly reasonable, i.e., Proof. For an index-sequence α = (α1 , α2 , . . . , αk ) ∈
a wide range of probability oracles that arise from [d]k and a set S = {i1 < i2 < . . . < i` } ⊆ [k]
quantum optimization problems satisfy this condition. consisting of positions of the index-sequence, we define
In particular, consider the function p : Rd → R that αS := (αi1 , αi2 , . . . , αi` ) ∈ [d]|S| to be the index-
we looked at (see Eq. (3.7)) during the discussion of a sequence where we only keep indexes corresponding to
generic model of quantum optimization algorithms: positions in S; also let S := [k] \ S. It can be seen that
X
x)† (|1ih1| ⊗ I)U (x
x) = h00|U (x x)|00i. ∂α (AB) = ∂αS A∂αS B,
p(x
S⊆[k]
We will now show that for every k ∈ N and index- hence, k∂α (AB)k can be upper bounded by
sequence α ∈ [d]k , we have15 |∂α p(x
x)| ≤ 2k when U (x
x) X X
γ |S| γ k−|S| = (2γ)k .

∂α A∂α B ≤
S S
k S⊆[k] S⊆[k]
14 Ifwe strengthen the ck k 2 upper bound assumption on the
derivatives to ck , then we could improve the bound of Theorem 5.3 Remark 5.1. Finally note that the unitary in
by a factor of k−k/2 . Therefore in the definition of R−1 we Lemma 5.4 is an analytic function of its parameters,

could replace m by m which would quadratically improve the
log factor in (5.13).
therefore the probability that we get by taking products
15 This essentially means that the function p(xx) in the probabil- of such unitaries and some fixed matrices/vectors is
ity oracle is in the Gevrey class G0 . also an analytic function.

Copyright © 2019 by SIAM


1437 Unauthorized reproduction of this article is prohibited
6 Lower bounds on gradient computation Proof.
In this section we prove that the number of phase X 2
X 2 2
2
|fj (x
x) − f∗ (x
x)| = 2εxj e−c kxxk /2

oracle queries required to compute the gradient for some
j∈[d] j∈[d]
of the smooth functions
√ satisfying the requirement of
Theorem 5.4 is Ω d/ε , showing that Theorem 5.4 2 2
x k2 4ε2
xk e−c
= 4ε2 kx kx
≤ ,
ec2
Downloaded 04/10/23 to 203.110.242.40 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

is optimal up to log factors. As we show in the full


2
version [22], a probability oracle can be simulated using where we used ze−z ≤ 1/e with z := c2 kx
xk .
a logarithmic number of phase oracle queries, therefore
In the full version of this paper [22] we prove bounds
this lower bound also translates to probability oracles.16
on the partial derivatives of the above functions to
In the full version of this paper we prove Theo-
determine their smoothness, resulting in the following
rem 1.2 providing a lower bound on the number of
lemma.
queries needed in order to distinguish an unknown phase
oracle from a family of phase oracles. This result is of Lemma 6.2. Let d, k be positive integers, c ∈ R+ and
c2 kx
x k2
independent interest, and has found an application for x ∈ Rd . Then, the function fj (x x) := cxj e− 2
proving lower bounds on quantum SDP-solving [2]. satisfies the following: for every index-sequence α ∈
Using Theorem 1.2 we prove our lower on gradient k
[d]k , the derivative of f is bounded by |∂α f (00)| ≤ ck k 2 .
computation by constructing a family of functions which Moreover ∇fj (00) = ceej .
can be distinguished from the trivial phase oracle I by ε-
precise gradient computation, but for which our hybrid- Now use the above lemmas combined with the
method based lower bound shows that distinguishing hybrid method Theorem 1.2 to prove our general lower
√  bound result which implies the earlier stated informal
the phase oracles from I requires Ω d/ε queries.
lower bound of Theorem 1.1.
6.1 A family of functions for proving the gra- Theorem 6.1. Let ε, c, d > 0 such that 2ε ≤ c and for
dient computation lower bound Now we prove our an arbitrary finite set G ⊆ Rd let
lower on gradient computation by constructing a fam- H = Span{|x
xi : x ∈ G}.
ily of functions F for which the √ corresponding phase x ∈G
oracles {Of : f ∈ F} require Ω( d/ε) queries to distin- Suppose A is a T -query quantum algorithm (assuming
guish them from the constant 0 function (as shown by xi 7→ eif (xx) , acting on
query access to phase oracle Of : |x
Theorem 1.2), but the functions in F can be uniquely d
H) for analytic functions f : R → R satisfying
identified by calculating their gradient at 0 with accu- k
racy ε. In particular, this implies that calculating an |∂α f (00)| ≤ ck k 2 for all k ∈ N, α ∈ [d]k ,
approximation of the gradient vector for these functions ˜ (00) of the
such that A computes an ε-approximation ∇f
must be at least as hard as distinguishing the phase or- ˜

gradient at 0 (i.e., ∇f (00) − ∇f (00) < ε), succeeding

acles corresponding to functions in F. ∞ √
c d
with probability at least 2/3. Then T > 4ε .
Lemma 6.1. Let d ∈ N, ε, c ∈ R+ and let us define
the following Rd → R functions: f∗ (x x) := 0 and Proof. Inspired by Lemma 6.1, we first define a set of
−c2
xk2 /2
kx “hard” functions, which we will use to prove our lower
x) := 2εxj e
fj (x forSall j ∈ [d]. Consider the 2 2
bound Let f∗ := f0 := 0 and fj (xx) := 2εxj e−c kxxk /2
family of functions F := j∈[d] {fj (x x)}, then for all
x ∈ Rd we have that S all j ∈ [d]. Consider the family of functions F :=
for
j∈[d] {fj (x
x)}. By Lemma 6.2, every f ∈ F satisfies
k
X 2 4ε2 |∂α f (00)| ≤ ck k 2 for all k ∈ N and α ∈ [d]k .
|fj (x
x) − f∗ (x
x)| ≤ 2 . Suppose we are given a phase oracle Of acting on H,
ec
j∈[d] such that Of = Ofj : |x xi 7→ eifj (xx) |x
xi for an j ∈
{0, . . . , d}. Since ∇f0 (00) = 0 and ∇fj (00) = 2εeej , using
16 A probability oracle can only represent functions which map
the T -query algorithm A in the theorem statement, one
to [0, 1], whereas the range of the function f we use in the lower
can determine the j ∈ {0, . . . , d} for which fj = f with
bound proof is an subinterval of [−1, 1]. However, by using the success probability at least 2/3. In particular we can
transformed function g := (2 + f )/4 we get a function which has a distinguish the case f = f∗ from f ∈ F, and thus by
range contained in [1/2, 3/4] so it can in principle be represented Theorem 1.2 and Lemma 6.1 we get that
by a probability oracle. Moreover for a function with a range √
√ c
r
contained in [1/2, 3/4] we can efficiently convert between phase e c d
and probability oracles as shown in the full version [22]. T ≥ d > .
ε 36 4ε

Copyright © 2019 by SIAM


1438 Unauthorized reproduction of this article is prohibited
6.2 Lower bound for more regular functions Variational quantum eigensolvers (VQEs) are
Note that the functions for which we apply our results widely used to estimate the eigenvalue corresponding
in this paper tend to satisfy a stronger condition than to some eigenstate of a Hamiltonian. The idea behind
our lower bound example in Theorem 6.1. They usually these approaches is to begin with an efficiently param-
x0 )| ≤ ck . We conjectured that the same
satisfy17 |∂α f (x eterizable ansatz to the eigenstate. For the example of
lower bound holds for this subclass of functions as well. ground state energy estimation, the ansatz state is often
Downloaded 04/10/23 to 203.110.242.40 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

Very recently, Cornelissen [16] managed to prove taken to be a unitary coupled cluster expansion. The
this conjecture (also building on our version of the terms in that unitary coupled cluster expansion are then
hybrid-method
  Theorem 1.2). Moreover, he showed an varied to provide the lowest energy for the groundstate.
1 1
Ω d 2 + p /ε lower bound for ε-precise gradient compu- For excited states a similar argument can be applied,
but minimizing a free-energy rather than ground state
tation in p-norm for every p ∈ [1, ∞]. This lower bound
energy is the most natural approach.
matches our upper bound18 up to logarithmic factors,
For simplicity, let us focus on the simplest (and
showing that our algorithm is essentially optimal for a
most common) example of groundstate P estimation.
large class of gradient computation problems.
Consider a Hamiltonian of the form H P = j aj Uj where
Uj is a unitary matrix, aj > 0 and j aj = 1. This
7 Applications
assumption can be made w.l.o.g by renormalizing the
In this section, we first consider variational quan- Hamiltonian and absorbing signs into the unitary ma-
tum eigensolvers and QAOA algorithms, which can trix. Let the state |ψ(x x)i for x ∈ Rd be the varia-
be treated essentially identically using our techniques. tional state prepared by the Prep. and Tuned circuits
Then we consider the training of quantum autoencoders, in Fig. 1b. Our objective function is then to estimate
which requires a slightly different formalism. We show  
that our gradient descent algorithms can be applied to X
these problems by reducing such problems to a prob- (7.14) x opt = argminhψ(x x)| aj Uj |ψ(x
x)i,
x
ability maximization problem. For each application j

our quantum gradient computation algorithm yields a which is real valued because H is Hermitian.
quadratic speedup in terms of the dimension. In order to translate this problem to one that we can
handle using our gradient descent algorithm, we con-
7.1 Variational quantum eigensolvers In recent struct a verifier circuit that given |ψ(x x)i sets an ancilla
years, variational quantum eigensolvers and QAOA [39, qubit to 1 with probability p = (1 + hψ(x x)|H|ψ(x x)i)/2.
47, 19] are favored methods for providing low-depth This is possible since kHk ≤ 1 due to the assumption
quantum algorithms for solving important problems in that P a = 1. This motivates the definition of new
j j
quantum simulation and optimization. Current quan- input oracles used for implementing the Hamiltonian.
tum computers are limited by decoherence, hence the
option to solve optimization problems using very short X√
circuits can be enticing even if such algorithms are poly- (7.15) prepareW : |0i 7→ aj |ji,
nomially more expensive than alternative strategies that j
could possibly require long gate sequences. Since these (7.16)
X
selectH := |jihj| ⊗ Uj .
methods are typically envisioned as being appropriate j
only for low-depth applications, comparably less atten-
tion is paid to the question of what their complexity We can then use the unitaries (7.15)-(7.16) to define
would be, if they were executed on a fault-tolerant quan- and compute the query complexity of performing a
tum computer. In this paper, we consider the case that single variational step in a VQE algorithm.
these algorithms are in fact implemented on a fault- Corollary 7.1. Let Tuned(x x) =
Qd −iHj xj
j=1 e for
tolerant quantum computer and show that the gradient d
kHj k = 1 and x ∈ R and let Prep = I. If H =
computation step in these algorithms can be performed PM P
quadratically faster compared to the earlier approaches j=1 aj Hj for unitary Hj with aj ≥ 0 and j aj = 1
that were tailored for pre-fault-tolerant applications. then the number of queries to prepareW, selectH and
Tuned needed to output a qubit string |yy i such that
17 Without the k k
|∇hψ(x  x)i − y | ≤ ε with probability at least 2/3
x√)|H|ψ(x
2 factor – i.e., they are of Gevrey class G0
1
instead of G .
2 is O d/ε .
e
18 If we want an ε-approximate gradient in p-norm, it suffices to
−1
compute an (εd p )-approximate gradient in the ∞-norm due to Proof. First we argue that the circuit in Fig. 2 outputs
Hölder’ inequality. the claimed probability. We then convert this into

Copyright © 2019 by SIAM


1439 Unauthorized reproduction of this article is prohibited
|x
xi
us that the probability of measuring 1 in the output
|0i
⊗n Prep. Tuned of the circuit in Fig. 2 is 1/2 − hψ|H|ψi/2 as claimed.
selectH Thus we have an appropriate probability oracle for the
⊗n0 prepareW approximate groundstate energy expectation.
|0i
Each query to the circuit of Fig. 2 requires O(1)
|+i • |−i? queries to prepareW and selectH. Thus the probability
Downloaded 04/10/23 to 203.110.242.40 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

oracle can be simulated at cost O(1) fundamental


Figure 2: Circuit for converting groundstate energy to queries. Now if we remove the measurement from the
a probability for VQE. The dashed box denotes the circuit we see that we can view the transformation as a
verifier circuit, V , in Fig. 1b which is implemented here circuit of the form
using the Hadamard test. (To get the actual Hadamard
test circuit, one should add a prepareW† gate. This
p
U |0i = 1/2 − hψ(x x)i/2|ψgood i|1i
x)|H|ψ(x
gate is in fact unnecessary therefore we omitted it.) (7.20) p
Probability of measuring 1 is 1/2 − hψ(x x)|H|ψ(xx)i/2. + 1/2 + hψ(x x)i/2|ψbad i|0i.
x)|H|ψ(x
In our description the Prep. circuit is just the identity
gate, but in general it might be useful to apply some that for any δ ∈ (0, 1/3) we can simulate a δ–
preprocessing. approximate query to the phase oracle analogue of U
using O(log(1/δ)) applications of U . Since U requires
O(1) fundamental queries, the phase oracle can be im-
a phase oracle, use our improvement over Jordan’s plemented using O(log(1/δ)) fundamental queries to se-
algorithm (Theorem 5.4) and prove that c ∈ O(1) for lectH, prepareW, Tuned and Prep. From Theorem 5.4
this problem to show the claimed complexity. it then follows that we can compute
First, if we examine the gates in Fig. 2 we note
that the prep and Tuned oracles by definition prepare ∇(1/2 − hψ(x x)|H|ψ(x x)i/2) = −∇hψ(x x)|H|ψ(x x)i/2,
the state |ψ(x x)i. In this context the prep circuit is the
identity. While this could be trivially removed from the within error ε/2 and error probability bounded above
 √ 
circuit, we retain it to match the formal description of
by 1/3 using O e c d/ε applications of selectH and
the model that we give earlier. Under the assumption
P
that j aj = 1 note that prepareW. Our result then immediately follows if c ∈
O(1).
e This is equivalent to proving that for some
(7.17) c ∈ O(1) we have that |∂α1 · · · ∂αk hψ|H|ψi| ≤ ck holds
e
h0|hψ(x x)|prepareW† (selectH)prepareW|0i|ψ(x x)i for all k ∈ N and α ∈ [d]k .
XX√ We prove that for this application c ≤ 2. To see
= aj ak hk|ji ⊗ hψ(x
x)|Uj |ψ(x
x)i. this note that for any index sequence α ∈ [d]k , we can
j k upper bound |∂α hψ(x x)|H|ψ(x x)i| by
(7.18)  
X
= hψ(x x)| aj Uj |ψ(x
x)i = hψ(xx)|H|ψ(x x)i.
Y1 Yd
iHj xj −iHj xj 

j ∂
α
 e H e
j=d j=1
Then because prepareW is unitary it follows that con-  
M 1 d
trolling the selectH operation enacts the controlled

X Y Y
iHj xj −iHj xj 

prepareW† (selectH)prepareW operation. (7.21) ≤ |a` | ∂α
 e H` e .
`=1 j=d j=1
The claim regarding the probability then follows

directly from the Hadamard test, which we prove below
for completeness. Let Λ(U ) be a controlled unitary Lemma 5.4 directly implies that
operation. Then HΛ(U )H|0i|ψ(x x)i equals
1

Y
iHj xj

H(|0i|ψ(x x)i + |1iU |ψ(xx)i)/ 2. (7.22) ∂α
e = 1,
    j=d
(1 + U )|ψ(x x)i (1 − U )|ψ(x
x)i

(7.19) = |0i + |1i
2 2
and since H` is unitary and Hermitian for all ` we have
Thus it follows from Born’s rule that the probabil-
ity of measuring the first register to be 1 is (1 −

Y d
Re(hψ|U |ψi))/2. We then have from combining this re- (7.23) −iHj xj

H` ∂α e
= 1.
sult with (7.18) and recalling that H is Hermitian gives j=1

Copyright © 2019 by SIAM


1440 Unauthorized reproduction of this article is prohibited
Finally, Lemma 5.5 and Eq. (7.22),(7.23) implies suppose we are given a set of high-dimensional vectors,
  we would like to learn a representation of the vectors
X M Y1 Yd hopefully of low dimension, so that computations on
iHj xj −iHj xj 

|a` | ∂α
 e H` e the original data set can be “approximately” carried
`=1 j=d j=1 out by working only with the low-dimensional vectors.
M More precisely the problem in auto-encoding is: Given
Downloaded 04/10/23 to 203.110.242.40 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

X
≤ |a` |2k = 2k . K < N and m data vectors {v1 , . . . , vm } ⊆ RN , find
`=1 an encoding map E : RN → RK and decoding map
D : RK → RN such that the average squared distortion
To compare the complexity for this procedure in an 2
kvi − (D ◦ E)(vi )k is minimized:19
unambiguous way to that of existing methods, we need
to consider a concrete alternative model for the cost. X kvi − (D ◦ E)(vi )k2
For classical methods, we typically assume that the aj (7.24) min .
are known classically as are the Uj that appear from
E,D m
i∈[m]
queries to selectH. For this reason, the most relevant
aspect to compare is the number of queries needed to the What makes auto-encoding interesting is that it
Tuned oracle. The number of state preparations needed does not assume any prior knowledge about the data
set. This makes it a viable technique in machine learn-
2

to estimate the gradient is clearly O d/ε using high-
e
degree gradient methods on the empirically estimated ing, with various applications in natural language pro-
gradients [47] if we assume that the gradient needs cessing, training neural networks, object classification,
to be computed with constant error in k · k∞ . In prediction or extrapolation of information, etc.
this sense, we provide a quadratic improvement over Given that classical auto-encoders are ‘work-horses’
such methods when the selectH and prepareW oracles of classical machine learning [5], it is also natural to
are sub-dominant to the cost of the state preparation consider a quantum variant of this paradigm. Very re-
algorithm. cently such quantum auto-encoding schemes have been
The application of this method to QAOA directly proposed by Wan Kwak et al. [45] and independently
follows from the analysis given above. There are by Romero et al. [41]. Inspired by their work we
many flavors of the quantum approximate optimization provide a slightly generalized description of quantum
algorithm (QAOA) [19]. The core idea of the algorithm auto-encoders by ’quantizing’ auto-encoders the follow-
is to considerQa parametrized family of states such ing way: we replace the data vectors vi by quantum
as |ψ(xx)i = dj=1 e−ixj Hj |0i. The aim is to modify states ρi and define the maps E, D as quantum chan-
the state in such a way as to maximize an objective nels transforming states back and forth between the
0
function. In particular, if we let O be a Hermitian Hilbert spaces H and H . A natural generalization of
operator corresponding to the objective function then squared distortion for quantum states ρ, σ that we con-
2 20
we wish to find x such that hψ(x
x)|H|ψ(xx)i is maximized. sider is 1 − F (ρ, σ), giving us the following minimiza-
For example, in the case of combinatorial optimization tion problem
problems the objective function is usually
Pmexpressed as X 1 − F 2 (ρi , (D ◦ E)(ρi ))
the number of satisfied clauses: O = α=1 Cα where (7.25) min .
Cα is 1 if and only if the αth clause is satisfied and E,D m
i∈[m]
0 otherwise [19]. Such clauses can be expressed as
sums of tensor products of Pauli operators, which allows Since F 2 (|ψihψ|, σ) = hψ|σ|ψi in the special case when
us to express them as Hermitian operators. Thus, the input states are pure states ρi = |ψi ihψi |, the above
from the perspective of our algorithm, QAOA looks minimization problem is equivalent to the maximization
exactly like variational quantum eigensolvers except in problem
that the parameterization chosen for the state may be
significantly different from that chosen for variational X hψi |[(D ◦ E)(|ψi ihψi |)]|ψi i
(7.26) max .
quantum eigensolvers. E,D m
i∈[N ]

7.2 Quantum auto-encoders Classically, one ap-


19 There are other natural choices of dissimilarity functions that
plication of neural networks is auto-encoders, which are
networks that encode information about a data set into one might want to minimize, for a comprehensive overview of the
classical literature see [6].
a low-dimensional representation. Auto-encoding was 20 Note that some authors (including [41]) call F 0 = F 2 the
first introduced by Rumelhart et al. [42]. Informally, fidelity. The distortion measure we use here is P (ρ, σ) =
p
the goal of an auto-encoding circuit is the following: 1 − F 2 (ρ, σ), which is called the purified (trace) distance [44].

Copyright © 2019 by SIAM


1441 Unauthorized reproduction of this article is prohibited
Observe that hψ|[(D ◦ E)(|ψihψ|)]|ψi is the probability of then we can use our gradient computation algorithm to
finding the output state (D ◦E)(|ψihψ|) in state |ψi after speedup optimization.
performing the projective measurement {|ψihψ|, I −
|x
xi
|ψihψ|}. Thus we can think about this as maximizing √1 Pm |mi
m i=1
the probability of recovering the initial pure state after
|~0i
encoding and decoding, which is a natural measure of
Downloaded 04/10/23 to 203.110.242.40 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

⊗n−k
|0i x)
UE (x
the quality of the quantum auto-encoding procedure. prepS
⊗k
|0i
prepS−1
7.2.1 Training quantum auto-encoders Simi- |0i
⊗n−k
x)
UD (x
larly to [45, 41] we describe a way to perform this op- |~0i
timization problem in the special case when the input
|0i |1i?
states are n-qubit pure states and they are mapped to k-
qubit states, i.e., H is the Hilbert space of n qubits and Prep. Tuned V

H0 is the Hilbert space of k < n qubits. We also show Figure 4: Quantum circuit which outputs 1 with proba-
how our gradient computation algorithm can speedup bility equal to the objective function (7.26). The struc-
solving the described optimization problem. ture of the circuit fits the generic model of quantum
We observe that by adding a linear amount of an- optimization circuits (Figure 1), therefore we can use
cilla qubits we can represent the encoding and decod- our gradient computation methods to speedup its opti-
ing channels by unitaries, which makes the minimiza- mization.
tion conceptually simpler. Indeed by Stinespring’s dila-
tion theorem [46, Corollary 2.27], [32] we know that any We can do the whole optimization using stochastic
quantum channel E that maps n qubit states to k qubit gradient descent [27], so that in each step we only need
states can be constructed by adding 2k qubits initial- to consider the effect of the circuit on a single pure
ized in |~0i state, then acting with a unitary UE on the state. Or if we have more quantum resources available
extended space and then tracing out k + n qubits. Ap- we can directly evaluate the full gradient by preparing
plying this result to both E and D results in a unitary a uniform superposition over all input vectors.
Pm In this
circuit representing the generic encoding/decoding pro- case the state preparation unitary Prep = i=1 |iihi| ⊗
cedure, see Figure 3. (This upper bound on the required Prepψi is a controlled unitary, which controlled on index
number of ancilla qubits for D becomes 2n.) i would prepare |ψi i. Graphically we represent this type
⊗2k
 of control by a small black square in contrast to the
|0i  Ancillae
small black circle used for denoting simple controlled
⊗n−k 
|0i UE for E unitaries. See the full quantum circuit in Figure 4.
Prepψ
⊗k

 ⊗n Finally, note that in some application it might
|0i  Result |0i


be desirable to ask for a coherent encoding/decoding

−1
Prepψ
procedure, where all the ancilla qubits are returned to

⊗n−k U

|0i D indicates success



the |~0i state. In this case similarly to [45, 41] one
could define UD = UE−1 and optimize the probability of
⊗n+k
|0i } Ancillae for D

~
Figure 3: A unitary quantum auto-encoding circuit: measuring |0i on the ancilla qubits after applying UE .
For the input |ψi, the circuit prepares |ψi, applies a
purified version of the channels E, D and finally checks 8 Conclusion and future research
by a measurement whether the decoded state is |ψi. We gave a new approach to quantum gradient compu-
tation that is asymptotically optimal (up to logarith-
In order to solve the maximization problem (7.26) mic factors) for a class of smooth functions, in terms
we could just introduce a parametrization of the uni- of the number of queries needed to estimate the gra-
taries UE , UD and search for the optimal parameters dient within fixed error with respect to the max-norm.
using gradient descent. Unfortunately a complete This is based on several new ideas including the use
parametrization of the unitaries requires exponentially of differentiation formulæ originating from high-degree
many parameters, which is prohibitive. However, analo- interpolatory polynomials. These high-degree methods
gously to, e.g., classical machine learning practices, one quadratically improve the scaling of the query complex-
could hope that a well-structured circuit can achieve ity with respect to the approximation quality compared
close to optimal performance using only a polyno- to what one would see if the results from Jordan’s work
mial number of parameters. If the circuits UE , UD are were used. In the case of low-degree multivariate poly-
parametrized nicely, so that Lemma 5.4 can be applied, nomials we showed that our algorithm can yield an ex-

Copyright © 2019 by SIAM


1442 Unauthorized reproduction of this article is prohibited
ponential speedup compared to Jordan’s algorithm or [5] E. M. Azoff, Neural Network Time Series Forecasting
classical algorithms. We also provided lower bounds on of Financial Markets, John Wiley & Sons, New York,
the query complexity of the problem for certain smooth NY, USA, 1st ed., 1994.
functions revealing that our algorithm is essentially op- [6] P. Baldi, Autoencoders, unsupervised learning, and
timal for a class of functions. deep architectures, in Unsupervised and Transfer
Learning - Workshop held at ICML 2011, 2012, pp. 37–
While it has proven difficult to find natural applica-
Downloaded 04/10/23 to 203.110.242.40 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

50.
tions for Jordan’s original algorithm, we provide in this [7] C. H. Bennett, E. Bernstein, G. Brassard, and
paper several applications of our gradient descent algo- U. Vazirani, Strengths and weaknesses of quantum
rithm to areas ranging from machine learning to quan- computing, SIAM J. Comp., 26 (1997), pp. 1510–1523.
tum chemistry simulation. These applications are built arXiv: quant-ph/9701001
upon a method we provide for interconverting between [8] M. Blum, R. W. Floyd, V. R. Pratt, R. L.
phase and probability oracles. The polynomial speedups Rivest, and R. E. Tarjan, Time bounds for selec-
that we see for these applications is made possible by tion, J. Comp. Sys. Sci., 7 (1973), pp. 448–461.
our improved quantum gradient algorithm via the use [9] F. G. S. L. Brandão, A. Kalev, T. Li, C. Y.-
of this interconversion process. It would be interesting Y. Lin, K. M. Svore, and X. Wu, Quantum SDP
to find applications where we can apply the results for solvers: Large speed-ups, optimality, and applications
to quantum learning. arXiv: 1710.02581, 2017.
low-degree multivariate polynomials providing an expo-
[10] F. G. S. L. Brandão and K. M. Svore, Quantum
nential speedup. speed-ups for solving semidefinite programs, in FOCS,
More work remains to be done in developing 2017, pp. 415–426. arXiv: 1609.05537
quantum techniques that speed up more sophisticated [11] D. Bulger, Quantum basin hopping with gradient-
higher-level, e.g., stochastic gradient descent methods. based local optimisation. Unpublished. arXiv:
Another interesting question is whether quantum tech- quant-ph/0507193, 2005.
niques can provide further speedups for calculating [12] , Quantum computational gradient estimation.
higher-order derivatives, such as the Hessian, using Unpublished. arXiv: quant-ph/0507109, 2005.
ideas related to Jordan’s algorithm, see e.g. [29, Ap- [13] S. Chakrabarti, A. M. Childs, T. Li, and X. Wu,
pendix D]. Such improvements might open the door for Quantum algorithms and lower bounds for convex
improved quantum analogues of Newton’s method and optimization. arXiv: 1809.01731, 2018.
[14] A. M. Childs, R. Kothari, and R. D. Somma,
in turn substantially improve the scaling of the number
Quantum algorithm for systems of linear equations
of epochs needed to converge to a local optima. with exponentially improved dependence on precision,
SIAM J. Comp., 46 (2017), pp. 1920–1950. arXiv:
Acknowledgements 1511.02306
The authors thank Ronald de Wolf, Arjan Cornelissen, [15] R. Cleve, D. Gottesman, M. Mosca, R. D.
Yimin Ge and Robin Kothari for helpful suggestions and Somma, and D. L. Yonge-Mallo, Efficient discrete-
discussions. A.G. and N.W. thank Sebastien Bubeck, time simulations of continuous-time quantum query
Michael Cohen and Yuanzhi Li for insightful discussions algorithms, in STOC, 2009, pp. 409–416. arXiv:
0811.4428
about accelerated gradient methods. A.G. thanks Joran
[16] A. Cornelissen, Quantum gradient estimation and
van Apeldoorn for drawing his attention to the “cheap its application to quantum reinforcement learning,
gradient principle”. Master’s thesis, Technische Universiteit Delft, 2018.
[17] C. Dürr and P. Høyer, A quantum algorithm
References for finding the minimum. arXiv: quant-ph/9607014,
1996.
[18] C. Dürr, M. Heiligman, P. Høyer, and
[1] A. Ambainis, Variable time amplitude amplification M. Mhalla, Quantum query complexity of some
and quantum algorithms for linear algebra problems, graph problems, SIAM J. Comp., 35 (2006), pp. 1310–
in STACS, 2012, pp. 636–647. arXiv: 1010.4458 1328. arXiv: quant-ph/0401091
[2] J. v. Apeldoorn and A. Gilyén, Improvements [19] E. Farhi, J. Goldstone, and S. Gutmann, A
in quantum SDP-solving with applications. arXiv: quantum approximate optimization algorithm. arXiv:
1804.05058, 2018. 1411.4028, 2014.
[3] J. v. Apeldoorn, A. Gilyén, S. Gribling, and [20] E. Farhi and H. Neven, Classification with quantum
R. d. Wolf, Quantum SDP-solvers: Better upper and neural networks on near term processors. arXiv:
lower bounds, in FOCS, 2017, pp. 403–414. arXiv: 1802.06002, 2018.
1705.01843 [21] M. Gevrey, Sur la nature analytique des solutions des
[4] , Convex optimization using quantum oracles. équations aux dérivées partielles. Premier mémoire.,
arXiv: 1809.00643, 2018. Annales Scientifiques de l’École Normale Supérieure,

Copyright © 2019 by SIAM


1443 Unauthorized reproduction of this article is prohibited
3 (1918), pp. 129–190. a photonic quantum processor, Nat. Comm., 5 (2014).
[22] A. Gilyén, S. Arunachalam, and N. Wiebe, Op- arXiv: 1304.3061
timizing quantum optimization algorithms via faster [40] P. Rebentrost, M. Schuld, F. Petruccione,
quantum gradient computation. arXiv: 1711.00465v3. and S. Lloyd, Quantum gradient descent and New-
Full version of this paper, 2017. ton’s method for constrained polynomial optimization.
[23] A. Gilyén, Y. Su, G. H. Low, and N. Wiebe, arXiv: 1612.01789, 2016.
Downloaded 04/10/23 to 203.110.242.40 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

Quantum singular value transformation and beyond: [41] J. Romero, J. P. Olson, and A. Aspuru-Guzik,
exponential improvements for quantum matrix arith- Quantum autoencoders for efficient compression of
metics. arXiv: 1806.01838, 2018. quantum data, Quantum Science and Technology, 2
[24] L. K. Grover, A fast quantum mechanical algorithm (2017), p. 045001. arXiv: 1612.02806
for database search, in STOC, 1996, pp. 212–219. [42] D. E. Rumelhart, G. E. Hinton, and R. J.
arXiv: quant-ph/9605043 Williams, Learning internal representations by error
[25] Y. Hamoudi and F. Magniez, Quantum Cheby- propagation, in Parallel Distributed Processing: Ex-
shev’s inequality and applications. arXiv: 1807.06456, plorations in the Microstructure of Cognition, Vol. 1,
2018. D. E. Rumelhart, J. L. McClelland, and P. R. Group,
[26] A. W. Harrow, A. Hassidim, and S. Lloyd, eds., MIT Press, Cambridge, MA, USA, 1986, pp. 318–
Quantum algorithm for linear systems of equations, 362.
Phys. Rev. Lett., 103 (2009), p. 150502. arXiv: [43] M. Schuld, A. Bocharov, K. M. Svore, and
0811.3171 N. Wiebe, Circuit-centric quantum classifiers. arXiv:
[27] P. Jain, S. M. Kakade, R. Kidambi, P. Netra- 1804.00633, 2018.
palli, and A. Sidford, Accelerating stochastic gra- [44] M. Tomamichel, R. Colbeck, and R. Ren-
dient descent. arXiv: 1704.08227, 2017. ner, Duality between smooth min- and max-entropies,
[28] S. P. Jordan, Fast quantum algorithm for numeri- IEEE Transactions on Information Theory, 56 (2010),
cal gradient estimation, Phys. Rev. Lett., 95 (2005), pp. 4674–4681. arXiv: 0907.5238
p. 050501. arXiv: quant-ph/0405146 [45] O. D. Wan, Kwok Ho, H. Kristjánsson,
[29] , Quantum Computation Beyond the Circuit R. Gardner, and M. S. Kim, Quantum gen-
Model, PhD thesis, Massachusetts Institute of Tech- eralisation of feedforward neural networks. arXiv:
nology, 2008. arXiv: 0809.2307 1612.01045, 2016.
[30] I. Kerenidis and A. Prakash, Quantum gradient [46] J. Watrous, The Theory of Quantum Information,
descent for linear systems and least squares. arXiv: Cambridge University Press, 2018.
1704.04992, 2017. [47] D. Wecker, M. B. Hastings, and M. Troyer,
[31] , A quantum interior point method for LPs and Progress towards practical quantum variational algo-
SDPs. arXiv: 1808.09266, 2018. rithms, Phys. Rev. A, 92 (2015), p. 042303. arXiv:
[32] M. Keyl, Fundamentals of quantum information the- 1507.08969
ory, Phys. Rep., 369 (2002), pp. 431 – 548. arXiv: [48] N. Wiebe, A. Kapoor, and K. M. Svore, Quan-
quant-ph/0202122 tum nearest-neighbor algorithms for machine learning,
[33] P. C. S. Lara, R. Portugal, and C. Lavor, A Quant. Inf. & Comp., 15 (2015), pp. 318–358. arXiv:
new hybrid classical-quantum algorithm for continuous 1401.2142
global optimization problems, J. Global Optimization,
60 (2014), pp. 317–331. arXiv: 1301.4667
[34] J. Li, General explicit difference formulas for numerical
differentiation, Journal of Computational and Applied
Mathematics, 183 (2005), pp. 29 – 52.
[35] G. H. Low and I. L. Chuang, Hamiltonian simula-
tion by qubitization. arXiv: 1610.06546, 2016.
[36] A. Montanaro, Quantum speedup of Monte Carlo
methods, Proc. Roy. Soc. A, 471 (2015). arXiv:
1504.06987
[37] A. Nayak and F. Wu, The quantum query com-
plexity of approximating the median and related
statistics, in STOC, 1999, pp. 384–393. arXiv:
quant-ph/9804066
[38] M. A. Nielsen and I. L. Chuang, Quantum com-
putation and quantum information, Cambridge Univer-
sity Press, 2000.
[39] A. Peruzzo, J. McClean, P. Shadbolt, M.-H.
Yung, X.-Q. Zhou, P. J. Love, A. Aspuru-Guzik,
and J. L. O’brien, A variational eigenvalue solver on

Copyright © 2019 by SIAM


1444 Unauthorized reproduction of this article is prohibited

You might also like