Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

NIPS 2009 Demonstration: Binary Classication using

Hardware Implementation of Quantum Annealing


Hartmut Neven
Google
neven@google.com
Vasil S. Denchev
Purdue University
vdenchev@purdue.edu
Marshall Drew-Brook
D-Wave Systems Inc.
marshall@dwavesys.com
Jiayong Zhang
Google
jiayong@google.com
William G. Macready
D-Wave Systems Inc.
wgm@dwavesys.com
Geordie Rose
D-Wave Systems Inc.
rose@dwavesys.com
December 7, 2009
Abstract
Previous work [NDRM08, NDRM09] has sought the development of binary classiers that
exploit the ability to better solve certain discrete optimization problems with quantum anneal-
ing. The resultant training algorithm was shown to offer benets over competing binary classi-
ers even when the discrete optimization problems were solved with software heuristics. In this
progress update we provide rst results on training using a physical implementation of quan-
tum annealing for black-box optimization of Ising objectives. We successfully build a classier
for the detection of cars in digital images using quantum annealing in hardware. We describe
the learning algorithm and motivate the particular regularization we employ. We provide re-
sults on the efcacy of hardware-realized quantum annealing, and compare the nal classier
to software trained variants, and a highly tuned version of AdaBoost.
1 Introduction
In a recent series of publications [NDRM08, NDRM09] we have developed a new approach to train
binary classiers. The new training procedure is motivated by the desire to capitalize on improve-
ments in heuristic solvers (both software and hardware solvers) for a class of discrete optimization
problems. The discrete optimization problem is used to identify a subset of weak classiers from
within a dictionary that vote on the class label. Explicit regularization favors small voting subsets.
In [NDRM08] the general approach was outlined, and in [NDRM09] the algorithm was adapted to
allow for scaling to arbitrarily large dictionaries while solving discrete optimization problems of
bounded size. Theoretical advantages of the proposed approach have been considered, and when
tested using heuristic software solvers the resulting procedure was shown to offer benets in both
sparsity and test set accuracy over competing alternatives like AdaBoost.
1
In this progress update we report rst results on the use of an optimization heuristic embodied
in hardware realizing a implementation of quantum annealing (a simulated-annealing-like heuris-
tic that uses quantum instead of thermal uctuations). The learning objective, which identies the
sparse voting set of weak classiers, is formulated as a Quadratic Unconstrained Binary Optimiza-
tion (QUBO), and mapped to an Ising model which models the low-energy physics of hardware in
development by D-Wave. Time evolution according to the Ising model with quantum interactions
present in the hardware allows the hardware to realize a physical implementation of quantum an-
nealing (QA). The optimizing ability of this hardware heuristic is harnessed to allow training of a
large scale classier for the detection of cars in digital images.
A more detailed study will be published elsewhere, but this report provides details on the algo-
rithm, the experiment on hardware, and compares the results to classical (non-quantum) alternatives.
We begin by introducing QUBOs and their relation to the Ising model that is realized in hardware.
We show how QUBOs can be optimized by mapping to Ising form, and evolving the the dynam-
ics to realize a physical implementation of quantum annealing. We note some of the constraints
that the hardware implementation imposes as these motivate the design choices in the construction
of a classication algorithm. With this background to quantum annealing on hardware, we apply
QUBO-solving to binary classication. We describe how a strong classier is constructed from a
dictionary of weak condence-weighted classiers showing how the primary limitations imposed
by the hardware, namely small sparsely-connected QUBOs, can be overcome. We validate the re-
sultant algorithm with an experiment where the classier is trained on hardware consisting of 52
sparsely connected qubits. We compare the nal classier to one trained with AdaBoost and other
software solvers. Preliminary results are encouraging we obtain a classier that employs fewer
weak classiers that AdaBoost and whose training set error is only marginally worse. As this work
is but the rst step at implementing the algorithm in hardware, we conclude with a discussion of
next steps.
2 Background
2.1 QUBOs and the Ising model
The QUBO problem seeks the minimizer of a quadratic function of N Boolean variables:
1
QUBO(QQQ): www

= argmin
www
www

QQQwww where w
i
0, 1
Terms linear in www are captured on the diagonal of the N N matrix QQQ. QUBO is NP hard, and is
related to the Ising problem
ISING(hhh, JJJ): sss

= argmin
sss
_
sss

JJJsss +hhh

sss
_
where s
i
1, +1
through the change of variables sss = 2www 111. The sss variables are called spins. D-Wave hardware
approximates sss

through a hardware realization of quantum annealing. The hardware is based on


a system of interacting qubits realized as loops of superconductor called Josephson junctions. The
physical description of the qubits is well captured at low energies by the Ising model augmented
with additional terms arising from quantum interactions. These quantum interactions are harnessed
1
Vectors are indicated in lowercase bold, and matrices in uppercase bold.
2
in the hardware to realize a quantum annealing algorithm for addressing ISING. Unlike previous
studies of quantum annealing [KTM09, SKT
+
09] in machine learning, the process is not simulated
in software but embodied in physical quantum dynamics.
2.2 Quantum annealing
In 2000 [FGGS00] proposed a model of quantum computation based on adiabatic quantum evolu-
tion. The basic idea relies on the transformation of an initial problem (that is easy to solve) into the
problem of interest (which may be difcult to solve). If the transformation is sufciently slow then
we are guaranteed to nd the solution of the problem of interest with high probability. In the present
context we are interested in solving an optimization problem (specically ISING), and the method
in this case is known as quantum annealing. As the D-Wave hardware relies on quantum annealing
we outline the basics of the quantum annealing when applied to ISING.
Quantum mechanics requires the representation of spin states s = 1 as orthogonal vectors in
a two-dimensional vector space. We indicate the two possibilities by [+ =
_
1 0

and [ =
_
0 1

. The extension to vectors is necessary as linear combinations of these state vectors exist
as real descriptions of nature. An example combination state, called a superposition, is

[ +

+
[+ =
_

. The state can always be normalized so that [

[
2
+ [
+
[
2
= 1.
2
This
quantum description of a bit is called a qubit. The amplitudes

and
+
of the superposition
state are given meaning through measurement of the qubit. When measured, the qubit is seen
as 1 with probability
2

, and seen as +1 with probability


2
+
. Larger quantum systems are
constructed through tensor product of the individual qubit vector spaces, so that for example [+
[[+ =
_
0 0 1 0

represents the 2 spin state s


1
= 1, s
2
= +1. Superpositions of N qubit
states are also possible with the associated amplitudes representing the probability of observing the
respective N spin state, e.g.
2
,+
is the probability of measuring s
1
= 1 and s
2
= +1.
To make contact with ISING dene the single qubit operators I =
_
1 0
0 1
_
and
z
=
_
1 0
0 1
_
.
With this denition
z
[ = [ and
z
[+ = +[+ so that the operator
z
leaves the qubit state
unchanged (
z
is diagonal), and simply multiplies the state by its classical spin value, either 1
or +1. Similarly, on a 2 qubit state the operator
z
I extracts the classical spin of the rst qubit,
I
z
extracts the classical spin of the second qubit, and
z

z
extracts the product s
1
s
2
of the
two classical spin variables.
With this notation the ISING objective on N spins is represented as a 2
N
2
N
diagonal operator
(called the Hamiltonian)
1
I
=

i,j
J
i,j

z
i

z
j
+

i
h
i

z
i
where
z
i
is shorthand for the operator
z
acting on qubit i alone
3
, and
z
i

z
j
is shorthand for the
operator giving the product of spins on qubits i and j. 1
I
acting on the classical state [sss =
[s
1
s
N
gives 1
I
[sss = E(sss)[sss where E(sss) = sss

JJJsss +hhh

sss. The state with the lowest objective


value corresponds to the eigenvector of 1
I
having the lowest eigenvalue. The quantum effects in
the D-Wave hardware manifest themselves with the addition of another single qubit operator. This
2
Most generally and + are complex valued. However, in what follows and + may be taken to be real.
3
Explicitly
z
i
= I I

i 1 times

z
I I

N i times
.
3
operator is given by
x
=
_
0 1
1 0
_
, and acts to ip the qubit state, i.e.
x
(
+
[+ +

[) =

+
[ +

[+. Note that unlike the


z
term, this operator is off-diagonal and alters the quantum
state when applied. The eigenstates of this operator are ([+ +[)/

2 (eigenvalue +1) and ([+


[)/

2 (eigenvalue -1) with both giving equal probability to a measurement yielding either s =
+1 or s = 1. The many qubit version of this operator acts upon each qubit individually as
1
0
=

x
i
. The lowest eigenstate of 1
0
is

i
([+
i
[
i
)/

2 which gives all 2


N
states
the same amplitude. This is a uniform superposition making all joint spin states equally likely to be
observed.
The quantum description of the D-Wave hardware is well approximated at low energies by the
Hamiltonian operator 1 = 1
0
+1
I
where all parameters , hhh, and JJJ can be user dened and ad-
justed dynamically. Given sufcient time, dissipative quantumevolution at zero temperature evolves
an initial quantum state into a state in the subspace corresponding to the lowest energy eigenvalue of
1. However, just as classical systems may become trapped in metastable higher energy states, the
quantum dynamics may spend long periods of time at higher energy states. Quantum annealing is a
technique to ameliorate this problem. Consider a convex combination of the Ising and 1
0
portions
of the Hamiltonian:
1() = (1 )1
0
+1
I
for 0 1.
When = 0 the lowest energy state is easily discovered and gives all classical congurations
equal probability. = 1 corresponds to the ISING problem that we want to solve. It is natural
then to evolve from the initial = 0 lowest energy state to the = 1 minimal Ising state. Just
as simulated annealing gradually eliminates the entropic exploration effects at higher temperatures
to focus on low energy congurations at low temperatures, quantum annealing gradually reduces
the exploration effects of quantum uctuations at large to focus on the classical Ising problem
at = 0. The adiabatic theorem indicates how rapidly the quantum term can be annealed to
maintain high probability of locating the globally lowest classical conguration. The difference
between thermal annealing (simulated annealing) and quantum annealing can be signicant. There
are problems where quantum annealing conveys no speedups [Rei04, FGGN05], and others where
quantum annealing is exponentially faster than simulated annealing [FGG02]. A more complete
theoretical characterization of QA awaits.
Previous studies of quantumannealing in machine learning [KTM09, SKT
+
09] have necessarily
relied on classical simulations of QA. While such simulations are more complex than simulated
annealing better minima are often obtained. In the experiments reported here no simulation is
necessary, quantum annealing is performed by dynamically adjusting in the D-Wave hardware.
We rely on the fast physical timescales within the D-Wave chip and the parallel dynamics of all
qubits to realize QA natively.
2.3 Hardware constraints
The idealized sketch of quantum annealing presented above omits complications arising from hard-
ware implementation. We mention the most important of these complications that affect both the
types of Ising problems that can be solved, and the efcacy of hardware optimization on early gen-
eration hardware.
4
q1
q2
q3
q4
q5
q6
q7
q8
q9
q10
q11
q12
q13
q14
q15
q16
q17
q18
q19
q20
q21
q22
q23
q24
q25
q26
q27
q28
q29
q30
q31
q32
q33
q34
q35
q36
q37
q38
q39
q40
q41
q42
q43
q44
q45
q46
q47
q48
q49
q50
q51
q52
q53
q54
q55
q56
q57
q58
q59
q60
q61
q62
q63
q64
q65
q66
q67
q68
q69
q70
q71
q72
q73
q74
q75
q76
q77
q78
q79
q80
q81
q82
q83
q84
q85
q86
q87
q88
q89
q90
q91
q92
q93
q94
q95
q96
q97
q98
q99
q100
q101
q102
q103
q104
q105
q106
q107
q108
q109
q110
q111
q112
q113
q114
q115
q116
q117
q118
q119
q120
q121
q122
q123
q124
q125
q126
q127
q128
Figure 1: Connectivity between the 128 qubits in the current D-Wave chip. The pattern of connec-
tivity used is driven by hardware design considerations.
Qubit connectivity To represent any Ising problemon N variables requires complete connectivity
between qubits. Physically realizing all
_
N
2
_
connections is difcult, and the D-Wave hardware
is architected with a sparser pattern of connectivity. The current chip on 128 qubits utilizes the
connectivity given in Fig. 1 The classication algorithm we develop accounts for this particular
hardware connectivity.
Non-zero temperature While the quantum hardware operates at a rather frigid 20-40mK, this
temperature is signicant when compared to the relevant energy scales of the chip. In units where
the largest h
i
or J
i,j
magnitude is 1, the temperature is T 1/4 1/3. Temperature effects are
well approximated by a Boltzmann distribution so that states with energy E are approximately seen
with probability exp(E/T). Thus, thermally excited states will be seen in many sample runs on
the hardware.
Parameter variability Early generations of the hardware are unable to provide highly precise
parameter values. Instead, when a particular h
i
, J
i,j
is requested, what is realized in hardware is a
sample from a Gaussian centered on the desired value. The standard deviation is large enough that
the mis-assigned value can generate candidate solutions of higher energy (according to the correct
parameter values).
To mitigate both the temperature and precision difculties we solve the same ISING(hhh, JJJ) prob-
lem multiple times, and select the lowest energy result by evaluating each returned conguration
using the desired problem parameters. Currently, obtaining m candidate minima on the hardware
takes time t = 900 + 100m ms. In section 4 we provide a characterization of the optimization
efcacy of the this hardware optimization procedure.
5
3 QBoost
3.1 Baseline System
We want to build a classier which exploits the ability to effectively solve QUBO. The basic idea
is straightforward: given a dictionary of weak classiers, boost these into a strong classier by
selecting a small subset which vote on the class label. The solution of a sequence of QUBOs
identies the best performing subset of weak classiers. The resultant algorithm is called QBoost
to bring to mind the quantum-inspired QUBO-based boosting procedure.
In [NDRM08] we studied a binary classier of the form
y = H(xxx) = sign
_
N

i=1
w
i
h
i
(xxx)
_
signwww

hhh(xxx) (1)
where xxx R
M
are the input patterns to be classied, y 1, +1 is the output of the classier,
the h
i
: R
M
1, +1 are so-called weak classiers or features detectors, and the w
i
0, 1
are a set of weights to be optimized during training. H(xxx) is known as a strong classier.
Training, the process of choosing www, proceeds by minimizing two terms. The loss L(www) term
measures the error over a set of S training examples (xxx
s
, y
s
)[s = 1, . . . , S. To give a QUBO
objective we use quadratic loss. The regularization R(www) term ensures that the classier does not
become too complex, and mitigates over-tting. We employ a regularization term based on the
L0-norm, R(www) = |www|
0
which for Boolean-valued variables is simply R(www) =

i
w
i
. This term
encourages the strong classier to be built with as few weak classiers as possible while maintaining
a low training error. Thus, training is accomplished by solving the following discrete optimization
problem:
Baseline: www

= argmin
www
_
S

s=1
_
1
N
www

hhh(xxx
s
) y
s
_
2
. .
L(www)
+|w|
0
. .
R(www)
_
= argmin
www
_
www

_
1
N
2
S

s=1
hhh(xxx
s
)hhh

(xxx
s
)
_
. .
E
P
0
(xxx,y)
_
hhh(xxx)hhh

(xxx)
_
/N
www +www

_
111 2
S

s=1
hhh(xxx
s
)y
s
N
. .
E
P
0
(xxx,y)
_
hhh(xxx)y
_
_
_
(2)
The parameters of this QUBO depend on correlations between the components of hhh and with the
output signal y. These correlations are measured with respect to P
0
, the empirical distribution over
(xxx, y). Note that the weights www are binary and not positive real numbers as in AdaBoost. Even
though discrete optimization could be applied to any bit depth representing the weights, we found
that a small bit depth is often sufcient [NDRM08]. Here we deal with the simplest case in which
the weights are chosen to be binary.
6
3.2 Comparison of the baseline algorithm to AdaBoost
In the case of a nite dictionary of weak classiers hhh(xxx) AdaBoost can be seen as a greedy algorithm
that minimizes the exponential loss [Zha04],

opt
= argmin

_
S

s=1
exp
_
y
s

hhh(xxx
s
)
_
_
, (3)
with
i
R
+
. There are two differences between the objective of our algorithm (Eq. (2)) and the
one employed by AdaBoost. The rst is that we added L0-norm regularization. Second, we employ
a quadratic loss function, while AdaBoost works with the exponential loss.
It can easily be shown that including L0-norm regularization in the objective in Eq. (2) leads
to improved generalization error as compared to using the quadratic loss only. The proof is as
follows. An upper bound for the Vapnik-Chernovenkis dimension of a strong classier H of the
form H(xxx) =

T
t=1
h
t
(xxx) is given by
VC
H
= 2(VC
{ht}
+ 1)(T + 1) log
2
(e(T + 1)), (4)
where VC
{ht}
is the VC dimension of the dictionary of weak classiers [FS95]. The strong classi-
ers generalization error Error
test
has therefore an upper bound given by [VC71]
Error
test
Error
train
+
_
VC
H
ln(1 + 2S/VC
H
) + ln(9/)
S
. (5)
It is apparent that a more compact strong classier that achieves a given training error Error
train
with
a smaller number T of weak classiers (hence, with a smaller VC dimension VC
H
), comes with a
guarantee for a lower generalization error. Looking at the optimization problem in Eq. (2), one can
see that if the regularization strength is chosen weak enough, i.e. < 2/N + 1/N
2
, then the
effect of the regularization is merely to thin out the strong classier. One arrives at the condition
for by demanding that the reduction of the regularization term R(w) that can be obtained by
switching one w
i
to zero is smaller than the smallest associated increase in the loss term L(w) that
comes from incorrectly labeling a training example. This condition guarantees that weak classiers
are not eliminated at the expense of a higher training error. Therefore the regularization will only
keep a minimal set of components, those which are needed to achieve the minimal training error that
was obtained when using the loss term only. In this regime, the VC bound of the resulting strong
classier is lower or equal to the VC bound of a classier trained with no regularization.
AdaBoost contains no explicit regularization term and it can easily happen that the classier
uses a richer set of weak classiers than needed to achieve the minimal training error, which in
turn leads to degraded generalization. Fig. 2 illustrates this fact. The regions outside the positive
cluster bounding box of the nal AdaBoost-trained classier occur in areas in which the training
set does not contain negative examples. This problem becomes more severe for higher dimensional
data. The optimal conguration is not found despite the fact that the weak classiers necessary to
construct the ideal bounding box are generated. In fact AdaBoost fails to learn higher dimensional
versions of this problem altogether with error rates approaching 50%. See section 6 for a discussion
on how global optimization based learning can handle this data set.
In practice we do not operate in the weak regime, but rather determine the regularization
strength by using a validation set. We measure the performance of the classier for different
7
A C B
T=10 T=20 T=640
Figure 2: AdaBoost applied to a simple classication task. (A) shows the data, a separable set con-
sisting of a two-dimensional cluster of positive examples (blue) surrounded by negative ones (red).
(B) shows the random division into training (saturated colors) and test data (light colors). The dic-
tionary of weak classiers is constructed of axis-parallel one-dimensional hyperplanes. (C) shows
the optimal classier for this situation, which employs four weak classiers to partition the input
space into positive and a negative areas. The lower row shows partitions generated by AdaBoost
after 10, 20, and 640 iterations. The T = 640 conguration does not change with further iterations.
values of on a validation set and then choose the one with the minimal validation error. In this
regime, the optimization performs a trade-off and accepts a higher empirical loss if the classier can
be kept more compact. In other words it may choose to misclassify training examples if it can keep
the classier simpler. This leads to increased robustness in the case of noisy data, and indeed we
observe the most signicant gains over AdaBoost for noisy data sets when the Bayes error is high.
The fact that boosting in its standard formulation with convex loss functions and no regularization
is not robust against label noise has drawn attention recently [LS08, Fre09].
The second difference between AdaBoost and our baseline system, namely that we employ
quadratic loss while AdaBoost works with exponential loss is of smaller importance. In fact, the
discussion above about the role of the regularization term would not have changed if we were to
choose exponential rather than square loss. Literature seems to agree that the use of exponential loss
in AdaBoost is not essential and that other loss functions could be employed to yield classiers with
similar performance [FHT98, Wyn02]. From a statistical perspective, quadratic loss is satisfying
since a classier that minimizes the quadratic loss is consistent. With an increasing number of
training samples it will asymptotically approach a Bayes classier i.e. the classier with the smallest
possible error [Zha04].
3.3 Large scale classiers
The baseline system assumes a xed dictionary containing a number of weak classiers small
enough, so that all weight variables can be considered in a single global optimization. This ap-
proach needs to be modied if the goal is to train a large scale classier. Large scale here means
that at least one of two conditions is fullled:
1. The dictionary contains more weak classiers than can be considered in a single global opti-
8
mization.
2. The nal strong classier consists of a number of weak classiers that exceeds the number of
variables that can be handled in a single global optimization.
Let us take a look at typical problem sizes. The state-of-art solver ILOG CPLEX can obtain good
solutions for up to several hundred variable QUBOs depending on the coefcient matrix. The
quantum hardware solvers manufactured by D-Wave currently can handle 128 variable problems.
In order to train a strong classier we often sift through millions of features. Moreover, dictionaries
of weak learners are often dependent on a set of continuous parameters such as thresholds, which
means that their cardinality is innite. We estimate that typical classiers employed in vision based
products today use thousands of weak learners. Therefore it is not possible to determine all weights
in a single global optimization, but rather it is necessary to break the problem into smaller chunks.
Let T designate the size of the nal strong classier and Q the number of variables that we can
handle in a single optimization. Q may be determined by the number of available qubits, or if we
employ classical heuristic solvers such as ILOG CPLEX or tabu search [Pal04], then Q designates
a problem size for which we can hope to obtain a solution of reasonable quality in reasonable time.
We consider two cases. The rst with T Q and the second with T > Q.
We rst describe the inner loop algorithm used when the number of variables we can handle
exceeds the number of weak learners needed to construct the strong classier. This algorithmmay be
Algorithm 1 T Q (inner loop)
Require: Training and validation data, dictionary of weak classiers
Ensure: Strong classier
1: Initialize weight distribution d
inner
over training samples as uniform distribution s : d
inner
(s) = 1/S
2: Set T
inner
0 and K
3: repeat
4: Select the set / of Q T
inner
weak classiers h
i
from the dictionary that have the smallest
weighted training error rates
5: for =
min
to
max
do
6: Run the optimization www

() = argmin
www
_
S
s=1

htKK
w
t
h
t
(x
s
)/Qy
s

2
+|www|
0
_
7: Set T
inner
() |www

()|
0
8: Construct H(xxx; ) sign
_
Tinner
t=1
w

t
()h
t
(xxx)
_
9: Measure validation error Error
val
() of H(xxx; ) on unweighted validation set
10: end for
11:

argmin

Error
val
()
12: Update T
inner
T
inner
(

) and K h
t
[w

t
(

) = 1
13: Update weights d
inner
(s) d
inner
(s)
_
htK
h
t
(x)/T
inner
y
s
_
2
14: Normalize d
inner
(s) d
inner
(s)/

S
s=1
d
inner
(s)
15: until validation error Error
val
stops decreasing
seen as an enrichment process. In the rst round, the algorithm selects those T
inner
weak classiers
out of subset of Q that produce the optimal validation error. The subset of Q weak classiers has
been preselected from a dictionary with a cardinality possibly much larger than Q. In the next step
the algorithm lls the QT
inner
empty slots in the solver with the best weak classiers drawn from
a modied dictionary that was adapted by taking into account for which samples the strong classier
constructed in the rst round is already good and where it still makes errors. This is the boosting
9
idea. Under the assumption that the solver always nds the global minimum, it is guaranteed that
for a given the solutions found in the subsequent round will have lower or equal objective value
i.e. they achieve a lower loss or they represent a more compact strong classier. The fact that the
algorithm always considers groups of Q weak classiers simultaneously rather than incrementing
the strong classier one by one and then tries to nd the smallest subset that still produces a low
training error allows it to nd optimal congurations more efciently.
If the validation error cannot be decreased any further using the inner loop, one may conclude
that more weak classiers are needed to construct the strong one. In this case the outer loop
algorithm freezes the classier obtained so far and adds another partial classier trained again by
the inner loop.
Algorithm 2 T > Q (outer loop)
Require: Training and validation data, dictionary of weak classiers
Ensure: Strong classier
1: Initialize weight distribution d
outer
over training samples as uniform distribution s : d
outer
(s) = 1/S
2: Set T
outer
0 and c(xxx) 0
3: repeat
4: Run Algorithm 1 with d
inner
initialized from current d
outer
and using an objective function that
takes into account the current c(xxx):
www

= argmin
www
_
S
s=1
__
c(xxx
s
) +

Q
i=1
w
i
h
i
(xxx
s
)
_
/(T
outer
+Q) y
s
_
2
+|www|
0
_
5: Set T
outer
T
outer
+|www

|
0
and c(xxx) c(xxx) +

Q
i=1
w

i
h
i
(xxx)
6: Construct a strong classier H(xxx) = sign
_
c(xxx)
_
7: Update weights d
outer
(s) = d
outer
(s)
_
Touter
t=1
h
t
(x)/T
outer
y
s
_
2
8: Normalize d
outer
(s) = d
outer
(s)/

S
s=1
d
outer
(s)
9: until validation error Error
val
stops decreasing
3.4 Mapping to hardware
The QBoost algorithm circumvents the need to solve QUBOs with many variables, and thus is
suitable for execution on D-Wave hardware. However, one important obstacle remains before the
learning algorithm can be executed on hardware. As described, QBoost requires complete coupling
between all pairs of qubits (optimization variables). The coupling between qubits i and j represent-
ing variables w
i
and w
j
is proportional to E
P
0
(xxx)
_
h
i
(xxx)h
j
(xxx)
_
, which typically will not be zero.
Thus, when realized in hardware with sparse connectivity, many connections must be discarded.
How can this be done in a way which minimizes the information lost?
Let G = (V, E) indicate the graph of connectivity between qubits. As the QUBOs are translated
to Ising problems we embed an Ising objective of N variables into the qubit connectivity graph
(with [V [ N) with a mapping : 1, . . . , N V such that ((i), (j)) E J
i,j
,= 0. The
mapping can be encoded with a set of binary variables
i,q
which indicate that problem variable
i is mapped to qubit node q. For a valid mapping we require

q

i,q
= 1 for all problem variables
i, and

i

i,q
1 for all qubit nodes q. To minimize the information lost we maximizing the total
magnitude of J
i,j
mapped to qubit edges, i.e. we seek

= argmax

i>j

(q,q

)E
[J
i,j
[
i,q

j,q

10
where is subject to the mapping constraints.
4
This problem is a variant of the NP hard quadratic
assignment problem. Since this embedding problem must be solved with each invocation of the
quantum hardware, we seek a fast O(N) heuristic to approximately maximize the objective.
As a starting point, let i
1
= argmax
i

j<i
[J
j,i
[ +

j>i
[J
i,j
[ so that i
1
is the row/column
index of JJJ with the highest sum of magnitudes (JJJ is upper-triangular). We map i
1
to the one of
the qubit vertices of highest degree. For the remaining variables, assume that we have dened on
i
1
, . . . , i
k
such that (i
j
) = q
j
V . Then assign (i
k+1
) = q
k+1
, where i
k+1
/ i
1
, . . . , i
k

and q
k+1
/ q
1
, . . . , q
k
to maximize the sum of all [J
i
k+1
,i
j
[ and [J
i
j
,i
k+1
[ over all j 1, . . . , k
for which q
j
, q
k+1
E. This greedy heuristic performs well. As an example, the greedy heuristic
manages to map about 11% of the total absolute edge weight

i,j
[J
i,j
[ of a fully connected random
Ising model into the hardware connectivity of Fig. 1 in a few milliseconds. For comparison, a
tabu heuristic designed for this matching quadratic assignment problem performs marginally better
attaining about 11.6% total edge weight after running for 5 minutes.
4 Experimental results
We test QBoost to develop a detector for cars in digital images. The training and test sets con-
sist of 20 000 images with roughly half the images in each set containing cars and the other half
containing city streets and buildings without cars. The images containing cars are human-labeled
ground truth data with tight bounding boxes drawn around each car and grouped by positions (e.g.
front, back, side, etc.) For demonstration purposes, we trained a single detector channel only us-
ing the side-view ground truth images. The actual data seen by the training system is obtained by
randomly sampling subregions of the input training and test data to obtain 100 000 patches for each
data set. Before presenting results on the performance of the strong classier we characterize the
optimization performance of the hardware.
4.1 Optimization efcacy
The hardware used in the experiment is an early generation Chimera chip where 52 of the 128
qubits were functioning. The connectivity of the functioning qubits is shown in Fig 3. All Ising
problems dened on even the fully functional 128 qubit array can be solved exactly using dynamic
programming as the treewidth of the fully functional 128 qubit graph is 16. Thus, we can easily
obtain global minima for problems dened on the function 52 qubit subset. Fig. 4(a) shows the
results on a typical problem generated at late stages in QBoost. We see that although there is
certainly room for improvement, good results can be obtained with as few as m = 32 samples. Fig.
4(b) shows the distribution of the hardware-obtained energy difference from the global minimum
in units of the standard deviation of Ising objective value across all congurations. We see clear
improvement with increasing numbers of samples m. For the runs of QBoost we used m = 32 and
found satisfactory performance. Larger values of m were tried but the performance gains did not
warrant the time cost.
4
The linear terms of the original ISING problem pose no problem, and can be mapped without loss into the hardware.
11
q
9
q
10
q
12
q
14
q
15
q
17
q
18
q
19
q
20
q
22
q
23
q
24
q
49
q
51
q
53
q
54
q
55
q
56
q
73
q
74
q
75
q
78
q
79
q
80
q
81
q
82
q
85
q
86
q
87
q
88
q
105
q
106
q
107
q
108
q
109
q
110
q
111
q
113
q
114
q
116
q
117
q
118
q
119
q
120
q
121
q
122
q
123
q
124
q
125
q
126
q
127
q
128
Figure 3: Functioning qubits used in the car detector training. Compare with Fig. 1.
2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3
x 10
4
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Objective Value Difference between Hardware and Exact Solvers

S
t
a
n
d
a
r
d

d
e
v
i
a
t
i
o
n
s
(a)
0 0.5 1 1.5 2
0
10
20
30
40
50
60
70
80
Standard deviations above optimal energy
Hardware Solution Quality Histogram (400 problems, 20 bins)


m=32
m=64
m=128
(b)
Figure 4: (a) Comparison of hardware solution (obtained using m = 32 samples) to the known
globally optimal result for an inner loop optimization across a set of regularization weights . We
plot the objective value after the problem is translated to Ising format. Differences of objective val-
ues are in units of the standard deviation of objective values across all congurations. (b) Histogram
of differences of hardware objective values and the globally optimal result (in units of the problems
standard deviation) across a random sampling of 400 Ising problems taken at different stages of
QBoost, and for randomly selected . Results are shown for m = 32, 64, and 128 samples.
12
Figure 5: Error rates and sparsity levels for Tabu Search as compared to the quantum hardware
solver. The training consisted of training 24 cascaded stages that employ three different feature
types. Each stage is trained on the false negatives of the previous stage as its positive samples
complemented with true negative samples unseen so far [VJ04]. The error rates of the hardware
solver are superior to the ones obtained by Tabu search across all stages. Please note that the error
rates shown are per pixel errors. The cumulative number of weak classiers employed at a certain
stage does not suggest a clear advantage for either method.
13
4.2 Classier results
We use three types of weak classiers. Each weak classier takes a rectangular subregion of the
scanned image patch as input, and outputs a continuous value encoding the log likelihood ratio of
positive and negative classes. The three types of weak classiers work similarly in the sense that they
all divide the input rectangular subregion into grid cells, map multiple raw grid cell values linearly
into a single scalar response, quantize this scalar value into multiple bins, and output the condence
score associated with that bin. They differ only in grid sizes and mapping weights, leading to
different complexity and discrimination power. We group weak classiers into multiple stages.
New bootstrap negative samplers are created at the end of each stage. For speed consideration,
more complex yet more powerful weak classiers are only considered in later stages.
We compare 3 different variants of the baseline algorithm all of which use a window of size
Q = 52. The rst variant allows for complete connectivity between all 52 optimization variables,
and is solved using a tabu heuristic designed for QUBOs [Pal04]. The two remaining variants
utilize the subgraph of K
52
given in Fig. 3. The only difference between these two variants is that
one solves all QUBOs using an exact software solver while the other solves them approximately
using the quantum hardware. The exact solver performs equivalently to the hardware version if
additive zero mean Gaussian noise is added to Ising parameters before solving. All three variants
are compared to a highly tuned state-of-the-art system that blends a variety boosting techniques on
the same dictionary of weak learners [SS98][VPZ06].
Starting from the highly tuned state-of-the-art system developed for Googles MapReduce in-
frastructure, we replaced its feature-selection routine by QBoost. For each stage, the original sys-
tem calls the feature selection routine, requesting a xed number T of features for that stage. The
numbers of features for different stages and different cascades are determined by experience and
generally involve smaller numbers for earlier stages and larger numbers for later stages in order to
achieve an optimal tradeoff between accuracy and processing time during evaluation of the nal
detector.
Consequently, QBoost is invoked with a request for selecting T features out of the dictionary
(size up to several million) of the current cascade. We keep the window size Q xed at 52, which
is the maximum problem size that the current quantum hardware can take. Since T can be both
smaller and larger than 52, early stages utilize only the inner loop of QBoost, whereas later stages
take advantage of the outer loop as well. However, regardless of the particular value of T for a
given stage, we obtain the top 52 features using the current weight distribution on the training data
and deliver to the quantum hardware an optimization problem of that size. In order to ensure our
ability to make a feature selection of size at most T when T < 52, we send to the hardware multiple
optimization problems with varying regularization strengths. After that, we use cross-validation
to evaluate all of the returned solutions of size at most T and select the one that gives the lowest
validation error. One of the advantages of QBoost is that it could very well be the case that the best
validation error is obtained for a solution of size smaller than T, thereby achieving better sparsity
in the nal detector.
QBoost iterates in this manner until it determines that it cannot lower the validation error any
further, at which point it returns the selected features for a single stage to the original system. The
original system then nalizes the current stage with them and proceeds to the next one.
We found that training using the quantum hardware led to a classier that produces useful clas-
sication accuracy as visitors to our demonstration stand will be able to experience rst hand. But
14
more importantly as shown in more detail in Fig. 5 the accuracy obtained by the hardware solver is
better than the accuracy of the classier trained with Tabu search. However, the classier obtained
with the hardware solver even though it is sparser by about 10% has a false negative rate approxi-
mately 10% higher than the boosting framework we employed as a point of departure. This can have
several reasons. One possibility is that the weak learners used here have 16 different output values
as opposed to just two as studied in our previous studies [NDRM08, NDRM09]. This could imply
that we need more than a single bit to represent the weights. Another reason may be the drastically
changed ratio of Q relative to the number of weak learners as here we employ several million weak
classiers.
5 Discussion and future work
Building on earlier work [NDRM08] we continued our exploration of QUBO-based optimization
as a tool to train binary classiers. The present focus was on developing algorithms that run on an
existing hardware QUBO solver developed at D-Wave. The proposed QBoost algorithm accommo-
dates the two most important constraints imposed by the hardware:
QUBOs with a great many variables cannot be directly addressed in hardware, consequently
large scale QUBOs must be decomposed into a sequence of smaller QUBOs
the hardware only allows for sparse but known connectivity between optimization variables
(qubits), thus each problem must discard some interactions before solving
The rst constraint is overcome by using a boosting-inspired greedy algorithm which grows the set
of voting classiers. The second constraint is solved by dynamically mapping QUBOs into the xed
connectivity by minimizing the loss of pairwise interactions. The resultant algorithm was found to
generate a strong classier that, while not as accurate as a highly-tuned AdaBoost-based classier,
did perform well with fewer weak classiers.
Being but the rst run on hardware QUBO solvers there are a great many questions that are
unanswered and control experiments still need to be performed. We highlight a number of important
issues to explore.
The classier trained using a tabu heuristic on the fully connected graph of all pairwise inter-
actions performed noticeably worse than both the Boosting, and the variants that threw away edges
in the mapping down to hardware connectivity. Is this poor performance the result of poor opti-
mization of the fully connected 52 variable problem, or does inclusion of all pairwise interactions
actually lead the algorithm away from classiers with the lowest generalization error? If we were
able to exactly solve the 52 variable QUBOs would performance have been better or worse than the
solvable sparsely connected problems? On a related note there is good reason to suspect that even
with exact optimization, the fully connected problem arising from quadratic loss may not be the
best choice. Our choice for quadratic loss is pragmatic quadratic loss leads directly to a QUBO.
However, zero-one loss or a margin-based loss may be much more effective. These other loss func-
tions can be represented as QUBOs, but require the introduction of additional variables to represent
higher-than-quadratic interactions.
The embedding of the fully connected quadratic objective into the sparsely connected hardware
throws away a great many interactions. The current scheme which seeks to minimize the magnitude
15
of discarded interactions is but one choice. Are there other embedding criteria that can be greedily
optimized, and what is the effect of these alternate embeddings on generalization error?
This rst experiment provides no insight into the importance of the sizes of QUBOS Q that
can be solved. Certainly as Q increases we expect that we can better approximate the globally best
sparse voting subset from the large dictionary. How rapidly does the identication of this optimal
improve as Q increases? Does a closer approximation to the regularized sparse set translate into
lower generalization error? Initial insights into these questions are easily obtained by experiments
with differing Q. Improved versions of the current hardware will allow for explorations up to
Q = 128.
Lastly, previous experiments on other data sets using software heuristics for QUBO solving have
shown that performance beyond AdaBoost can typically be obtained. Presumably the improvements
are due to the explicit regularization that QBoost employs. We would hope that QBoost can be made
to outperform even the highly tuned boosting benchmark we have employed here.
Finally, we mention that the experiments presented here were not designed to test the quantum-
ness of the hardware. Results of such tests will be reported elsewhere.
Acknowledgments
We would like to thank Johannes Steffens for his contributions to the baseline detector and Bo Wu
for preparing the training data.
References
[FGG02] Edward Farhi, Jeffrey Goldstone, and Sam Gutmann. Quantum adiabatic evolution
algorithms versus simulated annealing. 2002. Available at http://arxiv.org/
abs/quant-ph/0201031.
[FGGN05] Edward Farhi, Jeffrey Goldstone, Sam Gutmann, and Daniel Nagaj. How to make the
quantum adiabatic algorithm fail. 2005. Available at http://arxiv.org/abs/
quant-ph/0512159.
[FGGS00] Edward Farhi, Jeffrey Goldstone, Sam Gutmann, and Michael Sipser. Quantum com-
putation by adiabatic evolution. 2000. arXiv: quant-ph/0001106v1.
[FHT98] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression:
a statistical view of boosting. Annals of Statistics, 28:2000, 1998.
[Fre09] Yoav Freund. A more robust boosting algorithm. 2009. arXiv: stat.ML/0905.2138v1.
[FS95] Yoav Freund and Robert E. Shapire. A decision-theoretic generalization of online
learning and an application to boosting. AT&T Bell Laboratories Technical Report,
1995.
[KTM09] K. Kurihara, S. Tanaka, and S. Miyashita. Quantum annealing for cluster-
ing. In The 25th Conference on Uncertainty in Articial Intelligence, 2009.
Available at http://www.cs.mcgill.ca/

uai2009/papers/UAI2009_
0019_71a78b4a22a4d622ab48f2e556359e6c.pdf.
16
[LS08] Philip M. Long and Rocco A. Servedio. Random classication noise defeats all convex
potential boosters. 25th International Conference on Machine Learning (ICML), 2008.
[NDRM08] Hartmut Neven, Vasil S. Denchev, Geordie Rose, and William G. Macready. Train-
ing a binary classier with the quantum adiabatic algorithm. 2008. arXiv: quant-
ph/0811.0416v1.
[NDRM09] H. Neven, V. Denchev, G. Rose, and W. Macready. Training a large scale classier
with the quantum adiabatic algorithm. 2009. Available at http://arxiv.org/
abs/0912.0779.
[Pal04] Gintaras Palubeckis. Multistart tabu search strategies for the unconstrained binary
quadratic optimization problem. Ann. Oper. Res., 131:259282, 2004.
[Rei04] Ben Reichardt. The quantum adiabatic optimization algorithm and local minima. In
Annual ACM Symposium on Theory of Computing, 2004.
[SKT
+
09] I. Sato, K. Kurihara, S. Tanaka, S. Miyashita, and H. Nakagawa. Quantum anneal-
ing for variational bayes inference. In The 25th Conference on Uncertainty in Arti-
cial Intelligence, 2009. Available at http://www.cs.mcgill.ca/

uai2009/
papers/UAI2009_0061_3cf8f564c6b5b6a22bd783669917c4c6.pdf.
[SS98] Robert E. Schapire and YoramSinger. Improved boosting algorithms using condence-
rated predictions. In COLT, pages 8091, 1998.
[VC71] Vladimir Vapnik and Alexey Chervonenkis. On the uniform convergence of relative
frequencies of events to their probabilities. Theory of Probability and its Applications,
16(2):264280, 1971.
[VJ04] Paul Viola and Michael J. Jones. Robust real-time face detection. International Journal
of Computer Vision, 57(2):137154, 2004.
[VPZ06] Paul Viola, John Platt, and Cha Zhang. Multiple instance boosting for object detec-
tion. In Bernhard Sch olkopf Yair Weiss and John C. Platt, editors, Advances in Neural
Information Processing Systems, volume 18, pages 14171424. MIT Press, 2006.
[Wyn02] AbrahamJ. Wyner. Boosting and the exponential loss. Proceedings of the Ninth Annual
Conference on AI and Statistics, 2002.
[Zha04] Tong Zhang. Statistical behavior and consistency of classication methods based on
convex risk minimization. The Annals of Statistics, 32:5685, 2004.
17

You might also like