A Fast Machine Learning Algorithm For The Maxcut Problem: Allison Jin Xiao-Yang Liu

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

A Fast Machine Learning Algorithm for the

MaxCut Problem
Allison Jin Xiao-Yang Liu
Stuyvesant High School School of Engineering, Columbia University
New York, U.S. New York, U.S.
xl2427@columbia.edu

Abstract—The graph maxcut (MaxCut) problem is NP-hard.


There are many local optima, and it is computationally difficult
to find a high-quality solution. In this paper, we propose a fast
machine learning algorithm inspired by the variation principle.
First, we formulate the MaxCut problem as an Ising system.
Second, we propose a novel learn-to-transition (L2T) algorithm
that uses the Hamilton of this Ising system as the loss function
to train an RNN-style neural network, approximating the tran-
sition distribution to the ground state. Finally, we evaluate our
algorithm on well-known datasets and obtain good performance.
Compared to the best-known solution for each graph instance,
our L2T algorithm has a gap from −1.14% to −3.50%. For Fig. 1. Example: A graph with six nodes and eight edges. The dotted line
graphs of large size, our algorithm compares favorably with shows a maxcut of this graph: the filled nodes belong in set V1 , the empty
Gurobi [11], a state-of-the-art mathematical programming solver nodes belong in set V2 . The maxcut size is 6. Note there are other possible
for optimization problems. maxcuts.
Index Terms—Graph-based combinatorial optimization, NP-
hard problem, recurrent neural network, LSTM
optimal solutions are intractable for large graphs, people are
I. I NTRODUCTION interested in finding practical approaches to approximate the
Given an undirected graph G = (V, E), the unweighted optimal solutions.
max-cut problem seeks to determine a bipartition of nodes V There are classical approximation methods, like the Greedy
into two sets of nodes V1 , V2 , such that the number of edges algorithm, simulated annealing algorithm, and other heuristic
across the two sets V1 and V2 , also known as the cut-size, is methods (see [2]). Recently, there also has been a rising
maximized. For example, Fig. 1 shows a max-cut of a simple interest in applying machine learning techniques, especially
graph. deep learning, to the MaxCut problem [3]–[5].
Equivalently, if we associate each node vi with a binary In this paper, we propose a novel machine-learning algo-
value σi ∈ {−1, +1} and assign value +1 to these nodes in rithm that balances optimality and speed. In particular, we
set V1 and −1 to these in set V2 , then the cut-size of the graph propose a novel learn-to-transition (L2T) algorithm that uses
Cut(G) can be written as the Hamilton of a corresponding Ising system as the loss
1 X function to train a RNN-style neural network, approximat-
Cut(G) = (1 − σi σj ). ing the transition distribution to the ground state. Unlike
2
(i,j)∈E many classical methods, our algorithm avoids premature local
The MaxCut problem is to find assignments σ = maximums and often finds solutions comparable to the best
(σ1 , σ2 , · · · , σN ), where N is the number of nodes in G, that solutions possible, and at the same time, proves to be far
maximize Cut(G). Clearly, there are 2N possible configura- faster than the previously known machine learning methods
tions of σ so a naive search for the maximal configurations like variational classical annealing so that it can be used to
requires exponential complexity. It turns out that if we insist solve much larger graphs.
on finding the maximal configuration, there have been no Our proposed algorithm is inspired by the variation prin-
known polynomial-time methods yet - The MaxCut problem ciple. First, we formulate the MaxCut problem as an Ising
is a classical NP-hard problem. system; second, we use the Hamilton of this Ising system
The MaxCut problem has many important applications, such as the loss function to train a RNN-style neural network,
as the efficient design of electric circuits or communication approximating the transition distribution to the ground state.
networks [1]. Given its importance, despite that the fact its Finally, we evaluate our algorithm on well-known datasets
and obtain good performance. Compared to the best-known
Special thanks solution for each graph instance, our L2T algorithm has a
979-8-3503-0965-2/23/$31.00 ©2023 IEEE
gap from −1.14% to −3.50%. For graphs of large size, We also have that
our algorithm compares favorably with Gurobi [11], a state- 1 X
of-the-art mathematical programming solver for optimization Cut(G) = (1 − σi σj )
2
problems. (i,j)∈E

The remainder of the paper is organized as follows. Section 1 X 1 X


= 1− σi σj
II describes the Ising-spin glass model and a specific instance 2 2
(i,j)∈E (i,j)∈E
equivalent to the max-cut problem. Section III reviews known 1 1
approaches to the max-cut problem. Section IV proposes our = |E| − H(σ)
2 2
new machine-learning approach to the problem. Section V
presents performance evaluations. Therefore, maximizing the cut size of a graph G is equivalent
to minimizing the Hamilton of the spin system Eq.(3). Since
A. Notations MaxCut is an NP-hard problem, this equivalence implies that
We use lowercase letters for scalars, e.g., x ∈ R; boldface finding the minimum Hamilton of a spin system is an NP-hard
lowercase letters for vectors, e.g., x ∈ Rn ; and |S| to denote problem as well. However, the advantage of the spin system
the cardinality of a set S. formulation rests on insights from statistical mechanics. In
particular, we know that in the equilibrium state at any
II. T HE I SING - SPIN GLASS MODEL temperature, the probability distribution of the spin system
A. Definition is the Boltzmann distribution. It provides a different approach
We consider an Ising-spin system [6], [9], where the Hamil- to find the ground state.
ton of the spin system is (ignoring the field components): C. Ising System Using Binary Values
X
H(σ) = − wij σi σj , (1) The Ising system has spin values {−1, +1}; we can map
i<j these spin values to binary values {0, 1} by the linear transfor-
mation xi = σi2+1 . Using x = (x1 , x2 , · · · , xN ) ∈ {0, 1}N ,
where spin vector σ contains N -spins (σ1 , σ2 , · · · , σN ) ∈
the Hamilton of the system becomes
{−1, +1}N and wij is the interaction strength between spins X
σi and σj . H(x) = (2xi − 1)(2xj − 1). (4)
At any temperature T , a spin system will eventually reach (i,j)∈E
the equilibrium state. At equilibrium, the relative probability
Representing the Hamilton in vector σ as in Eq.(3) is clearly
of the system being in a state σ is directly proportional to
equivalent to representing the Hamilton in binary vector x as
e−H(σ)/T , wherein T includes the Boltzmann constant kB .
in Eq.(4). For convenience, henceforth we use the binary value
Consequently, the probability of finding the system in state σ
representation of the system.
is:
e−H(σ)/T
p(σ, T ) = , (2) III. C LASSICAL M ETHODS
Z(T )

A. Greedy Algorithm
where Z(T ) = σ′ e−H(σ )/T is the normalization factor.
P
The greedy algorithm for the MaxCut of a graph G with N
If the equilibrium probability p(σ, T = 0) can be found and nodes is a greedy Markov decision process (MDP). We denote
easily sampled, then we can obtain the ground state with the the space of all possible configurations as state space Ω. Each
minimum Hamilton by just sampling from this probability. configuration is a binary-valued vector of length N , where
B. The Equivalence of the Hamilton of a Spin System and the each node of the graph G is assigned 0 or 1 based on the set
MaxCut Problem it is contained in post-cut. At step k, the algorithm determines
the next state x(k+1) solely based on the current state x(k) ,
Here we construct an Ising-spin system that the Hamilton
as illustrated in figure 2.
of it as in Eq.(1) corresponds to the MaxCut problem. 1 Given
To traverse the Ω space, we define the neighbors of a
a graph G(V, E), we create a spin system with |V | spins
configuration x as the set of configurations x′ that x and
corresponding to the |V | nodes, and associate σi with node i.
x′ differs at exactly one position in their length-N vector.
We also let wij be −1 if and only if node i and j is connected
Clearly every x has N neighbors, and any configuration can
in graph G, otherwise wij equals to 0, i.e. there exists no
be reached by another configuration by at most N steps.
interaction between spin i and j if the corresponding nodes are
We start at a random configuration x and consider the
not connected in graph G. As neighboring nodes perfer to have
neighbors of this configuration. The greedy algorithm first
different signs, such a system is known as antiferromagnet spin
determines the neighbor configuration x′ with the lowest
system [6]. The Hamilton of this spin system is
X Hamilton value among all its neighbors. Then if and only
H(σ) = σi σj . (3) if that Hamilton value is lower than its current Hamilton
(i,j)∈E value, i.e. H(x) ≥ H(x′ ), the greedy algorithm moves to the
configuration x′ as its next state. This process repeats until a
1 This equivalence has long been established. See e.g. [7]. node has no neighbors with a lower Hamilton than itself and
we become stuck, signaling that we have arrived at a local One promising idea is the variation classical annealing ap-
maximum number of cuts for the MaxCut problem. proach proposed in [3] and further explored in [4], [5]. In [4],
Intuitively, the greedy algorithm is vulnerable to local [5], qθ (x) employs an upper triangular structure that captures
minimums. the auto-correlation among the spins,
B. Simulated Annealing Algorithm qθ (x) = ΠN
k=1 pθ (xk |xi<k ). (5)
Simulated annealing algorithm works in a similar way as
the greedy algorithm, but with the modified rule to avoid [3], [4] uses KL divergence [6] between qθ (x) and p(x, T )
being stuck at a local minimum prematurely. The idea is to as the loss function to optimize the model parameter θ of
introduce a so-called Metropolis factor [10] that is temperature qθ (x) in approximating p(x, T ). Surprisingly, and it is the key
dependent. Given at configuration x, we move to a random element for the efficacy of the approach, the loss function is
neighbor x′ , defined the same as in the greedy algorithm, with independent of p(x, T ) and is called the variation free energy:
probability X X

F (qθ ) = qθ H(x) + T qθ (x) ln qθ (x)
P = min(1, e−(H(x )−H(x))/T ), x x
= < H(x) > + T < ln qθ (x) > . (6)
where temperature T decreases at each step following some
annealing schedule. We see that when H(x) ≥ H(x′ ), we The variation free energy is the mean energy subtracted by an
must move to x′ as consistent with the greedy algorithm. But energy that is directly proportional to the Shannon entropy;
now, simulated annealing incorporates an element of regulated both can be evaluated by sampling over qθ (x).
chaos much as the annealing of metal itself does: we allow When the annealing ends, qθ (x) is an approximation of the
ourselves to move to neighbor x′ with probability P even if probability distribution p(x, T = 0) of the ground state of the
it doesn’t offer a better result, in hopes of avoiding becoming spin system. Sampling from this probability distribution, we
stuck early and finding not just a local minimum but closer are likely to obtain ground state configurations.
to the absolute minimum. In having T decrease over time to
0, however, the chaos becomes increasingly unlikely, allowing IV. L EARN T HE T RANSITION P ROBABILITY
us to close in on an answer. D ISTRIBUTION I N G ROUND S TATE
Simulated annealing is a significant improvement to the
A. Formulation
greedy algorithm. However, it is still vulnerable to the many
local minimums of the MaxCut problem. In the Ising system formulation described in Section II, we
assume that from any initial state, as time lapses the system
C. The Variational Classical Annealing Algorithm will continuously evolve and converge to the equilibrium
Some recent machine learning approaches try to learn the where states are distributed in accordance with the Boltzmann
probability distribution p(x, T ) as in Eq.(2) by deep neural distribution. So in the limit as temperature approaches zero,
networks. The ideas is to approximate p(x, T ) by a parameter- the system will settle down to the ground state which has
ized density function qθ (x), where θ is the model parameter. the lowest Hamilton. That means that given an initial state
x(0) , there is a transitional probability distribution function
p(x|x(0) ) that maps x(0) to a ground state x. The discrete na-
ture of configuration x dictates that this transitional probability
distribution is non-linear and complex. Here we aim to deploy
a deep neural network to learn this transitional probability
distribution.
Learning the transitional probability also fits in the Markov
decision process framework as depicted in Fig. 2. We view it
is possible for a state to transit to any state in the next step,
according to the probability distribution, as opposed to only
its neighbors as in the simulated annealing.
As the ground state has the minimum Hamilton, our learning
network will use the expected Hamilton of the system as
the loss function. Given the initial state x(0) , we can write
down the conditional Hamilton based on the the transitional
probability distribution,
X
Fig. 2. A Markov decision process (MDP) representation of a greedy search H(x|x(0) ) = p(x|x(0) )H(x). (7)
algorithm for the max-cut problem. x(k) is the configuration at step k. There
are various ways to determine how to transit to the next configuration x(k+1) . x
A greedy algorithm determines the neighbor of the current state with the
lowest Hamilton and moves to that configuration as the next state if and only Therefore if we can efficiently sample from p(x|x(0) ), we can
if the Hamilton decreases. efficiently evaluate Eq.(7) by Monte Carlo simulation.
Let us connect the expected Hamilton with the conditional
Hamilton,
X
< H(x) > = p(x)H(x)
x
XX
= p(x, x(0) )H(x)
x(0) x Fig. 3. LSTM network to learn the transitional probability distribution.
X X
= p(x(0) ) p(x|x(0) )H(x)
Algorithm 1: Learning-to-Transition (L2T) Algorithm
x(0) x
X Input: L (number of epochs), x (decision variables), K (number of steps),
= < p(x|x(0) )H(x) >p(x(0) ) M (number of environments), η (learning rate)
Training stage: a neural network (LSTM) πθ with parameter θ
x
01: Initialize the neural network πθ with random θ;
= < H(x|x(0) ) >p(x(0) ) 02: for epoch l = 1, . . . , L
03: for environment m = 1, . . . , M
In Initialize x(0) randomly;
P the last step in the above equation, we replaced 04:
(0) 05: for k = 1, . . . , K
x p(x|x )H(x) by our definition of conditional Hamilton
06: x(k) = πθ (x(k−1) );
H(x|x(0) ). As x(0) is random, we shall sample it uniformly 07: Compute H(x|x(0) ) by sampling from prob. given by x(k) ,
and approximate the expectation < H(x) >. Thus an efficient 08: J(θ) =< H(x) > = < H(x|x(0) ) >,
way to compute the expected Hamilton boils down to whether 09: θ = θ + η∇θ J(θ);
the transitional probability can be efficiently sampled. Testing stage: use policy network πθ
10: Initialize x(0) randomly;
Note that the transitional probability distribution can be 11: for k = 1, . . . , K
factored into a chained conditional probability. 12: x(k) = πθ (x(k−1) );
Output: x(K) .
p(x|x(0) ) = p(x1 , x2 , · · · , xN |x(0) ) = ΠN
k=1 p(xk |xi<k , x
(0)
).
Modeling the auto-regressive nature of p(xk |xi<k , x(0) ) often
results in slow computation; that is the main reason that the binary valued {0, 1}N ; the subsequent x(k) is not a binary
variational classical annealing method in section III.C falls configuration, instead it is real-valued X N , where X ∈ [0, 1].
short to handle graphs with more than a few hundreds of However, the output function sigmoid(x) generating x(k) can
nodes. We, instead, will use a deep neural network to capture be made to sharply transition between binary values 0 and 1, so
this auto-correlation and let p(xk |xi<k , x(0) ) ≃ qθ (xk |x(0) ), each x(k) approximates a valid configuration and calculating
i.e. dropping the dependency on xi<k . We emphasize this H(x(k) ) for these middle states x(k) actually makes sense.
assumption is heuristic and meant to facilitate a fast learning We use the K-step output x(K) as the transitional probabil-
algorithm. ity distribution p(x|x(0) ) which is the product of binary val-
Specifically, we factor the transitional probability distribu- ued probability distribution Πi p(xi |x(0) ): More specifically,
tion into a product of independent binary distributions, p(xi = 1|x(0) ) is set to be the i-th component of x(K) .
The decomposition allows efficient sampling and thus we can
p(x|x(0) ) = ΠN
k=1 p(xk |xi<k , x
(0)
) ≃ ΠN
k=1 qθ (xk |x
(0)
), (8) easily evaluate H(x|x(0) ) by sampling.
where qθ (xk |x(0) ) models the binary distribution of p(xk = D. Put Together - Our Algorithm
0, 1|x(0) ). Clearly, this approximation as a product of N
independent distribution enables efficient sampling of the Our algorithm is listed in the Alg. 1. In the training stage,
distribution. line 4 generates x(0) randomly; Lines 05-06 is the K-step
LSTM, resulting the binary distribution qθ (xk |x(0) ); Line 7
B. Loss Function computes H(x|x(0) ) by sampling the distribution p(x|x(0) );
Given that the transitional probability Eq.(8) is modeled by Line 8 computes < H(x) > by averaging over x(0) . This is
a neural network parameterized by θ, we can write our loss the loss function, whose gradient can be computed from the
function < H(x) > = J(θ). The adopted Long Short Term LSTM structure.
Memory (LSTM) network described below has standard ways Note that the described algorithm does not allow more ex-
to compute the stochastic gradient; thus the efficient compu- ploration at the beginning of the training. In order to facilitate
tation of J(θ) means efficient updating of θ and optimization. learning using an annealing process, we can introduce a time
dependent parameter in the sigmoid function. We are still
C. Neural Network Structure experimenting this idea to see if this improves performance.
We use the LSTM network as our policy network. To
approximate the correlative nature among the individual spins E. Connection with the VCA Algorithm
xk of x, we use K-steps LSTM from x(0) to the final outputs Our current approach directly seeks a probability distribu-
qθ (xk |x(0) ). tion p(x) that minimizes the expected energy H(x). This
In Fig. 3, the one-step LTSM takes in input x(k) and should be contrasted with the variation classical annealing as
cell state ck and hidden state hk . The initial input x(0) is described in Section III. There, to approximate the equilibrium
TABLE I
R ESULTS FOR GRAPH MAXCUT ON SYNTHETIC INSTANCES C. Performance on Benchmark Graphs
Nodes Gurobi L2T Improvement Speedup Table II presents results of 6 compared methods for seven
20 67 (5s) 71 (36s) +5.97% 0.139× instances from the benchmark dataset Gset [12]. The compared
30 132 (10s) 135 (93s) +2.27% 0.108× methods include SDP (DSDP) [13], breakout local search
100 1408 (2000s) 1415 (33s) +0.49% 60.6× (BLS) [2], Tabu search (KHLWG) [14], recurrent GNN (RUN-
1000 128508 (4400s) 129714 (119s) +0.94% 36.97×
5000 - (> 8h) 3175813 (202s) - − CSP) [15], and physical-inspired GNN (PI-GNN) [16]. We
evaluate the solution via the metric of relative gap = (H(x) −
H(x∗ )/H(x∗ ). Compared to the best-known solution for each
distribution p(x), we use KL divergence as the loss function graph instance, our L2T method has a gap from −1.14% to
and Eq.(6) says we need to minimize the difference between −3.50%.
the expected energy and the temperature scaled Shannon
R EFERENCES
entropy. Our approach here drops the Shannon entropy term
in Eq.(6) in approximating the true probability distribution. [1] F. Barahona, M. Grötschel, M. Jünger, and G. Reinelt. An Application
of Combinatorial Optimization to Statistical Physics and Circuit Layout
Shannon entropy can be regarded as a regularization term that Design. Operations Research. 36 (3): 493–513, 1988.
prefers distribution with higher entropy when temperature is [2] U. Benlic and J. Hao. Breakout local search for the max-cut problem.
high and the regulation slowly fades as temperature drops. It’s Engineering Applications of Artificial Intelligence. Vol. 26 (3), 1162-
1173, 2013.
also notable that our parameters aim to determine the next [3] Dian Wu, Lei Wang, and Pan Zhang. Solving Statistical Mechanics
configuration from the previous one, while the parameters in Using Variational Autoregressive Networkspp. Physical Review Letters,
VCA determine the value of each spin sequentially, consider- 122(8):080602, 2019.
[4] Mohamed Hibat-Allah, Estelle M. Inack, Roeland Wiersema, Roger
ing the values of the spins conditioned on all lesser indices G. Melko, and Juna Carrasquilla. Variational neural annealing. Nature
than itself. And in thereby de-emphasizing auto-correlation Machine Intelligence, 3(11), pp.952-961, Nov 2021.
between individual spins in order to capture transition behavior [5] Shoummo Ahsan Khandoker, Jawaril Munshad Abedin and Mohamed
Hibat-Allah. Supplementing recurrent neural networks with annealing to
from one whole configuration to another, our approach proves solve combinatorial optimization problems. Nature Machine Learning:
to be much faster. Sci. Technol. 4 , 2023.
[6] David J.C. MacKay. Information Theory, Inference, and Learning Al-
V. E XPERIMENT R ESULTS gorithms. Cambridge University Press, England, 2003.
[7] Andrew Lucas. Ising formulations of many NP problems. Frontiers in
A. Experiment Setup physics, page 5, 2014.
We implemented the proposed L2T algorithm in Alg. 1 [8] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press,
Cambridge, MA, 2016.
on a Linux server with one GPU A100. We used different [9] M. Mézard, G. Parisi, and M. Virasoro. Spin Glass Theory and Beyond:
parameters in different cases. The learning rate η is 3 × 10−5 An Introduction to the Replica Method and Its Applications, Vol. 9,
∼ 2 × 10−3 , and the number of epochs L is 1,000,000, and World Scientific, Singapore, 1987.
[10] J. D. Nulton, and P. Salamon. Statistical mechanics of combinatorial
the number of steps K is 1,000. When the number of nodes of optimization, vol. 37, No. 4, Physical Review A, Feb 1988.
a graph increases, the required GPU memory also increases; [11] Homepage of Gurobi (2023). https://www.gurobi.com.
therefore, we reduce the number of environments accordingly. [12] Gset dataset at Stanford (2023). https://web.stanford.edu/∼yyye/yyye/
Gset/
The number of environments M is 20 ∼ 2048. [13] C. Choi and Y. Ye. Solving sparse semidefinite programs using the dual
scaling algorithm with an iterative solver. Manuscript, Department of
B. Performance on Synthetic Graphs Management Sciences, University of Iowa, Iowa City, IA, 2000.
[14] G. A. Kochenberger, J.-K. Hao, Z. Lü, H. Wang, and F. Glover. Solving
Table I shows the results on synthetic data with N = large scale max cut problems via tabu search. Journal of Heuristics,
20, 30, 100, 1000, 5000. We compare our L2T algorithm with Springer, 2013.
Gurobi [11], which is a state-of-the-art mathematical program- [15] J. Toenshoff, M. Ritzert, H. Wolf, and M. Grohe. Graph neural networks
for maximum constraint satisfaction. Frontiers in Artificial Intelligence,
ming solver that can be used to solve optimization problems. 2021.
For the instance with 5000 nodes, Gurobi cannot return a [16] M. Schuetz, J. K. Brubaker, H. G. Matzgraber. Combinatorial opti-
solution within 8 hours. L2T finds better solutions than Gurobi mization with physics-inspired graph neural networks. Nature Machine
Intelligence, 2022.
for all test instances. When N ≥ 100, L2T has a speedup of
36.97× ∼ 60.6×, compared with Gurobi.
TABLE II
R ESULTS ON THE G SET DATASET. T HE COMPARED METHODS ARE BLS,
DSDP, KGLWG, RUN-CSP, PI-GNN.

Graph Nodes Edges BLS DSDP KHLWG RUN-CSP PI-GNN L2T (Ours)
G14 800 4694 3064 - 2922 3061 2943 3029
G15 800 4661 3050 2938 3050 2928 2990 2995
G22 2000 19990 13359 12960 13359 13028 13181 13167
G49 3000 6000 6000 6000 6000 6000 5918 5790
G50 3000 6000 5880 5880 5880 5880 5820 5720
G55 5000 12468 10294 9960 10236 10116 10138 10017
G70 10000 9999 9541 9456 9458 - 9421 9358

You might also like