Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

MODRL/D-EL: Multiobjective Deep Reinforcement

Learning with Evolutionary Learning for


Multiobjective Optimization
1st Yongxin Zhang 2nd Jiahai WangB
School of Computer Science and Engineering School of Computer Science and Engineering
Sun Yat-sen University Sun Yat-sen University
Guangzhou, P. R. China Guangzhou, P. R. China
zhangyx266@mail2.sysu.edu.cn wangjiah@mail.sysu.edu.cn
arXiv:2107.07961v1 [cs.NE] 16 Jul 2021

3rd Zizhen Zhang 4th Yalan Zhou


School of Computer Science and Engineering Colloge of Information
Sun Yat-sen University Guangdong University of Finance and Economics
Guangzhou, P. R. China Guangzhou, P. R. China
zhangzzh7@mail.sysu.edu.cn zhouylan@163.com

Abstract—Learning-based heuristics for solving combinatorial Formally, let x, y ∈ Rm be two objective vectors of
optimization problems has recently attracted much academic MOP. x is said to dominant to y, if xi ≤ yi , ∀i ∈
attention. While most of the existing works only consider the {1, 2, ..., m} and xi < yi , ∃i ∈ {1, 2, ..., m}, denoted as
single objective problem with simple constraints, many real-world
problems have the multiobjective perspective and contain a rich x ≺ y. A solution x∗ is called Pareto optimal solution, if
set of constraints. This paper proposes a multiobjective deep F (x) 6≺ F (x∗ ), ∀x ∈ X. The set which contains all the
reinforcement learning with evolutionary learning algorithm for a Pareto optimal solutions is called the Pareto set (PS), and the
typical complex problem called the multiobjective vehicle routing set {F (x)|x ∈ PS} is called the Pareto frontier (PF) [1].
problem with time windows (MO-VRPTW). In the proposed In literature, many iterative-based algorithms are proposed
algorithm, the decomposition strategy is applied to generate
subproblems for a set of attention models. The comprehensive to solve MOPs. These algorithms can be roughly divided
context information is introduced to further enhance the attention into two categories: the multiobjective evolutionary algorithms
models. The evolutionary learning is also employed to fine- (MOEAs) [2] (e.g., MOEA/D [1] and NSGA-II [3]), and mul-
tune the parameters of the models. The experimental results tiobjective local search based algorithms (e.g., MOGLS [4]).
on MO-VRPTW instances demonstrate the superiority of the MOEAs search for PF by evolving a population of solutions si-
proposed algorithm over other learning-based and iterative-based
approaches. multaneously. MOGLS uses local search operators to improve
solutions iteratively. These algorithms perform well in many
I. I NTRODUCTION multiobjective benchmark problems. However, as presented in
[5], the iterative-based algorithms have some drawbacks. First,
A multiobjective optimization problem (MOP) can be gen- to find near-optimal solutions, a large number of iterations are
erally defined as follows, required, so it would be very time consuming. Second, the
iteration process needs to be re-performed once a slight change
Min F (x) = (f1 (x), f2 (x), ..., fm (x))T ,
(1) of problem happens or a similar instance is encountered. Third,
s.t. x ∈ X, most heuristic operators used in these algorithms heavily
where X is the solution space, F (x) is a m-dimensional rely on experts’ empirical knowledge, which may affect the
vector in which fi is the i-th objective function of the MOP. solution quality.
Since these m objective functions potentially conflict with Recently, learning-based heuristics for solving optimization
each other, one would be keen in finding a set of trade-off problems has attracted more and more attention [6]–[8]. In
solutions among which the most desirable solution can be particular, deep reinforcement learning (DRL) has made re-
chosen. markable achievement in solving single objective problems
[9]–[13]. In addition, based on the decomposition strategy
This work is supported by the National Key R&D Program of [1] and DRL algorithm, a multiobjective deep reinforcement
China (2018AAA0101203), the National Natural Science Foundation of learning (MODRL) framework [5] is proposed for MOPs.
China (62072483), and the Natural Science Foundation of Guangdong
Province (2018A030313703, 2021A1515012298). (Corresponding author: Following this framework, [14] proposes a multiobjective
Jiahai Wang.) deep reinforcement learning algorithm with attention model
(MODRL/D-AM), which achieves good performance on mul- duration w. For the depot, q0 = 0 and t0 = [a0 , b0 ], where
tiobjective traveling salesman problem. The characteristics of a0 and b0 denote the earliest possible departure from and
these works are summarized in Table I. latest return to the depot, respectively. E = {eij |i, j ∈ V}
In this paper, we initiate the study on tackling an MOP is the edge set of graph G. The traveling distance of eij is
with complex constraints, called multiobjective vehicle routing defined as the Euclidean distance between vi and vj , and
problem with time windows (MO-VRPTW). MO-VRPTW the traveling time of eij is equal to the traveling distance
is more applicable in real-world scenarios. We try to apply numerically. There is a fleet contains K uniform vehicles with
learning-based heuristics for this problem. We have found that identical capacity Q > 0. Vehicle k departures from the depot,
the attention model [13] in MODRL/D-AM constructs solution serves customers with respect to capacity and time windows
by sequentially adding nodes to partial solution. The selection constraints, and finally returns to the depot. The sequence of
of next node is based on very limited context information, customers that the vehicle severed constitutes a route rk . A
which may lead to inefficiencies in MOPs (e.g., MO-VRPTW). solution is a set of routes, denoted as s = {r1 , r2 , ..., rK }.
Besides, the diversity of solutions obtained by MODRL/D-AM A solution s needs to meet the following S constraints: (1)
is not as good as expected [5]. To address these issues, this all Tcustomers are visited exactly once, i.e., r∈s = V and
paper proposes a multiobjective deep reinforcement learning ra rb = ∅, ∀a, b ∈ {1,P2, ..., K}; (2) all routes satisfy the
and evolutionary learning algorithm (MODRL/D-EL). In the capacity constraint, i.e., i∈rk qi ≤ Q; (3) all customers are
proposed algorithm, two mechanisms are employed. First, the served within their specific time windows.
comprehensive context information (CCI) of MOPs is taken This paper focuses on two conflicting objectives, which are
into consideration. The attention model with CCI can make commonly adopted in previous work [15]: the total traveling
more reasonable decision in constructing solution. Second, distance and makespan, formally described as follows.
a hybrid training method composed of deep reinforcement min F (s) = (f1 (s), f2 (s))T ,
learning and evolutionary learning (EL) is employed to train K
the models. The comparisons between our method and the X
f1 (s) = ck ,
related works are shown in Table I. (2)
k=1
To summarize, the contributions of this paper are threefold:
f2 (s) = max(c1 , c2 , ..., cK ),
• The attention model with comprehensive context infor-
mation is introduced to solve MO-VRPTW. It is shown s.t. s ∈ S.
to be effective in improving the solution quality. where ck is the traveling distance of k-th route in solution
• A two-stage hybrid learning strategy is presented. In the s. S is the valid solution space under the capacity and time
first stage, DRL is employed to train models. In the windows constraints.
second stage, evolutionary learning is used to fine-tune III. B RIEF R EVIEW OF MODRL/D-AM
the parameters of models. The results indicate that both
convergence and diversity of solutions can be improved. A. Decomposition Strategy
• Computational experiments are conducted on both ran- A well-known decomposition strategy [1] is employed in
dom instances and classical benchmark instances. It is MODRL/D-AM for MOPs. Specifically, given a set of uni-
demonstrated that the proposed method outperforms sev- formly spread weight vectors Λ = {λ1 , λ2 , ..., λM }, λi =
eral existing methods. (λ1i , ..., λm T
i ) , MODRL/D-AM adopts the weighted sum [16]
The remaining of this paper is structured as follows. In approach to convert MOP into M single objective sub-
Section II, problem formulation of MO-VRPTW is introduced. problems. Then, a set of neural network models Π =
MODRL/D-AM is briefly reviewed in Section III. Thereafter, {π1 , π2 , ..., πM } is trained for tackling these subproblems,
Section IV provides a detailed description of the proposed where πi corresponds to the subproblem with weight vector
MODRL/D-EL. Experimental results are shown and analyzed λi .
in Section IV. Conclusions are drawn in Section V. The MO-VRPTW is decomposed into M single objective
subproblems through decomposition strategy. For each sub-
II. P ROBLEM F ORMULATION problem, reinforcement learning is employed to train a neural
MO-VRPTW is a classical MOP with complex constraints network model to solve it. The states, actions, transition,
and has a high relevance for many practical applications. It rewards and policy in the reinforcement learning framework
can be defined on a complete undirected graph G = {V, E}. are defined as follows:
Node set V = {vi |i = 0, 1, ..., N }, where node i = 0 is the • States: a state consists of two parts. The first part includes
depot and i ∈ {1, ..., N } are the customers. Each node vi the partial solution of subproblem and the candidate
has some attributes including Euclidean coordinates ci ∈ R2 , nodes. The second part is the position, remaining capac-
a time window constraint ti ∈ R2 , and a demand qi ∈ R+ ity, travel distance and travel time of each vehicle;
that requires to be satisfied. The time window constraint ti • Actions: an action is the next node to visit;
indicates a tuple [ai , bi ], ai ≤ bi , where ai and bi are lower • Transition: transition corresponds to adding the selected
and upper bounds of the possible service time. Moreover, the action (node) into partial solution, and updating the state
service at each customer vi ∈ V requires an identical service of corresponding vehicle;
TABLE I
T HE CHARACTERISTICS OF RELATED WORKS AND OUR PAPER

Objective Literature Problem Encoder Context information Training method


[9] TSP LSTM last visited node DRL
[10] TSP GNN last visited node DRL
Single objective optimization [11] VRP linear embedding last visited node DRL
[12] TSP attention model last k visited nodes DRL
graph information, last visited node,
[13] TSP,VRP attention model DRL
first visited node (depot for VRP)
[5] MO-TSP LSTM last visited node DRL
Multiobjective optimization graph information, last visited node,
[14] MO-TSP attention model DRL
first visited node
graph information, last visited node,
Our paper MO-VRPTW attention model + MLP DRL + EL
depot, fleet information, route information

1
• Rewards: The reward function of i-th subproblem is attn(h, H; W ) = softmax( p hT (W query )T W key H).
defined as the weighted sum of two objectives, i.e. dkey
ri = λT (8)
i F (s);
• Policy: policy is parameterized with a deep neural net- The fully connected layer is defined as follows,
work model πi that selects a candidate node as next
visiting node according to current state. The details of FF(hi ; W, B) = max(0, W hi + B). (9)
neural network model will be introduced in the following 2) Decoder: The decoder sequentially constructs solution
subsection. in T time steps. In each step, decoder outputs a node for the
next visit. If the next visited node is depot, the current vehicle
B. Model of subproblem: Attention Model
route is closed and a new vehicle is dispatched until all the
In MODRL/D-AM, an attention model [13] is employed customers have been served. Specifically, at each time step
for a subproblem. The attention model, say π, is based on the t ∈ {1, 2, ..., T }, the decoder outputs node v t based on the
encoder-decoder architecture, parametrized by θ. sequence of node embeddings H = (h0 , h1 , ..., hN ) from the
1) Encoder: The encoder first takes the node-wise features encoder and context embedding C t . In MODRL/D-AM, the
xi = (ci , ti , qi ), i ∈ V, as input. It computes the initial context embedding C consists of graph embedding hgraph =
(0)
embedding hi ∈ Rdemb (demb is set to 128 empirically) of 1
PN t−1
(0) N +1 i=0 hi , the last served node embedding h , the depot
vi through a linear projection, i.e., hi = W xi + B. Then depot t
embedding h , the remaining capacity Q of the current
(0)
the initial embedding hi is updated through a stack of three vehicle, and the current traveling time-point P t of the vehicle.
self-attention (SA) blocks, i.e., C t is formally defined as follows:
(0)
hi = SA(SA(SA(hi , H (0) ))) i ∈ V, (3) C t = [hgraph ; ht−1 ; hdepot ; Qt ; P t ], (10)
(0) (0) (0) where “[; ]” denotes the concatenation operator.
where H (0) = (h0 , h1 , ..., hN ) is the sequence of initial
embeddings. SA block consists of two sublayers: a multi- The decoder consists of a MHA layer and a subsequent
head attention layer (MHA) and a node-wise fully connected layer with just attention weights operating on a masked input
feed-forward layer (FF). Each sublayer also includes a skip- sequence. A valid solution under capacity and time windows
connection [17] and batch normalization (BN) [18] i.e. constraints is promised by mask function. The mask function
rules out the nodes that would violate constraints and the nodes
(l−1) (l−1)
ĥi = BNl (hi + MHAl (hi , H (l−1) )), (4) that are already served by setting corresponding attention
(l) weight to −∞, and thus yield probability of zero, i.e.,
hi = BNl (ĥi + FFl (ĥi )), (5)
(l)
pθ (v t = i|st−1 ) = πθ (v t = i|H, C t−1 )
where hi denotes the node embedding produced by block (11)
= attn(MHA(C t−1 , H), mask(H)),
l. The multi-head attention layer is a linear combination of
multiple (e.g., eight) single-head attention (SHA) layers, each where st−1 denotes the partial solution in time step t − 1.
of which takes a slice of all the input elements, formally The probability of a complete solution s for instance G can
defined as follows, be modeled by the chain rule as follows,
Z T
Y
πθ (v t |H, C t−1 ).
X
MHA(h, H; W ) = Wihead SHA(hslice(i) , H.,slice(i) ; W ), pθ (s|G) = (12)
i=1 i=1
(6) C. Training Algorithm: Reinforcement Learning
|H|
Consider a specific subproblem with its model πθ , a well-
X
query key value
SHA(h, H; W ) = attn(h, H; W ,W )j W Hj ,
j=1 known policy gradient based REINFORCE algorithm [19] is
(7) employed to train it.
The gradients of parameters θ is formulated as follows, The new context embedding also contains the information
about the entire fleet and the last served nodes of vehicles,
∇θ J(θ|G) = Es∼pθ (·|G) [(λT F (s|G) − b(G))∇θ logpθ (s|G)], defined as follows,
(13)
K K
where G is an input instance, s is a complete solution, λ is 1 X vehicle 1 X node
weight vector and b(G) is a value function to reduce gradient hfleet = hk , hlast = hlast(rk ) , (16)
K K
variance. In the training process, the gradients of parameters k=1 k=1

θ can be approximated by Monte Carlo sampling as follows, where last(rk ) denotes the last node in route rk . Finally, the
B context embedding is defined as follows,
1 X
∇θ J(θ|G) ≈ [(λT F (sj |Gj )−b(Gj ))∇θ logpθ (sj |Gj )],
B j=1 C = [hgraph ; hfleet ; hcur ; hdepot ; hlast ], (17)
(14)
where B is the batch size. For the training of M subproblem where hcur denotes the embedding of vehicle in current
models, the neighborhood-based parameter-transfer strategy constructing route, and hdepot denotes the embedding of depot.
[5] is employed to accelerate the training process. To extract features of nodes and vehicles more effectively,
the node embedding hnode and vehicle embedding hvehicle are
IV. T HE P ROPOSED A LGORITHM : MODRL/D-EL combined to create a new sequence of node embedding H,
called joint combinatorial space [20] of all nodes and current
A. Motivation vehicle, formally defined as follows,
The MODRL/D-AM performs well in simple MOPs (e.g.,
MO-TSP). In more complicated problems such as MO- H = {ga (hcur , hnode
i )|i ∈ V}, (18)
VRPTW, the context embedding C in the attention model only
contains the information about the vehicle status and the last where ga is a neural network that computes compatibility
visited node in current route, while the nodes and vehicles between a vehicle and a node, formally defined as follows,
in other routes are not taken into consideration. The limited
context information usually leads to sub-optimal decisions. ga (hcur , hnode ; W1 , W2 ,W3 ) = W1 hnode + W2 hcur
i i
Besides, the diversity of solutions obtained by MODRL/D-
AM is not as good as expected [5]. Therefore, this paper + W3 [hcur hnode
i ; (hcur )T hnode
i ].
proposes to consider more comprehensive context information (19)
in constructing solution, and uses evolutionary learning to fine-
where is the element-wise product.
tune the parameters of models to improve the diversity of
solutions.
Algorithm 1 Evolutionary Learning
Input: The initial population Π = {π1 , π2 , ..., πM } where
B. Model of Subproblem: Attention Model with Comprehen-
πi has parameters θi , training batch size B, maximum
sive Context Information
generation D
The proposed model of subproblem is also based on the Output: optimized model set Π∗ = {π1∗ , π2∗ , ..., πM

}
attention mechanism [13]. The node encoder is consistent with 1: for i = 1, ..., M do
the encoder in MODRL/D-AM. To provide comprehensive 2: fitnessπi = Evaluate(πi )
context information for the decoder to make more reasonable 3: end for
decision in constructing solution, vehicle encoder and route 4: for g = 1, ..., D do
encoder are introduced, and the context embedding is extended 5: for i = 1, ..., M do
as in [20]. The overall architecture of our model is shown in 6: Gj = RandomInstance() ∀j ∈ {1, ..., B} .
Fig. 1. randomly sample
qP training data
Both of the vehicle encoder and route encoder are multi- N PB tj 2
7: = k=0 ( j=1 ∇θ p(v = k|stj −1 , Gj ))
layer perceptron (MLP) and denoted as gv and gl , respectively. 8: x ∼ N (0, µI)
gv has 3 layers and gl has 2 layers. gv encodes the vehicle 9: θ̂i = θi + x
feature dk which consists of vehicle index k, the current return
10: generate offspring π̂i with parameters θ̂i and
distance to depot, the current position (given by coordinates of
fitnessπ̂i = Evaluate(π̂i )
the last served node) and current time of vehicle k. gl encodes
11: end for
the route feature (given by the nodes served so far). Finally,
12: sort individuals (policies/models) with non-domination
the embedding of vehicle k, denoted as hvehicle
k , is computed
level and crowding distance, and select M individuals for
by aggregating the information of gl and gv :
next generation
1 X 13: end for
hvehicle
k = [gv (dk ); gl (hnode
i )]. (15)
|rk | i∈r
k
Node Encoder Vehicle Encoder Decoder
Vehicle
‫ݔ‬ଵ ‫ݔ‬ଶ ‫ڮ‬ ‫ݔ‬ே ݀ଵ ݄ଵᇱ ݄ଵ
݄ଶᇱ ݄ଶ
Linear projection
݀ଶ
‫ڮ‬
MLP
(2-layers) ‫ڮ‬ ‫ڮ‬ Solution
‫ڮ‬
݀௄ ݄௄ᇱ ݄௄ ‫ܥ‬
݄ଵ௟ିଵ ݄ଶ௟ିଵ ௟ିଵ
݄ே vehicle embedding

M route 1 route 2
MHA context embedding
H attn
Summation & BN A

ܰൈ ݄෠ଵ௟ ݄෠ଶ௟ ‫ڮ‬ ݄෠ே



Route Encoder
FF
݄ଵ ݃௔ ‫ܪ‬
ͳ
Summation & BN MLP ෍ ݄௜ ݄ଶ
ȁ‫ݎ‬௞ ȁ
(3-layers) ௥ೖ ‫ڮ‬ route 3
Nodes ݄ଶ௟ ௟
݄ଵ௟ ‫ڮ‬ ݄ே ݄௄
node embedding route embedding

Fig. 1. The overall architecture of our model. In encoder part, the node encoder takes node features as input to compute node embeddings. The route encoder
takes node embeddings as input and computes route embeddings. The vehicle encoder takes vehicle features and route embedding as input to compute vehicle
embeddings. In the decoder part, the node embeddings and vehicle embeddings that contain comprehensive context information are used to compute context
embedding C. The node embeddings and vehicle embeddings also used to compute joint combinatorial space H. Then, the context embedding C and joint
combinatorial space H are used to compute the next visit node.

makespan
Policies trained by DRL Evolutionary learning
ߨଵ
population mutation offspring
ߨଵ ߨଶ ߨெ ߨ௜ ߨ௜ᇱ ߨଵᇱ ߨଶᇱ ᇱ
‫ڭ‬

ߨெ
‫ڭ‬

ߨଶ

݀ଵ
ߨଵ level 2
‫ڭ‬ NDL CD candidate policies
݀ଶ
݀ଷ

ߨெ ߨଵ ߨଶ ߨெ
‫ڭ‬

multiobjective ݀ସ ߨଶ level 1
based selection ߨଵᇱ ߨଶᇱ ᇱ
ߨெ
‫ڭ‬

total travel distance

(a) (b)

Fig. 2. The overview of hybrid training algorithm with evolutionary learning. (a) Hybrid training algorithm with evolutionary learning. (b) Visualization of
multiobjective based selection. The non-dominantion level (NDL) is calculated by recursively select non-dominantion policies from the policies set. Then the
policies are sorted into different non-dominantion level correspondingly. In the same non-dominantion level, policies are ranked by crowding distance (CD).

C. Hybrid Training Algorithm with Evolutionary Learning of instances to compute the sensitivity , formally defined as
The hybrid training algorithm consists of two stages. In the follows,
first stage, the models of subproblems are trained by the well- v
uN B
known REINFORCE algorithm [19] as described in Section
uX X
=t ( ∇θ pθ (vt,j = k|st−1,j , Gj ))2 , (21)
III-C. The reward of ith subproblem is the weighted sum of k=0 j=1
two objectives as defined in (2) with weight vector λi .
In the second stage, the models trained by the first stage where N is the number of nodes, B is the batch size of the
are taken as the initial population. Then they are evolved training instances G, and t is the time step in the decoding
by variation operators. At each generation, each individual process, which is randomly selected from {1, 2, ..., T }. st−1,j
model generates an offspring by the proximal mutation [21], is a partial solution at time step t − 1 for the instance Gj , so
which takes SM-G-SUM [22] mutation operator to add scaled it gives a random sample state of Gj .
Gaussian perturbation to the model parameters θ, i.e., The parent and offspring models are evaluated with a
randomly sampled MO-VRPTW instance. Specifically, each
x model takes the instance as input and constructs solution
θ←θ+ , x ∼ N (0, µI), (20)
 for the instance. Then, a fitness vector (objective vector) is
where µ is a mutation magnitude hyperparameter. The SM- computed for the solution, and the model is marked with the
G-SUM operator [22] uses the gradient of attention weight of fitness vector. After that, a multi-objective based selection is
each node in a randomly sampled decoding step over a batch performed. Specifically, all the models are ranked according
to non-domination level and crowding distance of fitness capacity Q, i.e. qi = qi0 /Q. The vehicle capacity Q is also
vectors as in NSGA–II [3]. After ranking, M individuals normalized to 1 in training.
(policies/models) are reserved. Algorithm 1 shows the proce- • Time Window: For customer vi , hi = [asample =
dure of evolutionary learning. The overview of hybrid training ĥi , bsample = b0 −ĥi −w] is defined, where ĥi = di +0.01,
algorithm and multi-objective based selection are shown in and di is the traveling time from vi to the depot. Then,
Fig. 2. the start time ai is sampled uniformly from hi , and the
end time bi is calculated as bi = min(ai + 3α, bsample ),
D. The Procedure of MODRL/D-EL
α = max(|α̂|, 0.01), α̂ ∼ N (0, 1).
Algorithm 2 shows the procedure of MODRL/D-EL.
2) Training Details: In deep reinforcement learning, the
MODRL/D-EL first decomposes MO-VRPTW into M sin-
subproblem number M is set to 100. The weight vectors for
gle objective subproblems by weighted sum approach. For
subproblems are uniformly spread from 0 to 1, i.e. λ1 =
each subproblem, an attention model with comprehensive
(0, 1), λ2 = (0.01, 0.99), ..., λ100 = (0.99, 0.01). The value
context information is trained by reinforcement learning with
function in REINFORCE algorithm [19] is the greedy rollout
parameter-transfer strategy [5]. Then, the trained models are
baseline proposed in [13]. Adam [23] is used as optimizer,
taken as the initial population for evolutionary learning. The
and the learning rate is set to 10−4 constantly. The number of
evolutionary learning uses proximal mutation [21] to generate
heads in MHA is set to 8. The dimension demb of hnode and
offsprings, and select individuals (policies/models) according
hvehicle are set to 128. The hidden dimension in FF layer is
to the ranks of non-domination level and crowding distance as
256. The hidden dimension of gv and gl are set to 64.
in NSGA–II [3].
Based on the neighborhood-based parameter-transfer strat-
Algorithm 2 MODRL/D-EL egy [5], the first model is trained for 10 epochs, each epoch
involving 512,000 instances. The remaining 99 models are
Input: The model set of subproblems Π = {π1 , π2 , ..., πM },
only trained 1 epoch; each epoch contains 256,000 instances.
weight vectors Λ = {λ1 , λ2 , ..., λM }, maximum iteration
The training batch size in DRL is set to 128.
in evolutionary learning D
In evolutionary learning, the hyperparameter µ in proximal
Output: optimized model set Π∗ = {π1∗ , π2∗ , ..., πM

}
mutation is set to 0.01. The training batch size in evolutionary
1: randomly initialize parameters of models in Π
learning is set to 64. The maximum generation D is set to 30.
2: for i = 1, ..., M do
3: train policy πi by DRL as described in Section 3.3 The training time for our models is about 20 hours.
for ith subproblem 3) Baselines: Two learning-based algorithms are in-
4: end for troduced as competitors to evaluate the performance of
5: take policy set trained by DRL as the initial population MODRL/D-EL. The first one is MODRL/D-AM [14], and the
for evolutionary learning second one is MODRL/D-CCI, which extends MODRL/D-
6: for i = 1, ..., D do AM by integrating the comprehensive context information.
7: evolve individuals (policies/models) as described in MODRL/D-CCI is compared with MODRL/D-AM to evaluate
Algorithm 1 the contribution of CCI. MODRL/D-EL is compared with
8: end for MODRL/D-CCI to evaluate the contribution of evolutionary
learning.
MODRL/D-EL is also compared with three classical itera-
V. E XPERIMENTAL R ESULTS tive based algorithms: NSGA-II, MOEA/D and MOGLS. For
All the experiments are executed on a machine with an NSGA-II and MOEA/D, the number of iterations is set to
Intel Xeon E5-2637 CPU and 16GB memory. All the related 20000. The population size is set to 100. For MOGLS, the
network models are trained on a single 1080Ti GPU. size of the temporary population and the number of the initial
solutions are consistent with the setting in [4]. The local search
A. Experiment Settings uses 2-opt operator, which is terminated when 100 iterations
1) Data Generation: The vehicles capacity Q is set uni- are reached. The number of iterations in MOGLS is set to
formly to 750, 900 and 1000 for problem size 50, 80 and 100, 4000 for a similar testing time with NSGA-II.
respectively. The time window of depot is set to [a0 = 0, b0 =
10], and service duration w of customers is set to 0.1 time B. Results and Discussions
unit. The coordinates of nodes, demands and time windows of All the attention models used in learning-based algorithms
customers are defined as follows [20]: are trained with the instances of size 50. The average perfor-
• Coordinates: The coordinates of nodes are sampled mance over a batch of instances can evaluate the algorithms
uniformly from the interval [0, 1]. more accurately. The learning-based algorithms learn from the
• Demands: The demand qi is sampled from a normal training data and thus, they can efficiently construct solutions
distribution q̂i ∼ N (15, 10), and truncated its absolute for a batch of instances from the same distribution. The
value to integer values between 1 and 42, i.e., qi0 = iterative-based algorithms do not require the tested instances to
min(42, max(1, b|q̂i |c)). Then, qi0 is scaled by the vehicle follow some distribution. However, the iteration process needs
TABLE II
C OMPARATIVE RESULTS (M EAN AND STD) OF HV, |NDS| AND TESTING TIME ON THREE TESTING SETS OBTAINED BY MODRL/D-EL,
MODRL/D-CCI AND MODRL/D-AM, RESPECTIVELY. T HE HIGHEST VALUES OF HV ARE HIGHLIGHTED IN BOLD .

Algorithm MODRL/D-EL MODRL/D-CCI MODRL/D-AM


Instances HV |NDS| Time (s) HV |NDS| Time (s) HV |NDS| Time (s)
N = 50 730.28 (30.48) 14.53 (3.09) 48.09 731.24 (30.49) 12.81 (2.69) 47.49 728.57 (30.79) 11.72 (2.56) 37.33
N = 80 844.45 (37.77) 14.90 (2.96) 68.63 843.55 (37.65) 11.96 (2.52) 67.58 842.45 (38.29) 12.73 (2.58) 54.37
N = 100 1154.76 (49.34) 14.11 (2.83) 83.07 1150.94 (49.57) 10.00 (2.29) 81.02 1150.25 (50.36) 11.90 (2.44) 68.13

TABLE III
T HE HV VALUES , |NDS| AND TESTING TIME ON 11 SOLOMON INSTANCES R201-211. T HE MEAN VALUES AND STANDARD DEVIATION (STD) ARE
SHOWN IN THE BOTTOM . T O ELIMINATE THE IMPACT OF RANDOMNESS IN NSGA-II, MOEA/D AND MOGLS, THESE ALGORITHMS ARE RUN 5 TIMES
INDEPENDENTLY AND THE MEAN VALUES ARE TAKEN AS FINAL RESULTS . T HE HIGHEST VALUES OF HV ARE HIGHLIGHTED IN BOLD .

Algorithm MODRL/D-EL MODRL/D-CCI MODRL/D-AM NSGA-II MOEA/D MOGLS


Instances HV |NDS| Time (s) HV |NDS| Time (s) HV |NDS| Time (s) HV |NDS| Time (s) HV |NDS| Time (s) HV |NDS| Time (s)
R201 1296.9 14.0 63.8 1301.3 7.0 63.2 1309.3 8.0 53.4 1265.1 15.0 202.3 1285.6 8.0 139.3 1236.5 7.0 175.2
R202 1308.2 12.0 60.3 1306.0 8.0 56.4 1306.8 10.0 49.9 1275.5 15.0 205.4 1298.1 9.0 139.3 1266.0 7.0 187.8
R203 1324.2 11.0 59.6 1299.9 9.0 56.9 1210.0 7.0 48.3 1287.1 12.0 205.7 1311.9 15.0 138.8 1294.5 8.0 200.3
R204 1326.6 15.0 59.2 1295.6 8.0 59.8 1197.2 8.0 49.0 1264.9 9.0 200.8 1296.2 7.0 136.8 1302.6 10.0 200.2
R205 1307.9 8.0 60.1 1313.2 10.0 56.5 1319.0 6.0 48.6 1276.9 17.0 206.6 1306.4 18.0 138.5 1271.8 10.0 188.4
R206 1311.2 11.0 59.8 1316.6 8.0 55.8 1303.6 6.0 47.9 1278.9 12.0 202.6 1310.9 10.0 139.0 1291.8 8.0 194.7
R207 1320.7 14.0 62.3 1295.6 8.0 56.1 1199.5 3.0 47.8 1265.5 11.0 203.1 1307.7 15.0 138.1 1301.7 13.0 200.9
R208 1332.4 17.0 59.0 1297.9 9.0 55.9 1227.9 6.0 48.1 1243.0 5.0 202.2 1263.9 7.0 136.1 1292.4 7.0 198.4
R209 1317.2 15.0 59.9 1320.8 11.0 58.7 1323.5 9.0 48.5 1286.7 19.0 203.6 1315.3 16.0 138.2 1294.5 13.0 192.0
R210 1318.2 14.0 59.8 1317.6 14.0 55.6 1322.3 9.0 50.6 1278.0 16.0 204.9 1313.1 18.0 141.5 1293.9 14.0 190.2
R211 1322.7 15.0 59.5 1325.6 12.0 55.3 1337.0 7.0 47.3 1263.6 7.0 203.9 1295.8 8.0 137.3 1304.4 10.0 204.5
Mean 1316.9 13.3 60.3 1308.2 9.5 57.3 1277.8 7.2 49.0 1271.4 12.5 203.7 1300.4 11.9 138.4 1286.4 9.7 193.9
STD 9.7 2.4 1.4 10.4 2.0 2.3 53.5 1.8 1.7 12.1 4.1 1.7 14.5 4.3 1.4 19.5 2.5 8.0

to be re-performed for every run. Hence, a tremendous amount fronts of two instances with different problem size. As shown
of time is required to solve a batch of instances. in the figure, MODRL/D-EL has a better performance in both
The experiments consist of two parts. For the first part, to convergence and diversity.
compare MODRL/D-EL with other learning-based algorithms, Table III reports the experimental results on 11 bench-
three testing sets of problem size 50, 80 and 100 are respec- mark instances. In comparisons between MODRL/D-CCI and
tively sampled from the same distribution as training data. MODRL/D-AM, MODRL/D-CCI achieves better mean val-
Each testing set contains 1024 instances. The iterative-based ues in terms of HV and |NDS|, and MODRL/D-CCI has
algorithms are not tested in this part, because these algorithms a much more stable performance. In comparisons between
take a tremendous amount of time to solve 1024 instances. MODRL/D-EL and MODRL/D-CCI, MODRL/D-EL achieves
For the second part, to compare MODRL/D-EL with iterative- better mean value in term of HV, and MODRL/D-EL obtains
based algorithms, 11 instances (R201–R211) from Solomon much more non-dominant solutions in 10 out of 11 instances.
benchmark [15] are used as testing instances. Hypervolume In comparisons between MODRL/D-EL and iterative-based
(HV) and the number of solutions in non-dominant set (|NDS|) algorithms, MODRL/D-EL is much faster in solving all the
are taken as performance indicators. The reference point for instances and gets better mean values. Besides, even though
calculating HV are set to (100,10), (120,10) and (160,10) for MODRL/D-EL is trained with instances of problem size 50,
problem size 50, 80 and 100 respectively. it still performs better than iterative-based algorithms on the
Table II shows the experimental results on the three benchmark instances of problem size 100, which suggests that
testing sets. In comparisons between MODRL/D-CCI and MODRL/D-EL has a good generalization ability to tackle the
MODRL/D-AM, MODRL/D-CCI achieves better performance problem with different sizes.
in term of HV on all the three testing sets, which suggests that
VI. C ONCLUSION
comprehensive context information is helpful to make more
reasonable decision. In comparisons between MODRL/D-EL This paper proposes a multiobjective deep reinforcement
and MODRL/D-CCI, MODRL/D-EL achieves better perfor- learning and evolutionary learning algorithm for MOPs with
mance in term of HV on the testing sets of problem size 80 and complex constraints. In the proposed algorithm, the atten-
100. In addition, MODRL/D-EL obtains more non-dominant tion model constructs solution by using more comprehensive
solutions on all the three testing sets. The fact suggests that context information of MOP, which is helpful in making
evolutionary learning can fine-tune the parameters of models reasonable decision. The evolutionary learning is employed
to improve the quality of solutions. The testing time of in the proposed algorithm to fine-tune the parameters of
MODRL/D-EL is a little longer than that of MODRL/D-AM. models to achieve better performance. The proposed algorithm
This is because processing comprehensive context information is compared with other learning-based and iterative-based
requires more testing time than processing simple context algorithms for MO-VRPTW. The experimental results suggest
information. To further visualize the final solutions found by that the proposed algorithm has a better performance in both
different approaches, Fig. 2 shows the approximate Pareto convergence and diversity.
3.8
02'5/'(/ 02'5/'(/
2.2 02'5/'&&, 3.6 02'5/'&&,
02'5/'$0 02'5/'$0
3.4
2.0
3.2
PDNHVSDQ

PDNHVSDQ
1.8
3.0
1.6 2.8

1.4 2.6

2.4
1.2
10 15 20 25 30 35 20 40 60 80 100 120
WRWDOWUDYHOGLVWDQFH WRWDOWUDYHOGLVWDQFH
(a) (b)

Fig. 3. The PFs obtained by MODRL/D-EL, MODRL/D-CCI and MODRL/D-AM for random instances with (a) 50 customer nodes, (b) 100 customer nodes.

With respect to the future studies, the evolutionary learning [11] M. Nazari, A. Oroojlooy, L. Snyder, and M. Takác, “Reinforcement
in this paper only considers mutation operator. A crossover learning for solving the vehicle routing problem,” in NeurIPS, 2018, pp.
9839–9849.
operator may generate offspring models that construct so- [12] M. Deudon, P. Cournut, A. Lacoste, Y. Adulyasak, and L.-M. Rousseau,
lutions with higher diversity and convergence. Besides, the “Learning heuristics for the TSP by policy gradient,” in International
MO-VRPTW instances in this paper are symmetrical. How to Conference on the Integration of Constraint Programming, Artificial
Intelligence, and Operations Research. Springer, 2018, pp. 170–181.
efficiently deal with asymmetric instances is also worth further [13] W. Kool, H. van Hoof, and M. Welling, “Attention, learn to solve routing
studying. problems!” in ICLR, 2019.
[14] H. Wu, J. Wang, and Z. Zhang, “MODRL/D-AM: Multiobjective deep
R EFERENCES reinforcement learning algorithm using decomposition and attention
model for multiobjective optimization,” CCIS, vol. 1205, pp. 575–589,
[1] Q. Zhang and H. Li, “MOEA/D: A multiobjective evolutionary algorithm 2020.
based on decomposition,” IEEE Transactions on Evolutionary Compu- [15] J. Castro-Gutierrez, D. Landa-Silva, and J. M. Perez, “Nature of real-
tation, vol. 11, no. 6, pp. 712–731, 2007. world multi-objective vehicle routing with evolutionary algorithms,” in
[2] A. Zhou, B.-Y. Qu, H. Li, S.-Z. Zhao, P. N. Suganthan, and Q. Zhang, IEEE International Conference on Systems, Man, and Cybernetics, 2011,
“Multiobjective evolutionary algorithms: A survey of the state of the pp. 257–264.
art,” Swarm and Evolutionary Computation, vol. 1, no. 1, pp. 32–49, [16] K. Miettinen, Nonlinear multiobjective optimization. Springer Science
2011. & Business Media, 2012, vol. 12.
[3] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist [17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
multiobjective genetic algorithm: NSGA-II,” IEEE Transactions on recognition,” in CVPR, 2016, pp. 770–778.
Evolutionary Computation, vol. 6, no. 2, pp. 182–197, 2002. [18] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
[4] A. Jaszkiewicz, “Genetic local search for multi-objective combinatorial network training by reducing internal covariate shift,” in ICML, 2015,
optimization,” European Journal of Operational Research, vol. 137, pp. 448–456.
no. 1, pp. 50–71, 2002. [19] R. J. Williams, “Simple statistical gradient-following algorithms for
[5] K. Li, T. Zhang, and R. Wang, “Deep reinforcement learning for multi- connectionist reinforcement learning,” Machine Learning, vol. 8, no. 3,
objective optimization,” IEEE Transactions on Cybernetics, pp. 1–12, pp. 229–256, 1992.
2020. [20] J. K. Falkner and L. Schmidt-Thieme, “Learning to solve vehicle routing
[6] Y. Bengio, A. Lodi, and A. Prouvost, “Machine learning for combinato- problems with time windows through joint attention,” arXiv preprint
rial optimization: A methodological tour d’horizon,” European Journal arXiv:2006.09100, 2020.
of Operational Research, 2020 (In press). [21] C. Bodnar, B. Day, and P. Lió, “Proximal distilled evolutionary rein-
[7] N. Mazyavkina, S. Sviridov, S. Ivanov, and E. Burnaev, “Reinforce- forcement learning,” in AAAI, vol. 34, no. 4, 2020, pp. 3283–3290.
ment learning for combinatorial optimization: A survey.” arXiv preprint [22] J. Lehman, J. Chen, J. Clune, and K. Stanley, “Safe mutations for deep
arXiv:2003.03600, 2020. and recurrent neural networks through output gradients,” in GECCO,
[8] N. Vesselinova, R. Steinert, D. F. Perez-Ramirez, and M. Boman, 2018, pp. 117–124.
“Learning combinatorial optimization on graphs: A survey with applica- [23] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimiza-
tions to networking,” IEEE Access, vol. 8, pp. 120 388–120 416, 2020. tion,” in ICLR, 2015.
[9] I. Bello, H. Pham, Q. V. Le, M. Norouzi, and S. Bengio, “Neural combi-
natorial optimization with reinforcement learning,” in ICLR Workshop,
2017.
[10] H. Dai, E. B. Khalil, Y. Zhang, B. Dilkina, and L. Song, “Learning
combinatorial optimization algorithms over graphs,” in NeurIPS, 2017,
pp. 6351–6361.

You might also like