Deep Reinforcement Learning For Dynamic Scheduling of A Flexible Job Shop

International Journal of Production Research
ISSN: (Print) (Online) Journal homepage: www.tandfonline.com/journals/tprs20
Deep reinforcement learning for dynamic

scheduling of a flexible job shop
Renke Liu, Rajesh Piplani & Carlos Toro
To cite this article: Renke Liu, Rajesh Piplani & Carlos Toro (2022) Deep reinforcement learning
for dynamic scheduling of a flexible job shop, International Journal of Production Research,
60:13, 4049-4069, DOI: 10.1080/00207543.2022.2058432
To link to this article: https://doi.org/10.1080/00207543.2022.2058432
View supplementary material
Published online: 11 Apr 2022.
Submit your article to this journal
Article views: 7383
View related articles
View Crossmark data
Citing articles: 16 View citing articles
Full Terms & Conditions of access and use can be found at

https://www.tandfonline.com/action/journalInformation?journalCode=tprs20
INTERNATIONAL JOURNAL OF PRODUCTION RESEARCH
2022, VOL. 60, NO. 13, 4049–4069
https://doi.org/10.1080/00207543.2022.2058432
Deep reinforcement learning for dynamic scheduling of a flexible job shop

Renke Liu a , Rajesh Piplania and Carlos Torob
a School of Mechanical and Aerospace Engineering, Nanyang Technological University, Singapore, Singapore; b Vicomtech Research Centre, San
Sebastian, Spain
ABSTRACT ARTICLE HISTORY

The ability to handle unpredictable dynamic events is becoming more important in pursuing agile Received 30 March 2021
and flexible production scheduling. At the same time, the cyber-physical convergence in production Accepted 20 March 2022
system creates massive amounts of industrial data that needs to be mined and analysed in real-time. KEYWORDS
To facilitate such real-time control, this research proposes a hierarchical and distributed architecture Dynamic scheduling;
to solve the dynamic flexible job shop scheduling problem. Double Deep Q-Network algorithm is distributed multi-agent
used to train the scheduling agents, to capture the relationship between production information and systems; flexible job shop;
scheduling objectives, and make real-time scheduling decisions for a flexible job shop with constant hierarchical scheduling; deep
job arrivals. Specialised state and action representations are proposed to handle the variable speci- reinforcement learning
fication of the problem in dynamic scheduling. Additionally, a surrogate reward-shaping technique
to improve learning efficiency and scheduling effectiveness is developed. A simulation study is car-
ried out to validate the performance of the proposed approach under different scenarios. Numerical
results show that not only does the proposed approach deliver superior performance as compared to
existing scheduling strategies, its advantages persist even if the manufacturing system configuration
changes.
1. Introduction
high-volume data analytics, which exceeds the capability
Unpredictable real-time disruptions in manufacturing of traditional shop floor management systems or human
systems lead to changes in effectiveness of scheduled expertise and creates a need for automatic knowledge
plans or activities; even minor disruptions can add up creation and self-adaptive control.
to make the pre-developed schedule suboptimal, even The manufacturing industry paradigm is shifting
infeasible. Typical dynamic events on shop floor such as from mass production to mass customisation, requir-
machine breakdowns and operator unavailability must ing the ability to manage production with diversified
be attended to quickly to maintain high production effi- product configurations and trajectories of operations.
ciency. In addition to these internal disruptions, compa- The scheduling problem under these circumstances is
nies now give customers high flexibility in placing and referred to as the job shop problem (JSP), which aims
revising their orders to gain advantages in a customer- at finding the optimal sequence of jobs to be processed
oriented and highly competitive environment, making at different machines. Extra complexity arises with the
the arrival and cancellation of orders more frequent than deployment of ‘flexible’ machines: scheduling decisions
ever. now consist of machine selection (routing) and job selec-
Accelerating cyber-physical convergence on the shop tion (sequencing); the resulting scheduling problem is
floor is revolutionising the availability and diversity of referred to as the flexible job shop problem (FJSP).
industrial data. The large-scale deployment of connected In this research, we propose a distributed and hierar-
devices significantly improves the accessibility and com- chical approach to solve dynamic FJSP (DFJSP), realised
plexity of real-time industrial data (Messaoud et al. by Deep Reinforcement Learning (DRL). The objective
2020). Besides, advanced computing architectures such is to design an architecture in which intelligent agents
as cloud computing and edge computing are developed to learn the mapping from shop floor information to sound
enhance the capability for and reduce the latency of on- scheduling decisions through reward-driven training,
site data analytics (Chen and Ran 2019). The constantly and then perform real-time control to minimise tardiness
changing manufacturing system requires real-time and in an environment with constant job arrivals.
CONTACT Renke Liu renke001@e.ntu.edu sg 50 Nanyang Ave, Block N3, Nanyang Ave, Singapore 639798, Singapore
Supplemental data for this article can be accessed here. https://doi.org/10.1080/00207543.2022.2058432
© 2022 Informa UK Limited, trading as Taylor & Francis Group
4050 R. LIU ET AL.
The remainder of the paper is organised as follows. scheduling either does not or only includes a simple
Section 2 reviews the development and latest research in rescheduling process as the impact of infrequent machine
dynamic scheduling and the applications of Reinforce- breakdowns can be effectively minimised by incorporat-
ment Learning (RL) to scheduling. Section 3 defines the ing robustness metrics in schedule development.
research problem and objective. Section 4 presents an When rescheduling is needed, the developed pro-
approach for solving the DFJSP. Section 5 then presents duction schedule can be repaired by heuristics or
the results of numerical experiments and the compari- replaced by priority rules. Such a manner is referred
son with benchmark solutions. Finally, Section 6 high- to as the predictive-reactive approach, which maintains
lights our contribution and discusses future directions for the highest-possible efficiency until the occurrence of
research. events and delivers feasible and timely scheduling deci-
sions afterwards. Recent studies also resort to hybrid
approaches to reduce the decisional latency of meta-
2. Literature review heuristics, thus enabling their ability to handle fre-
quent disruptions such as job arrival and cancellation.
2.1. Traditional approaches
Examples include the combination of tabu search and
Static scheduling research has created a large fam- genetic algorithm (Li and Gao 2020); combination of
ily of techniques that are transferable to the dynamic particle swarm optimisation and genetic algorithm (Ren
environment: the most popular techniques are priority et al. 2021).
rules and metaheuristics. Due to the difference in tim-
ing and frequency of engagement, their applications to
2.2. Recent data-driven approaches
dynamic scheduling problems can be categorised as com-
pletely reactive, robust-proactive and predictive-reactive Traditional approaches face the dilemma of efficiency
approaches (Ouelhadj and Petrovic 2008). versus quality. The high computational cost of meta-
Priority rules build a schedule in a reactive and con- heuristics compromises their utility in an environment
structive manner, and are widely applied due to their ease with frequent dynamic events, where the effectiveness
of implementation. The performance and comparison of of optimal schedule may deteriorate quickly. On the
priority rules under different scenarios have been stud- other hand, rule-based approaches produce near real-
ied for a long time, and the consensus is that no single time solutions at the cost of quality, due to the limited
priority rule provides strictly stronger performance along type of information and simple mathematical operation
all objectives in all scenarios (Sels, Gheysen, and Van- involved in decision-making.
houcke 2012; Xiong et al. 2017; Ðurasević and Jakobović Recent research aims at developing scheduling strate-
2018). It is also widely accepted that composite priority gies that use comprehensive shop floor information to
rules’ performance is usually better than their building make high-quality scheduling decisions in near real-
blocks (Sels, Gheysen, and Vanhoucke 2012). time, by shifting the time-consuming development pro-
Schedules can also be produced in advance in consid- cess offline and applying the developed strategy to
eration of dynamic events, metaheuristics such as Tabu online decision-making. Some candidate approaches are
Search, Simulated Annealing, Genetic Algorithm, Parti- Genetic Programming (GP), supervised learning, and
cle Swarm Optimisation, and Ant Colony Optimisation RL.
are popular in developing robust-proactive schedules GP can be used for developing priority rules. Com-
(Ouelhadj and Petrovic 2008). In addition to commonly pared to traditional manually-designed priority rules,
used performance objectives such as flow time, make which utilise a limited range of job or machine infor-
span and tardiness, robustness must be included as an mation, GP provides a way to include a variety of real-
extra performance metric to evaluate a schedule’s abil- time information in decision-making; the development
ity to absorb disruptions (Liu et al. 2017; Zhang, Song, of rules is automated through iterative, population-based,
and Wu 2020). The development of a robust schedule and randomised evolution.
starts from a randomised solution or a population of solu- Early studies simplify the FJSP to JSP by coupling
tions; the quality of schedule is improved through the the sequencing rule with a universal routing rule (Ho,
time-consuming process of iterative evaluation (usually Tay, and Lai 2007; Pickardt et al. 2010). Recently, GP
by simulation of production with dynamic events) and is also used for coevolving routing and sequencing
reproduction. The resultant high cost of schedule devel- rules. Yska, Mei, and Zhang (2018) propose the Coop-
opment compromises metaheuristics’ ability to handle erative Coevolution GP (CCGP) framework to evolve
frequent dynamic events. Most studies based on meta- routing and sequencing rules in two separate popu-
heuristic techniques focus on machine breakdowns; the lations. Zhang, Mei, and Zhang (2018) propose the
INTERNATIONAL JOURNAL OF PRODUCTION RESEARCH 4051
multi-tree GP (MTGP) architecture to exploit the inter- In addition, dynamic scheduling problem can be for-
action between routing and sequencing rules. There is mulated as a discrete-time stochastic control process that
also research that aims to improve the performance by conforms to the definition of Markov Decision Process
feature selection (Zhang, Mei, and Zhang 2019; Zhang, (MDP) (Zhang, Xie, and Rose 2017). More specifically, a
Mei et al. 2020). GP-based approaches do improve the shop that operates under a reactive policy taken by dis-
effectiveness of developed rules by incorporating a wider tributed scheduling agents can be modelled as a decen-
range of features/information as input; however, their tralised Markov decision process (DEC-MDP) (Bern-
decision-making still relies on simple computations and stein et al. 2002). The Markovian nature of dynamic
logical operations. scheduling problem makes RL, a tool for sequential deci-
Alternatives are techniques with complex parametri- sion making in MDP, suitable for solving the scheduling
sation, such as supervised learning algorithms. The problem.
key assumption in their application to scheduling is Early studies use tabular RL algorithms to perform
that optimal scheduling knowledge can be learned by system-wide scheduling (Wang and Usher 2005; Aissani,
parametrised algorithms such as Artificial Neural Net- Beldjilali, and Trentesaux 2008), in which the state space
work (ANN), Support Vector Machine (SVM), and Deci- contains the abstracted system information; with action
sion Tree. Supervised learning algorithms improve the being the selection among a set of priority rules to be
quality of solution by analytical optimisation or gradient- applied to all machines in the system. It is also feasi-
based approach, leading to faster convergence to a satis- ble to break the scheduling problem into several sub-
factory solution compared to evolutionary approaches. problems with lower complexity, and tackle each with an
ANN is primarily used for priority rule selec- independent RL agent (Wang et al. 2020).
tion. Mouelhi-Chibani and Pierreval (2010) propose a DRL algorithms have been incorporated into the
dynamic rule-selection mechanism that is triggered by system-wide scheduling architecture for better ability to
real-time events to solve the dynamic flow shop schedul- handle high-dimensional input space, such as the use of
ing problem; Guh, Shiue, and Tseng (2011) propose DQN (Lin et al. 2019), DDPG (Liu, Chang, and Tseng
an adaptive rule-selection mechanism that periodically 2020) and PPO (Park et al. 2021) to solve JSP.
assigns different priority rules to various work centres in RL algorithms attached to each machine or work cen-
a flexible manufacturing system. tre can also be distributed decision-makers. Gabel and
There is also research aimed at finding the optimal Riedmiller (2012) describe the JSP as a decentralised
sequence of jobs. Weckman, Ganduri, and Koonce (2008) Markov decision process with changing action sets and
break the JSP into a series of job-oriented classification partially ordered transition dependencies and propose a
problems and use optimal schedules generated by GA to policy-gradient approach to solve it. A similar approach
train the ANN. Zang et al. (2019) also divide JSP into can be found in (Lang et al. 2020), where two types of
several sub-problems to be solved by a hybrid network scheduling agents trained by DQN perform job sequenc-
with parallel fully-connected and convolutional layers. ing and machine selection, independently, to manage a
Gupta, Majumder, and Laha (2019) propose an approach flexible job shop.
to find the relative position of jobs in an optimal sched- Approaches developed to solve fixed-size scheduling
ule instead of developing the optimal sequence directly. problems are hard to generalise to different classes of
Other supervised learning applications include the use of tasks. There are also very few dynamic scheduling stud-
Decision Trees to learn the strategy from optimal sched- ies that do not constrain the specification of problem
ules (Olafsson and Li 2010; Jun, Lee, and Chun 2019), to approximate a practical environment. Wang (2020)
and SVM as the adaptive rule-selector (Shiue 2009; Priore proposes a decentralised Q-learning approach to select
et al. 2010). scheduling rules, to tackle JSP with dynamic job arrivals.
Luo (2020) proposes a centralised double DQN approach
2.3. Reinforcement learning-based scheduling
to solve FJSP with dynamic job arrivals, in which the
Compared to other supervised learning algorithms pri- action is semi-integrated so that agents select coupled
marily developed for classification tasks, DRL algorithms sequencing and machine selection rules.
can build a direct mapping from the observation of envi- Some studies aim at scheduling in a specific context,
ronment to action; they also have the ability to handle thus having specialised state and action representations.
complex data due to the utilisation of artificial neural Waschneck et al. (2018) propose a DQN approach to
network (ANN) as value or policy representation. These schedule the movement of lots between positions in a
two features made DRL a promising candidate approach semiconductor manufacturing facility. Qu, Wang, and
for production control in a data-intensive manufacturing Jasperneite (2019) propose a decentralised actor-critic
system. approach to allocate jobs to servers. Kuhnle, Röhrig, and
4052 R. LIU ET AL.
Lanza (2019) develop an autonomous order dispatching and unique sequence of operations. In a hierarchical and
system based on TRPO in a semiconductor production distributed scheduling system, scheduling decisions are
facility. Shi et al. (2020) propose a DQN approach to made by routing and sequencing agents attached to each
transfer jobs in a discrete manufacturing system. Hubbs work centre and machine, respectively; the framework is
et al. (2020) propose a deep actor-critic approach for presented in Figure 1.
scheduling in a chemical manufacturing system with Routing Agent (RA) and Sequencing Agent (SA) make
non-deterministic processing times; Malus, Kozjek, and the scheduling decision by using priority rules or algo-
Vrabič (2020) propose a multi-agent approach to man- rithms such as DRL. The following assumptions are pro-
age autonomous mobile robots on a shop floor. Yang and posed: (1) a machine can process only one operation at a
Xu (2021) propose an A2C approach to manage a recon- time; (2) preemption is prohibited; (3) no rework is mod-
figurable production system with dynamic job arrival. elled; (4) no yield losses are considered; (5) buffer size is
Luo et al. (2021) propose a PPO approach to solve the unlimited and not explicitly modelled; (6) machine setup
dynamic work shop problem constrained by multiple and job travel time are ignored.
resources such as machine, tool and worker.
Jobs can also be the decision-making entity. Bouazza,
Sallez, and Beldjilali (2017) propose a Q-learning
approach that lets intelligent products choose a machine 3.1. Notations
selection and a job sequencing rule for each work centre
and machine they visit. Baer et al. (2019) model the flex- (1) Jobs: J = {Ji : i = 1, . . . , n} is the set of n jobs.
ible manufacturing system as a Petri net (Zhou 2012), in (2) Machines: M = {Mk : k = 1, . . . , m} is the set of m
which the DQN agents attached to individual products machines.
decide the route to different machines. (3) Work centres: W = {Wl : l = 1, . . . , w} is the set of
DRL-based dynamic scheduling is still a novel research w work centres, each work centre contains several
area with very few publications. This research adopts the machines, Wl = {Ma , . . . , Mz }; a machine can only
most general context – flexible job shop for better trans- belong to one work centre.
ferability and comparability. A formal definition of the (4) Sequence and operations: job Ji requires ni opera-
problem is presented in next section. tions OJi = [O1i , . . . , Oni i ] to be served by machines
by an operation sequence Sqci = [Wi1 , . . . , Wini ].
j j
Wi denotes the work centre where the operation Oi
3. Problem formulation needs to be processed. Completion rate Pi is the ratio
A flexible job shop contains several work centres, each of number of processed operations to ni .
of which may contain multiple machines. All jobs must (5) Processing time: ti,k denotes the processing time of Ji
j
visit each work centre once following a predetermined on machine Mk . ti denotes the expected processing
Figure 1. Hierarchical and distributed scheduling framework.

j Validation scenario and objective

time of Oi , and is calculated as
j 1 There are three factors that affect the production per-
ti = j
ti,k (1) formance. (1) Arrival rate of jobs: a higher arrival rate
|Wi | j
Mk ∈Wi leads to higher utilisation rate of machines and higher
congestion in the system. (2) Heterogeneity of jobs and
j machines: jobs are different with regards to their process-
PTJi = {ti : j = 1, . . . , ni } is the set of expected process-
ing time of all operations of Ji . ing time on different machines and due date tightness.
Additionally, their remaining work, slack time, and time
(6) Queue and available time of machine: J k = till due also vary over time once they start their process-
{Ja , . . . , Jz } is the set of jobs that are currently ing sequence. Machines also vary with regards to their
in queue to be processed or being processed on workload and availability. (3) Due date tightness: relaxed
machine Mk , COk denotes the time since the start TTD of a job results in lower total and average tardiness
of current operation on Mk ; AMk , the available time as jobs have more slack time (on average) to consume in
of Mk and AWl , the average available time of a work case of congestion. In addition, if the criticality of jobs
centre Wl can be calculated as in the system varies, the scheduler can take advantage of
some jobs’ high slack time to protect more critical jobs.
AMk = ti,k − COk (2) Based on these observations, we introduce three sce-
Ji ∈J k narios for training and validation:
1
AWl = AMk (3) (1) Expected arrival rate/ utilisation rate: we adjust
|Wl | M ∈W
k l the job arrival rate to match an expected utilisa-
(7) Time-till-due (TTD), remaining processing time tion rate of the system. Let E(t) be the expected
and slack time: NOW denotes the current time in processing time of all operations on all machines
system; Di denotes the due date of job Ji ; TTD of a and E(interval) denotes the expected time interval
job is the difference between its due date and NOW. between job arrivals. With the assumption that m
Subsequently, the slack time Si of a job is its TTDi machines are evenly distributed across all w work
less Ri , which is the sum of expected processing centres, the expected utilisation rate of the system
time of all remaining operations. These values are can be calculated as
calculated as
TTDi = Di − NOW (4) E(t) ÷ mw
E(utilization rate) = × 100%

ni
j E(interval)
Ri = ti (5)
E(t) × w
j=x = × 100%
E(interval) × m
Si = TTDi − Ri (6) (11)
where x is the index of Ji ’s imminent operation.
After the expected utilisation rate is specified, the
The value of TTDi , Ri and Si change as production
expected time between arrivals can be computed. Let us
advances.
assume that random variable X denotes the time interval
(8) Measurement of performance: Ci is the completion
between arrivals of jobs follows exponential distribution:
time of Ji , the tardiness Ti of a job is calculated as
X ∼ Exp(β), β = E(interval).
Ti = max(Ci − Di , 0) (7) It is worth mentioning that the expected utilisation
rate is calculated based on the assumption of constant
The total tardiness Tsum , maximum tardiness Tmax , and and smooth arrival, and identical jobs. In practice, some
tardy rate are calculated as idle capacity is inevitable due to stochasticity in inter-
arrival time, sequence of operations, and processing time

n
Tsum = Ti (8) on machines. The expected utilisation rate is assumed to
i=1 be 90% in this research to approximate the production in
a busy factory.
Tmax = max({Ti : i = 1, . . . , n}) (9)
|{Ji |Ti > 0}| (2) Heterogeneity of processing time: The processing
Tardy rate = (10)
n time ti,k is drawn from a uniform distribution: ti,k ∼
4054 R. LIU ET AL.
U[Low, High]. We consider two scenarios with iden- to develop a policy that maximises the cumulative reward
tical average processing time: (1) High heterogene- Gt (usually discounted).
ity scenario: ti,k ∼ U[5, 25], where the maximum DQN (Mnih et al. 2015) architecture improves
time is five times the minimum; and (2) Low hetero- the performance of neural-approximated RL in high-
geneity scenario: ti,k ∼ U[10, 20], where the maxi- dimensional state space. The introduction of experience-
mum time is only twice the minimum. replay mechanism removes the correlation between
(3) Due date tightness: upon creation of job Ji , it sequence of observations. The Q-network learns from a
is assigned an original TTD proportional to its minibatch of experience et , where et = (st , at , rt , st+1 ) is
expected total processing time. The original TTD, the record of a transition in MDP. In addition, Q-network
original slack time and due date are calculated as contains action network and target network; the former
is used for producing the action and the latter to gen-
erate a stable target value for parameter optimisation.
The parameter of action network (θ A ) is updated at each
original

ni
j
TTDi = αi × ti , αi ∼ U[1, High] (12) training iteration, while the parameter of target network
j=1 (θ T ) synchronises with θ A at low frequency to stabilise
the learning. The parametrised target value yi for learning

ni
original
Si = (αi − 1) ×
j
ti (13) at ith iteration is calculated as
j=1
yi = rt + γ maxa Q(st+1 , a|θ T ) (15)
original
Di = NOW + TTDi (14)
The agent can be trained by minimising the loss L(θiA )
The ratio αi denotes the due date tightness of job Ji . at ith iteration. For example, the least square error is
We consider two scenarios of tightness: (1) High due date calculated as
tightness: αi ∼ U[1, 2], where the average original slack 2 2
L(θiA ) = (yi − Q(st , at |θiA )) (16)
time of jobs is half their total processing time; and (2) N
Low due date tightness: αi ∼ U[1, 3], where the average
Based on DQN, Double DQN (Van Hasselt, Guez, and
original slack time is equal to the job’s total processing
Silver 2016) replaces the max operation that is used to
time.
generate the estimation of state value by following the
Four scenario combinations are used: (1) High het-
action that the action network rather than target net-
erogeneity and high tightness: HH([5, 25]/[1, 2]); (2)
work would take in the next state. To decouple the action
High heterogeneity and low tightness: HL([5, 25]/[1, 3]);
selection and policy evaluation and eventually reduce
(3) Low heterogeneity and high tightness: LH([10, 20]/
over-estimation, the calculation of target value can be
[1, 2]); and (4) Low heterogeneity and low tightness:
done as
LL([10, 20]/[1, 3]).
Minimising cumulative tardiness of all jobs is the pri- yi = rt + γ Q(st+1 , argmaxa Q(st+1 , a|θiA )|θ T ) (17)
mary objective in this research; additionally, the tardy
rate and maximum tardiness are also presented in the Double DQN is used as the learner of the two types
appendix to provide a comprehensive evaluation of the of agents in the system: (1) Routing agent attached to
approach. each work centre which performs the machine selection
for jobs upon their arrival, corresponding to the blue
dot in Figure 1, and (2) Sequencing agent attached to
4. Proposed approach each machine, selects a job to process when machine
becomes idle and finds more than one job in the queue,
4.1. DRL preliminaries
corresponding to the orange box in Figure 1.
RL is one of the most powerful tools for solving sequen-
tial decision-making problem (van Otterlo and Wiering
4.2. State representation
2012). It can be modelled as an MDP with 5-elements
tuple representation: (S, A, P, γ , R). In MDP, an intelli- As a component of DRL algorithm, the neural network
gent agent interacts with the environment according to a with static architecture does not facilitate adjustment of
specific policy π; At each time step t, the agent observes the size of input/ output space conditioned on a vari-
the current state st ∈ S and takes an action at ∈ A accord- able number of queuing jobs. Many studies simply avoid
ing to policy π(S, A); A reward rt ∈ R is given after the this problem by setting the research scope to a fixed size
agent arrives at a new state st+1 . The goal of RL agent is problem with a known number of jobs. This, however,
Table 1. State features for routing agent. infeasible, hence we select four sequencing rules as the
Category Feature Expression building block of the action space to realise indirect job
1. Information of machine Available time∗ {AMk |Mk ∈ Wl } ⎫
⎧
selection:
⎨ ⎬
Sum of processing times∗ ti,k |Mk ∈ Wl
⎩ ⎭ (1) Shortest Processing Time (SPT): select the job with
Ji ∈Jk
2. Information of job to be Processing time∗ {ti,k |Mk ∈ Wl } shortest processing time of imminent operation.
dispatched (Ji ) (2) Work in Queue (WINQ): select the job with the
Slack time Si
3. Information of arriving job Time till imminent arrival min(AT l )
smallest sum of processing time of queuing jobs in
Slack time min{Si |Ji ∈ AJl } its succeeding stage of production. WINQ rule tends
to even the distribution of jobs in system.
(3) Critical Ratio (CR): select the job with smallest ratio
lacks practicality in a dynamic environment and cannot of TTD to remaining processing time. A disadvan-
be used in this research which focuses on a manufac- tage of CR rule is its inconsistency. When available
turing system with a large and unknown number of job jobs are not yet tardy, CR rule favours jobs with
arrivals spread over time. smaller TTD and longer remaining processing time.
Instead, we use abstracted information instead of job- On the other hand, if available jobs are already tardy,
specific features to build the state space. The number of CR rule prioritises the job with highest overdue time
jobs only affects the computation rather than the dimen- (smallest TTD) and shortest remaining processing
sion of resulting abstracted data. Some related notation is time.
introduced next: (1) J NOW = {Ji |Ri > 0} is the set of jobs (4) Minimum Slack (MS): select the job with smallest
that are currently in the system, and can be used to mea- slack time. This rule behaves like CR rule if jobs are
sure the congestion in the system; (2) AJ l is the set of jobs not yet overdue, favouring jobs with shorter TTD
that are currently being processed by other machines and and longer remaining processing time (shorter slack
will arrive at the work centre Wl , the set of the arrival time time as a result) and remains consistent when the job
of these jobs is AT l ; and (3) for jobs in J k , the set of their becomes tardy.
succeeding work centre is denoted as SW k .
RA’s state space consists of three categories of features:
(1) the information of machines within the work centre; 4.4. Surrogate reward-shaping
(2) the information of the job to be dispatched; and (3)
information of the job that is about to arrive. Details of With the minimisation of tardiness as the scheduling
state features are given in Table 1; three types of features objective, a natural reward mechanism would be to use
(tagged with ‘∗’) are collected for each machine in the the realised tardiness of job as the joint reward signal for
work centre. For RA that controls work centre Wl , the all RAs and SAs that the job has gone through. How-
size of its state space equals |Wl | × 3 + 3. ever, studies of multi-agent RL suggest that even though
Arrival and departure of jobs constantly change the agents can learn from the joint reward to achieve a joint
queue size for the SA. Thus collecting certain categories goal, they may receive spurious reward signals that orig-
of information for all jobs results in a state space with inate from their teammates’ behaviour (Sunehag et al.
variable dimension. In this work, we use abstracted and 2017); this challenge is referred to as the multi-agent
channelled data to summarise the information of queu- credit assignment problem. Instead, a surrogate approach
ing jobs and system to create a stable state space. that rewards agents for their contribution to the queuing
Twenty-five features are divided into six channels of jobs is developed.
according to their type and magnitude; their details are An example production history of a job with three
given in Table 2. operations is shown in Figure 2. Job arrives at the shop
floor with an initial duration of TTD that is larger than
the sum of its operation times, which leaves some slack
4.3. Action representation time as the queuing buffer. The slack time is gradually
Available action for RA simply corresponds to the selec- consumed as the job queues for processing; tardiness
tion of machines within the work centre. For SA, how- results if a job consumes all the slack time before its
ever, the changing queue makes the direct selection of job processing is completed.
4056 R. LIU ET AL.
Table 2. State features for sequencing agent.

Channel Feature Calculation
1. Number of jobs Jobs in system |JNOW |
Jobs in queue |Jk |
|AJl |
Expected number of arriving job |W l |

2. Processing time of imminent operation of queuing Sum PT k
jobs (PT k ): PT k = {ti,k |Ji ∈ Jk } 1 k
Mean PT
|Jk |
Min min(PT ) k

3.1 Remaining processing time of queuing jobs Total remaining processing time RPT k
(RPT k ): RPT k = {Ri |Ji ∈ Jk }
1
3.2 Available time of succeeding work centres Average remaining processing time RPT k
(ASW k ): ASW k = {AWl |Wl ∈ SW k } |Jk |
Maximum remaining processing time max(RPT k )
ASW k
Average available time of succeeding work centre
|SW k |
Minimum available time of succeeding work centre min(ASW k )
1 k
4.1 Time till due: X k = {TTDi |Ji ∈ Jk } Mean of TTD X
Min of TTD |Jk | k
min(X )
4.2 Slack time: Y k = {Si |Ji ∈ Jk } 1 k
Mean of slack time Y
|Jk |
Min of slack time min(Y k )
Min of slack time of arriving job min{Si |Ji ∈ AJl }
AMk
5. System-level information Available time share
My ∈M AMy
1
Completion rate Pi
|JNOW | NOW
Ji ∈J
{Ji |Ji ∈JNOW ∧ TTDi <0}
Realised tardy rate |JNOW |
{Ji |Ji ∈JNOW ∧ Si <0}
Expected tardy rate |JNOW |
SPT k
6. Heterogeneity, represented by the coefficient of CV of processing time k
variation (CV) SPT
RPT k
CV of remaining processing time
RPT k
SX k
CV of TTD
Xk
SY k
CV of slack time
Yk
SASW k
CV of available time of succeeding work centres
ASW k
Figure 2. History of production.
The prevention and reduction of tardiness are equiv- Firstly, the reward function needs to balance con-
alent to the conservation of slack time at each stage of flicting objectives as pursuing different tardiness-related
production. However, rewarding interim target is not objectives can lead to an undesirable outcome. For
always advisable as the agent might learn a strategy that instance, when the system is overwhelmed by a large
maximises the rewards and ignores the actual target quantity of newly arrived jobs with tight due dates, the
(Kuhnle, Schafer et al. 2019). Several aspects must be SPT rule significantly reduces the sum of tardiness and
considered to ensure training’s success. proportion of tardy jobs as compared to other rules but
at the cost of maximum tardiness, as it ‘sacrifices’ jobs the job’s criticality factor.
with high processing time and instead prioritises jobs
with shorter time consumption regardless of their crit- rt = Fi × (St+1
i − E(St+1
i )) (21)
icality. An unbiased reward must be given to improve Surrogate reward function for SA: SA needs to pri-
the performance along primary objective, which in this oritise the queuing jobs and select the one that minimises
research, is cumulative tardiness. the long-term cumulative tardiness. The reward can be
Secondly, the reward must align with system-level interpreted from a queuing exposure perspective: when
performance. To avoid the credit-assignment problem, SA of machine Mk and its queue J k , selects a job Js , it ends
RA and SA are trained to make decisions based on the wait of the selected job at the cost of extending queue-
mostly local information; the reward is also calculated ing time of other jobs, while at the same time expos-
with respect to individual operations instead of eventual ing the selected job to queuing at its succeeding work
tardiness. The key to training success is to ensure the centre Ws .
consistency between the reward and overall perfor- We term the saved or prolonged queueing time as
mance, i.e. improve system-level performance by pursu- the gain or loss of slack time. For the selected job, the
ing short-term rewards. slack time gain/ loss is defined as the average processing
In addition, it is very difficult to preserve the slack time time of other jobs adjusted by the criticality factor of Js ,
of all jobs, especially in a congested system, or when the minus the adjusted available time of the succeeding work
due dates are tight. Considering this, it is more reasonable centre:
to quantify the likelihood of a job becoming tardy and
include this factor in the calculation of reward. For job Ji , {ti,k |Ji ∈ J k ∧ Ji = Js }
S1 = Fs × − δ × AWs (22)
we use a sigmoid function to convert its slack time Si to a |J k | − 1
criticality factor Fi , to eliminate outliers (excessively large The average slack time gain/loss of the jobs that are not
values): selected equals the processing time on Js adjusted by the
Si mean value of their criticality factors, minus the adjusted
Fi = 1 − (18) average available time of their succeeding work centre.
|Si | + β
Fi ∈ (0, 2), where β is a user-specified factor to adjust Ji ∈J k ∧Ji =Js Fi
the curvature of the sigmoid function; as the value of S2 = × ts,k − δ
|J k | − 1
β increases, the sigmoid function becomes flatter and
smoother. {AWi |Wi ∈ SW k ∧ Wi = Ws }
× (23)
Based on the above considerations, we propose two |J k | − 1
slack-driven reward functions. δ is a factor to adjust the magnitude of available time
Surrogate reward function for RA: RA needs to eval- as it is usually few times larger than the processing
uate the availability and suitability of all machines and time; agents would place excessive attention on succeed-
assign the job to a machine that minimises long-term ing work centre’s workload without this adjustment. The
cumulative tardiness. When job Ji arrives at the work eventual reward is calculated as:
centre Wl to complete its xth operation, an estimated
slack upon the completion of operation (time step t + 1), rt = S1 − S2 (24)
denoted as E(St+1
i ), can be calculated as a function of Choosing an appropriate value of β and δ for the
the remaining processing time and the available time of reward calculation is crucial for training’s success. A
machines. cursory search leads to 40 < β < 60 and δ = 0.2; gener-

ni
1 ally, agents trained with smaller β perform better in lower
j
E(St+1
i ) = TTDi − ti − AMk (19) heterogeneity scenarios.
j=x
|Wl | M ∈W
k l
The actual slack time St+1 is realised upon the completion 4.5. Asynchronous transition
i
of the operation: We propose an asynchronous transition approach to
align with the surrogate reward shaping. As shown in

ni
St+1 = TTDi −
j
ti (20) Figure 3, the transition starts from the decision-making
i
j=x+1
point at time step t and ends upon the completion
of operation at time step t + 1. Time step t + 1 does
The reward is calculated as the gain/ loss of actual slack not necessarily coincide with the next decision-making
time compared with the estimated slack time, adjusted by point.
4058 R. LIU ET AL.
Figure 3. Transition for routing and sequencing agent.
4.6. Parameters and training The architectures of ANN for RA and SA are shown
in Figure 5. In the two-branch architecture for SA, only
Distributed scheduling architecture is essentially a multi-
the data in channels 1 to 4 is processed by the instance
agent system. We adopt parameter sharing technique and
normalisation layer as the abstracted data in channels 5
centralised training and decentralised execution train-
and 6 has suitable magnitude.
ing scheme (Gupta, Egorov, and Kochenderfer 2017)
The simulation model of manufacturing system is
in this research to reduce the multi-agent problem to
implemented by open-source Python discrete-event sim-
a single-agent problem in the training stage. Agents
ulation library SimPy (Matloff 2008). ANN is imple-
with similar scope and objective share the neural net-
mented using Python machine learning library PyTorch
work parameters and learn from the public experience
(Paszke et al. 2019). Training is carried out on a personal
pool, leading to similar and cooperative behaviour. Other
computer with Intel Core i5-4210H 2.90 GHz CPU and
DRL-related parameters are listed in Table 3.
12 GB RAM. The records of training loss of RA and SA
Both types of agents are trained in a flexible job
are presented in Figure A-11 and A-12 respectively.
shop consisting of three work centres, each housing
two machines. Training of RA and SA is done sepa-
rately to avoid the non-stationarity. The training process
5. Numerical experiments
simulates production for 100,000 units of time, during
which around 12,400 jobs arrive at the shop floor. The The experiments are divided into three stages:
process of simulation and training is shown in Figure 4.
The value of many features, especially those associ- (1) Independent utility test to test the independent
ated with the number of jobs and time, far exceeds the utility of SA and RA in designated scenarios and
range in which common activation functions can pro- qualify benchmark rules to enter the second stage of
duce smooth and distinguishable gradient. Instance nor- experiment based on their performance on cumula-
malisation technique (Ulyanov, Vedaldi, and Lempitsky tive tardiness (primary objective).
2016) is used to process the data to avoid the vanishing (2) Integrated DRL test that based on the result of the
gradient problem. As for the neurons, tanh (hyperbolic first stage, wherein several sequencing and routing
tangent) function is used as the activation function as rule combinations are used to validate the perfor-
negative values are quite common in state space. Huber mance of integrated SA and RA. Recent DRL-based
loss is used as the loss function. Learning rate decays from and GP-based DFJSP solutions are also compared.
10−2 to 10−3 . Stochastic gradient descent (SGD) is used (3) Scalability test that studies the performance of RA
as the optimiser of ANN, the momentum is set as 0.9. The and SA in larger-scale systems, and verifies their
hyperparameters of networks are listed in Table 3. scalability.
Table 3. Parameters for DRL and ANN.

Category Parameter RA SA
DRL related Discount factor (γ ) 0.8
Exploration rate ( ) 0.3 decays to 0.1
Minibatch size 128 64
Replay memory size 512 256
Hyperparameters of ANN Input size 9 25
Output size 2 4
Channels 1 6
Hidden layers 16 × 16 × 16 × 8 × 8 48 × 36 × 36 × 24 × 24 × 12
Figure 4. Simulation and training.
Figure 5. Architecture of ANN for scheduling agents.
The performance of priority rules and algorithms or algorithm in a run is calculated as:
are tested in 100 runs of simulation (each containing
multiple iterations) under each validation scenario. A
unique production incident is created in each run; the baseline − T
Tsum
rule or algorithm
sum
time interval between job arrivals, the processing time on NP = baseline
(25)
Tsum
each machine, and due date tightness of jobs are drawn
from respective statistical distributions. To make a fair
NP measures the performance gain; positive values
comparison, the production incident is then iteratively
imply better than baseline performance, higher the bet-
applied to a system controlled by DRL agents or bench-
ter. NP is presented in ‘.1’ panels in each figure below;
mark rules. Each iteration lasts for 1,000 units of time,
the mean of proposed DRL approach is drawn as green
with 124 job arrivals (on average).
dot lines.
The performance is evaluated using the following per-
formance metrics:
(2) Win rate: percentage of runs in which a rule or
(1) Normalised performance (NP): with a priority rule algorithm results in the lowest cumulative tardiness.
as the baseline, the normalised performance of rule It measures the adaptability of scheduling strategy
4060 R. LIU ET AL.
to various production incidents, and its superior- 5.1. Independent utility test
ity to others. To better present how proposed DRL
5.1.1. Routing agent
approach performs relative to benchmark rules, in
In this section, the independent utility of RA (DRL-RA) is
figures below we show the win rate with and without
tested, with six well-known routing rules listed in Table 4
the DRL agents in ‘.2’ and ‘.3’ panels, respectively.
as the benchmark. They are all coupled with First-In-
First-Out (FIFO) sequencing rule to minimise the effect
of sequencing decision; Earliest Available (EA) routing
Table 4. Benchmark routing rules.
rule is the baseline.
No. Routing Rule Description The results of numerical experiments are summarised
1 Earliest Completion Time (CT) Smallest sum of available time of in Figure 6. In each scenario, the top two benchmark rules
machine and processing time on the
machine. (except baseline) are marked with star; rules with a lower-
2 Minimum Execution Time (ET) Shortest processing time of job on the than-baseline performance are marked with cross.
machine.
3 Minimum Total Processing Least total processing time of jobs in The difference between routing strategies is decisive,
Time (TT) queue. and proposed DRL-RA exhibits strict superiority over
4 Minimum Utilisation Rate (UT) Least utilised machine, also termed as
‘longest idle time’ or ‘shortest work
all benchmark routing rules. An interesting observation
hour’ rule. is that EA rule, as the baseline with minimal active job
5 Smallest Queue Size (SQ) Fewest jobs in queue. allocation effect, delivers better performance than most
6 EA∗ Earliest available (idle) machine, also
termed as ‘shortest waiting time’ rule. routing rules, making it the second-best benchmark rule
Figure 6. Tardiness performance of RA under four validation scenarios.

in two low heterogeneity scenarios entering the stage of the behaviour of a specific rule, but maintains good per-
the integrated DRL experiment. See Table A-6 for more formance across all validation instances and beats the
detailed results. other rules consistently. See Table A-7 for more detailed
results.
5.1.2. Sequencing agent
In this section, SA (DRL-SA) and other sequencing rules 5.2. Validation of integrated DRL
are coupled with EA routing rule. In addition to the four
rules used for building the action space, we also include In this section, trained RA and SA are integrated to con-
in the benchmark pool nine other well-known sequenc- trol all work centres and machines to solve DFJSP. To
ing rules designed to improve tardiness-related objectives verify the synergistic effect between them, we also let RA
and six composite rules that have strong performance and SA run independently, in parallel.
in minimising tardiness. The complete list is shown in Based on the results of the previous section, we select
Table 5; FIFO rule is the baseline rule. The results of top two routing rules and top five sequencing rules to cre-
numerical experiments are shown in Figure 7. In each ate ten benchmark combinations, with FIFO+EA being
scenario, the top five benchmark rules are marked with the baseline. In addition, two novel GP-generated rule
star; rules with a lower-than-baseline performance are combinations are also included, termed GP1 (Zhang,
marked with cross. Mei, and Zhang 2019) and GP2 (Zhang et al. 2020a),
Ranking of rules differs in four scenarios, indicating respectively.
that the attribute of sound sequencing decision changes As it is unfair to compare with tabular RL-based
in different environments; sequencing strategy utilising approaches given the decisive superiority of DRL algo-
limited shop floor information cannot adapt to all scenar- rithms, the range of benchmarks is limited to the DRL-
ios. On the other hand, the difference in normalised per- based DFJSP solutions, which leaves very few studies to
formance is small; even the best rule or algorithm delivers compare against. Two peer approaches are included as the
less than 30% performance gain over baseline, resulting benchmark, termed Lang 2020 (Lang et al. 2020) and Luo
in a rather even distribution of win rate. A rule with even 2020 (Luo 2020), respectively.
middling performance can win in some of the runs. Metaheuristic-based approaches are not used for com-
Another cross-scenario observation is that the supe- parison as they lack the ability to cope with dynamic job
rior performance of singular rules with more complex arrival and cannot be applied to real-time scheduling.
parameterisation and composite rules such as MOD, The results are shown in Figure 8; see Table A-8 for
PTWINQS and DPTLWKRS is consistent. more detailed results.
Proposed DRL-SA outperforms most of the rules in Routing decision has more of an impact on sys-
all the scenarios; it delivers around 4% higher perfor- tem performance than sequencing decision, the differ-
mance than the best benchmark rule in each scenario. ence between the scheduling strategies is affected more
The comparison of win rate with and without DRL agents by the constituent routing rule than the sequencing
shows that proposed approach does not simply imitate rule.
This phenomenon is, on one hand, a result of how
Table 5. Benchmark sequencing rules. often scheduling agent gets engaged for decision-making:
No. Rule
under hierarchical scheduling architecture, a job must
1 Apparent tardiness cost (ATC)
be assigned to a machine but can be processed with-
2 Average processing time per operation (AVPRO) out a sequencing decision. Additionally, routing deci-
3 Cost over time (COVERT) sions can better exploit the heterogeneity of processing
4 CR
5 Earliest due date (EDD) times and workload. Good routing decisions can avoid
6 Least work remaining (LWKR) excessively long operation and queuing times, whereas
7 Modified due date (MDD)
8 Modified operational due date (MOD) sequencing decisions only rearrange the job sequence
9 Montagne’s heuristic (MON) with the constraints of unchangeable capacity and
10 Slack (MS)
11 Next operation’s processing time (NPT)
demand.
12 SPT In four scenarios, integrated DRL outperforms all
13 WINQ combinations of benchmark rules, peer DRL-based and
14 LWKR + SPT (LWKRSPT)
15 LWKR + MOD (LWKRMOD) GP-based solutions. The performance of independent
16 SPT + WINQ (PTWINQ) RA and SA runs in parallel also validates their utility
17 SPT + WINQ + Slack (PTWINQS)
18 2PT + LWKR + Slack (DPTLWKRS) as a component of the integrated approach; although
19 2PT + WINQ + NPT (DPTWINQNPT) SA does not provide as clear an advantage as RA, they
20 FIFO∗
both contribute to the performance improvement of
4062 R. LIU ET AL.
Figure 7. Tardiness performance of SA under four validation scenarios.

Figure 8. Tardiness performance of integrated DRL under four validation scenarios.

4064 R. LIU ET AL.
integrated DRL; with either one absent performance are shown in Figure 9, panels (a) to (d); the results of 4-
degrades. machine per work centre test (240 job arrivals on average)
are shown in panels (e) to (h).
Flexibility amplifies the difference among the routing
5.3. Scalability rules, indicating that alternatives for job allocation can
either be an advantage or a handicap. CT rules that utilise
In this section, proposed approach’s sensitivity to config-
the job processing time and machine workload informa-
uration change is tested. RA and SA run individually in
tion benefit from flexibility increase and achieve better
two scalability tests.
normalised performance and lower cumulative tardiness
even with more job arrivals. The superiority of DRL-RA
5.3.1. Higher flexibility test and CT rules becomes even more pronounced in the flex-
For the flexibility test, the number of machines at each ible systems; they both reduce the cumulative tardiness
work centre is increased. RA are retrained due to the to an extremely low level with high heterogeneity of pro-
change in input/output size as the configuration of work cessing time, especially when the number of machines
centres has changed; the records of training loss are at a work centre reaches four. Table A-9 shows more
presented in Figure A-13 and A-14, respectively. For information on the results of cumulative tardiness.
simplicity of presentation, only the top two benchmark The performance gain in flexible systems, however,
rules under each validation scenario from section 5.1.1 comes at the expense of a reduced win rate for the pro-
and baseline (EA rule) are compared. The results of 3- posed DRL-RA. When the exploitation of system flexibil-
machine per work centre test (180 job arrivals on average) ity overwhelms the high utilisation rate, the occurrence of
Figure 9. Tardiness performance under higher flexibility.

Figure 10. Tardiness performance under longer sequences.
tardiness is more a result of the fluctuation in job arrival SA still outperforms or at least matches the perfor-
rate rather than the quality of routing decisions. mance of strongest benchmark rules. On the other
hand, the difference between SA and strong bench-
5.3.2. Longer sequence test mark rules is insensitive to the configuration changes;
For the test of longer process sequence, the number of this phenomenon conforms with Holthaus and Rajen-
operations (work centres) is increased to six and nine. dran (1997), who state that the relative performance of
Due to the homogeneity of input/output size, the trained sequencing rules is not significantly affected by the size
models used in section 5.1.2 are imported and used as of the manufacturing system. Table A-10 provides more
the initial parameter (instead of random initialisation) to detailed results.
accelerate the learning. SA is then retrained to adapt to
the new environment; the training loss records are pre-
6. Conclusion
sented in Figures A-15 and A-16, respectively. Only the
top five benchmark rules from section 5.1.2 and base- Scheduling in the factory of the future would rely heav-
line (FIFO rule) are compared under each validation ily on the ability to handle unpredictable dynamic events
scenario. The results of 6-operation test are shown in with real-time responsiveness. This research proposes a
Figure 10, panels (a) to (d), while the result of 9-operation DRL architecture to solve the DFJSP as an attempt in that
test is shown in panels (e) to (h). direction.
The performance gain of all sequencing rules increases Two types of scheduling agents trained with Dou-
slightly when the number of operations increases; ble DQN algorithm outperform existing scheduling
4066 R. LIU ET AL.
strategies either as independent components or when ability to handle high-dimensional input space and
deployed in an integrated manner; the advantage pro- lower sensitivity to increase in feature complexity. Auto-
vided by scheduling agents is maintained even as the mated training also enables fast learning with minimal
system configuration changes. In addition to the per- involvement of human expertise, reducing the cost of
formance improvement, major contributions of this design and development. Most importantly, DRL is a
research are: highly parametrised algorithm, the use of ANN pro-
vides the ability to perceive and manage complex sys-
(1) Specialised representations: Dynamic scheduling tem dynamics. More performance improvement can
faces the problem of ever-changing specification, be expected by incorporating algorithmic innovations
which cannot be handled by the static architecture and taking advantage of rapidly increasing computing
of existing Machine Learning (ML) algorithms. Most power.
current studies make assumptions to constrain the DRL can also achieve high quality and responsiveness
scheduling problem to a fixed size. This research simultaneously in a dynamic environment. Although
develops abstracted state features and indirect action metaheuristics have long been used to produce practical
representations to resolve this issue. The resulting ‘optimal’ schedules (given the NP-hardness of JSP and
approach can manage scheduling problems with FJSP), however, no knowledge is created along with the
long duration and large number of jobs. schedule to facilitate the development of a new sched-
(2) Knowledge-based reward-shaping: Unlike the ule once disruption occurs. Learning-based approaches
usual tasks where DRL has achieved great success, overcome this disadvantage, the parametrised knowledge
such as board games and video games, most pro- can be reused with negligible cost (a forward propagation
duction planning and control problems cannot be in neural network), generalised, and transferred. These
formulated as a task to ‘win’ or ‘survive’, but instead features enable fast prototyping and deployment, and
focus on the constrained optimisation of an objec- are core capabilities of an agile production scheduling
tive that is complex in nature. Reward shaping based system.
on domain knowledge is vital to strike a balance There are, however, challenges and opportunities
between solution quality and implementation cost. in transferring the DRL research into the production
This research develops a surrogate reward-shaping domain:
technique to improve training efficiency and encour- Unlike the tabular or graphical state representation
age cooperation among scheduling agents. It delivers in applications where DRL has achieved great success,
higher stability and performance than the simple industrial data is messy and unstructured. The selection,
joint reward approach, enabling good reproducibil- abstraction, and pre-processing of features still require
ity of our work. human expertise. This process can be facilitated by intro-
(3) Efficient training and application: The lightweight ducing innovative data interpretation techniques in ML
neural network guarantees quick decision-making research. In addition, shop floor is a complex environ-
with millisecond-level latency, making real-time ment, study that considers more types of dynamic events
control possible. Our proposed architecture also such as stochastic processing time and machine break-
exhibits good training efficiency: 100,000 units of downs would be helpful in bringing DRL closer to real-
simulation time (training) take up to 20 minutes to life application.
complete, which allows fast iteration to adapt to Manual reward-shaping relies on domain expertise
the dynamic environment. Scheduling agents can and requires great effort to design and fine-tune. In addi-
rapidly acquire new behaviour by incorporating new tion, actions can have a lasting and widespread influence
experiences to replay memory or tuning parameters. that is hard to evaluate using a scalar reward. Incorporat-
Even though the proposed architecture is demon- ing causal learning and inference with RL to determine
strated in a small manufacturing system, its effi- the relationship between action and its influence is a
ciency is expected to persist even in larger-scale popular research topic and its potential for industrial
systems with the support of stronger computation applications is promising.
capacity. The cooperative behaviour among intelligent agents
in a cyber-physical manufacturing system is not very
Research into real-time dynamic scheduling includes well-studied. Other popular techniques of cooperative
the discovery of features to facilitate decision-making multi-agent RL, such as communication between agents
and the study of operating those features. Compared (Foerster et al. 2016) and automatic value decomposi-
to rule-based approaches (manually designed or auto- tion (Sunehag et al. 2017), have also proven valuable in
matically evolved), DRL algorithms possess stronger improving the performance of the multi-agent system;
their application to dynamic scheduling, though, still Simulation where he leads the activities of Vicomtech as Digital
remains unexplored. Twin Consortium branch coordinator for Spain.
Data availability statement ORCID

The data that support the findings of this study are Renke Liu http://orcid.org/0000-0003-4375-9140
available from the corresponding author, [Liu, R], upon
reasonable request. References
Aissani, N., B. Beldjilali, and D. Trentesaux. 2008. “Efficient and
Disclosure statement effective reactive scheduling of manufacturing system using
Sarsa-multi-objective agents". MOSIM’08: 7th Conference
No potential conflict of interest was reported by the author(s). Internationale de Modelisation et Simulation.
Baer, Schirin, Jupiter Bakakeu, Richard Meyes, and Tobias
Notes on contributors Meisen. 2019. Multi-Agent Reinforcement Learning for
Job Shop Scheduling in Flexible Manufacturing Systems.
Renke Liu received his B.E. degree from Paper presented at the 2019 s International Conference
Hunan University, China, in 2015; and on Artificial Intelligence for Industries (AI4I), 25-27 Sept.
M.S. degree from Nanyang Technolog- 2019.
ical University, Singapore, in 2017. He Bernstein, Daniel S, Robert Givan, Neil Immerman, and
is currently pursuing his Ph.D. degree Shlomo Zilberstein. 2002. “The Complexity of Decentral-
at Nanyang Technological University. His ized Control of Markov Decision Processes.” Mathematics of
research interests span the areas of compu- Operations Research 27 (4): 819–840.
tational intelligence’s application to pro- Bouazza, W., Y. Sallez, and B. Beldjilali. 2017. “A Distributed
duction planning and control, deep reinforcement learning, Approach Solving Partially Flexible job-Shop Scheduling
and dynamic scheduling in complex systems. Problem with a Q-Learning Effect.” IFAC-PapersOnLine 50
(1): 15890–15895. doi:10.1016/j.ifacol.2017.08.2354.
Dr. Rajesh Piplani is an associate professor
Chen, Jiasi, and Xukan Ran. 2019. “Deep Learning With Edge
and the director of the M.Sc. program in
Computing: A Review.” Proceedings of the IEEE 107 (8):
Supply chain Engineering in the school of
1655–1674. doi:10.1109/JPROC.2019.2921977.
Mechanical and Aerospace Engineering at
Ðurasević, Marko, and Domagoj Jakobović. 2018. “A Sur-
Nanyang Technological University, Singa-
vey of Dispatching Rules for the Dynamic Unrelated
pore. He obtained his Ph. D. from Purdue
Machines Environment.” Expert Systems with Applications
University and M.S. from Arizona State
113: 555–569. doi:10.1016/j.eswa.2018.06.053.
University. He ran the Center for Supply
Foerster, Jakob N., Yannis M. Assael, Nando de Freitas, and Shi-
Chain Management (2002-2009) at NTU, and was also the Pro-
mon Whiteson. 2016. “Learning to Communicate to Solve
gram Manager, Integrated Manufacturing and Service Systems
Riddles with Deep Distributed Recurrent Q-Networks.”
(IMSS) for Singapore research funding agency A∗STAR, man-
arXiv preprint arXiv:1602.02672 .
aging the SGD 8 Million research program for four years. Dr.
Gabel, Thomas, and Martin Riedmiller. 2012. “Distributed Pol-
Piplani’s interests are in supply chain management of manu-
icy Search Reinforcement Learning for job-Shop Scheduling
facturing enterprises, data analytics, inventory planning, and
Tasks.” International Journal of Production Research 50 (1):
air traffic control. Dr. Piplani has over seven years of industry
41–61. doi:10.1080/00207543.2011.571443.
experience in India and USA.
Guh, Ruey-Shiang, Yeou-Ren Shiue, and Tsung-Yuan Tseng.
Dr. Carlos Toro received both his Ph.D. 2011. “The Study of Real Time Scheduling by an Intelligent
And M.Sc in Computer Science from the Multi-Controller Approach.” International Journal of Pro-
University of the Basque Country (Spain) duction Research 49 (10): 2977–2997. doi:10.1080/00207541
and his Bachelor degree in Mechanical 003794884.
Engineering (with honors) from EAFIT Gupta, Jayesh K., Maxim Egorov, and Mykel Kochenderfer.
University (Colombia). In 2003 he moved 2017. “Cooperative Multi-agent Control Using Deep Rein-
to Spain and started working in applied forcement Learning.” Autonomous Agents and Multiagent
research focusing on Industrial and Systems. AAMAS 2017. Lecture Notes in Computer Science
Advanced Manufacturing. At the same time he lectured at 10642: 66–83. doi: 10.1007/978-3-319-71682-4_5.
the University of the Basque Country. In 2007 he was invited Gupta, Jatinder N. D., Arindam Majumder, and Dipak Laha.
researcher at the University of Newcastle (Australia), return- 2019. “Flowshop Scheduling with Artificial Neural Net-
ing in 2011 with a Marie Curie research visitor grant funded works.” Journal of the Operational Research Society 70 (10):
by the EU Commission. In 2017 he joined A∗STAR Singapore 1–19. doi:10.1080/01605682.2019.1621220.
at their Advanced Remanufacturing and Technology Centre Ho, Nhu Binh, Joc Cing Tay, and Edmund M. K. Lai. 2007.
(ARTC) as lead architect and research scientist for the Factory “An Effective Architecture for Learning and Evolving Flex-
of the Future initiative which is the effort of Singapore gov- ible job-Shop Schedules.” European Journal of Operational
ernment for implementing a model factory that includes the Research 179 (2): 316–333. doi:10.1016/j.ejor.2006.04.007.
concepts of the 4th industrial revolution. In 2021 he joined Holthaus, Oliver, and Chandrasekharan Rajendran. 1997. “Effi-
Vicomtech as Research Line Head for Virtual Engineering & cient Dispatching Rules for Scheduling in a job Shop.”
4068 R. LIU ET AL.
International Journal of Production Economics 48 (1): Messaoud, Seifeddine, Abbas Bradai, Syed Hashim Raza
87–105. doi: 10.1016/S0925-5273(96)00068-0. Bukhari, Pham Tran Anh Quang, Olfa Ben Ahmed, and
Hubbs, Christian D., Can Li, Nikolaos V. Sahinidis, Igna- Mohamed Atri. 2020. “A Survey on Machine Learning in
cio E. Grossmann, and John M. Wassick. 2020. “A Deep Internet of Things: Algorithms, Strategies, and Applica-
Reinforcement Learning Approach for Chemical Produc- tions.” Internet of Things 12: 100314. doi:10.1016/j.iot.2020.
tion Scheduling.” Computers & Chemical Engineering 141: 100314.
106982. doi:10.1016/j.compchemeng.2020.106982. Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei
Jun, Sungbum, Seokcheon Lee, and Hyonho Chun. 2019. A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, et al.
“Learning Dispatching Rules Using Random Forest in Flexi- 2015. “Human-level Control Through Deep Reinforcement
ble job Shop Scheduling Problems.” International Journal of Learning.” Nature 518 (7540): 529–533. doi:10.1038/nature
Production Research 57 (10): 3290–3310. doi:10.1080/00207 14236.
543.2019.1581954. Mouelhi-Chibani, Wiem, and Henri Pierreval. 2010. “Train-
Kuhnle, Andreas, Nicole Röhrig, and Gisela Lanza. 2019. ing a Neural Network to Select Dispatching Rules in Real
“Autonomous Order Dispatching in the Semiconductor Time.” Computers & Industrial Engineering 58 (2): 249–256.
Industry Using Reinforcement Learning.” Procedia CIRP 79: doi:10.1016/j.cie.2009.03.008.
391–396. doi:10.1016/j.procir.2019.02.101. Olafsson, Sigurdur, and Xiaonan Li. 2010. “Learning Effec-
Kuhnle, Andreas, Louis Schäfer, Nicole Stricker, and Gisela tive new Single Machine Dispatching Rules from Opti-
Lanza. 2019. “Design, Implementation and Evaluation of mal Scheduling Data.” International Journal of Produc-
Reinforcement Learning for an Adaptive Order Dispatch- tion Economics 128 (1): 118–126. doi:10.1016/j.ijpe.2010.
ing in job Shop Manufacturing Systems.” Procedia CIRP 81: 06.004.
234–239. doi:10.1016/j.procir.2019.03.041. Ouelhadj, Djamila, and Sanja Petrovic. 2008. “A Survey of
Lang, Sebastian, Fabian Behrendt, Nico Lanzerath, Tobias Dynamic Scheduling in Manufacturing Systems.” Journal of
Reggelin, and Marcel Müller. 2020. "Integration of Deep Scheduling 12 (4): 417. doi:10.1007/s10951-008-0090-8.
Reinforcement Learning and Discrete-Event Simulation for Park, Junyoung, Jaehyeong Chun, Sang Hun Kim, Youngkook
Real-Time Scheduling of a Flexible Job Shop Produc- Kim, and Jinkyoo Park. 2021. “Learning to Schedule job-
tion." 2020 Winter Simulation Conference (WSC):3057-68. Shop Problems: Representation and Policy Learning Using
doi:10.1109/WSC48552.2020.9383997. Graph Neural Network and Reinforcement Learning.” Inter-
Li, Xinyu, and Liang Gao. 2020. “A Hybrid Genetic Algorithm national Journal of Production Research 59 (11): 3360–3377.
and Tabu Search for Multi-Objective Dynamic jsp.” In Effec- doi:10.1080/00207543.2020.1870013.
tive Methods for Integrated Process Planning and Schedul- Paszke, Adam, Sam Gross, Francisco Massa, Adam Lerer, James
ing. Engineering Applications of Computational Methods. Vol. Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin,
2, edited by Xinyu Li and Liang Gao, 377–403. Berlin, Hei- Natalia Gimelshein, and Luca Antiga. 2019. Pytorch: An
delberg: Springer. imperative style, high-performance deep learning library.
Lin, Chun-Cheng, Der-Jiunn Deng, Yen-Ling Chih, and Hsin- Paper presented at the Advances in neural information pro-
Ting Chiu. 2019. “Smart Manufacturing Scheduling With cessing systems.
Edge Computing Using Multiclass Deep Q Network.” IEEE Pickardt, Christoph, Jurgen Branke, Torsten Hildebrandt, Jens
Transactions on Industrial Informatics 15 (7): 4276–4284. Heger, and Bernd Scholz-Reiter. 2010. Generating dispatch-
doi:10.1109/TII.2019.2908210. ing rules for semiconductor manufacturing to minimize
Liu, Chien-Liang, Chuan-Chin Chang, and Chun-Jan Tseng. weighted tardiness. Paper presented at the 2010 Winter Sim-
2020. “Actor-critic Deep Reinforcement Learning for Solv- ulation Conference.
ing job Shop Scheduling Problems.” IEEE Access 8: 71752– Priore, Paolo, José Parreño, Raúl Pino, Alberto Gómez,
71762. doi:10.1109/ACCESS.2020.2987820. and Javier Puente. 2010. “Learning-based Scheduling of
Liu, Feng, Shengbin Wang, Yong Hong, and Xiaohang Yue. Flexible Manufacturing Systems Using Support Vector
2017. “On the Robust and Stable Flowshop Schedul- Machines.” Applied Artificial Intelligence 24 (3): 194–209.
ing Under Stochastic and Dynamic Disruptions.” IEEE doi:10.1080/08839510903549606.
Transactions on Engineering Management 64 (4): 539–553. Qu, Shuhui, Jie Wang, and Juergen Jasperneite. 2019. "Dynamic
doi:10.1109/TEM.2017.2712611. scheduling in modern processing systems using expert-
Luo, Shu. 2020. “Dynamic Scheduling for Flexible job Shop guided distributed reinforcement learning." In 2019 24th
with new job Insertions by Deep Reinforcement Learning.” IEEE International Conference on Emerging Technologies and
Applied Soft Computing 91: 106208. doi:10.1016/j.asoc.2020. Factory Automation (ETFA), 459-66.
106208. Ren, Weibo, Yan Yan, Yaoguang Hu, and Yu Guan. 2021.
Luo, Peng Cheng, Huan Qian Xiong, Bo Wen Zhang, Jie Yang “Joint Optimisation for Dynamic Flexible job-Shop Schedul-
Peng, and Zhao Feng Xiong. 2021. “Multi-resource Con- ing Problem with Transportation Time and Resource Con-
strained Dynamic Workshop Scheduling Based on Proximal straints.” International Journal of Production Research. doi:
Policy Optimisation.” International Journal of Production 10.1080/00207543.2021.1968526..
Research. doi: 10.1080/00207543.2021.1975057. Sels, Veronique, Nele Gheysen, and Mario Vanhoucke. 2012. “A
Malus, Andreja, Dominik Kozjek, and Rok Vrabič. 2020. “Real- Comparison of Priority Rules for the job Shop Scheduling
time Order Dispatching for a Fleet of Autonomous Mobile Problem Under Different Flow Time-and Tardiness-Related
Robots Using Multi-Agent Reinforcement Learning.” CIRP Objective Functions.” International Journal of Production
Annals 69 (1): 397–400. doi: 10.1016/j.cirp.2020.04.001. Research 50 (15): 4255–4270.
Matloff, Norm. 2008. Introduction to discrete-event simulation Shi, Daming, Wenhui Fan, Yingying Xiao, Tingyu Lin, and Chi
and the simpy language. Xing. 2020. “Intelligent Scheduling of Discrete Automated
Production Line via Deep Reinforcement Learning.” Inter- Xiong, Hegen, Huali Fan, Guozhang Jiang, and Gongfa Li.
national Journal of Production Research 58 (11): 3362–3380. 2017. “A Simulation-Based Study of Dispatching Rules
doi:10.1080/00207543.2020.1717008. in a Dynamic job Shop Scheduling Problem with Batch
Shiue, Yeou-Ren. 2009. “Data-mining-based Dynamic Dis- Release and Extended Technical Precedence Constraints.”
patching Rule Selection Mechanism for Shop Floor Control European Journal of Operational Research 257 (1): 13–24.
Systems Using a Support Vector Machine Approach.” Inter- doi:10.1016/j.ejor.2016.07.030.
national Journal of Production Research 47 (13): 3669–3690. Yang, Shengluo, and Zhigang Xu. 2021. “Intelligent Scheduling
doi:10.1080/00207540701846236. and Reconfiguration via Deep Reinforcement Learning in
Sunehag, Peter, Guy Lever, Audrunas Gruslys, Wojciech Mar- Smart Manufacturing.” International Journal of Production
ian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Research. doi: 10.1080/00207543.2021.1943037.
Lanctot, et al. 2017. “Value-Decomposition Networks For Yska, Daniel, Yi Mei, and Mengjie Zhang. 2018. Genetic pro-
Cooperative Multi-Agent Learning.” arXiv preprint arXiv: gramming hyper-heuristic with cooperative coevolution for
1706.05296. dynamic flexible job shop scheduling. Paper presented at the
Ulyanov, Dmitry, Andrea Vedaldi, and Victor Lempitsky. 2016. European Conference on Genetic Programming, Cham.
“Instance Normalization: The Missing Ingredient for Fast Zang, Zelin, Wanliang Wang, Yuhang Song, Linyan Lu, Weikun
Stylization.” arXiv preprint arXiv:1607.06450. Li, Yule Wang, and Yanwei Zhao. 2019. “Hybrid Deep
Van Hasselt, Hado, Arthur Guez, and David Silver. 2016. Deep Neural Network Scheduler for job-Shop Problem Based
reinforcement learning with double q-learning. Paper pre- on Convolution two-Dimensional Transformation.” Com-
sented at the Thirtieth AAAI conference on artificial intelli- putational Intelligence and Neuroscience 2019: 7172842.
gence. doi:10.1155/2019/7172842.
van Otterlo, Martijn, and Marco Wiering. 2012. Reinforce- Zhang, Fangfang, Yi Mei, Su Nguyen, and Mengjie Zhang.
ment Learning and Markov Decision Processes, Reinforcement 2020. “Evolving Scheduling Heuristics via Genetic Pro-
Learning: State-of-the-art. Berlin: Springer Berlin Heidel- gramming With Feature Selection in Dynamic Flexible Job-
berg. Shop Scheduling.” IEEE Transactions on Cybernetics 51 (4):
Wang, Yu-Fang. 2020. “Adaptive job Shop Scheduling Strategy 1797–1811. doi:10.1109/TCYB.2020.3024849.
Based on Weighted Q-Learning Algorithm.” Journal of Intel- Zhang, Fangfang, Yi Mei, and Mengjie Zhang. 2018. Genetic
ligent Manufacturing 31 (2): 417–432. doi:10.1007/s10845- programming with multi-tree representation for dynamic
018-1454-3. flexible job shop scheduling. Paper presented at the Aus-
Wang, Haoxiang, Bhaba R. Sarker, Jing Li, and Jian Li. 2020. tralasian Joint Conference on Artificial Intelligence.
“Adaptive Scheduling for Assembly job Shop with Uncer- Zhang, Fangfang, Yi Mei, and Mengjie Zhang. 2019. "A two-
tain Assembly Times Based on Dual Q-Learning.” Interna- stage genetic programming hyper-heuristic approach with
tional Journal of Production Research 59 (19): 5867–5883. feature selection for dynamic flexible job shop scheduling."
doi:10.1080/00207543.2020.1794075. In Proceedings of the Genetic and Evolutionary Computation
Wang, Yi-Chi, and John M. Usher. 2005. “Application of Rein- Conference, 347–55. Prague, Czech Republic: Association for
forcement Learning for Agent-Based Production Schedul- Computing Machinery.
ing.” Engineering Applications of Artificial Intelligence 18 (1): Zhang, Rui, Shiji Song, and Cheng Wu. 2020. “Robust
73–82. doi:10.1016/j.engappai.2004.08.018. Scheduling of hot Rolling Production by Local Search
Waschneck, Bernd, André Reichstaller, Lenz Belzner, Thomas Enhanced ant Colony Optimization Algorithm.” IEEE
Altenmüller, Thomas Bauernhansl, Alexander Knapp, and Transactions on Industrial Informatics 16 (4): 2809–2819.
Andreas Kyek. 2018. “Optimization of Global Production doi:10.1109/TII.2019.2944247.
Scheduling with Deep Reinforcement Learning.” Procedia Zhang, Tao, Shufang Xie, and Oliver Rose. 2017. Real-time job
CIRP 72: 1264–1269. doi:10.1016/j.procir.2018.03.212. shop scheduling based on simulation and Markov decision
Weckman, Gary R., Chandrasekhar V. Ganduri, and David processes. Paper presented at the 2017 Winter Simulation
A. Koonce. 2008. “A Neural Network job-Shop Sched- Conference (WSC), Las Vegas, Nevada.
uler.” Journal of Intelligent Manufacturing 19 (2): 191–201. Zhou, MengChu. 2012. Petri Nets in Flexible and Agile Automa-
doi:10.1007/s10845-008-0073-9. tion. Vol. 310. Springer Science & Business Media.

Deep Reinforcement Learning For Dynamic Scheduling of A Flexible Job Shop

Uploaded by

Copyright:

Available Formats

You might also like

Deep Reinforcement Learning For Dynamic Scheduling of A Flexible Job Shop

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deep Reinforcement Learning For Dynamic Scheduling of A Flexible Job Shop

Uploaded by

Copyright:

Available Formats

International Journal of Production Research

ISSN: (Print) (Online) Journal homepage: www.tandfonline.com/journals/tprs20

Deep reinforcement learning for dynamic

Renke Liu, Rajesh Piplani & Carlos Toro

To link to this article: https://doi.org/10.1080/00207543.2022.2058432

View supplementary material

Published online: 11 Apr 2022.

Submit your article to this journal

Article views: 7383

View related articles

View Crossmark data

Citing articles: 16 View citing articles

Full Terms & Conditions of access and use can be found at

Deep reinforcement learning for dynamic scheduling of a flexible job shop

ABSTRACT ARTICLE HISTORY

Figure 1. Hierarchical and distributed scheduling framework.

j Validation scenario and objective

Table 2. State features for sequencing agent.

Figure 2. History of production.

Figure 3. Transition for routing and sequencing agent.

Table 3. Parameters for DRL and ANN.

Figure 4. Simulation and training.

Figure 5. Architecture of ANN for scheduling agents.

Figure 6. Tardiness performance of RA under four validation scenarios.

Figure 7. Tardiness performance of SA under four validation scenarios.

Figure 8. Tardiness performance of integrated DRL under four validation scenarios.

Figure 9. Tardiness performance under higher ﬂexibility.

Figure 10. Tardiness performance under longer sequences.

Data availability statement ORCID

You might also like