A Review On Reinforcement Learning Algorithms and Applications in Supply Chain Management

International Journal of Production Research
ISSN: (Print) (Online) Journal homepage: www.tandfonline.com/journals/tprs20
A review on reinforcement learning algorithms

and applications in supply chain management
Benjamin Rolf, Ilya Jackson, Marcel Müller, Sebastian Lang, Tobias Reggelin
& Dmitry Ivanov
To cite this article: Benjamin Rolf, Ilya Jackson, Marcel Müller, Sebastian Lang, Tobias Reggelin
& Dmitry Ivanov (2023) A review on reinforcement learning algorithms and applications in
supply chain management, International Journal of Production Research, 61:20, 7151-7179,
DOI: 10.1080/00207543.2022.2140221
To link to this article: https://doi.org/10.1080/00207543.2022.2140221
© 2022 The Author(s). Published by Informa

UK Limited, trading as Taylor & Francis
Group
Published online: 03 Nov 2022.
Submit your article to this journal
Article views: 18375
View related articles
View Crossmark data
Citing articles: 22 View citing articles
Full Terms & Conditions of access and use can be found at

https://www.tandfonline.com/action/journalInformation?journalCode=tprs20
INTERNATIONAL JOURNAL OF PRODUCTION RESEARCH
2023, VOL. 61, NO. 20, 7151–7179
https://doi.org/10.1080/00207543.2022.2140221
A review on reinforcement learning algorithms and applications in supply chain

management
Benjamin Rolf a , Ilya Jackson b , Marcel Müller a , Sebastian Lang c , Tobias Reggelin a and Dmitry Ivanovd
a Otto-von-Guericke-University Magdeburg, Magdeburg, Germany; b Massachusetts Institute of Technology, Center for Transportation &
Logistics, Cambridge, MA, USA; c Fraunhofer Institute for Factory Operation and Automation IFF, Magdeburg, Germany; d Berlin School of
Economics and Law, Berlin, Germany
ABSTRACT ARTICLE HISTORY

Decision-making in supply chains is challenged by high complexity, a combination of continuous Received 17 June 2022
and discrete processes, integrated and interdependent operations, dynamics, and adaptability. The Accepted 13 October 2022
rapidly increasing data availability, computing power and intelligent algorithms unveil new poten- KEYWORDS
tials in adaptive data-driven decision-making. Reinforcement Learning, a class of machine learning Reinforcement learning;
algorithms, is one of the data-driven methods. This semi-systematic literature review explores the supply chain management;
current state of the art of reinforcement learning in supply chain management (SCM) and proposes literature review; inventory
a classification framework. The framework classifies academic papers based on supply chain drivers, management; machine
algorithms, data sources, and industrial sectors. The conducted review revealed a few critical insights. learning; artificial
First, the classic Q-learning algorithm is still the most popular one. Second, inventory management intelligence
is the most common application of reinforcement learning in supply chains, as it is a pivotal element
of supply chain synchronisation. Last, most reviewed papers address toy-like SCM problems driven
by artificial data. Therefore, shifting to industry-scale problems will be a crucial challenge in the next
years. If this shift is successful, the vision of data-driven decision-making in real-time could become
a reality.
1. Introduction in its turn, used by plastics manufacturers to derive

Supply chains operate in increasingly complex and uncer- polyamide-12, a plastic essential for strong, lightweight
tain environments. Adaptive planning and control in components. Polyamide-12 is in the bill of materials of
such environments is of utmost importance to ensure any car, scattered across thousands of different parts and
delivery to end customers with minimal delays and inter- manufactured by a multitude of different suppliers. The
ruptions avoiding unnecessary costs and maintaining accident created the ripple effect and threatened to dis-
business continuity. For implementation of adaptation- rupt the entire automobile industry. Only collaboration
based management principles, real-time coordination of among competing automakers and dozens of suppliers
production scheduling, inventory control, and delivery prevented the catastrophe (Sheffi 2020). Other exam-
plans is required whereas control parameters of the sys- ples of complexity include the semiconductor supply
tem need to be dynamically adjusted toward minimis- chains (Khan, Mann, and Peterson 2021) and the sup-
ing costs, maximising revenue, satisfying a target service ply chains behind vaccine production and distribution
level, or pursuing any other quantifiable objective with (Sheffi 2021). Given high levels of complexity, supply
the consideration of the dynamics, nonstationarity, and chains are prone to disruptions and suboptimal perfor-
uncertainty (Kim et al. 2005; Ivanov and Sokolov 2013; mance caused by operational failures and information
Emerson, Zhou, and Piramuthu 2009). miscoordination. The most notable examples of disrup-
The global interconnectedness and complexity may tive phenomena include bullwhip and ripple effects. The
result in a lack of visibility and risks of devastating bullwhip effect, also widely known as the Forrester effect,
disruptions. An explosion at the Evonik Factory is a can be defined as the amplification of demand varia-
notable example. In 2012, the explosion, followed by a tion on production and order quantities as they propa-
fire, destroyed the chemical factory in Marl, Germany. gate downstream in supply chains (Xun Wang and Dis-
The factory produced cyclododecatriene, used by the ney 2016). On the other hand, the ripple effect occurs
chemical industry to make laurolactam. Laurolactam is, when a disruption, rather than being localised within
CONTACT Benjamin Rolf benjamin.rolf@ovgu.de Otto-von-Guericke-University Magdeburg, Universitätsplatz 2, Magdeburg, Germany

© 2022 The Author(s). Published by Informa UK Limited, trading as Taylor & Francis Group
This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives License (http://creativecommons.
org/licenses/by-nc-nd/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited, and is not
altered, transformed, or built upon in any way.
7152 B. ROLF ET AL.
one part of the supply chain, cascades downstream and physical world. Therefore all these applications require
undermines the performance of the entire supply chain a virtual environment or simulation, a dynamic testbed
(Dolgui, Ivanov, and Rozhkov 2019). Both effects cause for learning through trial and error (MacCarthy and
significant problems for supply chain managers because Ivanov 2022a, 2022b). In SCM, Supply Chain Digital
they eventually give rise to over- and under-production Twin seems to be the most promising candidate for
cycles, leading to excess inventory levels, potential stock- such a virtual testbed environment. The combination of
outs, and suboptimal network performance. These prob- model-based and data-driven approaches by incorporat-
lems may be further worsened if structural and opera- ing real-time data from physical and cyber sources can
tional vulnerabilities in the supply chain are intercon- ensure end-to-end visibility and permanent information
nected (Ivanov 2020a). Besides, the ongoing COVID-19 accessibility for an RL agent (Ivanov and Dolgui 2021;
pandemic demonstrated a new kind of disruption, char- Burgos and Ivanov 2021). Since such emerging digi-
acterised by the long-term presence and unpredictable tal technologies as the Internet of Things, Blockchain,
scale (Ivanov 2020b). Since the severe impact of such 5G, cloud and edge computing make the Supply Chain
disruptions cannot be easily mitigated, supply chain par- Digital Twin implementable by enabling real-time con-
ticipants require recovery planning and adaptation in nectivity and visibility at a highly granular level (Dol-
the presence of disruption (Ivanov 2021b, 2022a, 2022c). gui and Ivanov 2022; Ivanov, Dolgui, and Sokolov 2022;
In this regard, the full potential of the supply chain is Ivanov 2021a; Choi et al. 2022; Zhang, MacCarthy, and
unlocked if and only if it becomes synchronised, namely, Ivanov 2022), we can expect an increase in popularity
all the critical stakeholders obtain accurate real-time and faster adoption rate of the DRL applications in supply
data, identify weaknesses, streamline processes, and mit- chains.
igate risk. In light of these facts, the aim of this article is to iden-
Decision-making in Supply Chain Management tify the main RL algorithms, and their application to SCM
(SCM) is challenged by high complexity, a combina- using a structured review methodology. For this purpose,
tion of continuous and discrete processes, integrated and the article addresses three research questions:
interdependent operations, as well as dynamics, result-
ing in requirements for adaptability (Ivanov, Dolgui, and (1) What are the main applications of reinforcement
Sokolov 2012). Reinforcement Learning (RL), a class of learning in SCM?
machine learning algorithms specialising in sequential (2) What are the main algorithms for reinforcement
decision-making, is a promising candidate solution to learning in SCM?
address these challenges. RL determines how to take (3) How widespread is reinforcement learning for solv-
actions in an environment to maximise the reward over ing industrial use cases?
time and, thus, can serve as an adaptive controller for
such complex systems. The ultimate goal of such a con- To answer the research questions, we first introduce
troller (also known as an RL agent) is to learn the best the general framework of RL and how this framework
possible control actions in each possible state of the would fit into the supply chain context. Then, we provide
dynamic system to maximise long-term system objec- a detailed breakdown of the existing studies and classify
tives (Boute et al. 2022). them by the supply chain drivers, algorithms used, data
The last decade manifested itself in growing data sources, and industrial sectors. Last, we provide critical
quantities and processing power. This trend eventually managerial insights, discuss the current challenges, and
resulted in the advent of deep learning, a machine learn- shed light on future research avenues that may elevate the
ing approach capable of leveraging immense computa- current state of the art of RL applications for SCM. The
tional resources and capitalising on massive amounts of remainder of this article is organised as follows. Section 2
data (Bengio 2016). These facts, along with the advent presents the review methodology and the procedure of
of new algorithmic techniques, mature software pack- selecting relevant literature. Section 3 introduces the RL
ages, and strong interest from business, resulted in paradigm and its mathematical foundations with refer-
deep reinforcement learning (DRL), the promising com- ence to SCM. In Section 4 we perform a descriptive analy-
bination of reinforcement learning and deep learning sis to show relevant metrics of the identified publications
(Krakovsky 2016). Despite its novelty, at this point, DRL such as subject areas, publication years and countries of
has already demonstrated remarkable performance in the origin. In addition, we show the bibliometric networks of
fields of road traffic navigation (Vinitsky et al. 2020), citations and keywords to identify the most relevant pub-
autonomous vehicles (Isele et al. 2018), and robotics lications. Section 5 presents a classification framework
(Gu et al. 2017). However, it is essential to empha- that categorises the publications according to four char-
sise that RL agents can not learn directly from the acteristics: supply chain drivers, algorithms, data sources,
INTERNATIONAL JOURNAL OF PRODUCTION RESEARCH 7153
Figure 1. Literature review methodology.
and industrial sectors. Section 6 provides a discussion been defined, the database for the systematic part of the
of the findings from the classification. Finally, Section 7 literature review was selected. We only included Scopus
gives a critical analysis with a special focus on research in the literature review as it is the most extensive database
gaps and managerial insights for production managers in for peer-reviewed scientific literature and highly overlaps
the industry. with other databases such as Web of Science and Google
Scholar. In addition, Scopus ensures a certain scientific
standard as it only includes peer-reviewed literature com-
2. Review methodology
pared to Google Scholar (Mongeon and Paul-Hus 2015).
Outlining the state of knowledge for RL in SCM requires The search term ‘reinforcement AND learning AND
a structured review methodology. Snyder (2019) defines supply AND chain’ yielded 165 publications that were
different review methodologies that depend on the checked for violating at least one of the three exclu-
review’s objectives and the research discipline. In this sion criteria. The first exclusion criterion removed empty
case, the semi-systematic type, a mix of quantitative and proceedings documents as the search yielded the actual
qualitative analysis, is most suitable. A semi-systematic publication and the entire conference proceedings. One
literature review is beneficial when the topic is not clearly publication was removed because Scopus remarked a vio-
delimited and has been studied in different research dis- lation of good scientific practice and recommended not
ciplines (Wong et al. 2013). It aims to present current to cite the publication. Third, the authors excluded pub-
applications and the general state of knowledge to show lications not relevant to the considered topic. Removal
an agenda for further research (Snyder 2019). due to thematic irrelevance required the consent of at
The initial assessment revealed that the considered least two authors and was conducted after reading the
topic still has very few contributions. Therefore, it is pos- abstracts. After applying the exclusion criteria, 103 pub-
sible to include all relevant literature for the most part and lications remained that the authors read in detail. The
analyse it systematically. Figure 1 shows the methodology number of publications (sample size n) varies slightly in
for this literature review. After the research questions had the following sections because not all publications fit all
7154 B. ROLF ET AL.
classification frameworks. We mention the sample size in

each section and explain why some publications had to
be excluded or classified into multiple categories.
3. Reinforcement learning
In contrast to such popular machine learning paradigms
as supervised and unsupervised learning, reinforcement
learning (RL) is akin to instrumental conditioning in
physiology (Sutton and Barto 2018). In instrumental con- Figure 2. MDP in the SCM context.
ditioning, animals learn associations between stimulus
and response. Such that given a stimulus (or environ-
mental state), the animal tries a response (or action). If
if there is a long trajectory of states and actions prior
the response outcome for a given stimulus is positive, the
to the received reward, it is unclear how to determine
connection between the stimulus and response is rein-
which actions should get credit for the eventual reward.
forced (Thorndike 1898). The RL paradigm is driven
Since it is often unclear whether the performed action
by the idea that intelligent systems can learn through
actually contributed to the gained reward, a standard
a similar process of trial and error (Michie and Cham-
solution is to apply n-step discounted returns, where the
bers 1968). This idea appears to be extremely powerful
cumulative rewards for the action at for n steps are expo-
when applied to the adaptive control of highly complex
nentially weighted by a discount factor γ ∈ [0, 1] as in
and dynamic systems such as supply chains (Kegenbekov
Equation (1). Nevertheless, MDP is a powerful frame-
and Jackson 2021).
work for goal-oriented autonomous learning. An adap-
Formally, RL can be defined as a Markov Decision Pro-
tive controller based on this principle can learn how to
cess (MDP). The MDP serves as the flexible framework
make optimal or nearly-optimal decisions in dynamic
for goal-directed learning that can be described as a tuple
and stochastic supply chain environments (Figure 2).
M = (S , A, P (st+1 , r | s, a), R), γ ), where:
The supply chain environment can become too com-
plex for the classic reinforced learning approach. This
• s ∈ S is a set of the possible states of the environment.
is where deep reinforcement learning (DRL) comes into
• a ∈ A is the set of possible actions that can be exe-
play. The ‘deep’ part refers to the application of an
cuted by an agent to interact with the environment.
artificial neural network to estimate possible sequences
• P (st+1 , r | s, a) denotes the transition probability to
of actions and associate them with long-term rewards,
move to the state st+1 receiving a reward r, given st ∈
increasing the manageability of solution space. Besides,
S and at ∈ A
since agents based on DRL do not store all state and value
• R ∈ R is the expected reward received from the envi-
pairs in a table, they become capable of generalising the
ronment after the agent performs action a at state s
value of states that have not been seen before or, rigor-
(Adi et al. 2020).
ously speaking, have not been encountered during the
training phase.
The sequence of states, actions, and rewards produce
a trajectory s0 , a0 , R1 , s1 , a1 , R2 , s2 , a2 , R3 , . . . , sn , where
sn stands for the terminal state. The goal of the RL agent
4. Analysis and findings
is to find the optional policy π : S × A that maps states
into actions so that the cumulative expected return over The publications considered are mostly conference
the time horizon is maximised. The expected return over papers or journal articles. Figure 3 shows the different
a finite time horizon can be defined as follows: proportions of the investigated publication in terms of
∞ document type and the subject area. Scopus assigns two
subject areas to each document, so the analysis includes
Rt = E k
γ rt+k (1)
206 data points. The most significant proportion of doc-
k=0
uments, 34.5%, was from computer science, followed by
where γ ∈ [0, 1] is a discount factor determining how far engineering (25.7%) and mathematics (12.1%). Despite
the agent should look into the future. its small share, surprisingly large was the field of chemical
It is worth mentioning that the MDP framework is engineering (3.4%). Although SCM is an essential topic
not perfect. For example, it does not address the cru- in economics research, purely economic subject areas are
cial credit-assignment problem (Minsky 1961). Namely, underrepresented.
Figure 3. Classification and proportion of investigated publications in terms of document type and subject area. (a) Type of the
publications (n = 103) and (b) Subject area of the publications (n = 206).
the keyword. The keywords ‘supply chain management’,

‘supply chains’, and ‘reinforcement learning’ appear most
frequently, as they are similar to the search terms of the
review. We classify the keywords of three main clusters
presented in Figure 5 according to the approach of Ivanov
et al. (2021):
(1) Supply chain drivers and algorithms (red colour)

(a) Supply chain drivers (e.g. supply chain man-
Figure 4. Publications by year (n = 101). agement, supply chains, supply chain, inventory
management, inventory control)
(b) Artificial intelligence (e.g. reinforcement learn-
The first publication on RL in SCM was published in ing, machine learning, deep reinforcement learn-
2000. Since then, the number of publications has tended ing, deep learning, learning algorithms, reinforce-
to increase. This observation corresponds to the general ment learning techniques)
trend that the number of publications is increasing in (2) Methods and data (green colour), e.g. multi agent sys-
almost all research disciplines (Ware and Mabe 2015). tems, heuristic methods, computer simulation, mathe-
Figure 4 shows the considered publications by year. Due matical models, learning systems, production control,
to the low number of publications per year, the numbers education, Markov processes
are highly volatile. The authors can explain neither the (3) Applications (blue colour), e.g. commerce, sales, man-
local increase in 2008 nor the extreme jump in published ufacture, costs
documents from 2018 to 2019. Individual circumstances
and statistical fluctuations are therefore assumed. The These classifications will be used for structuring our
year 2022 was not considered fully, so two publications further analyses.
were excluded from the figure, reducing the total number
to 101.
We used VOSviewer to investigate further the consid-
5. Classification framework
ered bibliometric network (van Eck and Waltman 2009).
The data basis was a CSV export of the 103 selected The review of publications revealed that RL models
literature sources from the Scopus database. There is a in SCM significantly differ in their settings, assump-
strong linkage between occurring keywords for the con- tions, aims and technologies. From these differences,
sidered documents. This linkage is probably due to the we derived four criteria that are possible to map in
relatively small number of publications on the literature’s a classification framework and support answering the
specific topic. The analysis considers only keywords that research questions. These criteria are supply chain driver,
appeared in at least five documents. Of 910 keywords, 46 algorithm, data source and industrial sector. The first cri-
meet this threshold. 186 keywords are mentioned at least terion captures the supply chain setting and the aim of
two times. Figure 5 shows the bibliometric network of the models because supply chain drivers directly affect
the 46 keywords linked with co-occurring keywords. The the supply chain performance and define the target of the
size of the circles represents the number of occurrences of optimisation model. The second criterion assesses the RL
7156 B. ROLF ET AL.
Figure 5. Visualization of the bibliometric network of keywords with VOSviewer (n = 46).
model from a technical point of view and evaluates the the final product and the costs incurred in the sup-
underlying algorithms and technologies. The classifica- ply chain (Chopra and Meindl 2013). There is not only
tion of data sources and the industrial sectors is necessary one way to achieve this objective but various inter-
to assess the dissemination of RL in real-world supply connected supply chain drivers that impact the over-
chains. all value generated. Chopra and Meindl (2013) define
The proposed classification framework is hierarchic six drivers that influence the supply chain performance:
with three levels: criteria, classes and subclasses. The facilities, inventory, transportation, information, sourc-
classes originate from well-recognised academic and ing, and pricing. This section derives a classification
industrial standard classification systems, which are the framework for RL models from these drivers. As RL
supply chain decision-making framework (Chopra and is commonly used to address optimisation problems, it
Meindl 2013), the OpenAI taxonomy of RL algorithms usually adjusts one or more of the drivers so that their set-
(Achiam 2018) and the International Standard Industrial ting leads to improved supply chain performance. How-
Classification (ISIC) (UNO 2008). In the case of the Ope- ever, it is impossible to represent all drivers in a sin-
nAI taxonomy and the ISIC, the frameworks included gle model. Instead, most identified publications consider
subclasses that could be used for this classification frame- firmly delimited supply chain optimisation problems and
work. We also propose novel subclasses representing target selected drivers. Reviewing the targeted driver of
the range of reviewed publications for the supply chain each model allows for the classification of them and
drivers and data sources. We only include subclasses to deriving of applications later on. Table 1 maps each con-
which at least one publication was assigned. However, it sidered publication to one of the supply chain drivers.
is possible to add new subclasses in the future. The fol- Almost all the publications implemented an RL soft-
lowing subsections describe the classes and subclasses of ware model and presented quantitative results. There-
each criterion. fore, assigning most publications to one supply chain
driver is possible considering the input variables and
performance metrics. In the rare cases where the pub-
5.1. Supply chain drivers lication considered multiple supply chain drivers, the
Every supply chain aims to maximise the overall value publication was categorised according to the most crucial
generated, which is the difference between the value of driver. The last class includes the remaining publications,
Table 1. Classification of publications according to supply chain drivers.

Supply chain driver Publications
Inventory
Customer-driven replenishment Ganesan, Sundararaj, and Srinivas (2021), Meisheri et al. (2021), Zwaida, Pham, and
Beauregard (2021), Barat et al. (2020), Singi et al. (2020), Vanvuchelen, Gijsbrechts, and
Boute (2020), Barat et al. (2019), Lee and Sikora (2019), Dogan and Güner (2015), Ghorbel,
Addouche, and El Mhamedi (2015), Li et al. (2015), Yang and Zhang (2015), Mehta and
Yamparala (2014), Chodura, Dominik, and Koźlak (2011), Jiang and Sheng (2009), Xu,
Zhang, and Liu (2009), Kim, Kwon, and Baek (2008), Kim et al. (2005), Ravulapati, Rao, and
Das (2004), and Rao, Ravulapati, and Das (2003)
Supplier-driven replenishment Hachaichi, Chemingui, and Affes (2020), Tariq Afridi et al. (2020), Yang and Zhang (2015),
Zarandi, Moosavi, and Zarinbal (2013), Zhang, Xu, and Zhang (2013), Sui, Gosavi, and
Lin (2010), Xu, Zhang, and Liu (2009), Jiang (2008), Kim, Kwon, and Baek (2008), Kwon
et al. (2008), Li, Guo, and Zuo (2008), Back, Kim, and Kwon (2006), Kim et al. (2005), and Lin
and Pai (2000)
Global order management with complete information Chen et al. (2021), Kegenbekov and Jackson (2021), Perez et al. (2021), F. Wang and
Lin (2021), Alves and Mateus (2020), Bharti, Kurian, and Pillai (2020), Peng et al. (2019),
Zhou and Zhou (2019), Zhou, Purvis, and Muhammad (2016), Sun, Zhao, and Yin (2010),
Chaharsooghi, Heydari, and Zegordi (2008), Zhang and Bhattacharyya (2007), Giannoccaro
and Pontrandolfo (2002), and Pontrandolfo et al. (2002)
Global order management with incomplete information Mortazavi, Khamseh, and Azimi (2015), Saitoh and Utani (2013), Valluri, North, and
MacAl (2009), Sheremetov and Rocha-Mier (2008), Zhang and Bhattacharyya (2007), Van
Tongeren et al. (2007), Sun et al. (2006), Gang and Ruoying (2006), Sheremetov, Rocha-Mier,
and Batyrshin (2005), Sheremetov and Rocha-Mier (2004), and Bukkapatnam and Gao (2000)
Supply chain scheduling Lu et al. (2021), Lee and Sikora (2019), Marandi and Fatemi Ghomi (2019), Aissani et al. (2012),
Dahlem and Harrison (2010), Sheremetov and Rocha-Mier (2008), Li and Zhao (2006),
Sheremetov, Rocha-Mier, and Batyrshin (2005), Simsek, Albayrak, and Korth (2004),
Sheremetov and Rocha-Mier (2004), Cao, Xi, and Smith (2003), and Ren, Chai, and Liu (2002)
Transportation
Vehicle routing Gutierrez-Franco, Mejia-Argueta, and Rabelo (2021), Li et al. (2021), and Habib, Khan, and
Uddin (2017)
Vehicle scheduling Adi, Bae, and Iskandar (2021), and Bretas et al. (2019)
Transportation bidding Zhanguo (2008), and Tang and Kumara (2005)
Platooning configuration Puskás, Budai, and Bohács (2020)
Network flow routing Hwangbo and Yoo (2018)
Information
Forecasting Chien, Lin, and Lin (2020), Makridis et al. (2020), Zhang et al. (2019), Ko et al. (2011), and
Reindorp and Fu (2011)
Collaboration Xiang (2020), Li et al. (2014), Zhao, Jiang, and Feng (2010), Kaihara and Fujii (2008), and
Dangelmaier et al. (2006)
Human behaviour evaluation Ghavamipoor and Golpayegani (2020), Craven and Krejci (2017), Craven and Krejci (2016), De
Maio et al. (2016), and Valluri, North, and MacAl (2009)
Risk management Aboutorab et al. (2022), Zhao et al. (2021a), and Yang et al. (2019)
Sourcing
Supplier selection Kim, Bilsel, and Kumara (2008), and Tae, Bilsel, and Kumara (2007)
Supplier segmentation Liu (2020), Aghaie and Heidary (2019), and Du and Jiang (2019)
Pricing
Trading Hirano et al. (2021), Du and Xiao (2019), Chatzidimitriou, Symeonidis, and Mitkas (2013), and
Reeder et al. (2008)
Market simulation Liu, Howley, and Duggan (2011)
Other
Concept Filatova, El-Nouty, and Fedorenko (2021), Serrano, Mula, and Poler (2021), and Mezouar and El
Afia (2019)
Literature review Sianaki et al. (2019)
which present exclusively conceptual considerations when to order and what quantity to order with the objec-
on RL and literature reviews. tive of finding a balance between material availability
and inventory costs. Traditional inventory models such
5.1.1. Inventory as the economic order quantity (EOQ) model provide
Inventory comprises decisions that impact raw materials, quick solutions assuming constant demand and com-
work in process and finished goods, such as the man- plete information (Hax and Candea 1984). However, in
agement of orders and the determination of safety stock real supply chains, the main challenge is dealing with
and cycle stock (Chopra and Meindl 2013). Inventory changing demands, unknown lead times, disruptions and
management or replenishment is a well-known optimi- incomplete information. RL approaches have the advan-
sation problem in SCM. It deals with the questions of tage that they can learn adaptive, situation-dependent
7158 B. ROLF ET AL.
Figure 6. Types of inventory problems. Customer-driven replenishment and supplier-driven replenishment take the view of the supplier
and have the objective to determine optimal actions for their operations. Global order management takes a global view and either aims
to find a global optimum for all participants (complete information) or observe the system behaviour if each participant takes his own
actions (incomplete information). Supply chain scheduling considers a global view and tries to find an optimal schedule in a multi-site
environment.
ordering strategies contrary to traditional inventory because they do not determine each order indepen-
models, which have fixed order quantities, safety stocks dently but let the agent dynamically choose from five
and cycle stocks. common inventory policies.
The inventory driver of the framework includes all • Supplier-driven replenishment considers replenish-
publications that aim at determining order quantities and ment models in which the supplier is the driving entity
order times, production quantities and production times of the ordering process and tries to fulfil a certain
or scheduling task sequences. We identified five major service level determined by the customer. Neverthe-
subclasses that differ in view, driving player, number of less, the supplier is independent of the customer and
players and degree of information exchange. Figure 6 takes a self-centred view. A well-known example of
illustrates the five types of inventory problems. Publica- this subclass is vendor-managed inventory models.
tions that compare multiple views are assigned to multi- The customer-driven view is mainly located in sup-
ple subclasses. For example, Yang and Zhang (2015), Kim ply chains using the push principle. For instance, Tariq
et al. (2005), Kim, Kwon, and Baek (2008) and Xu, Zhang, Afridi et al. (2020) consider a vendor-managed inven-
and Liu (2009) compare centralised and decentralised tory setting in the semiconductor industry and solve
supplier-customer relationships. it with RL and discrete-event simulation. Hachaichi,
Chemingui, and Affes (2020) study a similar inven-
• Customer-driven replenishment includes replenish- tory problem with an agent that learns to place optimal
ment models in which a customer orders raw materials orders.
from one or more independent suppliers without hav- • Global order management with complete informa-
ing information about the supplier’s processes. The tion includes models in which the RL agent takes a
customer, who is the driving entity of the ordering global view and has complete information on the sta-
process, takes a self-centred view and considers merely tus of each supply chain entity. The agent aims to
its own performance metrics. The customer-driven find a global optimum for the entire supply chain.
view is primarily located in supply chains based on These models can be both single-agent or multi-
the pull principle. For instance, Lee and Sikora (2019) agent systems. Real-world examples are, for instance,
present an agent that handles procurement, customer intra-company supply chains with multiple sites that
sales and production scheduling simultaneously and exchange information. Being one of the first and most
test it in the supply chain environment of a compe- cited contributions to the topic, the work of Pon-
tition for trading agents. Zwaida, Pham, and Beau- trandolfo et al. (2002), and Giannoccaro and Pon-
regard (2021) consider a hospital supply chain and trandolfo (2002) belongs to this subclass. They con-
train a DRL agent to prevent drug shortages. Ganesan, sider a supply chain model with three stages and one
Sundararaj, and Srinivas (2021) utilise RL differently agent at each stage. The agents act independently but
have information about the other stages and try to of routing and route segmentation in consideration
maximise the total reward. Perez et al. (2021) com- of constraints. In the vehicle routing problem, vehi-
pare deterministic linear programming, multi-stage cles are usually not a scarce resource, but the focus
stochastic linear programming and RL in a supply is instead on finding the best route. For instance,
chain with five stages and nine nodes. They found that Gutierrez-Franco, Mejia-Argueta, and Rabelo (2021)
RL performs well, considering inventory balancing consider last-mile operations in Bogota, Colombia,
but the stochastic model yields the highest profit. and train an agent to route vehicles depending on
• Global order management with incomplete informa- real-time circumstances.
tion considers classic supply chains comprising several • Vehicle scheduling has similar objectives as vehicle
independent entities that exchange information to a routing but considers problems where the vehicles
lesser extent. These models are usually multi-agent are scarce resources and have to fulfil multiple tasks
systems in which each RL agent manages the opera- during a specific period. This is often the case in
tions of a single entity in order to find a local optimum small-scale, closed systems such as the railway traffic
for the entity. For instance, Mortazavi, Khamseh, and management problem (Bretas et al. 2019) or the inter-
Azimi (2015) present a simulation-based multi-agent terminal transportation problem (Adi, Bae, and Iskan-
RL model in a four-stage environment where each dar 2021).
agent tries to optimise its operations. • Transportation bidding considers transportation prob-
• Supply chain scheduling has strong similarities to clas- lems in which minimising costs is a major part of
sic production scheduling problems but extends the the problem. In a common scenario, multiple logis-
scope to multiple sites. These models consider not tics service providers compete for transport orders by
only the inventory levels of the entities but also trying to place the lowest bid. This scenario is also
include restrictions such as raw material availability, suitable for multi-agent systems because the service
batch sizes, or sequence constraints. For instance, Ais- providers are independent entities with individual sets
sani et al. (2012) solve a multi-site flexible job shop of actions. Tang and Kumara (2005) study a game-
problem with transfer times between the sites. Lu theoretic task delegation problem in which indepen-
et al. (2021) try to minimise the total supply chain dent logistics service providers bid on transportation
order completion time by assigning orders to suppli- orders. Zhanguo (2008) also consider a task delegation
ers. problem but contrary to Tang and Kumara (2005) the
transportation agents do not compete but cooperate to
5.1.2. Transportation maximise the overall profit.
Transportation encompasses the movement of inventory • Platooning configuration is a unique transportation
from one site to another in the supply chain (Chopra problem in which multiple trucks with different desti-
and Meindl 2013). Transportation problems such as the nations must be coupled in the form of platoons so that
vehicle routing problem or the travelling salesman prob- costs or driven distance are minimal. Puskás, Budai,
lem are widespread in academia, but they are not only and Bohács (2020) found out that RL is capable of con-
considered in the context of SCM. Hence, the structured trolling the platooning and provides a better solution
literature search yielded only a small excerpt of models than a heuristic.
that explicitly consider supply chain transportation prob- • Network flow routing deals with transporting indivisi-
lems. Transportation problems usually aim to find the ble goods such as liquids, gases, energy, or bulk goods
shortest or fastest route between multiple nodes. In addi- that flow through stationary networks. A network of
tion, these problems can have several restrictions such as nodes and arcs with different flow rates represents the
capacity constraints, time window constraints or pickup problem. The objective is to find the maximum flow
and delivery restrictions. Traditionally, vehicle routing state of the network or the minimum distance between
problems are solved using computationally intensive con- the sink and the source. Hwangbo and Yoo (2018)
straint optimisation, heuristics, or metaheuristics from consider a hydrogen supply network in the Republic
operations research. RL is expected to provide equal or of Korea. They implement an agent that manages the
better results in less computation time. We identified supply of wind energy, solar energy, and wastewater to
five subclasses of transportation problems that are solved the hydrogen production plants.
using RL:
5.1.3. Information
• Vehicle routing includes all models that solve the clas- Information includes measures that turn data into infor-
sic vehicle routing problem with RL. It is composed mation or improve the ability to make justified decisions.
7160 B. ROLF ET AL.
This class directly affects all the other supply chain drivers behaviour of farmers and producers to derive a pricing
because data is the basis for every decision-making pro- strategy from it.
cess (Chopra and Meindl 2013). The applications of • Risk management aims to ensure supply chain oper-
machine learning models for information retrieval and ations by taking proactive and reactive measures.
processing are manifold, as this domain is a signifi- According to Chopra and Sodhi (2004), seven groups
cant scope of machine learning in general. However, of risks exist in SCM, which are disruptions, delays,
supervised learning algorithms are state-of-the-art in this wrong forecasts, price fluctuations, the bankruptcy of
domain because their learning process is more straight- partners, wrong production capacities or high inven-
forward whenever clear input-output pairs exist, which tory levels. RL is, for example, used to rate the impor-
is usually the case for applications such as forecasting. tance of risks (Aboutorab et al. 2022) and propose
Therefore, the prevalence of RL models is relatively low, recovery actions after supply chain disruptions (Yang
just like the number of models assigned to this class. Nev- et al. 2019).
ertheless, there are valid use cases for the application of
RL: 5.1.4. Sourcing
Sourcing addresses the organisation of participants in
• Forecasting deals with predicting the future from his- the supply chain, especially supplier selection (Caval-
torical data to support decision-making (Hyndman cante et al. 2019) and segmentation, as well as outsourc-
and Athanasopoulos 2018). Most of the identified ing and insourcing decisions (Chopra and Meindl 2013).
RL models do not consider forecasting in the clas- The reviewed publications showed that the sourcing and
sic sense but rather address more complex problems inventory drivers are closely related when considering
closely related to forecasting. For instance, the RL real-world supply chain problems. Many models consider
agent of Chien, Lin, and Lin (2020) selects demand sourcing decisions as well as inventory decisions to some
forecasting methods rather than actual forecasting. Ko extent. Therefore, the transition between both classes is
et al. (2011) address a unique application of RL for pre- fuzzy, and it is important to distinguish their scope for
dicting the location of missing products based on its the classification framework. Models have been assigned
planned path and RFID tracking. to sourcing when they focus on tactical to strategic
• Collaboration includes models that aim to improve the decision-making, such as selecting suppliers or sourcing
collaboration between supply chain entities. Contrary strategies. In contrast, inventory models consider opera-
to global order management with complete informa- tional to tactical decisions such as order management and
tion, which has very similar objectives, collabora- scheduling. We identified two main applications of RL in
tion refers to models that do not consider inventory sourcing:
decisions but other collaboration issues. For exam-
ple, Dangelmaier et al. (2006) train an RL agent that • Supplier selection aims to select one or more ven-
learns to solve bilateral conflicts by choosing the best dors that supply a customer in the best way possible
actions in a conflict situation. In addition, collabo- concerning criteria such as costs, reliability, or qual-
ration includes models that consider the negotiation ity. As described in Subsection 5.1, the use of RL in
between supply chain entities such as in Kaihara and this domain is limited because it is a typical strategic
Fujii (2008) or Li et al. (2014). decision. Merely, the publications of Tae, Bilsel, and
• Human behaviour evaluation focuses on the role of Kumara (2007) and Kim, Bilsel, and Kumara (2008)
the human in supply chain decision-making by eval- have been identified, which study a game-theoretic
uating human behaviour or considering the actual model of multiple suppliers and one manufacturer that
decision-making process. Based on the models, RL are modelled as agents. The manufacturer requests
has two prominent roles in this subclass: Learning orders and the suppliers can bid on the orders
humans’ behaviour in supply chain environments or and learn from the past behaviour of competing
simulating human behaviour with multi-agent sys- suppliers.
tems. For instance, De Maio et al. (2016) investigate • Supplier segmentation models aim to find a sourcing
group decision-making processes in SCM and train strategy considering a small number of fixed, substi-
their RL model to weigh group members’ decisions to tutable suppliers. Companies commonly have primary
find the best overall decision. Ghavamipoor and Gol- and backup suppliers, which help in certain situa-
payegani (2020) use RL to learn the expected profit tions. Du and Jiang (2019) model a system in which
from customers in a webshop and adapt the service an agent determines how to split orders of a manufac-
level to it. In contrast, the RL agents of Craven and turer among two substitutable suppliers. Aghaie and
Krejci (2017) and Craven and Krejci (2016) imitate the Heidary (2019) study a similar environment with a
primary supplier, a backup supplier, and a spot mar- Filatova, El-Nouty, and Fedorenko (2021) create a math-
ket. Liu (2020) trains agents to take insourcing and ematical model for RL in supply chain collaboration and
outsourcing decisions flexibly. sustainability.
Surprisingly, the search results contain only one litera-
5.1.5. Pricing ture review. The review of Sianaki et al. (2019) shows state
Pricing comprises all decisions regarding costs and of the art on machine learning in smart cities, health-
prices, such as bidding, contracting, marketing strategies care and transportation. However, no literature review
and sales planning (Chopra and Meindl 2013). In SCM, explicitly focuses on RL in SCM.
it is an essential topic for determining the delivery costs
of semi-finished products between companies and the 5.2. Algorithms
selling prices to consumers. Pricing is a common topic
in economics and marketing, but a few contributions We modified the OpenAI taxonomy to classify papers
also explicitly focus on SCM and apply reinforcement based on the applied RL algorithm (Achiam 2018). All
learning to it. Two applications are most important: the RL algorithms can be divided into model-free and
model-based, depending on whether the agent has direct
• Trading is the most common application of RL in access to a model of the environment. In this context,
pricing. Agents are responsible for trading on mar- an environment model refers to a function that predicts
kets, determining selling prices or bidding on offers state transitions and rewards. The most famous model-
to maximise profit. For instance, Reeder et al. (2008) based approach is AlphaZero, the DRL algorithm that
train agents to participate in the market of a mul- could master classic board games such as Shogi and
tiplayer online game. They conclude that trading Chess through self-play (Silver et al. 2017). The com-
agents can participate in virtual economies equally mon challenge is that the agent can discover and exploit
to humans, which can also be helpful for real sup- bias in the model, resulting in a policy that performs
ply chain environments. Chatzidimitriou, Symeoni- well in the learned model, but behaves sub-optimally
dis, and Mitkas (2013) use an agent to manage pro- in the real environment. That is why the model-based
duction and customer orders in terms of costs. Du approaches are hardly applicable if the problem is char-
and Xiao (2019) consider an environment with one acterised by stochasticity, long-term planning horizons,
supplier, two retailers and several customers and let partial observability, and imperfect information. Since all
the agents choose between uniform and differenti- the above traits are common in complex supply chains,
ated pricing strategies. Hirano et al. (2021) investigate there is no wonder that all the reviewed studies took
how to prevent unintentional collusion when multiple advantage of model-free RL. Model-free RL, in its turn,
agents act in a supply chain environment. also constitutes two major categories: policy optimisa-
• Market simulation includes models that use RL to sim- tion and value-based algorithms (see Figure 7). Table 2
ulate the behaviour of humans in the market, espe- contains the academic papers classified by the proposed
cially to evaluate the price sensitivity of consumers. modification of the OpenAI taxonomy. It is worth men-
Liu, Howley, and Duggan (2011) give insights into tioning that it was possible to classify RL algorithms
the relationship between market price and consumer unambiguously only for 61 publications. The remaining
behaviour in a supply chain environment. academic papers do not explicitly mention the algorithm
used.
5.1.6. Other
The literature search also yielded a few publications 5.2.1. Value-based algorithms
that do not fit the classification framework of supply In order to find an optimal policy π(s), value-based
chain drivers because they do not provide quantitative methods derive an approximator Qθ (s, a) parametrised
models but rather general considerations. Three publica- by a set of parameters θ that maps states to the corre-
tions present concepts on RL in SCM. Mezouar and El sponding actions as if they are stored in a table. This
Afia (2019) propose to extend the SCOR model with a optimisation is conducted off-policy, which means that
sixth standard process which they call ‘Learn’. The new each update utilises data collected at any point during
process shall add RL capabilities to the SCOR model and training. As a result, the policy π ∗ is approximated by
make supply chains more adaptive. Serrano, Mula, and the optimal action-value function Q∗ (s, a), and the agent
Poler (2021) present a conceptual framework based on takes action according to the following equation:
a digital twin which shall reschedule the master produc- a(s) = arg max Qθ (s, a) (2)
tion schedule of supply chains in real-time using DRL. a
7162 B. ROLF ET AL.
Figure 7. The proposed taxonomy encompasses the RL algorithms applied to supply chain problems.
Table 2. Classification of publications according to the applied algorithms.

Algorithm Publication
Value-based algorithms
Classic Q-learning Aboutorab et al. (2022), Chen et al. (2021), Hirano et al. (2021), Meisheri et al. (2021), Wang and Lin (2021), Zhao
et al. (2021b), Chien, Lin, and Lin (2020), Liu (2020), Makridis et al. (2020), Puskás, Budai, and Bohács (2020),
Aghaie and Heidary (2019), Du and Jiang (2019), Marandi and Fatemi Ghomi (2019), Mezouar and El
Afia (2019), Hwangbo and Yoo (2018), Craven and Krejci (2017), Craven and Krejci (2016), Zhou, Purvis,
and Muhammad (2016), Dogan and Güner (2015), Mortazavi, Khamseh, and Azimi (2015), Li et al. (2014),
Saitoh and Utani (2013), Sun and Zhao (2012), Reindorp and Fu (2011), Sui, Gosavi, and Lin (2010), Sun,
Zhao, and Yin (2010), Chaharsooghi, Heydari, and Zegordi (2008), Kwon et al. (2008), Reeder et al. (2008),
Zhanguo (2008), Van Tongeren et al. (2007), Zhang and Bhattacharyya (2007), Dangelmaier et al. (2006),
Tang and Kumara (2005), Ravulapati, Rao, and Das (2004), Sheremetov and Rocha-Mier (2004), and Rao,
Ravulapati, and Das (2003)
DQN Li et al. (2021), Zwaida, Pham, and Beauregard (2021), Singi et al. (2020), Tariq Afridi et al. (2020), and Nanduri
and Saavedra-Antolínez (2013)
SARSA Habib, Khan, and Uddin (2017), Aissani et al. (2012), and Dahlem and Harrison (2010)
Policy optimisation algorithms
Policy Gradient Huang and Tan (2021), Peng et al. (2019), and Mehta and Yamparala (2014)
A2C/A3C Zhu, Ke, and Wang (2021), and Barat et al. (2019)
PPO Kegenbekov and Jackson (2021), Perez et al. (2021), Alves and Mateus (2020), Hachaichi, Chemingui, and
Affes (2020), and Vanvuchelen, Gijsbrechts, and Boute (2020)
SMART Giannoccaro and Pontrandolfo (2002), and Pontrandolfo et al. (2002)
Other
Evolutionary Zhao, Jiang, and Feng (2010), and Sheremetov and Rocha-Mier (2008)
EM Govindan and Al-Ansari (2019)
Examples of value-based methods discovered during the it with stochastic dynamic programming and agent-
literature review include: based simulations. The study analysed simultaneous
ordering and pricing decisions for retailers working in
• Classic Q-learning finds an optimal policy in the sense a multi-echelon supply chain, such that retailers com-
of maximising the expected value of the total reward pete for the same market where the market demand
by identifying an optimal action-selection policy for is uncertain. Aboutorab et al. (2022) demonstrated
a given MDP (Watkins and Dayan 1992). For exam- how Q-learning can assist risk managers in proactively
ple, Chaharsooghi, Heydari, and Zegordi (2008) pro- identifying the risks to the supply chain operations.
posed a Q-learning-based approach to derive order- Wang and Lin (2021) proposed a multi-agent-based
ing policies for multiple supply chain entities in an collaborative replenishment model for supply chains.
integrated manner to minimise total inventory costs. Q-learning was applied to optimise supply network
Zhou, Purvis, and Muhammad (2016) combined Q- topology such that the replenishment time is min-
learning with system dynamics in a multi-agent archi- imised. Meisheri et al. (2021) incorporated Q-learning
tecture to address a collaborative planning problem into the decision-making framework to solve multi-
in global supply chains. Dogan and Güner (2015) product, multi-period inventory management prob-
used a Q-learning agent as a base complementing lems with uncertain demand.
• Deep Q-Networks (DQN), initially invented by Deep- θk + α∇θ J(πθk ), where α is the learning rate (Sutton
Mind, are well-known for their ability to master et al. 1999). For example, in a recent paper, Huang
a wide range of computer games at a superhu- and Tan (2021) applied a policy gradient algorithm
man level. The algorithm was developed by enhanc- to optimise the time for the supply chain order man-
ing a classic Q-learning algorithm with deep neu- agement process. Peng et al. (2019) compared the pol-
ral networks and experience replay technique (Mnih icy gradient algorithm with the (r, Q) policy, a com-
et al. 2013). For example, Li et al. (2021) proposed mon approach in classic operations research (Jack-
a data-driven approach based on DQN to solve son 2020). Both policies were compared on the multi-
industry-scale dynamic pickup and delivery problems period capacitated supply chain optimisation prob-
in supply chains. Singi et al. (2020) compared several lem under demand uncertainty. Mehta and Yampar-
DQN architectures addressing the problem of inven- ala (2014) formulated a general stochastic SCM prob-
tory management in a single-product supply chain. lem as a multi-arm non-contextual bandit problem.
Zwaida, Pham, and Beauregard (2021) faced a health- A policy gradient directly approached the problem to
care inventory management problem. An RL agent find a robust policy in the new setting.
based on DQN was applied to approach manufac- • Asynchronous Advantage Actor-Critic (A2C/A3C) are
turing problems, supply and demand issues, and raw conceptually similar to the classical actor-critic meth-
material problems across the entire supply chain. The ods, where the policy πs (actor) and value estimate
proposed solution could automatically make a drug Vπ (s) (critic) are trained at the same time (Sut-
refilling decision to prevent a drug shortage. ton et al. 1999). However, the core difference is that
• State-action-reward-state-action (SARSA) is a modifi- A2C/A3C take advantage of artificial neural networks
cation of the classic Q-learning algorithm. Unlike clas- as non-linear function approximators for both value
sic Q-learning, SARSA is an on-policy RL algorithm and policy outputs. It is worth emphasising that A2C
in the sense that the action is performed by the and A3C algorithms are mathematically equivalent.
current policy to learn the Q-value (Rummery and However, they differ significantly in technical imple-
Niranjan 1994). A notable example of SARSA applica- mentation. Namely, A2C can be efficiently executed
tion in the supply chain context includes the seminal on the GPU hardware, which entails a significant com-
work by Habib, Khan, and Uddin (2017), who applied putational advantage compared to the typically CPU-
the algorithm to an MDP that can incorporate the only A3C-based agents (Mnih et al. 2016). Examples
design of a multi-stage supply chain network. Addi- of A2C/A3C applications for SCM include the work
tionally, Dahlem and Harrison (2010) applied SARSA by Barat et al. (2019) who proposed an A2C-based
to load-balancing applications and supply chain net- approach to optimise transportation flows and inven-
work design. Aissani et al. (2012) proposed a multi- tory levels across the complex supply chain network
agent approach based on SARSA to perform adaptive to enable the deployment of the RL agent in the real-
scheduling in a multi-echelon supply chain. world system with minimal further tuning. Addition-
ally, Zhu, Ke, and Wang (2021) developed an MDP
5.2.2. Policy optimisation algorithms model to depict the dynamics in ride-sourcing mar-
Methods based on policy optimisation, on the other kets with A2C-based RL agents aiming to maximise
hand, represent a policy explicitly as πθ (a | s). The their individual income.
parameters θ are usually optimised directly by gradient • Proximal Policy Optimisation (PPO) also utilises the
ascent on the expected return J(πθ ). This optimisation actor-critic method, such that the actor maps the
is conducted on-policy, which means that each policy observation to action, and the critic provides an
update is performed based on data collected while util- expectation of the rewards (Schulman et al. 2017).
ising the most recent version of the policy. PPO is distinguished by simple implementation, gen-
Examples of methods based on policy optimisation erality, and low sample complexity (Kegenbekov and
discovered during the literature review include: Jackson 2021). Examples of PPO applications to sup-
ply chain problems include Hachaichi, Chemingui,
• Policy Gradient aims to gradually increase the proba- and Affes (2020) who developed an RL agent capa-
bilities of selecting actions that lead to higher returns ble of placing optimal orders in the supply chain,
until the optimal policy is obtained. The algorith- taking the stochastic lead times into account. Alves
mic implementation of policy gradient performs by and Mateus (2020) considered a multi-period four-
updating policy parameters using stochastic gradi- echelon supply chain as a sequential decision-making
ent ascent on policy performance, namely θk+1 = problem. The PPO algorithm was applied to find a
7164 B. ROLF ET AL.
policy that minimises the total operating costs. In by Moriarty, Schultz, and Grefenstette (1999). On
a recent paper, Perez et al. (2021) compared several the other hand, it is also possible to use RL for
DRL techniques, including PPO on a single product, a setting appropriate parameters of evolutionary algo-
multi-period centralised supply chain under stochas- rithms (Sakurai et al. 2010). Evolutionary methods
tic stationary consumer demand. Kegenbekov and discovered during the literature review include varia-
Jackson (2021) demonstrated how a PPO-based agent tions of Genetic Algorithm-based RL (Long Zhao and
could synchronise inbound and outbound flows and Liu 1996). For example, Zhao, Jiang, and Feng (2010)
support business continuity operating in the stochas- combined classic Q-learning with a genetic algorithm
tic and nonstationary environment if end-to-end visi- to address the multi-echelon supply chain coordina-
bility across the entire supply chain is ensured. tion problem.
• Semi-Markov Average Reward Technique (SMART) is
an extension of the temporal difference algorithm
5.3. Data sources
that can be applied to a larger class of problems
for which the underlying probability structures can Getting suitable data for the training of the RL agent
not be characterised solely by Markov chains (Das is often a crucial problem because the performance of
et al. 1999). Pontrandolfo et al. (2002) demonstrated the RL model is highly dependent on the data qual-
that the SMART algorithm could be applied to opti- ity. Data in a supply chain environment includes, for
mise a networked production system that spans sev- instance, the supply chain layout, available suppliers, lead
eral geographic areas. Shortly after that, it was demon- times, demand scenarios, costs or additional restrictions.
strated that the SMART algorithm could also coor- Based on these data, it is possible to design the supply
dinate inventory policies adopted by different sup- chain environment and define the actions and rewards
ply chain participants (Giannoccaro and Pontran- of the agent. The reviewed publications use three pri-
dolfo 2002). mary data sources: Artificial, public, and institutional
data. Most publications created artificial data for their
5.2.3. Other RL model. The reason is usually the conceptual nature of
We also identified papers that take advantage of Bayesian the papers or the lack of suitable data. However, training
and evolutionary approaches to RL that are less common agents with artificial data is usually the most straight-
and do not fit into the original OpenAI taxonomy. Never- forward way because data can perfectly fit the needs,
theless, these approaches represent a promising alterna- and the required amount of data is not a problem. How-
tive to value-based and policy-optimisation methods. ever, some of the publications use existing data for the
training, namely public data or institutional data. Public
• Bayesian methods in the RL setting attempt to lever- data comes mainly from the web or academic literature
age Bayesian inference to incorporate information and is freely accessible, whereas institutional data comes
into the learning process. More specifically, this from companies or the government and is confidential.
class of methods assumes that the system designer These subclasses are attractive for further review because
can add prior information about the problem in a training an RL agent on real-world data is usually more
probabilistic distribution, and the new information challenging. Especially for deploying RL in the indus-
can be incorporated by standard Bayesian inference try, it is necessary to fit the RL model to existing data.
(Ghavamzadeh et al. 2015). The Bayesian methods Table 3 lists all publications that either used public or
applied to the supply chain domain are represented institutional data.
by the expectation-maximisation algorithm (EM) ini-
tially proposed by Dayan and Hinton (1997). Govin- 5.3.1. Public data
dan and Al-Ansari (2019) used the EM algorithm in a In the following, we consider the publications in detail
computational framework to identify and prevent fail- that used public data:
ures of critical infrastructures when the global supply
chains are subject to stresses and shocks. • SCM competitions simulate representative supply
• Evolutionary methods include a broad class of meta- chain environments where researchers can test their
heuristics inspired by natural evolution and the ‘sur- algorithms and compete directly with each other. The
vival of the fittest’ principle (Mahapatra and Pat- most well-known competition was the yearly TAC
naik 2018). The combination of RL and evolution- SCM game that took place until 2011 and was organ-
ary methods can take different forms. Some authors ised by Carnegie Mellon University, the University
use evolutionary algorithms to search the space of of Minnesota and the Swedish Institute of Com-
RL policies and train the agent, as it was proposed puter Science (Collins et al. 2006). Agents had to
Table 3. Classification of publications according to the data source.

Source Publications
Public
SCM competition Lee and Sikora (2019), Mehta and Yamparala (2014), and Chatzidimitriou, Symeonidis, and Mitkas (2013)
Open Data Aboutorab et al. (2022), Chen et al. (2021), Meisheri et al. (2021), Barat et al. (2020), and Barat et al. (2019)
Academic literature Kegenbekov and Jackson (2021), Perez et al. (2021), Aghaie and Heidary (2019), Valluri, North, and MacAl (2009), Chaharsooghi,
Heydari, and Zegordi (2008), Van Tongeren et al. (2007), Gang and Ruoying (2006), Sun et al. (2006), and Simsek, Albayrak,
and Korth (2004)
Online game Reeder et al. (2008)
Institutional
Industry Adi, Bae, and Iskandar (2021), Li et al. (2021), Gutierrez-Franco, Mejia-Argueta, and Rabelo (2021), Chien, Lin, and Lin (2020),
Makridis et al. (2020), Singi et al. (2020), Tariq Afridi et al. (2020), Bretas et al. (2019), Yang et al. (2019), Craven and
Krejci (2017), Craven and Krejci (2016), and Aissani et al. (2012)
Government Zhao et al. (2021a), and Hwangbo and Yoo (2018)
manage banking, production and warehousing ser- Albayrak, and Korth (2004), Gang and Ruoying (2006)
vices to maximise their profit. Lee and Sikora (2019) and Sun et al. (2006) all address the case study of a
and Chatzidimitriou, Symeonidis, and Mitkas (2013) research project called DispoWeb that was done by
used the data from the competition to train and test the Technical University of Berlin (DispoWeb 2004).
their RL agents. Another competition that received The beer game is a well-known educational game in
much attention was held by the data science platform SCM that illustrates the bullwhip effect, widespread
Kaggle (Instacart 2017). The dataset used in this com- in logistics and production education. It was ini-
petition was provided by the American grocery deliv- tially developed by Jay Forrester at the MIT Sloan
ery service Instacart and included 3 million customer School of Management in the 1960s (Sterman 1992).
orders. The competition had the goal of predicting It is a standard environment that many researchers
which products would be in a customer’s next order. tried to approach with algorithms. There are also
Barat et al. (2019), Barat et al. (2020), and Meisheri some approaches that try to play the beer game with
et al. (2021) use this dataset to model an inventory RL agents, for instance Chaharsooghi, Heydari, and
management problem and solved it with RL. Another Zegordi (2008), Van Tongeren et al. (2007), and Val-
competition was run by the Indian decision sciences luri, North, and MacAl (2009). Furthermore, the sem-
company Mu Sigma. Mehta and Yamparala (2014) inal work of Glasserman and Tayur (1995) inspired the
used the environment and the dataset from the com- models of Kegenbekov and Jackson (2021) and Perez
petition to implement their RL algorithm. et al. (2021).
• Open data are other common sources that are often • Online games provide similar environments as SCM
suitable for training RL agents. The data science plat- competitions, but they are designed for human play-
form Kaggle was already mentioned as an impor- ers. However, it is possible to train RL agents for trad-
tant competition host. However, it is also a plat- ing in multiplayer online games, as shown by Reeder
form where users can share datasets tailored explicitly et al. (2008). Contrary to SCM competitions, the agent
to data science. There exist also datasets for SCM. competes not only with other agents but also with
Another source is the national bureaus of statistics human players. Based on the experience with online
which provide official data on economic matters. Chen games, agents can also be used in real-world supply
et al. (2021) use data from the National Bureau of chain environments.
Statistics of China to simulate an agri-food supply
chain and train an RL agent. Aboutorab et al. (2022) 5.3.2. Institutional data
use online news websites to identify risks and take Institutional data includes all data sources that are not
proactive measures with RL. public such as data from industry and government:
• Academic literature is the most obvious source for
data and supply chain environments. In many research • Industry is an important source for data because it best
disciplines, it is common practice to publish data represents the current challenges and requirements
and results so that other researchers can use them of industrial RL deployment. The publications in this
to benchmark their algorithms. RL in SCM does not subclass either used data provided by the industry or
have any well-known benchmark problems. One rea- even collaborated with a company on the publication.
son might be the loose connection of publications in For example, Tariq Afridi et al. (2020) collaborated
this research domain. However, some authors use the with the German semiconductor manufacturer Infi-
scenarios of former publications. For example, Simsek, neon or Li et al. (2021) tackled the dynamic pickup
7166 B. ROLF ET AL.
Table 4. Classification of publications according to industry.

Industry Publications
Manufacturing Chen et al. (2021), Li et al. (2021), Chien, Lin, and Lin (2020), Tariq Afridi et al. (2020), Aissani et al. (2012),
Zhang and Bhattacharyya (2007), Gang and Ruoying (2006), Sun et al. (2006), Simsek, Albayrak, and
Korth (2004), and Bukkapatnam and Gao (2000)
Wholesale and retail trade Meisheri et al. (2021), Barat et al. (2020), Makridis et al. (2020), Singi et al. (2020), Barat et al. (2019),
Craven and Krejci (2017), Craven and Krejci (2016), and Mehta and Yamparala (2014)
Transportation and storage Adi, Bae, and Iskandar (2021), and Gutierrez-Franco, Mejia-Argueta, and Rabelo (2021)
Electricity, gas, steam and air conditioning supply Xiang (2020), and Hwangbo and Yoo (2018)
Mining and quarrying Bretas et al. (2019)
Information and communication Ghavamipoor and Golpayegani (2020)
Public administration and defence Zhao et al. (2021a)
Human health and social work activities Zwaida, Pham, and Beauregard (2021)
and delivery problem of the Chinese electronics man- (UNO 2008). RL models focus on three manufactur-
ufacturer Huawei. Section 5.4 provides further clas- ing divisions: Electronics, vehicles and food. Espe-
sification of RL models that target certain industrial cially semiconductors are typical products in elec-
sectors. tronics supply chains and are often used for case
• Government includes all data sources from offi- studies in academic literature. Semiconductor supply
cial institutions such as government, state, province, chains are characterised by short product life cycles,
municipality, or the military and are not publicly avail- strong economies of scale, pervasive uncertainty and
able. We identified only two publications that used a global nature (Mönch, Uzsoy, and Fowler 2018).
government data. Zhao et al. (2021a) focus on a game Tariq Afridi et al. (2020) use the example of the
with two agents with contrary objectives and use the German manufacturer Infineon, whereas Chien, Lin,
case study of the United States Marine Corps logistics. and Lin (2020) address a generic semiconductor sup-
Hwangbo and Yoo (2018) used data from a province ply chain. Li et al. (2021) consider the transporta-
in the Republic of Korea to model a hydrogen supply tion of raw materials between the factories of Huawei
network. in China, and Bukkapatnam and Gao (2000) use a
generic computer supply chain for their inventory
management problem. Besides semiconductors, vehi-
5.4. Industrial sectors cles are another focus in the manufacturing section.
This is mostly because of the tractor production use
Another way to structure the publications is according to case that we have already described in Subsection 5.3.1
the industrial sectors. Most publications consider generic (Simsek, Albayrak, and Korth 2004; Gang and Ruoy-
supply chains with generic participants such as suppliers, ing 2006; Sun et al. 2006). Agri-food supply chains
producers, retailers and customers but some target spe- have to deal with perishable products and stringent
cific industrial sectors. This is the case when the targeted regulations regarding food safety which makes them
industry has unique requirements for its supply chain, fundamentally different from electronics and vehicle
or the data source indicates the affiliation to an indus- supply chains. For example, Chen et al. (2021) model
trial sector. As the publications that focus on a specific an entire agri-food supply chain from farmer till
industry are most interesting, we only consider these for consumer. Another publication considers the apparel
further review in this subsection. It is not always pos- industry and uses the example of the Algerian com-
sible to classify publications to one industry, especially pany ENADITEX (Aissani et al. 2012).
manufacturing and retail, because supply chains usually • Wholesale and retail trade is another major indus-
include stakeholders in both areas. In this case, we assign trial sector for RL application in SCM. This section
them to the industry that represents the focus of the includes trade without transformation, which is usu-
publication. The most common classification system for ally one of the final steps in the supply chain
industrial sectors is the ISIC, maintained by the United (UNO 2008). All the models in this industrial sector
Nations (UNO 2008). It uses 21 sections that are further consider the retail of groceries or agricultural prod-
classified into divisions, groups and classes. Table 4 shows ucts. Craven and Krejci (2017) and Craven and Kre-
the industry affiliations according to the ISIC sections. In jci (2016) consider the trading between farmers and
the following, we describe the relevant industrial sectors producers in regional food hubs in Iowa. Makridis
and the belonging publications: et al. (2020) aim at enhancing food safety by predict-
ing the risk of food recalls based on the data of the
• Manufacturing includes the transformation of mate- Greek decision analytics company Agroknow. Mehta
rials, substances or components into new products
and Yamparala (2014) develop a model for order man- (UNO 2008). Zhao et al. (2021b) simulate a mili-
agement in the food service industry. Singi et al. (2020) tary scenario in which they manage the United States
consider the case of the Indian grocery retailer DMart. Marine Corps logistics and maintenance.
Eventually, Barat et al. (2019), Barat et al. (2020), and • Human health and social activities includes health-
Meisheri et al. (2021) predict customer orders for the care and social work (UNO 2008). Zwaida, Pham,
American grocery delivery service Instacart. and Beauregard (2021) consider the drug inventory
• Transportation and storage comprises freight trans- management in a generic hospital supply chain. The
port, cargo handling and storage (UNO 2008). Two inventory management in hospital supply chains is
publications explicitly focus on the transportation different from manufacturing inventory management
industry: Adi, Bae, and Iskandar (2021) study the because shortages are more critical. Approaches must
inter-terminal truck routing in Busan port in South have a high level of awareness for inventories (Landry
Korea and Gutierrez-Franco, Mejia-Argueta, and and Beaulieu 2013).
Rabelo (2021) study the last-mile delivery in the
Columbian city of Bogota.
6. Discussion
• Electricity, gas, steam and air conditioning includes the
provision of electric power, gas, steam, or air through a 6.1. Applications
permanent infrastructure (UNO 2008). Xiang (2020)
Inventory management is the most common appli-
proposes a method that handles energy emergen-
cation of RL in SCM (see Figure 8). 61.1% of the
cies as a consequence of disruptions. Hwangbo and
RL models target inventory management and use the
Yoo (2018) consider a hydrogen supply network in
agent for controlling the material flow between mul-
which the total costs must be minimised.
tiple sites in the supply chain. The inventory mod-
• Mining and quarrying comprises the extraction of
els are almost evenly split between two-tier settings
minerals, liquids and gases (UNO 2008). Bretas
(Customer-driven and supplier-driven replenishment)
et al. (2019) approach the railway traffic management
and multi-tier settings (Global order management and
of the Hunter Valley Coal Chain Coordinator in Aus-
supply chain scheduling). A proportion of 16.5% aims
tralia with RL.
at the information driver, including forecasting, collab-
• Information and communication includes all activities
oration, human behaviour evaluation, and risk manage-
belonging to information technology, data processing
ment. Even though information retrieval and processing
and communication (UNO 2008). Ghavamipoor and
are standard domains for machine learning, the num-
Golpayegani (2020) consider an electronic services
ber of RL models in this domain is limited as supervised
supply chain that consists of different network service
learning is superior when clear input-output pairs exist.
providers providing infrastructure for e-commerce. In
RL prevalence in transportation, sourcing and pricing
this environment, they train an RL agent to adapt the
is lower than in inventory. One reason might be that
service quality adaptively based on the expected profit
transportation and pricing problems do not only exist
of the customers.
in SCM but also separate research communities focus
• Public administration and defence includes activities
on these problems. Nevertheless, both drivers are part of
carried out by the government and military activities
Figure 8. Classification and proportion of investigated publications in terms of supply chain drivers and RL algorithms. (a) Supply chain
drivers (n = 103) and (b) Algorithms (n = 61).
7168 B. ROLF ET AL.
SCM and impact performance. Although sourcing is a optimising any performance indicator. However, a small
key consideration in supply chains, it does not have many share of papers does not directly aim at decision-making
contributions, possibly due to its longer decision horizon. but uses multi-agent systems for simulation purposes.
Generally, decision levels and prevalence of RL seem to According to the RL paradigm, each agent still tries to
have coherence. find a near-optimal state in the supply chain environ-
The standard definition of SCM distinguishes three ment. However, in this case, the aim is not to provide
decision levels that differ in their planning horizons: Sup- direct decision support but to study the behaviour and
ply chain design, supply chain planning and supply chain interaction of agents. This finding is particularly interest-
operation, respectively long-term, mid-term and short- ing in environments with multiple players and substan-
term (Chopra and Meindl 2013). Inventory management tial uncertainties because it is challenging to represent
and transportation planning are typical short-term tasks realistic human behaviour with fixed rules in traditional
that require frequent and fast decision-making. In con- simulation models. Instead, actions, states and rewards
trast, sourcing and pricing are usually mid- to long-term are defined, and the system is left to its own devices
tasks with a decision horizon of a quarter to a year to observe the agents’ competition and collaboration.
(Chopra and Meindl 2013; Fleischmann, Meyr, and Wag- Deriving recommendations for the supply chain problem
ner 2008). Information has no generally valid decision from the simulation is only the second step after reach-
horizon because the need for information spans across ing an equilibrium state. The findings can, in turn, be
all levels (Chopra and Meindl 2013). Considering the used to optimise one of the supply chain drivers. Pub-
decision levels, most of the identified RL models address lications using the second approach mainly target the
short-term tasks and only some solve mid-term tasks. information or pricing drivers and can be found, for
None consider long-term tasks. A possible explanation instance, in human behaviour evaluation (Craven and
is the nature of decisions on the decision levels. Supply Krejci 2016, 2017) and market simulation (Liu, Howley,
chain design significantly impacts the success of sup- and Duggan 2011).
ply chains, and decisions are taken at the managerial
level. Furthermore, these decisions are valid for several
6.2. Algorithms
years, and decision-making time is not an essential factor.
Hence, black-box approaches such as RL seem unsuitable It was possible to classify RL algorithms unambiguously
as they do not provide reasoning, and the low decision- only for 61 publications (see Figure 8). The remain-
making time is irrelevant. Furthermore, supply chain ing academic papers focus on conceptualisation and
design is usually not a structured problem because it can- managerial insights and do not explicitly mention the
not be captured quantitatively, and the decision-maker algorithm used. Among the classified publications, the
has much implicit knowledge that influences his deci- vast majority (62%) apply Q-learning. The algorithm’s
sion. For instance, the decision to choose a location for a popularity can be explained by the fact that it is model-
new production site depends not only on the mathemat- free, which implies that it does not require a model of the
ical location model but also on qualitative factors such as environment, so RL agents can operate directly by sam-
the availability of qualified employees. Contrarily, short- pling and learning the expected rewards for an action
term problems such as inventory management are easier taken in a given state. Besides, Q-learning can handle
to capture for RL models as they have a limited scope, a stochastic transitions and reward problems (Baird 1995).
short planning horizon and precise inputs and objectives. Twelve publications (20%) took advantage of the DRL-
The suitability of RL for mid-term tasks is ambiguous. based approaches, including DQN, A2C, A3C, and PPO.
For instance, supplier selection is usually based on clear By incorporating deep artificial neural networks into the
quantitative measures, whereas policy selection regard- algorithms, DRL allows agents to make decisions from
ing product programs or sales markets is more difficult large-scale and potentially unstructured input data with-
to define. out additional data engineering. As a result, DRL agents
Regarding the research subjects of the identified pub- can operate with substantial data inputs (e.g. annual
lications, the classification also revealed two general inventory data at the transactional level) and decide what
approaches to applying RL in SCM that are independent actions to perform to optimise the objective. We also dis-
of supply chain drivers. Most papers directly utilise agents covered a few applications of evolutionary and Bayesian
for controlling operations and decision-making. These approaches, which constitute a promising alternative to
papers aim to perform better than humans in the con- classic RL and DRL. It is important to highlight that
sidered environments and improve decision-making by outside of the RL scope, Bayesian methods are gaining
Figure 9. Classification and proportion of investigated publications in terms of data sources and industrial sectors. (a) Data sources
(n = 94) and (b) Industrial sectors (n = 96).
momentum in SCM applications. For example, in a recent 7. Managerial insights and future research
study, a multi-layer Bayesian network was used to simu- directions
late and quantify the impact of supply chain disruptions
The larger volume and lower latency of digital informa-
(Hosseini and Ivanov 2021).
tion produced by sources ranging from social media to
the Internet of Things have significantly extended the
6.3. Dissemination in industry scope of what can be considered measurable and quantifi-
able. Worldwide revenue for AI, Big Data, and business
The last research question considers the dissemination
analytics solutions is expected to exceed $ 274 billion
of RL in industrial SCM. By assessing the data sources
in 2022 (Brynjolfsson, Jin, and McElheran 2021). Keep-
and industrial sectors, we can indicate the relevance of
ing in mind that deep learning is still a young area of
RL in industrial applications. Figure 9 shows the propor-
research, and the efficient DRL algorithms have been dis-
tions of data sources and industrial sectors. The classi-
covered only recently (in 2013, 2016, and 2017 for DQN,
fication of data sources revealed that 66% of the publi-
A2C, and PPO, respectively), the popularity of these tech-
cations use artificial data for their implementation, and
niques is expected to increase over time. Besides the
73% consider generic supply chains. This finding indi-
recent progress in algorithmic, hardware, and data tech-
cates that the deployment of RL in SCM is still in an early
nologies, the advent of the high-level frameworks for
stage because of the mostly conceptual level in the pub-
DRL development and deployment has the potential to
lications. Artificial data simplifies the implementation
make DRL applications for SCM more affordable, widely
because it is possible to generate arbitrary data accord-
applicable, and accessible than ever before (Daniel Zhang
ing to the needs. Even if authors used non-artificial data,
et al. 2021).
these are often taken from academic literature and repre-
Although the analysed publications show that RL algo-
sent reality to a limited extent. However, there are some
rithms exhibit great potential for various SCM tasks, our
promising exemptions that have been published in the
investigations conclude that RL research for SCM appli-
last few years for the most part and consider industrial
cations is still in its infancy from a practical industrial
supply chains such as Gutierrez-Franco, Mejia-Argueta,
perspective. The reasons for this are manifold. In the fol-
and Rabelo (2021), Li et al. (2021) or Adi, Bae, and Iskan-
lowing subsections, we provide managerial insights and
dar (2021). Most generic settings are simple supply chains
derive future research directions.
with few stages and participants, little uncertainty and no
additional restrictions. The extent and uncertainties, for
instance, regarding demands and lead times, make real-
7.1. Availability of supply chain data and procedure
world supply chains far more complex than these exam-
models for the industrial implementation of
ples. Moreover, there are different types of supply chains
reinforcement learning
from which some have special requirements, for exam-
ple, regarding perishability in agri-food supply chains As mentioned in Section 5.1, most papers evaluate
or increased awareness in healthcare supply chains. If RL applications on comparatively simple problems that
authors considered specific industrial sectors, these are neglect many aspects of genuine supply chains. One of the
primarily standard manufacturing or retail supply chains main reasons for this circumstance might be that there
with few special requirements. are hardly any publicly available datasets from genuine
7170 B. ROLF ET AL.
industrial supply chains that researchers could use to First, agent models trained by RL allow data-driven
develop, and pilot RL applications. Consequently, the and autonomous decision-making in real time. Thus,
vast majority of the investigated papers evaluate RL algo- RL enables response to sudden disturbances in the sup-
rithms on very simply structured supply chains with ply chain by determining alternative decisions based on
randomly generated data. the latest information from the supply chain. Second,
In addition, some aspects currently prevent the use RL methods can compute alternative decisions in case
of RL in real industrial SCM. The major impediment of sudden disturbances and adapt control strategies in
is data acquisition, especially after the training, when case of gradual changes in the supply chain data. By
deploying an RL-trained agent for decision-making in a this means, RL provides a fundament for developing
real supply chain. To take advantage of the agent’s real- resilient decision support systems for SCM. Third, RL
time responsiveness, all necessary information from the methods also facilitate human-centred decision-making,
supply chain that the agent receives as an input state as they consider human experience when learning a deci-
must be continuously accessible in real-time. An obvious sion policy. For instance, this policy can be achieved by
problem is the required data from external companies implementing that a human must first confirm agent-
in the supply chain. Often companies are not eager to computed decisions before they can be applied. The con-
make internal data available due to concerns about com- firmation or rejection of a decision can then be used as
petitive disadvantage. Even if companies agree to share an additional reward and thus as a learning signal for the
their data, implementing privacy-compliant interfaces agent.
between the companies is necessary. From an OEM’s A fundamental practical problem of machine learning
perspective, the more upstream a supplier’s location is approaches is that the corresponding models (in particu-
in the supply chain, the more difficult it becomes to lar deep neural networks) are black-box methods. Thus,
obtain operational data. In practice, OEMs only know actions suggested by the agent might be difficult to com-
their first-, second- and third-tier suppliers. This lim- prehend by human decision-makers. In particular, if the
ited knowledge contrasts with many of the researched agent’s decisions deviate strongly from the intuition of
RL applications in which the agent has a global view of the human planner, the acceptance of the agent’s deci-
the supply chain. However, not only the acquisition of sion is diminished. A simulative validation of the agent’s
external data can be a challenge. Even accessing internal decision and meaningful visualisation of the simulation
company data sometimes requires customised software results can help increase acceptance. Another concep-
solutions, as many closed-source ERP systems do not tual problem is that the agent’s generated solution is a
offer interfaces for data exchange with external software data-based prediction. When assuming constant agent
applications. Against this background, we see a need for parameters, the same data input will always result in
detailed procedure models that describe the implemen- the same output. If the agent’s proposed solution is not
tation of RL-based decision support systems from an feasible, the agent natively has no strategies to adapt a
operations research perspective and a holistic software solution until it is feasible. One way to address this prob-
engineering perspective. lem is to implement additional heuristic search strategies
Therefore, we can identify an essential need for pub- deployed in case of an infeasible solution. The alternative
licly available data for future research and the general identified solution can, in turn, be used as a learning sig-
reproducibility of experiments. The Open Data paradigm nal to adjust the agent’s decision policy. However, apply-
is currently gaining momentum, encouraging the free ing heuristic search strategies can significantly reduce
provision of data (Braunschweig et al. 2012). SCM could the real-time capability of the agent. For instance, Wang
also benefit from these new data sources, especially in et al. (2022) present a promising example of combining
disruption monitoring, as important geopolitical, ecolog- RL and traditional methods of operations research.
ical, or economic events are quickly accessible on the Current RL models focus on decision-making either
Internet. Furthermore, benchmark problems and open with complete information or incomplete information
datasets for training RL agents in supply chain environ- but they rarely assess the proposed actions regarding
ments are necessary to strengthen future research in this resilience and human acceptance. However, these are
domain. key characteristics for implementing RL-based decision
support systems in industrial environments. In the Post-
COVID-19 era many companies will shift to a more
7.2. Reliability and user-acceptance of
resilient and transparent decision-making, as they have
reinforcement learning decisions in SCM
encountered the consequences of neglecting these factors
Even though industrial applications are still scarce, the (Schroeder et al. 2021). The prevalence of these factors as
benefits of RL algorithms for SCM problems are apparent. objectives in decision support systems is still low. Hence,
we can identify a need for resilient and human-centred 8.2. Research question 2: what are the main
RL in SCM. Another research gap is the application of algorithms of reinforcement learning in SCM?
hybrid approaches such as the combination of RL and
The OpenAI taxonomy has been modified to classify
metaheuristics. These approaches have been applied in
papers based on the applied RL algorithm. Among the
other domains but are currently limited to very few appli-
classified publications, most models take advantage of Q-
cations in the context of SCM.
learning. This fact can be explained by the maturity of
the long-standing technique capable of learning stochas-
8. Conclusion tic transitions and reward problems. Besides, Q-learning
belongs to the model-free class, implying that an RL
In this paper, we performed a semi-systematic review to
agent can operate directly by sampling and learning the
understand algorithms, applications and practical adop-
expected rewards. However, more recent publications use
tion of RL in SCM. Our results show that RL applications
such deep learning techniques as DQN, A2C, A3C, and
in SCM date back to 2000. The publication dynamics
PPO. By incorporating deep artificial neural networks,
were steady until the number of publications rose dras-
RL agents gain the capabilities to make decisions from
tically in 2019, which can be attributed to the computa-
large-scale and potentially unstructured input data.
tional hardware, growing data quantities, and the advent
of deep learning. Most publications can be assigned to
computer science, engineering, and mathematics. The
8.3. Research question 3: how widespread is
bibliometric network consists of low linkage between the
reinforcement learning for solving industrial use
considered publications concerning citations, but there
cases?
are many similarities in the keywords used.
As one of the major outcomes of our study, we pro- Most RL models consider generic supply chains with
posed a hierarchic classification framework that cate- artificial data regarding the data sources and industrial
gorises RL applications to SCM according to four criteria sectors. Only a few publications use public or institu-
with several classes and subclasses. The criteria include tional data from academic literature or industry. Some
supply chain drivers, algorithms, data sources, and indus- authors also address the requirements of supply chains
trial sectors. in specific industrial sectors such as healthcare, hydro-
gen distribution, or mining. Nevertheless, the theoretical
models show great potential for various SCM tasks. RL
8.1. Research question 1: what are the main
models are capable of data-driven decision-making in
applications of reinforcement learning in SCM?
real-time, making them capable of responding to sudden
According to the supply chain driver classification, inven- disturbances in the supply chain. In the ever-changing
tory management problems are the most common appli- supply chain environments, agents can adapt their con-
cation in SCM, followed by information and transporta- trol strategies in case of gradual changes in the supply
tion problems. The models that address inventory man- chain data. Furthermore, RL facilitates human-centric
agement problems usually use the RL agent to orches- decision-making by learning from human experience,
trate the material flow between multiple sites in the sup- but it is still a black-box approach without reasoning
ply chain. Information is a broader class that includes capabilities. The main problem in industrial deployment
applications that aim to increase information availability, is data streaming. Even trained agents require continuous
such as forecasting, collaboration, or risk management. access to supply chain data in real-time, a capacity many
In the transportation class, RL is used to address vehi- companies currently do not have. Furthermore, access is
cle routing or scheduling problems. The prevalence of RL usually limited to internal data, but external data from
models in planning tasks with short decision horizons is upstream suppliers is also crucial for decision-making.
significantly higher than in long-term decision-making Finally, we note that RL applications can be extended
because they mostly have a limited scope, precise inputs toward the novel SCM context driven by tremendous dis-
and objectives, and a requirement for fast decision- ruptions in the years of pandemic, geopolitical shocks,
making. On the contrary, long-term decisions taken on component shortages, and transportation bottlenecks
the managerial level require valid reasoning. However, and manifested in the new paradigms such as the short-
the decision-making time is irrelevant, and a quantita- age economy (Ivanov and Dolgui 2022) and supply chain
tive model cannot fully reflect the problem structure and viability (Ivanov 2020c; Ruel et al. 2021). RL applications
complexity. Therefore, due to the limited suitability, the have high potential in the emerging paradigms of Indus-
prevalence of RL in long-term decision-making is much try 5.0 (Ivanov 2022b) and reconfigurable supply chain
lower than for short-term tasks. (Dolgui, Ivanov, and Sokolov 2020).
7172 B. ROLF ET AL.
8.4. Limitations Marcel Müller is a research fellow at the

Otto-von-Guericke-University Magdeburg.
The applied review methodology has several limitations. He earned his master degree in Indus-
The study provides an exhaustive literature review of the trial Engineering for Logistics at the Otto-
Scopus database but does not include other databases. von-Guericke-University Magdeburg. His
Other databases are expected to overlap with Scopus, research interests include modeling and
simulation of logistics systems and han-
but we might have missed some relevant publications.
dling of deadlocks. His email address is
Regarding the search terms and keywords, we tried to marcel1.mueller@ovgu.de. His website is https://www.ilm.ovgu.
make them as general as possible to include many pub- de/mueller.
lications, but some relevant contributions might be miss-
Sebastian Lang is research fellow at the
ing. Especially research communities that apply RL to
Fraunhofer Institute for Factory Opera-
supply chain problems with the emphasis on transportation and Automation IFF. He holds a
tion, pricing, or location planning might not mention master’s degree in mechanical engineer-
the term ‘supply chain’, making them undetectable for ing with focus on production technologies
our study. Hence, we expect these domains to have more and a master’s and bachelor’s degree in
contributions than considered in this literature review. industrial engineering and logistics. His
research interests include studying and
applying methods of machine learning and artificial intelli-
gence, simulation modeling, and mathematical optimization
Disclosure statement for problems in production and logistics. His e-mail address
No potential conflict of interest was reported by the author(s). is sebastian.lang@iff.fraunhofer.de. His ResearchGate profile is
https://www.researchgate.net/profile/Sebastian_Lang5.
Tobias Reggelin is a project manager,

Data availability statement researcher, and lecturer at Otto von Guer-
The authors confirm that the data supporting the findings of icke University Magdeburg and Fraun-
this study are available within the article. The data were derived hofer Institute for Factory Operation and
from the Scopus database, available at https://www.scopus.com. Automation IFF Magdeburg. His main
research and work interests include mod-
eling and simulation of production and
logistics systems and developing and apply-
Notes on contributors ing logistics management games. Tobias Reggelin received a
doctoral degree in engineering from the Otto-von-Guericke-
Benjamin Rolf is a researcher and PhD University Magdeburg. Furthermore, he holds a master’s
candidate in the Institute of Logistics degree in Engineering Management from the Rose-Hulman
and Material Handling Systems at Otto- Institute of Technology in Terre Haute, IN and a diploma
von-Guericke-University Magdeburg. He degree in Industrial Engineering in Logistics from the Otto-
holds a master’s degree in Industrial Engi- von-Guericke-University Magdeburg. His email address is
neering with a specialization in Logis- tobias.reggelin@ovgu.de.
tics from Otto-von-Guericke-University
Magdeburg. His main research interest Dmitry Ivanov is a professor of sup-
includes the application of machine learning, simulation and ply chain and operations management
network theory in supply chain management. In his PhD thesis, at Berlin School of Economics and Law
he currently works on the dynamic reconfiguration of supply (HWR Berlin). His publication list includes
chains as a reaction to disruptions using graph-based artificial around 380 publications, including over
intelligence and open data. 120 papers in international academic jour-
Dr. Ilya Jackson is a Postdoctoral Asso- nals and leading textbooks Global Supply
ciate at MIT Center for Transportation Chain and Operations Management and
& Logistics. He earned his PhD in Civil Introduction to Supply Chain Resilience. His main research
Engineering and Transportation from the interests and results span the resilience and ripple effect in sup-
Transport and Telecommunication Insti- ply chains, risk analytics, and digital twins. He co-edits Inter-
tute, where he spent one year as an assis- national Journal of Integrated Supply Management (IJISM)
tant professor shortly after that. The main and is an associate editor of the International Journal of Pro-
ideas of his PhD thesis had been summa- duction Research (IJPR) and OMEGA. He is Chairman of
rized in the paper ‘Neuroevolutionary approach to metamodel- IFAC TC 5.2 ‘Manufacturing Modelling for Management and
based optimization in production and logistics’, which received Control’.
the young researcher award in 2020. Dr. Ilya Jackson currently
focuses on reinforcement learning for supply chain synchro-
nization and domain-specific automated machine learning for
ORCID
supply chain management and logistics. Benjamin Rolf http://orcid.org/0000-0002-5454-8894
Ilya Jackson http://orcid.org/0000-0002-7457-6040 (Including Subseries Lecture Notes in Artificial Intelligence

Marcel Müller http://orcid.org/0000-0001-9865-7331 and Lecture Notes in Bioinformatics) 12025 LNAI, 26–38.
Sebastian Lang http://orcid.org/0000-0003-3397-1551 Cham: Springer.
Bengio, Yoshua. 2016. Deep Learning. Adaptive Computation
Tobias Reggelin http://orcid.org/0000-0003-3001-9821
and Machine Learning Series. London, England: MIT Press.
Bharti, S., D. S. Kurian, and V. M. Pillai. 2020. “Reinforce-
References ment Learning for Inventory Management.” Lecture Notes in
Mechanical Engineering, 877–885. Singapore: Springer.
Aboutorab, H., O. K. Hussain, M. Saberi, and F. K. Hus- Boute, Robert N., Joren Gijsbrechts, Willem van Jaarsveld, and
sain. 2022. “A Reinforcement Learning-Based Framework Nathalie Vanvuchelen. 2022. “Deep Reinforcement Learn-
for Disruption Risk Identification in Supply Chains.” Future ing for Inventory Control: A Roadmap.” European Journal
Generation Computer Systems126: 110–122. of Operational Research298 (2): 401–412.
Achiam, J. 2018. “OpenAI Spinning up. GitHub Repository.” Braunschweig, Katrin, Julian Eberius, Maik Thiele, and Wolf-
Accessed 16 February 2022. https://spinningup.openai. gang Lehner. 2012. “The State of Open Data – Limits of
com/. Current Open Data Platforms.” In WWW ’12: Proceedings of
Adi, Taufik Nur, Hyerim Bae, and Yelita Anggiane Iskan- the 21st International Conference on World Wide Web, Lyon,
dar. 2021. “Interterminal Truck Routing Optimization Using France.
Cooperative Multiagent Deep Reinforcement Learning.” Bretas, A., A. Mendes, S. Chalup, M. Jackson, R. Clement, and
Processes 9 (10): 1728. C. Sanhueza. 2019. “Modelling Railway Traffic Management
Adi, T. N., Y. A. Iskandar, H. Bae, and Y. Choi. 2020. “Reduction Through Multi-agent Systems and Reinforcement Learn-
of Number of Empty-Truck Trips in Inter-Terminal Trans- ing.” In 23rd International Congress on Modelling and Sim-
portation Using Multi-agent Q-learning.” In Interconnected ulation – Supporting Evidence-Based Decision Making: The
Supply Chains in An Era of Innovation – Proceedings of the Role of Modelling and Simulation, MODSIM 2019, Canberra,
8th International Conference on Information Systems, Logis- Australia, 291–297.
tics and Supply Chain, ILS 2020, Austin, TX, USA, 167– Brynjolfsson, Erik, Wang Jin, and Kristina McElheran. 2021.
172. “The Power of Prediction: Predictive Analytics, Workplace
Aghaie, A., and M. H. Heidary. 2019. “Simulation-Based Opti- Complements, and Business Performance.” Business Eco-
mization of a Stochastic Supply Chain Considering Sup- nomics 56 (4): 217–239.
plier Disruption: Agent-Based Modeling and Reinforcement Bukkapatnam, S., and G. Gao. 2000. “Effect of Reinforcement
Learning.” Scientia Iranica26 (6): 3780–3795. Learning on Coordination of Multiagent Systems.” Proceed-
Aissani, N., A. Bekrar, D. Trentesaux, and B. Beldjilali. ings of SPIE – The International Society for Optical Engineer-
2012. “Dynamic Scheduling for Multi-Site Companies: A ing 4208: 31–41.
Decisional Approach Based on Reinforcement Multi-agent Burgos, Diana, and Dmitry Ivanov. 2021. “Food Retail Supply
Learning.” Journal of Intelligent Manufacturing 23 (6): Chain Resilience and the COVID-19 Pandemic: A Digital
2513–2529. Twin-Based Impact Analysis and Improvement Directions.”
Alves, J. C., and G. R. Mateus. 2020. “Deep Reinforcement Transportation Research Part E: Logistics and Transportation
Learning and Optimization Approach for Multi-echelon Review 152: 102412.
Supply Chain with Uncertain Demands.” Lecture Notes in Cao, H., H. Xi, and S. F. Smith. 2003. “A Reinforcement
Computer Science (Including Subseries Lecture Notes in Arti- Learning Approach to Production Planning in the Fabri-
ficial Intelligence and Lecture Notes in Bioinformatics) 12433 cation/fulfillment Manufacturing Process.” In Winter Sim-
LNCS, 584–599. Berlin, Heidelberg: Springer. ulation Conference Proceedings, New Orleans, LA, USA,
Back, J.-G., C. O. Kim, and I.-H. Kwon. 2006. “An Adaptive 1417–1423. Vol. 2.
Inventory Control Model for a Supply Chain with Non- Cavalcante, Ian M., Enzo M. Frazzon, Fernando A. Forcellini,
stationary Customer Demands.” Lecture Notes in Computer and Dmitry Ivanov. 2019. “A Supervised Machine Learning
Science (including Subseries Lecture Notes in Artificial Intel- Approach to Data-driven Simulation of Resilient Supplier
ligence and Lecture Notes in Bioinformatics) 4099 LNAI, Selection in Digital Manufacturing.” International Journal of
895–900. Berlin, Heidelberg: Springer. Information Management 49: 86–97.
Baird, Leemon. 1995. “Residual Algorithms: Reinforcement Chaharsooghi, S. K., J. Heydari, and S. H. Zegordi. 2008.
Learning with Function Approximation.” In Machine Learn- “A Reinforcement Learning Model for Supply Chain Order-
ing Proceedings 1995, 30–37. Elsevier. doi:10.1016/b978-1- ing Management: An Application to the Beer Game.” Deci-
55860-377-6.50013-x. sion Support Systems 45 (4): 949–959.
Barat, S., H. Khadilkar, H. Meisheri, V. Kulkarni, V. Baniwal, Chatzidimitriou, K. C., A. L. Symeonidis, and P. A. Mitkas.
P. Kumar, and M. Gajrani. 2019. “Actor Based Simulation 2013. “Policy Search Through Adaptive Function Approxi-
for Closed Loop Control of Supply Chain Using Reinforce- mation for Bidding in TAC SCM.” Lecture Notes in Business
ment Learning.” In Proceedings of the International Joint Information Processing136: 16–29.
Conference on Autonomous Agents and Multiagent Systems, Chen, H., Z. Chen, F. Lin, and P. Zhuang. 2021. “Effec-
AAMAS, Montreal, QC, Canada, 1802–1804. Vol. 3. tive Management for Blockchain-Based Agri-Food Supply
Barat, S., P. Kumar, M. Gajrani, H. Khadilkar, H. Meisheri, Chains Using Deep Reinforcement Learning.” IEEE Access
V. Baniwal, and V. Kulkarni. 2020. "Reinforcement Learn- 9: 36008–36018.
ing of Supply Chain Control Policy Using Closed Loop Chien, C.-F., Y.-S. Lin, and S.-K. Lin. 2020. “Deep Reinforce-
Multi-agent Simulation.” Lecture Notes in Computer Science ment Learning for Selecting Demand Forecast Models to
7174 B. ROLF ET AL.
Empower Industry 3.5 and An Empirical Study for a Semi- Flexibility, End-to-end Connectivity and Real-Time Visibil-
conductor Component Distributor.” International Journal of ity Through Internet-of-everything.” International Journal of
Production Research 58 (9): 2784–2804. Production Research 60 (2): 442–451.
Chodura, D., P. Dominik, and J. Koźlak. 2011. “Market Dolgui, Alexandre, Dmitry Ivanov, and Maxim Rozhkov. 2019.
Strategy Choices Made by Company Using Reinforcement “Does the Ripple Effect Influence the Bullwhip Effect? An
Learning.” Advances in Intelligent and Soft Computing 90: Integrated Analysis of Structural and Operational Dynam-
83–90. ics in the Supply Chain.” International Journal of Production
Choi, Tsan-Ming, Alexandre Dolgui, Dmitry Ivanov, and Research 58 (5): 1285–1301.
Erwin Pesch. 2022. “OR and Analytics for Digital, Resilient, Dolgui, Alexandre, Dmitry Ivanov, and Boris Sokolov. 2020.
and Sustainable Manufacturing 4.0.” Annals of Operations “Reconfigurable Supply Chain: The X-network.” Inter-
Research 310 (1): 1–6. national Journal of Production Research 58 (13): 4138–
Chopra, Sunil, and P. Meindl. 2013. Supply Chain Management: 4163.
Strategy, Planning, and Operation. 5th ed. Boston: Pearson Du, H., and Y. Jiang. 2019. “Backup Or Reliability Improve-
Education. ment Strategy for a Manufacturer Facing Heterogeneous
Chopra, Sunil, and Man Mohan S. Sodhi. 2004. “Managing Risk Consumers in a Dynamic Supply Chain.” IEEE Access 7:
to Avoid Supply-Chain Breakdown.” MIT Sloan Manage- 50419–50430.
ment Review 46 (1): 53–61. Du, H., and T. Xiao. 2019. “Pricing Strategies for Compet-
Collins, John, Raghu Arunachalam, Norman Sadeh, Joakim ing Adaptive Retailers Facing Complex Consumer Behavior:
Eriksson, Niclas Finne, and Sverker Janson. 2006. “The Sup- Agent-Based Model.” International Journal of Information
ply Chain Management Game for the 2007 Trading Agent Technology and Decision Making 18 (6): 1909–1939.
Competition.” http://reports-archive.adm.cs.cmu.edu/ Emerson, Denise, Wei Zhou, and Selwyn Piramuthu. 2009.
anon/isri2007/CMU-ISRI-07-100.pdf. “Goodwill, Inventory Penalty, and Adaptive Supply Chain
Craven, T. J., and C. C. Krejci. 2016. “Assessing Manage- Management.” European Journal of Operational Research 199
ment Strategies for Intermediated Regional Food Supply (1): 130–138.
Networks.” In 2016 International Annual Conference of the Filatova, D., C. El-Nouty, and R. V. Fedorenko. 2021.
American Society for Engineering Management, ASEM 2016, “Some Theoretical Backgrounds for Reinforcement Learn-
Charlotte, North Carolina, USA, 21–48. ing Model of Supply Chain Management Under Stochastic
Craven, T. J., and C. C. Krejci. 2017. “An Agent-Based Model Demand.” In International Conference on Information and
of Regional Food Supply Chain Disintermediation.” In ADS Digital Technologies 2021, IDT 2021, Zilina, Slovakia, 24–30.
’17: Proceedings of the Agent-Directed Simulation Symposium, Fleischmann, Bernhard, Herbert Meyr, and Michael Wagner.
Virginia Beach, Virginia, USA, 83–92. Vol. 49. 2008. “Advanced Planning.” In Supply Chain Management
Dahlem, D., and W. Harrison. 2010. “Collaborative Function and Advanced Planning, edited by Hartmut Stadtler, and
Approximation in Social Multiagent Systems.” In Proceed- Christoph Kilger, 81–106. Berlin, Heidelberg: Springer.
ings – 2010 IEEE/WIC/ACM International Conference on Ganesan, V. K., D. Sundararaj, and A. P. Srinivas. 2021. “Adap-
Intelligent Agent Technology, IAT 2010, Washington, DC, tive Inventory Replenishment for Dynamic Supply Chains
USA, 48–55. Vol. 2. with Uncertain Market Demand.” Lecture Notes in Mechan-
Dangelmaier, W., T. Rust, A. Döring, and B. Klöpper. 2006. ical Engineering, 325–335. Singapore: Springer.
“A Reinforcement Learning Approach for Learning Coor- Gang, Z., and S. Ruoying. 2006. “Policy Transition of Rein-
dination Rules in Production Networks.” In CIMCA 2006: forcement Learning for An Agent Based SCM System.” In
International Conference on Computational Intelligence for 2006 IEEE International Conference on Industrial Informat-
Modelling, Control and Automation, Jointly with IAWTIC ics, INDIN’06, Singapore, 793–798.
2006: International Conference on Intelligent Agents Web Ghavamipoor, H., and S. A. Hashemi Golpayegani. 2020.
Technologies, Sydney, Australia, 84. “A Reinforcement Learning Based Model for Adaptive Ser-
Das, Tapas K., Abhijit Gosavi, Sridhar Mahadevan, and vice Quality Management in E-Commerce Websites.” Busi-
Nicholas Marchalleck. 1999. “Solving Semi-Markov Deci- ness and Information Systems Engineering 62 (2): 159–177.
sion Problems Using Average Reward Reinforcement Learn- Ghavamzadeh, Mohammad, Shie Mannor, Joelle Pineau, and
ing.” Management Science45 (4): 560–574. Aviv Tamar. 2015. “Bayesian Reinforcement Learning: A
Dayan, Peter, and Geoffrey E. Hinton. 1997. “Using Expecta Survey.” Foundations and Trends® in Machine Learning 8
tion-Maximization for Reinforcement Learning.” Neural (5-6): 359–483. ArXiv: 1609.04436.
Computation 9 (2): 271–278. Ghorbel, N., S.-A. Addouche, and A. El Mhamedi. 2015.
De Maio, C., G. Fenza, V. Loia, F. Orciuoli, and E. Herrera- “Forward Management of Spare Parts Stock Shortages Via
Viedma. 2016. “A Framework for Context-aware Hetero- Causal Reasoning Using Reinforcement Learning.” IFAC-
geneous Group Decision Making in Business Processes.” PapersOnLine 48: 1061–1066.
Knowledge-Based Systems 102: 39–50. Giannoccaro, I., and P. Pontrandolfo. 2002. “Inventory Man-
DispoWeb. 2004. “Dispositive Supply Web Coordination with agement in Supply Chains: A Reinforcement Learning
Multi-Agent Systems.” https://www.aot.tu-berlin.de/index. Approach.” International Journal of Production Economics 78
php?id = 1764&L = 1. (2): 153–161.
Dogan, I., and A. R. Güner. 2015. “A Reinforcement Learning Glasserman, Paul, and Sridhar Tayur. 1995. “Sensitivity Anal-
Approach to Competitive Ordering and Pricing Problem.” ysis for Base-Stock Levels in Multiechelon Production-
Expert Systems 32 (1): 39–48. inventory Systems.” Management Science 41 (2): 263–281.
Dolgui, Alexandre, and Dmitry Ivanov. 2022. “5G in Digi- Govindan, R., and T. Al-Ansari. 2019. “Computational Deci-
tal Supply Chain and Operations Management: Fostering sion Framework for Enhancing Resilience of the Energy,
Water and Food Nexus in Risky Environments.” Renewable Ivanov, Dmitry. 2020c. “Viable Supply Chain Model: Integrat-
and Sustainable Energy Reviews 112: 653–668. ing Agility, Resilience and Sustainability Perspectives–les
Gu, Shixiang, Ethan Holly, Timothy Lillicrap, and Sergey sons From and Thinking Beyond the COVID-19 Pandemic.”
Levine. May 2017. “Deep Reinforcement Learning for Annals of Operations Research 1–21.
Robotic Manipulation with Asynchronous Off-policy Ivanov, Dmitry. 2021a. “Digital Supply Chain Management and
Updates.” In 2017 IEEE International Conference on Robotics Technology to Enhance Resilience by Building and Using
and Automation (ICRA). IEEE. End-to-end Visibility During the COVID-19 Pandemic.”
Gutierrez-Franco, E., C. Mejia-Argueta, and L. Rabelo. 2021. IEEE Transactions on Engineering Management 1–11.
“Data-driven Methodology to Support Long-lasting Logis- Ivanov, Dmitry. 2021b. Introduction to Supply Chain Resilience:
tics and Decision Making for Urban Last-mile Operations.” Management, Modelling, Technology. Cham: Springer Nature.
Sustainability (Switzerland) 13 (11): 6230. Ivanov, Dmitry. 2022a. “Blackout and Supply Chains: Cross-
Habib, A., M. I. Khan, and J. Uddin. 2017. “Optimal Route Structural Ripple Effect, Performance, Resilience and Via-
Selection in Complex Multi-Stage Supply Chain Networks bility Impact Analysis.” Annals of Operations Research
Using SARSA(λ).” In 19th International Conference on Com- 1–17.
puter and Information Technology, ICCIT 2016, Dhaka, Ivanov, Dmitry. 2022b. “The Industry 5.0 Framework: Viabi
Bangladesh, 170–175. lity-Based Integration of the Resilience, Sustainability, and
Hachaichi, Y., Y. Chemingui, and M. Affes. 2020. “A Policy Human-centricity Perspectives.” International Journal of
Gradient Based Reinforcement Learning Method for Sup- Production Research.
ply Chain Management.” In Proceedings of the International Ivanov, Dmitry. 2022c. “Probability, Adaptability, and Time:
Conference on Advanced Systems and Emergent Technologies, Some Research-Practice Paradoxes in Supply Chain Resi
IC_ASET 2020, Tunis, Tunisia, 135–140. lience and Viability Modeling.” International Journal of Inte-
Hax, Arnoldo C., and Dan Candea. 1984. Production and Inven- grated Supply Management 15 (4): 454–465.
tory Management. Englewood Cliffs: Prentice-Hall. Ivanov, Dmitry, and Alexandre Dolgui. 2021. “A Digital Sup-
Hirano, M., H. Matsushima, K. Izumi, and T. Mukai. 2021. ply Chain Twin for Managing the Disruption Risks and
“Simulation of Unintentional Collusion Caused by Auto Resilience in the Era of Industry 4.0.” Production Planning
Pricing in Supply Chain Markets.” Lecture Notes in Com- & Control 32 (9): 775–788.
puter Science (Including Subseries Lecture Notes in Artifi- Ivanov, Dmitry, and Alexandre Dolgui. 2022. “The Shortage
cial Intelligence and Lecture Notes in Bioinformatics) 12568 Economy and Its Implications for Supply Chain and Oper-
LNAI, 352–359. Cham: Springer. ations Management.” International Journal of Production
Hosseini, Seyedmohsen, and Dmitry Ivanov. 2021. “A Multi- Research.
layer Bayesian Network Method for Supply Chain Dis- Ivanov, Dmitry, Alexandre Dolgui, and Boris Sokolov. 2012.
ruption Modelling in the Wake of the COVID-19 Pan- “Applicability of Optimal Control Theory to Adaptive Supply
demic.” International Journal of Production Research 60 (17): Chain Planning and Scheduling.” Annual Reviews in Control
5258–5276. 36 (1): 73–84.
Huang, H., and X. Tan. 2021. “Application of Reinforcement Ivanov, Dmitry, Alexandre Dolgui, and Boris Sokolov. 2022.
Learning Algorithm in Delivery Order System Under Sup- “Cloud Supply Chain: Integrating Industry 4.0 and Digital
ply Chain Environment.” Mobile Information Systems 2021: Platforms in the “Supply Chain-as-a-Service”.” Transporta-
1–11. tion Research Part E: Logistics and Transportation Review
Hwangbo, S., and C. Yoo. 2018. “A Methodology of a Hybrid 160: 102676.
Hydrogen Supply Network (HHSN) Under Alternative Ivanov, Dmitry, and Boris Sokolov. 2013. “Control and System-
Energy Resources (AERs) of Hydrogen Footprint Constraint Theoretic Identification of the Supply Chain Dynamics
for Sustainable Energy Production (SEP).” Computer Aided Domain for Planning, Analysis and Adaptation of Perfor-
Chemical Engineering 43: 343–348. mance Under Uncertainty.” European Journal of Operational
Hyndman, Rob J., and George Athanasopoulos. 2018. Forecast- Research 224 (2): 313–323.
ing: Principles and Practice. 2nd ed. Melbourne: OTexts. Ivanov, Dmitry, Christopher S. Tang, Alexandre Dolgui, Daria
Instacart. 2017. “Instacart Market Basket Analysis.” https:// Battini, and Ajay Das. 2021. “Researchers’ Perspectives on
www.kaggle.com/competitions/instacart-market-basket- Industry 4.0: Multi-disciplinary Analysis and Opportuni-
analysis/overview. ties for Operations Management.” International Journal of
Isele, David, Reza Rahimi, Akansel Cosgun, Kaushik Subrama- Production Research 59 (7): 2055–2078.
nian, and Kikuo Fujimura. May 2018. “Navigating Occluded Jackson, Ilya. 2020. “Neuroevolutionary Approach to Meta-
Intersections with Autonomous Vehicles Using Deep Rein- modeling of Production-Inventory Systems with Lost-Sales
forcement Learning.” In 2018 IEEE International Conference and Markovian Demand.” In Lecture Notes in Networks and
on Robotics and Automation (ICRA). IEEE. Systems, 90–99. Cham: Springer International Publishing.
Ivanov, Dmitry. 2020a. “‘A Blessing in Disguise’ Or ‘as If it Jiang, C.. 2008. “Two-dimensional Learning Mechanisms for
Wasn’t Hard Enough Already’: Reciprocal and Aggravate Alliance Members in Multi-agent Supply Chains.” In Pro-
Vulnerabilities in the Supply Chain.” International Journal of ceedings of the International Conference on Information Man-
Production Research 58 (11): 3252–3262. agement, Innovation Management and Industrial Engineer-
Ivanov, Dmitry. 2020b. “Predicting the Impacts of Epidemic ing, ICIII 2008, Taipei, Taiwan, 524–527. Vol. 2.
Outbreaks on Global Supply Chains: A Simulation-Based Jiang, C., and Z. Sheng. 2009. “Case-Based Reinforcement
Analysis on the Coronavirus Outbreak (COVID-19/SARS- Learning for Dynamic Inventory Control in a Multi-agent
CoV-2) Case.” Transportation Research Part E: Logistics and Supply-chain System.” Expert Systems with Applications 36
Transportation Review 136: 101922. (3 PART 2): 6520–6526.
7176 B. ROLF ET AL.
Kaihara, T., and S. Fujii. 2008. “Supply Chain Management Science – Proceedings of the 2015 International Conference on
for Virtual Enterprises with Adaptive Multi-agent Mecha- Electronic Engineering and Information Science, ICEEIS 2015,
nism.” International Journal of Manufacturing Technology Harbin, China, 779–782.
and Management 14 (3-4): 299–310. Lin, F.-R., and Y.-H. Pai. 2000. “Using Multi-agent Simula-
Kegenbekov, Z., and I. Jackson. 2021. “Adaptive Supply Chain: tion and Learning to Design New Business Processes.” IEEE
Demand-Supply Synchronization Using Deep Reinforce- Transactions on Systems, Man, and Cybernetics Part A: Sys-
ment Learning.” Algorithms 14 (8): 240. tems and Humans. 30 (3): 380–384.
Khan, Saif M., Alexander Mann, and Dahlia Peterson. Liu, C. 2020. “Outsourcing Strategies for Manufacturers Fac-
2021. The Semiconductor Supply Chain: Assessing National ing Reputation Oriented Consumers.” In Proceedings – 2020
Competitiveness. Technical Report. Center for Security Chinese Automation Congress, CAC 2020, Shanghai, China,
and Emerging Technology. Accessed 27 December 2021. 1980–1985.
https://cset.georgetown.edu/publication/the-semi Liu, H., E. Howley, and J. Duggan. 2011. “An Agent-Based
conductor-supply-chain/. Simulation of the Effects of Consumer Behavior on Market
Kim, T., R. U. Bilsel, and S. Kumara. 2008. “Supplier Selection in Price Dynamics.” In Proceedings of the IASTED International
Dynamic Competitive Environments.” International Journal Conference on Applied Simulation and Modelling, ASM 2011,
of Services Operations and Informatics 3 (3-4): 283–293. Crete, Greece, 316–325.
Kim, C. O., J. Jun, J. K. Baek, R. L. Smith, and Y. D. Kim. 2005. Lu, W., H. Tan, X. Yan, and C. Lv. 2021. “Supply Chain Schedul-
“Adaptive Inventory Control Models for Supply Chain Man- ing Using Double Deep Time-Series Differential Neural
agement.” International Journal of Advanced Manufacturing Network.” In 5th International Workshop on Advances
Technology 26 (9-10): 1184–1192. in Energy Science and Environment Engineering (AESEE
Kim, C. O., I.-H. Kwon, and J.-G. Baek. 2008. “Asynchronous 2021), E3S Web of Conferences 257, Xiamen, China.
Action-reward Learning for Nonstationary Serial Supply MacCarthy, Bart L., and Dmitry Ivanov. 2022a. The Digital
Chain Inventory Control.” Applied Intelligence 28 (1): 1–16. Supply Chain. Amsterdam: Elsevier.
Ko, J. M., C. Kwak, Y. Cho, and C. O. Kim. 2011. “Adaptive Prod- MacCarthy, Bart L., and Dmitry Ivanov. 2022b. “The Dig-
uct Tracking in RFID-enabled Large-Scale Supply Chain.” ital Supply Chain–emergence, Concepts, Definitions, and
Expert Systems with Applications 38 (3): 1583–1590. Technologies.” In The Digital Supply Chain, edited by
Krakovsky, Marina. 2016. “Reinforcement Renaissance.” Bart L MacCarthy and Dmitry Ivanov, 3–24. Amsterdam:
Communications of the ACM 59 (8): 12–14. Elsevier.
Kwon, I.-H., C. O. Kim, J. Jun, and J. H. Lee. 2008. “Case-Based Mahapatra, Bandana, and Srikanta Patnaik. 2018. “Ant Colony
Myopic Reinforcement Learning for Satisfying Target Ser- Optimization.” In Advances in Swarm Intelligence for Opti-
vice Level in Supply Chain.” Expert Systems with Applications mizing Problems in Computer Science, 79–114. 1st ed. Boca
35 (1-2): 389–397. Raton, FL: CRC Press/Taylor & Francis Group, [2019]: Chap-
Landry, Sylvain, and Martin Beaulieu. 2013. “The Challenges man and Hall/CRC.
of Hospital Supply Chain Management, From Central Stores Makridis, G., P. Mavrepis, D. Kyriazis, I. Polychronou, and
to Nursing Units.” In Handbook of Healthcare Operations S. Kaloudis. 2020. “Enhanced Food Safety Through Deep
Management: Methods and Applications, edited by Brian T Learning for Food Recalls Prediction.” Lecture Notes in Com-
Denton, 465–482. New York: Springer. puter Science (Including Subseries Lecture Notes in Artificial
Lee, Y. S., and R. Sikora. 2019. “Application of Adaptive Strat- Intelligence and Lecture Notes in Bioinformatics) 12323 LNAI,
egy for Supply Chain Agent.” Information Systems and E- 566–580, Cham: Springer.
Business Management 17 (1): 117–157. Marandi, F., and S. M. T. Fatemi Ghomi. 2019. “Network Con-
Li, J., P. Guo, and Z. Zuo. 2008. “Inventory Control Model figuration Multi-Factory Scheduling with Batch Delivery:
for Mobile Supply Chain Management.” In Proceedings – earning-oriented Simulated Annealing Approach.” Comput-
The 2008 International Conference on Embedded Software ers and Industrial Engineering 132: 293–310.
and Systems Symposia, ICESS Symposia, Chengdu, China, Mehta, D., and D. Yamparala. 2014. “Policy Gradient Reinforce-
459–463. ment Learning for Solving Supply-chain Management Prob-
Li, X., W. Luo, M. Yuan, J. Wang, J. Lu, J. Wang, J. Lu, and J. lems.” In I-CARE 2014: Proceedings of the 6th IBM Collab-
Zeng. 2021. “Learning to Optimize Industry-Scale Dynamic orative Academia Research Exchange Conference (I-CARE),
Pickup and Delivery Problems.” In Proceedings – Inter- Bangalore, India, 1-4.
national Conference on Data Engineering, Chania, Greece, Meisheri, H., N. N. Sultana, M. Baranwal, V. Baniwal, S. Nath,
2511–2522 S. Verma, B. Ravindran, and H. Khadilkar. 2021. “Scal-
Li, H., T. Pang, Y. Wu, and G. Jiang. 2014. “Conflict Resolution able Multi-product Inventory Control with Lead Time Con-
of Production-marketing Collaborative Planning Based on straints Using Reinforcement Learning.” Neural Computing
Multi-agent Self-adaptation Negotiation.” In ICAART 2014 and Applications 34: 1735–1757.
– Proceedings of the 6th International Conference on Agents Mezouar, H., and A. El Afia. 2019. “A 4-level Reference for
and Artificial Intelligence, Angers, France, 209–214. Vol. 2. Self-adaptive Processes Based on SCOR and Integrating
Li, Y., and J.-M. Zhao. 2006. “Applying Adaptive Multi-agent Q-Learning.” In BDIoT’19: Proceedings of the 4th Interna-
Modeling in Agile Supply Chain Simulation.” In Proceedings tional Conference on Big Data and Internet of Things, Rabat,
of the 2006 International Conference on Machine Learning Morocco, 1–5.
and Cybernetics, Dalian, China, 4191–4196. Michie, Donald, and R. A. Chambers. 1968. “BOXES: An
Li, C. Y., S. H. Zhao, T. W. Zhang, and X. T. Wang. 2015. “Rein- Experiment in Adaptive Control.” In Machine Intelligence,
forcement Learning of Fuzzy Joint Replenishment Problem edited by E. Dale, and D. Michie. Edinburgh, UK: Oliver and
in Supply Chain.” In Electronic Engineering and Information Boyd.
Minsky, Marvin. 1961. “Steps Toward Artificial Intelligence.” on Adaptive Dynamic Programming and Reinforcement
Proceedings of the IRE 49 (1): 8–30. Learning, Paris, France, 176–183.
Mnih, Volodymyr, Adrià Puigdomènech Badia, Mehdi Mirza, Ren, C., Y. Chai, and Y. Liu. 2002. “Agile Supply Chain Sim-
Alex Graves, Tim Harley, Timothy P. Lillicrap, David Sil- ulation Using Adaptive Multi-agent Modeling.” In Proceed-
ver, and Koray Kavukcuoglu. 2016. “Asynchronous Methods ings of Asian Simulation Conference; System Simulation and
for Deep Reinforcement Learning.” In Proceedings of the Scientific Computing, Shanghai, China, 752–756. Vol. 2.
33rd International Conference on International Conference Ruel, Salomée, Jamal El Baz, Dmitry Ivanov, and Ajay Das.
on Machine Learning – Volume 48, ICML’16, 1928–1937. 2021. “Supply Chain Viability: Conceptualization, Measure-
JMLR.org. ment, and Nomological Validation.” Annals of Operations
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Research1–30.
Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Rummery, Gavin Adrian, and Mahesan Niranjan. 1994. “On-
Riedmiller. 2013. “Playing Atari with Deep Reinforcement line Q-learning Using Connectionist Systems.”.
Learning.”. Saitoh, F., and A. Utani. 2013. “Coordinated Rule Acquisition of
Mönch, Lars, Reha Uzsoy, and John W. Fowler. 2018. “A Survey Decision Making on Supply Chain by Exploitation-oriented
of Semiconductor Supply Chain Models Part I: Semicon- Reinforcement Learning.” Lecture Notes in Computer Sci-
ductor Supply Chains, Strategic Network Design, and Sup- ence (including Subseries Lecture Notes in Artificial Intel-
ply Chain Simulation.” International Journal of Production ligence and Lecture Notes in Bioinformatics) 8131 LNCS,
Research 56 (13): 4524–4545. 537–544. Berlin, Heidelberg: Springer.
Mongeon, Philippe, and Adèle Paul-Hus. 2015. “The Journal Sakurai, Yoshitaka, Kouhei Takada, Takashi Kawabe, and Set-
Coverage of Web of Science and Scopus: A Comparative suo Tsuruta. 2010. “A Method to Control Parameters of
Analysis.” Scientometrics 106: 213–228. Evolutionary Algorithms by Using Reinforcement Learn-
Moriarty, David E, Alan C Schultz, and John J Grefenstette. ing.” In 2010 Sixth International Conference on Signal-image
1999. “Evolutionary Algorithms for Reinforcement Learn- Technology and Internet Based Systems, 74–79. IEEE.
ing.” Journal of Artificial Intelligence Research 11: 241–276. Schroeder, Meike, Birgit von See, Johannes Schnelle, and Wolf-
Mortazavi, A., A. Arshadi Khamseh, and P. Azimi. 2015. gang Kersten. 2021. “Impact of the Covid-19 Pandemic on
“Designing of An Intelligent Self-adaptive Model for Supply Supply Chain Management.” In Logistik in Wissenschaft und
Chain Ordering Management System.” Engineering Applica- Praxis, edited by Roy Fritzsche, Stefan Winter, and Jacob
tions of Artificial Intelligence 37: 207–220. Lohmer, 3–24. Wiesbaden: Springer Gabler.
Nanduri, V., and I. Saavedra-Antolínez. 2013. “A Competi- Schulman, John, Filip Wolski, Prafulla Dhariwal, Alec Rad-
tive Markov Decision Process Model for the Energy-water- ford, and Oleg Klimov. 2017. “Proximal Policy Optimization
climate Change Nexus.” Applied Energy 111: 186–198. Algorithms.”.
Peng, Z., Y. Zhang, Y. Feng, T. Zhang, Z. Wu, and H. Su. 2019. Serrano, J. C., J. Mula, and R. Poler. 2021. “Digital Twin for
“Deep Reinforcement Learning Approach for Capacitated Supply Chain Master Planning in Zero-Defect Manufac-
Supply Chain Optimization Under Demand Uncertainty.” In turing.” IFIP Advances in Information and Communication
Proceedings – 2019 Chinese Automation Congress, CAC 2019, Technology 626: 102–111.
Hangzhou, China, 3512–3517. Sheffi, Yosef. 2020. The New (ab)normal: Reshaping Business
Perez, H. D., C. D. Hubbs, C. Li, and I. E. Grossmann. 2021. and Supply Chain Strategy Beyond Covid-19. Cambridge
“Algorithmic Approaches to Inventory Management Opti- (Mass.): MIT CTL Media. OCLC: 1258319911.
mization.” Processes 9 (1): 1–17. Sheffi, Yossi. 2021. A Shot in the Arm: How Science, Engineer-
Pontrandolfo, P., A. Gosavi, O. G. Okogbaa, and T. K. Das. ing, and Supply Chains Converged to Vaccinate the World.
2002. “Global Supply Chain Management: A Reinforcement Cambridge: MIT CTL Media.
Learning Approach.” International Journal of Production Sheremetov, L., and L. Rocha-Mier. 2004. “Collective Intelli-
Research 40 (6): 1299–1317. gence As a Framework for Supply Chain Management.” In
Puskás, E., Á. Budai, and G. Bohács. 2020. “Optimization of a 2004 2nd International IEEE Conference ’Intelligent Systems’
Physical Internet Based Supply Chain Using Reinforcement – Proceedings, Varna, Bulgaria, 417–422. Vol. 2.
Learning.” European Transport Research Review 12 (1): 47. Sheremetov, L., and L. Rocha-Mier. 2008. “Supply Chain Net-
Rao, J. J., K. K. Ravulapati, and T. K. Das. 2003. “A Simulation- work Optimization Based on Collective Intelligence and
Based Approach to Study Stochastic Inventory-planning Agent Technologies.” Human Systems Management 27 (1):
Games.” International Journal of Systems Science 34 (12-13): 31–47.
717–730. Sheremetov, L., L. Rocha-Mier, and I. Batyrshin. 2005.
Ravulapati, K. K., J. Rao, and T. K. Das. 2004. “A Reinforce- “Towards a Multi-agent Dynamic Supply Chain Simulator
ment Learning Approach to Stochastic Business Games.” for Analysis and Decision Support.” In Annual Conference
IIE Transactions (Institute of Industrial Engineers) 36 (4): of the North American Fuzzy Information Processing Society
373–385. – NAFIPS, Detroit, Michigan, USA, 263–268.
Reeder, J., G. Sukthankar, M. Georgiopoulos, and G. Anag- Sianaki, O. A., A. Yousefi, A. R. Tabesh, and M. Mahdavi.
nostopoulos. 2008. “Intelligent Trading Agents for Massively 2019. “Machine Learning Applications: The Past and Cur-
Multi-player Game Economies.” In Proceedings of the 4th rent Research Trend in Diverse Industries.” Inventions 4 (1):
Artificial Intelligence and Interactive Digital Entertainment 8.
Conference, AIIDE 2008, Stanford, California, 102–107. Silver, David, Thomas Hubert, Julian Schrittwieser, Ioannis
Reindorp, M. J., and M. C. Fu. 2011. “Dynamic Lead Time Antonoglou, Matthew Lai, Arthur Guez, and Marc Lanctot,
Promising.” In IEEE SSCI 2011: Symposium Series on Com- et al. 2017. “Mastering Chess and Shogi by Self-Play with a
putational Intelligence – ADPRL 2011: 2011 IEEE Symposium General Reinforcement Learning Algorithm.”.
7178 B. ROLF ET AL.
Simsek, B., S. Albayrak, and A. Korth. 2004. “Reinforcement UNO. 2008. “International Standard Industrial Classification of
Learning for Procurement Agents of the Factory of the All Economic Activities (ISIC), Rev.4.”.
Future.” In Proceedings of the 2004 Congress on Evolutionary Valluri, A., M. J. North, and C. M. MacAl. 2009. “Reinforcement
Computation, CEC2004, Portland, Oregon, USA, 1331–1337. Learning in Supply Chains.” International Journal of Neural
Vol. 2. Systems 19 (5): 331–344.
Singi, S., S. Gopal, S. Auti, and R. Chaurasia. 2020. “Rein- van Eck, Nees Jan, and Ludo Waltman. 2009. “Software Sur-
forcement Learning for Inventory Management.” In Proceed- vey: VOSviewer, a Computer Program for Bibliometric Map-
ings of International Conference on Intelligent Manufacturing ping.” Scientometrics 84 (2): 523–538.
and Automation, Lecture Notes in Mechanical Engineering, Van Tongeren, T., U. Kaymak, D. Naso, and E. Van Asperen.
317–326. Singapore: Springer. 2007. “Q-learning in a Competitive Supply Chain.” In Con-
Snyder, Hannah. 2019. “Literature Review As a Research ference Proceedings – IEEE International Conference on Sys-
Methodology: An Overview and Guidelines.” Journal of tems, Man and Cybernetics, Montreal, Quebec, Canada,
Business Research 104: 333–339. 1211–1216.
Sterman, John D. 1992. “Teaching Takes Off.” OR/MS Today 35 Vanvuchelen, N., J. Gijsbrechts, and R. Boute. 2020. “Use of
(3): 40–44. Proximal Policy Optimization for the Joint Replenishment
Sui, Z., A. Gosavi, and L. Lin. 2010. “A Reinforcement Learning Problem.” Computers in Industry 119:103239.
Approach for Inventory Replenishment in Vendor-managed Vinitsky, Eugene, Nathan Lichtle, Kanaad Parvate, and Alexan-
Inventory Systems with Consignment Inventory.” EMJ – dre Bayen. 2020. “Optimizing Mixed Autonomy Traffic Flow
Engineering Management Journal 22 (4): 44–53. with Decentralized Autonomous Vehicles and Multi-agent
Sun, R., and G. Zhao. 2012. “Analyses About Efficiency of RL.”.
Reinforcement Learning to Supply Chain Ordering Man- Wang, Xun, and Stephen M. Disney. 2016. “The Bullwhip
agement.” In IEEE International Conference on Industrial Effect: Progress, Trends and Directions.” European Journal
Informatics (INDIN), Beijing, China, 124–127. of Operational Research 250 (3): 691–701.
Sun, R., G. Zhao, C. Li, and S. Tatsumi. 2006. “The Improve- Wang, F., and L. Lin. 2021. “Spare Parts Supply Chain Net-
ment on Reinforcement Learning for SCM by the Agent Pol- work Modeling Based on a Novel Scale-Free Network and
icy Mapping.” In IECON Proceedings (Industrial Electronics Replenishment Path Optimization with Q Learning.” Com-
Conference), Paris, France, 3585–3590. puters and Industrial Engineering 157: 107312.
Sun, R., G. Zhao, and C. Yin. 2010. “A Multi-agent Coordina- Wang, Hao, Jiaqi Tao, Tao Peng, Alexandra Brintrup, Edward
tion of a Supply Chain Ordering Management with Multiple Elson Kosasih, Yuqian Lu, Renzhong Tang, and Luoke
Members Using Reinforcement Learning.” In IEEE Interna- Hu. 2022. “Dynamic Inventory Replenishment Strategy for
tional Conference on Industrial Informatics (INDIN), Osaka, Aerospace Manufacturing Supply Chain: Combining Rein-
Japan, 612–616. forcement Learning and Multi-agent Simulation.” Interna-
Sutton, Richard S., and Andrew G. Barto. 2018. Reinforcement tional Journal of Production Research 60 (13): 1–20.
Learning: An Introduction. 2nd ed. Adaptive Computation Ware, Mark, and Michael Mabe. 2015. The STM Report: An
and Machine Learning Series. Cambridge, MA: The MIT Overview of Scientific and Scholarly Journal Publishing. Tech-
Press. nical Report. International Association of Scientific, Techni-
Sutton, Richard S., David McAllester, Satinder Singh, and cal and Medical Publishers.
Yishay Mansour. 1999. “Policy Gradient Methods for Rein- Watkins, Christopher J. C. H., and Peter Dayan. 1992. “Q-
forcement Learning with Function Approximation.” In Pro- learning.” Machine Learning 8 (3-4): 279–292.
ceedings of the 12th International Conference on Neural Infor- Wong, Geoff, Trisha Greenhalgh, Gill Westhorp, Jeanette Buck-
mation Processing Systems, NIPS’99, 1057–1063. Cambridge, ingham, and Ray Pawson. 2013. “RAMESES Publication
MA, USA: MIT Press. Standards: Meta-narrative Reviews.” BMC Medicine 11: 20.
Tae, I. K., R. U. Bilsel, and S. R. T. Kumara. 2007. “A Reinforce- Xiang, L. 2020. “Energy Emergency Supply Chain Collabora-
ment Learning Approach for Dynamic Supplier Selection.” tion Optimization with Group Consensus Through Rein-
In 2007 IEEE International Conference on Service Operations forcement Learning Considering Non-cooperative Beha
and Logistics, and Informatics, SOLI, Philadelphia, Pennsyl- viours.” Energy 210: 118597.
vania, USA, 1-6. Xu, J., J. Zhang, and Y. Liu. 2009. “An Adaptive Inventory
Tang, K., and S. R. T. Kumara. 2005. “Cooperation in a Control for a Supply Chain.” In 2009 Chinese Control and
Multi-Stage Game for Modeling Distributed Task Delegation Decision Conference, CCDC 2009, Guilin, China, 5714–5719.
in a Supply Chain Procurement Problem.” In Proceedings Yang, S., Y. Ogawa, K. Ikeuchi, Y. Akiyama, and R. Shibasaki.
of the 2005 IEEE Conference on Automation Science and 2019. “Firm-level Behavior Control After Large-Scale Urban
Engineering, IEEE-CASE 2005, Edmonton, Alberta, Canada, Flooding Using Multi-agent Deep Reinforcement Learning.”
93–98. In Proceedings of the 2nd ACM SIGSPATIAL International
Tariq Afridi, M., S. Nieto-Isaza, H. Ehm, T. Ponsignon, and A. Workshop on GeoSpatial Simulation, GeoSim 2019, Chicago,
Hamed. 2020. “A Deep Reinforcement Learning Approach Illinois, USA, 24–27.
for Optimal Replenishment Policy in A Vendor Man- Yang, S., and J. Zhang. 2015. “Adaptive Inventory Control
aged Inventory Setting for Semiconductors.” In Proceed- and Bullwhip Effect Analysis for Supply Chains with Non-
ings – Winter Simulation Conference, Orlando, Florida, USA, Stationary Demand.” In Proceedings of the 2015 27th Chi-
1753–1764. nese Control and Decision Conference, CCDC 2015, Qingdao,
Thorndike, Edward L. 1898. “Animal Intelligence: An Experi- China, 3903–3908.
mental Study of the Associative Processes in Animals.” The Zarandi, M. H. F., S. V. Moosavi, and M. Zarinbal. 2013.
Psychological Review: Monograph Supplements 2 (4): i–109. “A Fuzzy Reinforcement Learning Algorithm for Inventory
Control in Supply Chains.” International Journal of Advanced Evolutionary Computation Conference Companion, Lille,
Manufacturing Technology65 (1-4): 557–569. France, 1907–1915.
Zhang, Y., and S. Bhattacharyya. 2007. “Effectiveness of Q- Zhao, H.-P., J.-D. Jiang, and Y.-C. Feng. 2010. “Coordinat-
learning As a Tool for Calibrating Agent-Based Supply ing Supply Chain of Stackelberg Game Model Based on
Network Models.” Enterprise Information Systems 1 (2): Evolutionary Game with GA-RL.” Xitong Gongcheng Lilun
217–233. yu Shijian/System Engineering Theory and Practice 30 (4):
Zhang, Gongtao, Bart L MacCarthy, and Dmitry Ivanov. 2022. 667–672.
“ The Cloud, Platforms, and Digital Twins–Enablers of the Zhao, Long, and Zemin Liu. 1996. “A Genetic Algorithm for
Digital Supply Chain.” In The Digital Supply Chain, edited by Reinforcement Learning.” In Proceedings of International
Bart L MacCarthy and Dmitry Ivanov, 77–99. Amsterdam: Conference on Neural Networks (ICNN’96), Washington, DC,
Elsevier. USA. 1056-1060.
Zhang, Daniel, Saurabh Mishra, Erik Brynjolfsson, John Zhao, F., L. Zhang, J. Cao, and J. Tang. 2021b. “A Cooperative
Etchemendy, Deep Ganguli, Barbara Grosz, and Terah Water Wave Optimization Algorithm with Reinforcement
Lyons, et al. 2021. “The AI Index 2021 Annual Report.”. Learning for the Distributed Assembly No-idle Flowshop
Zhang, K., J. Xu, and J. Zhang. 2013. “A New Adaptive Inven- Scheduling Problem.” Computers and Industrial Engineering
tory Control Method for Supply Chains with Non-Stationary 153: 107082.
Demand.” In 2013 25th Chinese Control and Decision Confer- Zhou, J., M. Purvis, and Y. Muhammad. 2016. “A Combined
ence, CCDC 2013, Guiyang, China, 1034–1038. Modelling Approach for Multi-Agent Collaborative Plan-
Zhang, L., Y. Yin, L. Feng, and H. Fan. 2019. “Unsal- ning in Global Supply Chains.” In Proceedings – 2015 8th
able Risk Prediction of Fruit and Vegetable Agricultural International Symposium on Computational Intelligence and
Products Based on Markov Decision Process.” In Pro- Design, ISCID 2015, Hangzhou, China, 592–597. Vol. 1.
ceedings of the 8th International Conference on Logistics Zhou, J., and X. Zhou. 2019. “Multi-Echelon Inventory Opti-
and Systems Engineering 2018, Changsha City, China, 206– mizations for Divergent Networks by Combining Deep
217. Reinforcement Learning and Heuristics Improvement.” In
Zhanguo, X. 2008. “Research on Refining the Distributed Sup- Proceedings – 2019 12th International Symposium on Com-
ply Chain Procurement Plans Based on CRL.” In Proceedings putational Intelligence and Design, ISCID 2019, Hangzhou,
– International Symposium on Information Processing, ISIP China, 69–73. Vol. 1.
2008 and International Pacific Workshop on Web Mining Zhu, Z., J. Ke, and H. Wang. 2021. “A Mean-Field Markov
and Web-Based Application, WMWA 2008, Moscow, Russia, Decision Process Model for Spatial-Temporal Subsidies in
119–122. Ride-Sourcing Markets.” Transportation Research Part B:
Zhao, Y., E. Hemberg, N. Derbinsky, G. Mata, and U.-M. Methodological 150: 540–565.
O’Reilly. 2021a. “Simulating a Logistics Enterprise Using An Zwaida, T. A., C. Pham, and Y. Beauregard. 2021. “Optimiza-
Asymmetrical Wargame Simulation with Soar Reinforce- tion of Inventory Management to Prevent Drug Shortages in
ment Learning and Coevolutionary Algorithms.” In GECCO the Hospital Supply Chain.” Applied Sciences (Switzerland)
2021 Companion – Proceedings of the 2021 Genetic and 11 (6): 2726.

A Review On Reinforcement Learning Algorithms and Applications in Supply Chain Management

Uploaded by

Copyright:

Available Formats

You might also like

A Review On Reinforcement Learning Algorithms and Applications in Supply Chain Management

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Review On Reinforcement Learning Algorithms and Applications in Supply Chain Management

Uploaded by

Copyright:

Available Formats

International Journal of Production Research

ISSN: (Print) (Online) Journal homepage: www.tandfonline.com/journals/tprs20

A review on reinforcement learning algorithms

To link to this article: https://doi.org/10.1080/00207543.2022.2140221

© 2022 The Author(s). Published by Informa

Published online: 03 Nov 2022.

Submit your article to this journal

Article views: 18375

View related articles

View Crossmark data

Citing articles: 22 View citing articles

Full Terms & Conditions of access and use can be found at

A review on reinforcement learning algorithms and applications in supply chain

ABSTRACT ARTICLE HISTORY

1. Introduction in its turn, used by plastics manufacturers to derive

CONTACT Benjamin Rolf benjamin.rolf@ovgu.de Otto-von-Guericke-University Magdeburg, Universitätsplatz 2, Magdeburg, Germany

Figure 1. Literature review methodology.

classification frameworks. We mention the sample size in

the keyword. The keywords ‘supply chain management’,

(1) Supply chain drivers and algorithms (red colour)

Figure 5. Visualization of the bibliometric network of keywords with VOSviewer (n = 46).

Table 1. Classiﬁcation of publications according to supply chain drivers.

Table 2. Classiﬁcation of publications according to the applied algorithms.

Table 3. Classiﬁcation of publications according to the data source.

Table 4. Classiﬁcation of publications according to industry.

8.4. Limitations Marcel Müller is a research fellow at the

Tobias Reggelin is a project manager,

Ilya Jackson http://orcid.org/0000-0002-7457-6040 (Including Subseries Lecture Notes in Artificial Intelligence

You might also like