Oroojlooyjadid Et Al 2021 A Deep Q Network For The Beer Game Deep Reinforcement Learning For Inventory Optimization

This article was downloaded by: [240c:c001:1014:896a:ad35:df13:bc01:100f] On: 31 October 2023, At: 02:08
Publisher: Institute for Operations Research and the Management Sciences (INFORMS)
INFORMS is located in Maryland, USA
Manufacturing & Service Operations Management

Publication details, including instructions for authors and subscription information:
http://pubsonline.informs.org
A Deep Q-Network for the Beer Game: Deep

Reinforcement Learning for Inventory Optimization
Afshin Oroojlooyjadid, MohammadReza Nazari, Lawrence V. Snyder, Martin Takáč
To cite this article:

Afshin Oroojlooyjadid, MohammadReza Nazari, Lawrence V. Snyder, Martin Takáč (2022) A Deep Q-Network for the Beer Game:
Deep Reinforcement Learning for Inventory Optimization. Manufacturing & Service Operations Management 24(1):285-304.
https://doi.org/10.1287/msom.2020.0939
Full terms and conditions of use: https://pubsonline.informs.org/Publications/Librarians-Portal/PubsOnLine-Terms-and-

Conditions
This article may be used only for the purposes of research, teaching, and/or private study. Commercial use
or systematic downloading (by robots or other automatic processes) is prohibited without explicit Publisher
approval, unless otherwise noted. For more information, contact permissions@informs.org.
The Publisher does not warrant or guarantee the article’s accuracy, completeness, merchantability, fitness
for a particular purpose, or non-infringement. Descriptions of, or references to, products or publications, or
inclusion of an advertisement in this article, neither constitutes nor implies a guarantee, endorsement, or
support of claims made of that product, publication, or service.
Copyright © 2021, INFORMS
Please scroll down for article—it is on subsequent pages
With 12,500 members from nearly 90 countries, INFORMS is the largest international association of operations research (O.R.)
and analytics professionals and students. INFORMS provides unique networking and learning opportunities for individual
professionals, and organizations of all types and sizes, to better understand and use O.R. and analytics tools and methods to
transform strategic visions and achieve better outcomes.
For more information on INFORMS, its publications, membership, or meetings visit http://www.informs.org
MANUFACTURING & SERVICE OPERATIONS MANAGEMENT
Vol. 24, No. 1, January–February 2022, pp. 285–304
http://pubsonline.informs.org/journal/msom ISSN 1523-4614 (print), ISSN 1526-5498 (online)
Downloaded from informs.org by [240c:c001:1014:896a:ad35:df13:bc01:100f] on 31 October 2023, at 02:08 . For personal use only, all rights reserved.
A Deep Q-Network for the Beer Game: Deep Reinforcement

Learning for Inventory Optimization
Afshin Oroojlooyjadid,a MohammadReza Nazari,a Lawrence V. Snyder,a Martin Takáča
a
Department of Industrial and Systems Engineering, Lehigh University, Bethlehem, Pennsylvania 18015
Contact: oroojlooy@gmail.com, https://orcid.org/0000-0001-7829-6145 (AO); mon314@lehigh.edu,
https://orcid.org/0000-0002-7575-6289 (MN); larry.snyder@lehigh.edu, https://orcid.org/0000-0002-2227-7030 (LVS);
takac.mt@gmail.com, https://orcid.org/0000-0001-7455-2025 (MT)
Received: February 2, 2020 Abstract. Problem definition: The beer game is widely used in supply chain management
Revised: June 18, 2020 classes to demonstrate the bullwhip effect and the importance of supply chain coordi-
Accepted: June 25, 2020 nation. The game is a decentralized, multiagent, cooperative problem that can be modeled
Published Online in Articles in Advance: as a serial supply chain network in which agents choose order quantities while cooper-
February 23, 2021 atively attempting to minimize the network’s total cost, although each agent only observes
https://doi.org/10.1287/msom.2020.0939 local information. Academic/practical relevance: Under some conditions, a base-stock
replenishment policy is optimal. However, in a decentralized supply chain in which some
Copyright: © 2021 INFORMS agents act irrationally, there is no known optimal policy for an agent wishing to act
optimally. Methodology: We propose a deep reinforcement learning (RL) algorithm to play
the beer game. Our algorithm makes no assumptions about costs or other settings. As with
any deep RL algorithm, training is computationally intensive, but once trained, the al-
gorithm executes in real time. We propose a transfer-learning approach so that training
performed for one agent can be adapted quickly for other agents and settings. Results:
When playing with teammates who follow a base-stock policy, our algorithm obtains near-
optimal order quantities. More important, it performs significantly better than a base-stock
policy when other agents use a more realistic model of human ordering behavior. We
observe similar results using a real-world data set. Sensitivity analysis shows that a trained
model is robust to changes in the cost coefficients. Finally, applying transfer learning
reduces the training time by one order of magnitude. Managerial implications: This paper
shows how artificial intelligence can be applied to inventory optimization. Our approach
can be extended to other supply chain optimization problems, especially those in which
supply chain partners act in irrational or unpredictable ways. Our RL agent has been
integrated into a new online beer game, which has been played more than 17,000 times by
more than 4,000 people.
Funding: This work was supported by National Science Foundation Extreme Science and Engineering
Discovery Environment (XSEDE) [Grants DDM180004 and IRI180020], Division of Civil, Me-
chanical and Manufacturing Innovation [Grant 1663256], and Division of Computing and Com-
munication Foundations [Grants 1618717 and 1740796].
Supplemental Material: The online appendices are available at https://doi.org/10.1287/msom.2020.0939.
Keywords: inventory optimization • reinforcement learning • beer game
1. Introduction costs. In this paper, we are interested not in the

The beer game consists of a serial supply chain net- bullwhip effect but in the stated purpose, that is, the
work with four agents—a retailer, a warehouse, a minimization of supply chain costs, which under-
distributor, and a manufacturer—that must make lies the decision making in every real-world supply
independent replenishment decisions with limited chain. For general discussions of the bullwhip effect,
information. The game is widely used in classroom see Lee et al. (2004), Geary et al. (2006), and Snyder
settings to demonstrate the bullwhip effect, a phe- and Shen (2019).
nomenon in which order variability increases as one The agents in the beer game are arranged se-
moves upstream in the supply chain, as well as the quentially and numbered from one (retailer) to four
importance of communication and coordination in (manufacturer) (Figure 1). The retailer node faces sto-
the supply chain. The bullwhip effect occurs for a chastic demands from its customer, and the manufac-
number of reasons, some rational (Lee et al. 1997) and turer node has an unlimited source of supply. There are
some behavioral (Sterman 1989). It is an inadvertent deterministic transportation lead times (ltr ) imposed on
outcome that emerges when the players try to achieve the flow of product from upstream to downstream,
the stated purpose of the game, which is to minimize although the actual lead time is stochastic because of
285
Oroojlooyjadid et al.: Deep Reinforcement Learning for Inventory Optimization
286 Manufacturing & Service Operations Management, 2022, vol. 24, no. 1, pp. 285–304, © 2021 INFORMS
Figure 1. (Color online) Generic View of the Beer et al. 2015) to solve this problem. Because reward
Game Network shaping is a key part of our approach, we refer to our
algorithm as the shaped-reward DQN (SRDQN) algo-

rithm. The SRDQN algorithm is customized for the
beer game, but we view it also as a proof-of-concept
that reinforcement learning (RL) can be used to solve
stockouts upstream; there are also deterministic in- messier, more complicated supply chain problems
formation lead times (lin ) on the flow of information than those typically analyzed in the literature.
from downstream to upstream (replenishment or- As with most machine learning (ML) algorithms,
ders). Each agent may have nonzero shortage and the training of the SRDQN algorithm can be slow, and
holding costs. our approach needs training data or the probability
In each period of the game, each agent chooses an distribution of the demand. Moreover, our approach
order quantity q to submit to its predecessor (sup- offers no theoretical guarantee of global optimality.
plier) in an attempt to minimize the long-run system- By contrast, once trained, our algorithm executes nearly
wide costs: instantaneously, it attains very good cost performance,
and (as we show) it can be trained using small data sets.
∑
T ∑
4 ( )+ ( )− Moreover, our method is quite flexible and can be
cih ILit + cip ILit , (1)
t1 i1
applied even in settings for which optimal inventory
policies are unknown.
where i is the index of the agents, t 1, . . . , T is the Our contributions can be summarized as follows
index of the time periods, T is the time horizon of the (see Section 2.3 for further discussion):
game (which is often unknown to the players), cih and • We present the first RL-based algorithm for the beer
cip are the holding and shortage cost coefficients, re- game that observes the game’s typical rules, in partic-
spectively, of agent i, and ILit is the inventory level ular decentralized information and decision making.
of agent i in period t. If ILit > 0, then the agent has • By introducing reward shaping, we adapt the
inventory on hand, and if ILit < 0, then it has back- classical DQN algorithm, which applies to competi-
orders. The notation x+ and x− denote max{0, x} and tive zero-sum games, to enable it to play the beer
max{0, −x}, respectively. game, which is a cooperative non-zero-sum game.
The standard rules of the beer game dictate that the • SRDQN is a data-driven algorithm and does not
agents may not communicate in any way and that need the knowledge of demand distribution.
they do not share any local inventory statistics or cost • We demonstrate numerically that the SRDQN
information with other agents until the end of the agent is capable of learning a near-optimal policy
game, at which time all agents are made aware of the when its coplayers (teammates) act rationally, that is,
system-wide cost. As a result, according to the cate- follow a base-stock policy.
gorization by Claus and Boutilier (1998), the beer • More important, when the coplayers are irrational,
game is a decentralized, independent-learners (ILs), that is, do not follow a base-stock policy—which mimics
multiagent cooperative problem. how humans play the beer game and is also common in
The beer game assumes that the agents incur holding actual supply chains—we show that the SRDQN agent
and stockout costs but not fixed ordering costs, and obtains much lower costs than an agent that uses a base-
therefore, the optimal inventory policy is a base-stock stock policy, even with optimized base-stock levels.
policy in which each stage orders a sufficient quantity • This suggests that a base-stock policy is not
to bring its inventory position (on-hand plus on-order optimal for a stage in a serial system with non-base-stock
inventory minus backorders) equal to a fixed number, coplayers, which is, to the best of our knowledge, a new
called its base-stock level (Clark and Scarf 1960). When result in the literature.
there are no stockout costs at the nonretailer stages, • We show that the SRDQN agent can be suffi-
that is, cip 0, i ∈ {2, 3, 4}, the well-known algorithm ciently trained based on relatively few (in our case
by Clark and Scarf (1960) provides the optimal base- 100) demand observations.
stock levels. To the best of our knowledge, there is no • We show that the trained model is quite robust
algorithm to find the optimal base-stock levels for against changes in the shortage and holding costs.
general stockout-cost structures. More significantly, Although it is outside the scope of this paper, our
when some agents do not follow a base-stock or other hope is to learn something about the structure of the
rational policy (as is typical in the beer game), the optimal inventory policy with irrational coplayers
form and parameters of the optimal policy that a by examining how the SRDQN agent plays, much as
given agent should follow are unknown. human players are learning new chess strategies by
In this paper, we propose a data-driven algorithm studying the way deep RL chess agents such as
that is an extension of deep Q-networks (DQNs; Mnih AlphaZero play the game (Chan 2017).
Manufacturing & Service Operations Management, 2022, vol. 24, no. 1, pp. 285–304, © 2021 INFORMS 287
The remainder of this paper is as follows. Section 2 such a policy or any policy. Often their behavior is
provides a brief summary of the relevant literature and quite irrational, as shown by (Sterman 1989). There is
our contributions to it. The details of the algorithm are comparatively little literature on how a given agent
introduced in Section 3. Section 4 provides numerical should optimize its inventory decisions when the
experiments, and Section 5 concludes the paper. other agents do not play rationally (Sterman 1989,
Strozzi et al. 2007)—that is, how an individual player
2. Literature Review can best play the beer game when his or her team-
2.1. Current State of Art mates may not be making optimal decisions—and
The beer game consists of a serial supply chain net- there are, to the best of our knowledge, no theoretical
work. Under the conditions dictated by the game, a proofs of policy optimality in this setting.
base-stock policy is optimal at each stage (Lee et al. Some of the beer game literature assumes the agents
1997). If the demand process and costs are stationary, are JALs; that is, information about inventory positions
then so are the optimal base-stock levels, which im- is shared among all agents, a significant difference from
plies that in each period (except the first), each stage classicalILmodels.Forexample,Kimbrough et al. (2002)
simply orders from its supplier exactly the amount propose a genetic algorithm (GA) that receives a current
that was demanded from it. If the customer demands snapshot of each agent and decides how much to order
are independent and identically distributed (i.i.d.) according to the d + x rule. In the d + x rule, agent i
random and if backorder costs are incurred only at observes dit , the received demand/order in period t;
stage 1, then the optimal base-stock levels can be found chooses xit ; and then places an order of size ait dit + xit .
using the exact algorithm by Clark and Scarf (1960) In the literature on multiechelon inventory prob-
(see also Chen and Zheng 1994, Gallego and Zipkin lems, there are only a few papers that use RL. Among
1999). Nevertheless, for more general serial systems, them, Jiang and Sheng (2009) propose an RL for a two-
neither the form nor the parameters of the optimal echelon serial system in which the retailer has to
policy are known. Moreover, even in systems for optimize (r, Q) or (T, S) policy parameters. The only
which a base-stock policy is optimal, such a policy decision in the model is for the retailers, so it is
may no longer be optimal for a given agent if the other not truly a multiechelon problem. Their state consists
agents do not follow it. of the IL and the demand, and they discretize and
There is substantial literature on the beer game and truncate the state and action values to avoid the curse
the bullwhip effect. We review some of that literature of dimensionality. Gijsbrechts et al. (2019) solve a one-
here, considering both independent learners (ILs) and warehouse, multiretailer (OWMR) problem (in ad-
joint-action learners (JALs; Claus and Boutilier 1998). dition to two difficult single-echelon problems) using
(ILs have no information about the other agent, whereas RL. They use the asynchronous advantage actor-
JALs may share some information and take central critic (A3C) algorithm (Mnih et al. 2016) to solve
decisions.) Martinez-Moyano et al. (2014) provide a the problem with promising results compared with
thorough history of the beer game. classical methods. Their model considers a different
In the category of ILs, Sterman (1989) proposes a supply chain structure from ours (OWMR versus
formula (which we call the Sterman formula) to de- serial) and uses a different RL methodology.
termine the order quantity based on the current We are aware of two papers that propose RL algo-
backlog of orders, on-hand inventory, incoming and rithms to play versions of the beer game. Giannoccaro
outgoing shipments, incoming orders, and expected and Pontrandolfo (2002) consider a beer game with
demand. In a nutshell, the Sterman formula attempts three agents with stochastic shipment lead times and
to model the way human players over- or underreact stochastic demand. They propose a classical tabular
to situations they observe in the supply chain, such as RL algorithm to make decisions, in which the state
shortages or excess inventory. Sterman’s formula is variable is defined as the three inventory positions, each
not an attempt to optimize the order quantities in the of which is discretized into 10 intervals. The agents may
beer game; rather, it is intended to model typical use any actions in the integers on [0, 30]. Chaharsooghi
human behavior. Sterman (1989) uses a version of et al. (2008) consider the same game and solution
the game in which the demand is four for the first approach except with four agents and a fixed length of
four periods and eight after that. Hereafter, we 35 periods for each game. In their proposed RL, the
refer to this demand process as C(4, 8), or the classic state variable is the four inventory positions, which
demand process. Also, Sterman (1989) does not allow are each discretized into nine intervals. Moreover,
the players to be aware of the demand process. their RL algorithm uses the d + x rule to determine the
The optimization method described in the first order quantity, with x restricted to be in {0, 1, 2, 3}.
paragraph of this section assumes that every agent Both these papers assume that real-time information
follows a base-stock policy. The hallmark of the beer is shared among agents and that there is a centralized
game, however, is that players do not tend to follow decision-making agent, which are significant departures
from the typical beer game setting. The setting consid- Figure 2. Generic Procedure for RL
ered in these two papers can be modeled as a Markov
decision process (MDP) that is relatively easy to solve. In

contrast, in the typical beer game setting, and the setting
we consider, each agent only observes local information,
which leads to a much more complex MDP and a much
more challenging setting for an RL algorithm. In addi-
tion, these papers use a tabular RL algorithm, and they
assume (implicitly or explicitly) that the size of the state
space is small, which is unrealistic in the beer game be-
cause the state variable representing a given agent’s in-
ventory level can be any number in (−∞, +∞). Thus,
they map the state variable onto a small number of
tiles (Sutton and Barto 1998), which leads to a loss of
valuable state information and therefore of accuracy The matrix Pa (s, s ), which is called the transition
of the solution. We do not use any mapping or cutting of probability matrix, provides the probability of tran-
the state variables, which leads to a more challenging sitioning to state s when taking action a in state s;
problem but is capable of attaining better accuracy. that is, Pa (s, s ) Pr(st+1 s | st s, at a). Similarly,
Another possible approach to tackle this problem Ra (s, s ) defines the corresponding reward matrix.
might be classical supervised ML algorithms. How- In each period t, the decision maker takes action
ever, these algorithms also cannot be readily applied at πt (st ) according to a given policy, denoted by πt .
to the beer game because there are no historical data The goal of RL is to maximize the expected discounted
in the form of correct input–output pairs. Thus, we sum of the rewards rt when the systems runs for an
cannot use a stand-alone support vector machine or infinite horizon, that is, finding policy π : S → A that
∑
deep neural network with a training data set and train maximizes E[ ∞ t0 γ Rat (st , st+1 )], where 0 ≤ γ < 1 is the
t
it to learn the best action (as in the approach used by discount factor. For given Pa (s, s ) and Ra (s, s ), the
Oroojlooyjadid et al. 2017, 2020; Ban and Rudin 2019; optimal policy can be obtained through dynamic
and Ban et al. 2019 to solve simpler supply chain programming or linear programming (Sutton and
problems such as newsvendor and dynamic pro- Barto 1998).
curement). Based on our understanding of the liter- Another approach for solving this problem is
ature, there is a large gap between solving the beer Q-learning, an RL algorithm that obtains the Q-value
game problem effectively and what the current al- for any s ∈ S and a π(s); that is,
gorithms can handle. In order to fill this gap, we [
propose a variant of the DQN algorithm, SRDQN, to Q(s, a) E rt + γrt+1 + γ2 rt+2 + · · · | st s,
choose the order quantities in the beer game. ]
at a; π .
2.2. Reinforcement Learning The Q-learning approach starts with an initial guess
RL (Sutton and Barto 1998) is an area of ML that has for Q(s, a) for all s and a and then proceeds to update
been successfully applied to solve complex sequen- them based on the iterative formula
tial decision problems. RL is concerned with the (
question of how an agent should choose an action to Q(st , at ) (1 − αt )Q(st , at ) + αt rt+1
maximize a cumulative reward. RL is a popular tool in
) (2)
telecommunications, robot control, and game play-
ing, to name a few (see Li 2017 for more applications + γ max Q(st+1 , a) , ∀t 1, 2, . . . ,
a
and details).
RL considers an agent that interacts with an envi- where αt is the learning rate at time step t. In each
ronment. In each time step t, the agent observes the observed state, the agent chooses an action through
current state of the system st ∈ S (where S is the set of an -greedy algorithm with constantly decreasing
possible states), chooses an action at ∈ A(st ) (where to guarantee convergence to the optimal policy (see
A(st ) is the set of possible actions when the system is Online Appendix A for more details). After finding
in state st ), and gets reward rt ∈ R, and then the optimal Q∗ , one can recover the optimal policy as
system transitions randomly into state st+1 ∈ S. This π∗ (s) arg maxa Q∗ (s, a).
procedure is known as an MDP (Figure 2), and Both of the algorithms discussed thus far (dynamic
RL algorithms can be applied to solve this type programming and Q-learning) guarantee that they
of problem. will obtain the optimal policy. However, because of
the curse of dimensionality, these approaches are not 2.3. Our Contribution
able to solve MDPs with large state or action spaces in We propose a Q-learning algorithm, a data-driven
reasonable amounts of time. Many problems of in- algorithm for the beer game in which a dynamic
terest (including the beer game) have large state and/ neural network (DNN) approximates the Q-function.
or action spaces. Moreover, in some settings (again, Indeed, the general structure of our algorithm is
including the beer game), the decision maker cannot based on the DQN algorithm (Mnih et al. 2015), al-
observe the full state variable. This case, which is though we modify it substantially because DQN
known as a partially observed MDP (POMDP), makes is designed for single-agent, competitive zero-sum
the problem much harder to solve than MDPs. games, and the beer game is a multiagent, decen-
In order to solve large POMDPs and avoid the curse tralized, cooperative non-zero-sum game. In other
of dimensionality, it is common to approximate the Q- words, DQN provides actions for one agent that in-
values in the Q-learning algorithm (Sutton and Barto teracts with an environment in a competitive setting,
1998). Linear regression is often used for this purpose and the beer game is a cooperative game in the sense
(Li 2017); however, it is not powerful or accurate that all the players aim to minimize the total cost of the
enough for our application. Nonlinear functions and system in a random number of periods. Also, beer
neural network approximators are able to provide game agents are playing independently and do not
more accurate approximations; by contrast, they are have any information from other agents until the
known to provide unstable or even diverging Q- game ends and the total cost is revealed, whereas
values because of issues related to nonstationarity DQN usually assumes that the agent fully observes
and correlations in the sequence of observations. The the state of the environment at any time step t of the
seminal work of Mnih et al. (2015) solved these issues game. For example, DQN has been successfully ap-
by proposing target networks and using experience plied to Atari games, but in these games, the agent
replay memory. They proposed a deep Q-network attempts to defeat an opponent and observes full
(DQN) algorithm, which uses a deep neural network information about the state of the system at each time
to obtain an approximation of the Q-function and step t.
trains it through the iterations of the Q-learning al- Without shared information about rewards, ap-
gorithm. This algorithm has been applied to many plying DQN in a straightforward way to a given agent
competitive games (see Li 2017)). Although the DQN would result in a policy that optimizes that agent
algorithm works well in practice, there has been no locally but will not result in an optimal solution for the
theoretical proof of convergence. This is mainly be- supply chain as a whole. (For a simple example, see
cause of the nonconvexity of the deep neural network Online Appendix D.) Therefore, we propose a unified
and the fact that the DQN algorithm uses the argmax framework, SRDQN, in which the agents still play
function, which is nonsmooth. In addition, the effect independently from one another, but in the training
of the target network and the replay buffer memory phase, we use reward shaping via a feedback scheme
on the distribution of the training samples and the so that the SRDQN agent learns the total cost for the
gradients is unknown and has not been studied yet. whole network and can, over time, learn to minimize
For more details, see Yang et al. (2019). it. Thus, the SRDQN agent in our model plays smartly
The beer game exhibits one characteristic that dif- in all periods of the game to get a near-optimal cu-
ferentiates it from most settings in which the DQN mulative cost for any random horizon length.
algorithm is commonly applied, namely that there In principle, our framework can be applied to
are multiple agents that cooperate in a decentralized multiple SRDQN agents playing the beer game si-
manner to achieve a common goal. Such a problem is multaneously on a team. However, to date, we have
called a decentralized POMDP (Dec-POMDP). Because designed and tested our approach only for a single
of the partial observability and nonstationarity of SRDQN agent whose teammates are not SRDQN
agents’ local observations, Dec-POMDPs are hard agents; for example, they are controlled by simple
to solve and are categorized as nondeterministic ex- formulas or by human players. Enhancing the algo-
ponential time complete (NEXP-complete) problems rithm to train multiple SRDQN agents simultaneously
O(1)
(problems that can be solved in 2n time; Bernstein and cooperatively is a topic of ongoing research.
et al. 2002). Multiagent RL problems are significantly harder to
The beer game exhibits all the complicating charac- solve than single-agent RL problems. This is because the
teristics described previously—large state and action policy of each agent changes throughout the training,
spaces, partial state observations, and decentralized and as a result, the environment becomes nonsta-
cooperation. Our algorithm aims to overcome these tionary from the viewpoint of each individual agent.
difficulties by making several adaptations to the This violates the underlying assumption of MDP on
standard DQN algorithm. the existence of a stationary distribution. Therefore,
multiagent problems need specialized algorithms semioptimal policy with the aim of minimizing the total
and tricks to handle the nonstationarity. For more cost of the game. Each agent has access only to its local
details, see Zhang et al. (2019) and OroojlooyJadid information and considers the other agents as parts of
and Hajinezhad (2019). its environment. That is, the RL agent does not know
The proposed approach works very well when we any information about the other agents, including
tune and train the SRDQN for a given agent and a both static parameters such as costs and lead times
given set of game parameters (e.g., costs, lead times, and dynamic state variables such as inventory levels.
action spaces). However, our sensitivity analysis We propose a feedback scheme to teach the RL agent
shows that a given trained model is fairly robust to to work toward minimizing the system-wide cost
changes in the parameters; that is, the model can be rather than its own local cost. The details of the scheme,
trained for one set of parameters and still perform Q-learning, state and action spaces, reward function,
well when playing under a (somewhat) different set DNN approximator, and algorithm are discussed.
of parameters. Of course, if one needs the best
possible performance, in principle, one needs to tune 3.1.1. State Variables. Consider agent i in time step t.
and train a new network, but this approach is time Let OOit denote the on-order items at agent i, that is,
consuming. To avoid this, we propose using a transfer- the items that have been ordered from agent i + 1 but
learning approach (Pan and Yang 2010) in which we not received yet; let AOit denote the size of the arriving
transfer the acquired knowledge of one agent under order (i.e., the demand) received from agent i − 1; let
one set of game parameters to another agent with ASit denote the size of the arriving shipment from
another set of game parameters. In this way, we agent i + 1; let ait denote the action agent i takes; and let
decrease the required time to train a new agent by ILit denote the inventory level as defined in Section 1.
roughly one order of magnitude. We interpret AO1t to represent the end-customer de-
To summarize, SRDQN is a variant of the DQN al- mand and AS4t to represent the shipment received by
gorithm for choosing actions in the beer game. In order to agent 4 from the external supplier. In each period t
attain near-optimal cooperative solutions, we de- of the game, agent i observes ILit , OOit , AOit , and ASit .
velop a feedback scheme as a communication framework. In other words, in period t, agent i has historical
The trained model is fairly robust to the changes of observations oit [((ILi1 )+ , (ILi1 )− , OOi1 , AOi1 , RSi1 ), . . . ,
parameters, and finally, to simplify training agents with ((ILit )+ , (ILit )− , OOit , AOit , ASit )]. In addition, any beer
new settings, we use transfer learning to efficiently make game will finish in a finite time horizon, so the problem
use of the learned knowledge of trained agents. can be modeled as a POMDP in which each historic
In addition to playing the beer game well, we believe sequence oit is a distinct state, and the size of the vector
that our algorithm serves as a proof-of-concept that RL oit grows over time, which is difficult for any RL or DNN
and other ML approaches can be used for real-time algorithm to handle. To address this issue, we capture
decision making in complex supply chain settings. only the last m periods (e.g., m 3) and use them as the
We have integrated our algorithm into a new online state variable; thus, the state variable of agent i in time
beer game, developed by Opex Analytics (http:// t is sit [((ILij )+ ,(ILij )− , OOij ,AOij , RSij )]tjt−m+1 . See Online
beergame.opexanalytics.com/) (Figure 3). The Opex Appendix C for more details about choosing the right
beer game allows human players to compete with or play value for m.
on a team with our SRDQN agent. As of this writing, the
game has been played more than 17,000 times by more 3.1.2. Action Space. In each period of the game, each
than 4,300 unique players from academia and industry. agent can order any amount in [0, ∞). Because Q-learning
More than 1,400 of these game plays have included the needs a finite-size action space, we limit the cardinality
SRDQN agent described in this paper. of the action space by using the d + x rule for selecting
Finally, we note that our open-source SRDQN al- the order quantity: the agent determines how much
gorithm and simulator code are available at https:// more or less to order than its received order; that is,
github.com/OptMLGroup/DeepBeerInventory-RL. the order quantity is d + x, where x ∈ [al , au ] (al , au ∈ Z).
3. The SRDQN Algorithm 3.1.3. DNN. In our algorithm, a DNN plays the role of
In this section, we first present the details of our the Q-function approximator, providing the Q-value
SRDQN algorithm to solve the beer game and then as the output for any pair of state s and action a. Given
describe the transfer-learning mechanism. the state and action spaces, our DNN outputs Q(s, a)
for every possible action a ∈ A. (In the beer game,
3.1. SRDQN: DQN with Reward Shaping for A(s) [al , au ] is the same for all s; therefore, we simply
the Beer Game use A to denote the action space.) The DNN is trained
In our algorithm, an RL agent runsa Q-learning algorithm using random minibatches taken from the experi-
with DNN as the Q-function approximator to learn a ence replay iteratively until all episodes are complete.
Figure 3. (Color online) Screenshot of Opex Analytics Online Beer Game Integrated with Our SRDQN Agent
∑4 ∑T
t1 rt , which the agents only learn after finishing
i
(For more details about experience replay, see Online i1
Appendix E.) the game. In order to add this information into the
agents’ experience, we use reward shaping through a
3.1.4. Reward Function. In iteration t of the game, feedback scheme.
agent i observes state variable sit and takes action ait ;
one needs to know the corresponding reward value rit 3.1.5. Feedback Scheme. When an episode of the beer
to measure the quality of action ait . The state variable game is finished, all agents are made aware of the total
sit+1 allows us to calculate ILit+1 and thus the corre- reward. In order to share this information among the
sponding shortage or holding costs, and we consider agents, we propose a penalization procedure in the
the summation of these costs for rit . However, because training phase to provide feedback to the RL agenti
∑ ∑ r
there are information and transportation lead times, about the way that it has played. Let ω 4i1 Tt1 Tt
∑ r i
there is a delay between taking action ait and ob- and τi Tt1 Tt be the average reward per period and
serving its effect on the reward. Moreover, the reward the average reward of agent i per period, respectively.
rit reflects not only the action taken in period t but also After the end of each episode of the game (i.e., after
the actions taken in previous periods, and it is not period T), for each RL agent i we update its observed
possible to decompose rit to isolate the effects of each reward in all T time steps in the experience replay
β
of these actions. However, defining the state variable memory using rit rit + 3i (ω − τi ), ∀t ∈ {1, . . . , T}, where
to include information from the last m periods re- βi is a regularization coefficient for agent i. With this
solves this issue to some degree; the reward rit rep- procedure, agent i gets appropriate feedback about its
resents the reward of state sit , which includes the actions and learns to take actions that result in min-
observations of the previous m steps. imum total cost, not locally optimal solutions.
By contrast, the reward values rit are the interme-
diate rewards of each agent, and the objective of the 3.1.6. The Algorithm. Our algorithm to get the policy
beer game is to minimize the total reward of the game π to solve the beer game is provided in Algorithm 1.
Algorithm 1 (SRDQN for Beer Game)

1: procedure SRDQN
2: Initialize experience replay memory Ei [ ], ∀i.
3: for Episode 1 : n, do
4: Reset IL, OO, d, AO, and AS for each agent.
5: for t 1 : T, do
6: for i 1 : 4, do
7: With probability take random action at .
8: Otherwise, set at arg mina Q(st , a; θ).
9: Execute action at , observe reward rt and state st+1 .
10: Add (sit , ait , rit , sit+1) into Ei .
11: Get a minibatch
{ of experiences (sj , aj , rj , sj+1 ) from Ei .
rj if it is the terminal state
12: Set yj
rj + min Q(sj+1 , a;θ− ) otherwise
a
13: Run forward and backward step on the DNN with loss function (yj − Q(sj , aj ; θ))2 .
14: Every C iterations, set θ− θ.
15: end for
16: end for
17: Run feedback scheme, update experience replay of each agent.
18: end for
19: end procedure
The algorithm, which is based on DQN (Mnih et al. they do not attempt to optimize.) The details of the
2015), finds weights θ of the DNN to minimize the training procedure and benchmarks are described
Euclidean distance between Q(s, a; θ) and yj , where yj is in Section 4.
the prediction of the Q-value that is obtained from
target network Q− with weights θ− . Every C iterations, 3.2. Transfer Learning
the weights θ− are updated by θ. Moreover, the ac- Transfer learning (Pan and Yang 2010) has been an
tions in each training step of the algorithm are ob- active and successful field of research in ML and
tained by an -greedy algorithm, which is explained especially in image processing. In transfer learning,
in Online Appendix A. there is a source data set S and a trained neural
In the algorithm, in period t, agent i takes action ait , network to perform a given task, for example, classi-
satisfies the arriving demand/order AOit−1 , observes fication, regression, or decisioning through RL. Training
the new demand AOit , and then receives the ship- such networks may take a few days or even weeks.
ments ASit . This sequence of events, which is explained Therefore, for similar or even slightly different target
in detail in Online Appendix J, results in the new data sets T, one can avoid training a new network
state st+1 . Feeding st+1 into the DNN with weights θ− from scratch and instead use the same trained net-
provides the corresponding Q-value for state st+1 and work with a few customizations. The idea is that most
all possible actions. The action with the smallest of the learned knowledge on data set S can be used in
Q-value is our choice. Finally, at the end of each ep- the target data set with a small amount of additional
isode, the feedback scheme runs and distributes the training. This idea works well in image processing
total cost among all agents. (Rajpurkar et al. 2017) and considerably reduces the
training time.
3.1.7. Evaluation Procedure. In order to validate our In order to use transfer learning in the beer game,
algorithm, we compare the results of SRDQN to those assume that there exists a source agent i ∈ {1, 2, 3, 4}
obtained using the optimal base-stock levels (when with trained network Si (with a fixed size on all
j j j
possible) by Clark and Scarf (1960),1 as well as models agents), parameters Pi1 {|A1 |, cp1 , ch1 }, observed de-
of human beer game behavior by Sterman (1989). mand distribution D1 , and coplayer (i.e., teammate)
(None of these methods attempts to do exactly the policy π1 . The weight matrix Wi contains the learned
q
same thing as our method. The method by Clark and weights such that Wi denotes the weight between
Scarf (1960) optimizes the base-stock levels assuming layers q and q + 1 of the neural network, where
that all players follow a base-stock policy—which q ∈ {0, . . . , nh}, and nh is the number of hidden layers.
beer game players do not tend to do—and the formula The aim is to train a neural network Sj for target
by Sterman (1989) models human beer game play, but agent j ∈ {1, 2, 3, 4}, j i. We set the structure of the
network Sj the same as that of Si and initialize Wj Section 4.3.1 presents the results on two real-world
with Wi , making the first k layers not trainable. Then data sets, and Section 4.5 provides sensitivity analysis
we train neural network Sj with a small learning rate. on the trained models. After analyzing these cases, in
As we get closer to the final layer, which provides Section 4.6, we provide the results obtained using
the Q-values, the weights become less similar to agent transfer learning for each of the six proposed cases. As
i’s and more specific to each agent. Thus, the ac- noted previously, we only consider cases in which a
quired knowledge in the first k hidden layer(s) of single SRDQN agent plays with non-SRDQN agents,
the neural network belonging to agent i is trans- for example, simulated human coplayers. In each of
ferred to agent j, in which k is a tunable parameter. the cases listed previously, we consider three types
Following this procedure, we test the use of trans- of policies that the non-SRDQN coplayers follow:
fer learning in six cases to transfer the learned (1) base-stock policy, (2) Sterman formula, and (3) ran-
knowledge of source agent i to target agent j with dom policy. In the random policy, agent i also follows
the following: a d + x rule, in which ati ∈ A is selected randomly and
1. For j i in the same game. with equal probability, for each t.
j j j
2. For {|A1 |, cp2 , ch2 }, that is, the same action space We test values of m in {5, 10} and C ∈ {5, 000,
but different cost coefficients. 10, 000}. Considering the state and action spaces,
j j j
3. For {|A2 |, cp1 , ch1 }, that is, the same cost coeffi- our DNN is a fully connected network with shape
cients but different action space. [5m,180, 130, 61, au − al + 1]. We trained each agent on at
j j j
4. For {|A2 |, cp2 , ch2 }, that is, different action space most 60,000 episodes and used a replay memory E
and cost coefficients. equal to the one million most recently observed expe-
j j j
5. For {|A2 |, cp2 , ch2 }, that is, different action space riences. The -greedy algorithm starts with 0.9 and
and cost coefficients, as well as a different demand linearly reduces it to 0.1 in the first 80% of iterations.
distribution D2 . In the feedback mechanism, the appropriate value
j j j
6. For {|A2 |, cp2 , ch2 }, that is, different action space of the feedback coefficient βi heavily depends on τj ,
and cost coefficients, as well as a different demand the average reward for agent j, for each j i. For
distribution D2 and coplayer policy π2 . example, when τi is one order of magnitude larger
Unless stated otherwise, the demand distribution than τj , for all j i, agent i needs a large coefficient to
and coplayer policy are the same for the source and get more feedback from the other agents. Indeed, the
target agents. Transfer learning could also be used feedback coefficient has a similar role as the regula-
when other aspects of the problem change, for ex- rization parameter λ has in the lasso loss function; the
ample, lead times and state representation. This avoids value of that parameter depends on the -norm of the
having to tune the parameters of the neural network variables, but there is no universal rule to determine
for each new problem, which considerably reduces the best value for λ. Similarly, proposing a simple rule
the training time. However, we still need to decide how or value for each βi is not possible because it depends
many layers should be trainable and to determine on τi , ∀i. For example, we found that a very large βi does
which agent can be a base agent for transferring the not work well because the agent tries to decrease other
learned knowledge. Still, this is computationally much agents’ costs rather than its own. Similarly, with a very
cheaper than finding each network and its hyper- small βi , the agent learns how to minimize its own cost
parameters from scratch. instead of the total cost. Therefore, we used a cross-
validation approach to find good values for each βi .
4. Numerical Experiments There are three types of coplayers: base stock,
In Section 4.1, we discuss a set of numerical experi- Sterman, and random. When the single player uses
ments that uses a simple demand distribution and a the SRDQN algorithm, we call the three cases BS-
relatively small action space: SRDQN, Strm-SRDQN, and Rand-SRDQN, respec-
• dt0 ∈ U[0, 2], A {−2, −1, 0, 1, 2}. tively. We compare the results of these cases to the
After exploring the behavior of SRDQN under different results when the agent instead follows a base-stock
coplayer (teammate) policies, in Section 4.2, we test the policy; we call those cases BS-BS, Strm-BS, and Rand-
algorithm using three well-known cases from the litera- BS. In the -BS cases, the single agent uses the optimal
ture, which have larger possible demand values and ac- (or near-optimal) base-stock level for that case. In
tion spaces: particular, in BS-BS, the agent’s optimal base-stock
• dt0 ∈ U[0, 8], A {−8, . . . ,8} (Croson and Dono- level is provided by the algorithm by Clark and
hue 2006). Scarf (1960), whereas in the Strm-BS and Rand-BS
• dt0 ∈ N(10, 22 ), A {−5, . . . , 5} (adapted from Chen cases, we numerically optimize the base-stock level of
and Samroengraja 2000). the single base-stock player because the literature does
• dt0 ∈ C(4, 8), A {−8, . . . , 8} (Sterman 1989). not provide guidance on how to set the base-stock
level in this case. To do so, we evaluated each integer relax this assumption later. We use al −2 and au 2
base-stock level in the range [μ − 10σ, μ + 10σ] (for so that there are five outputs in the neural network.
normal demands) or [−25dl , 25du ] (for uniform de- (Later, we expand these to larger action spaces.)
mands), where μ and σ are the mean and average of For the cost coefficients and other settings specified
the normal demand distribution, respectively, and dl for this beer game, it is optimal for all players to
and du are the lower and upper bounds of the uniform follow a base-stock policy, and we use this policy as a
demand distribution, respectively. For each base- benchmark and a lower bound on the SRDQN cost.
stock level, we simulated 50 episodes of the game The vector of optimal base-stock levels is [8, 8, 0, 0],
and calculated the total system-wide cost for each. We and the resulting average cost per period is 2.0705,
then selected the base-stock level that obtains the although these levels may be slightly suboptimal
minimum average cost over the 50 episodes. because of rounding. This cost is allocated to stages
We compare SRDQN with a base-stock policy in 1–4 as [2.0073, 0.0632, 0.03, 0.00]. We present the results
order to demonstrate two main things: for systems with coplayers that use base-stock and Ster-
• When coplayers use base-stock policies, the SRDQN man policies in Sections 4.1.1 and 4.1.2, respectively.
agent can attain a performance that is close to that of
the optimal base-stock policy, even though that policy 4.1.1. SRDQN Plus Base-Stock Policy (BS-SRDQN).
is found using algorithms (Clark and Scarf 1960) that We consider four cases, with the SRDQN playing
assume that all information about all players (their the role of each of the four players and the coplayers
policies, cost coefficients, etc.) is available, whereas using a base-stock policy. The results of all four cases
the SRDQN agent only knows its own local infor- are shown in Figure 4. Each plot shows the training
mation and learns by experimentation. (We observe a curve, that is, the evolution of the average cost per
similar effect for random coplayers. Although the game as the training progresses. In particular, the
optimal policy in this setting is, to the best of our horizontal axis indicates the number of training ep-
knowledge, unknown, it is plausible that a base-stock isodes, whereas the vertical axis indicates the total
policy is optimal.) cost per game. The caption indicates the role played
• With Sterman coplayers, the SRDQN agent signif- by SRDQN. After every 100 episodes of the game and
icantly outperforms a base-stock policy, even with an the corresponding training, the costs of 50 validation
optimized base-stock level. The optimal policy with points (i.e., 50 new games), each with 100 periods, are
Sterman coplayers has not, to the best of our knowledge, obtained, and their average and a 95% confidence
been investigated in the literature, and our results sug- interval are plotted. (The confidence intervals, which
gest that a base-stock policy is not optimal in this case. are depicted in light color in the figure, are quite
narrow, so they are difficult to see.) The solid horizontal
4.1. Basic Cases line indicates the cost of the case in which all players
In this section, we test our approach using a beer game follow a base-stock policy, which, recall, is optimal in
setup with the following characteristics. Information this case. In each of the subfigures, there are two plots;
and shipment lead times ltr in
j and lj equal two periods the upper plot shows the cost, and the lower plot
at every agent. Holding and stockout costs are given by shows the normalized cost, in which each cost is di-
ch [2, 2, 2, 2] and cp [2, 0, 0, 0], respectively, where vided by the corresponding BS-BS cost; essentially this
the vectors specify the values for agents 1, . . . , 4. The is a zoomed-in version of the upper plot. We trained
demand is an integer uniformly drawn from {0, 1, 2}. the network using values of βi ∈ {5, 10, 20, 50, 100, 200},
Additionally, we assume that agent i observes the arriving each for at most 60,000 episodes. Figure 4 plots
shipment ASit when it chooses its action for period t. We the results from the best βi value for each agent;
Figure 4. (Color online) Total Cost (Top) and Normalized Cost (Bottom) for BS-SRDQN and BS-BS Cases
Note. Caption indicates the role played by SRDQN.

Figure 5. (Color online) The ILt , OOt , at , rt , and OUTL Values When SRDQN Plays Retailer with β1 50 (BS-SRDQN), When
All Players Play Sterman (Strm), and When All Players Play Base-Stock (BS-BS)
we present the full results using different βi , m, and C 4.1.2. SRDQN Plus Sterman Formula. Figure 6 shows
values in Online Appendix H. the results of the case in which the three non-SRDQN
Figure 4 indicates that SRDQN performs well in all agents use the formula proposed by Sterman (1989).
cases and finds policies whose costs are close to those (See Online Appendix B for the formula and its pa-
of BS-BS. After convergence (i.e., after 60,000 training rameters.) The single agent uses either SRDQN (Strm-
episodes),the average gapbetweenthecostofBS-SRDQN SRDQN) or optimized base-stock levels (Strm-BS).
and BS-BS, over all four agents, is 2.31%. From the figure, it is evident that SRDQN plays
Figure 5 shows the trajectories of the retailer’s in- much better than the optimized base-stock policy.
ventory level IL, on-order quantity OO, order quan- This is because if the other three agents do not follow a
tity a, reward r, and order up to level (OUTL) for a base-stock policy, it is no longer optimal for the fourth
single game when the retailer is played by the SRDQN agent to follow a base-stock policy. In general, the
(BS-SRDQN), as well as when it is played by a base- optimal inventory policy when other agents do not
stock policy (BS-BS) and when all agents follow the follow a base-stock policy is an open question. This
Sterman formula (Strm). The base-stock policy and figure suggests that SRDQN is able to learn to play
SRDQN have similar IL and OO trends, and as a re- effectively in this setting.
sult, their rewards are also very close: BS-BS has a cost Table 1 gives the cost of all four agents when a given
of [1.42, 0.00, 0.02, 0.05] (total 1.49), and BS-SRDQN agent plays using either SRDQN (Strm-SRDQN) or a
has a cost of [1.43, 0.01, 0.02, 0.08] (total 1.54, or 3.4% base-stock policy (Strm-BS) and the other agents play
larger). (BS-BS has a slightly different cost here than using the Sterman formula. From this table, we can
reported earlier because those costs are the average see that SRDQN learns how to play to decrease the
costs of 50 samples, whereas this cost is from a single costs of the other agents and not just its own costs—
sample.) Similar trends are observed when SRDQN for example, the warehouse’s and manufacturer’s costs
plays the other three roles (see Online Appendix G). are significantly lower when the distributor uses
This suggests that SRDQN can successfully learn to SRDQN than they are when the distributor uses a base-
achieve costs close to BS-BS when the other agents stock policy. Similar conclusions can be drawn from
also follow a base-stock policy. (The OUTL plot shows Figure 6. This shows the power of SRDQN when it
that SRDQN does not quite follow a base-stock policy, plays with coplayer agents that do not play rationally,
although its costs are similar.) Also, see https:// that is, do not follow a base-stock policy, which is
youtu.be/ZfrCrmGsDJE for a video animation of common in real-world supply chains. Also, we note
the policy that SRDQN learns in this case. that when all agents follow the Sterman formula, the
Figure 6. (Color online) Total Cost (Top) and Normalized Cost (Bottom) for Strm-SRDQN and Strm-BS Cases
Note. Caption indicates the role played by SRDQN.

Table 1. Average Cost Under Different Choices of Which Agent Uses SRDQN or Base Stock, with Sterman Coplayers
Cost (Strm-SRDQN, Strm-BS)
SRDQN agent Retailer Warehouse Distributor Manufacturer Total
Retailer (1.75, 3.57) (0.89, 1.04) (1.75, 2.35) (3.02, 3.60) (7.41, 10.56)
Warehouse (1.67, 2.19) (0.26, 0.48) (1.02, 2.62) (1.73, 4.26) (4.68, 9.56)
Distributor (3.24, 1.80) (0.74, 3.66) (0.00, 2.01) (2.03, 4.78) (6.01, 12.25)
Manufacturer (1.89, 1.75) (3.89, 3.53) (9.45, 8.45) (2.03, 4.67) (17.26, 18.40)
average cost of the agents is [2.10, 4.85, 11.22, 13.41], gives the corresponding cost when the SRDQN agent
for a total of 31.58, which is much higher than when is replaced by a base-stock agent (using the opti-
any one agent uses SRDQN. Finally, for details on mized base-stock levels) and the coplayers remain as
IL, OO, a, r, and OUTL in this case, see Online Ap- the row set determines. The third column (“Gap”)
pendix G. gives the percentage difference between these two
costs. We use R, W, D, and W to represent retailer,
4.2. Literature Benchmarks wholesaler, distributor, and manufacturer, respec-
We next test SRDQN on beer game settings from the tively. (For example, when the coplayers are Sterman
literature. These have larger demand-distribution do- agents and the SRDQN agent plays the role of the
mains and larger plausible action spaces and therefore distributor, the average total cost is 8.35, whereas if
represent harder instances for SRDQN. In all instances the distributor is played by a base-stock agent, the
in this section, lin [2, 2, 2, 2] and ltr [2, 2, 2, 1]. Cost average total cost is 10.10.) In addition, Figure 7
coefficients and the base-stock levels for each instance presents 90% confidence intervals for all cases, except
are presented in Table 2. the classic distribution with BS coplayers, which has
The Clark–Scarf algorithm assumes that stage 1 is zero variance.
the only stage with nonzero stockout costs, whereas the As the first row set in Table 3 shows, when the
U[0, 8] instance has nonzero costs at every stage. SRDQN agent plays with base-stock coplayers un-
Therefore, we used a heuristic approach based on a two- der uniform or normal demand distributions, it ob-
moment approximation, similar to that proposed by tains costs that are reasonably close to the case when
Graves (1985), to choose the base-stock levels (Snyder all players use a base-stock policy, with average
2018). In addition, the C(4, 8) demand process is gaps of 8.15% and 5.80%, respectively. These gaps
nonstationary—four, then eight—but we allow only are not quite as small as those in Section 4.1 because
stationary base-stock levels. Therefore, we chose to of the larger action spaces in the instances in this
set the base-stock levels equal to the values that would section. Nevertheless, these gaps do not necessarily
be optimal if the demand were eight in every period. mean that SRDQN performs significantly worse than
Finally, in the experiments in this section, we assume BS. Although the gap is positive for all cases with
that agent i observes ASit after choosing ait , whereas in BS coplayers, Figure 7, (a) and (b), shows that in five
Section 4.1, we assumed the opposite. Therefore, the of eight cases, the average costs of SRDQN and BS
agents in these experiments have one fewer piece of are statistically equal. In addition, because a base-
information when choosing actions and are therefore stock policy is optimal at every stage, the small gaps
more difficult to train. demonstrate that the SRDQN agent can learn to play
Table 3 shows the results of the cases in which the the game well for these demand distributions. For
SRDQN agent plays with coplayers who follow base- the classic demand process, the percentage gaps are
stock, Sterman, and random policies (as listed in the larger. To see why, note that if the demand were
“Coplayer type” column). In each group of columns, equal to eight in every period, the optimal base-stock
the first column (“-SRDQN”) gives the average cost levels for the classic demand process would result
(over 50 instances) when one agent (indicated by the in ILti 0, ∀i, t. The four initial periods of demand
“Agent” column) is played by SRDQN and coplayers equal to four disrupt this effect slightly, but the cost
are played by the player type indicated in the “Coplayer of the optimal base-stock policy for the classic de-
type” column. The second column in each group (“-BS”) mand process is asymptotically zero as the time
Table 2. Cost Parameters and Base-Stock Levels for Instances with Uniform, Normal, and Classic Demand Distributions
Demand cp ch Base-stock level
U[0, 8] [1,1,1,1] [0.50,0.50,0.50,0.50] [19,20,20,14]

N(10, 22 ) [10,0,0,0] [1.00,0.75,0.50,0.25] [48,43,41,30]
C(4, 8) [1,1,1,1] [0.50,0.50,0.50,0.50] [32,32,32,24]
Table 3. Results of Literature Data Sets
Uniform U[0, 8] Normal N(10, 22 )

Classic C(4, 8)
Coplayer type Agent -SRDQN -BS Gap (%) -SRDQN -BS Gap (%) -SRDQN -BS Gap (%)
BS- R 4.11 4.00 2.76 4.41 4.19 5.19 0.50 0.34 45.86
W 4.69 4.00 17.15 4.66 4.19 11.28 0.47 0.34 36.92
D 4.42 4.00 3.67 4.40 4.19 5.04 0.67 0.34 96.36
M 4.08 4.00 2.02 4.26 4.19 1.69 0.30 0.34 −13.13
Average 8.15 5.80 41.50
Strm- R 6.88 8.74 −21.29 9.98 10.54 −5.27 3.80 5.31 −28.54
W 5.90 8.58 −31.22 7.11 8.72 −18.40 2.85 3.84 −25.76
D 8.35 10.10 −17.32 8.49 11.52 −26.35 3.82 17.70 −78.41
M 12.36 12.75 −3.10 13.86 15.05 −7.87 15.80 17.02 −7.13
Average −18.23 −14.47 −34.96
Rand- R 31.39 27.69 13.34 13.03 15.93 −18.19 19.99 20.30 −1.55
W 29.62 27.59 7.37 27.87 22.33 24.80 23.05 20.03 15.08
D 30.72 27.97 9.82 34.85 26.03 33.90 22.81 21.06 8.31
M 29.03 27.09 7.16 37.68 29.36 28.36 22.36 20.17 10.85
Average 9.42 17.22 8.17
horizon goes to infinity. The absolute gap attained by Similar to the results of Section 4.1.2, when the
the SRDQN agent is quite small—an average of 0.49 SRDQN agent plays with coplayers that follow the
versus 0.34 for the base-stock cost—but the percent- Sterman formula, it performs far better than Strm-BS.
age difference is large simply because the optimal As Table 3 shows, SRDQN performs 23% better than
cost is close to zero. Indeed, if we allow the game Strm-BS on average. In addition, Figure 7, (c)–(e), shows
to run longer, the cost of both algorithms decreases, that in 11 of 12 cases, the cost of Strm-SRDQN is statis-
and so does the absolute gap. For example, when tically smaller than that of Strm-BS.
the SRDQN agent plays the retailer, after 500 periods, By contrast, when the SRDQN agent plays with
the discounted costs are 0.0090 and 0.0062 for BS- coplayers that use the random policy, for all de-
SRDQN and BS-BS, respectively, and after 1,000 mand distributions, Rand-SRDQN learns well to
periods, the costs are 0.0001 and 0.0000 (to four- play to minimize the total cost of the system, although
digit precision). Rand-BS, on average, obtains 12% better solutions.
Figure 7. (Color online) Confidence Intervals for Normal and Uniform Distributions
Nonetheless, as Figure 7, (f)–(h), shows, Rand-BS coplayer types. We note that in these experiments, we
obtains a statistically smaller cost in only in one case, report averages more than 200 episodes (i.e., 20,000
whereas in two cases SRDQN obtains a statistically periods) instead of 50 as in previous sections because
smaller cost, and in the rest of the nine cases, the the demands are noisier in this experiment.
means of the costs are statistically equal. Our con- In both data sets, the demands are badly scaled,
jecture to explain why the results here are not as meaning that the demand values can be quite large
strong as those for Strm-BS is that when the coplayers (e.g., up to 140). This is a difference from the cases
take random actions, the demand process faced by the in Section 4.2, and it makes the training much more
single agent is approximately an i,i.d. random pro- difficult. Therefore, we scaled the demands by dividing
cess, for which a base-stock policy would be ap- each demand observation by the maximum demand
proximately optimal, similar to the BB-BS case. (In across all items, times 10, so that all demands are within
contrast, Sterman players place orders that are highly [0, 10]. The following results use the resulting scaled
correlated over time rather than i.i.d. random.) data sets. The results are promising, and from them,
we can conclude that SRDQN performs well for
4.3. Real-World Data Sets properly scaled data sets. For data sets with larger
In order to test the performance of SRDQN on more demand values, our preliminary results showed that
challenging cases, we used two real-world data sets. SRDQN attains larger costs than the optimized base-
The main question we wish to address in this section stock policy—usually within 10% and always within
is whether our method still performs well even if the 20%, on average. It is possible that one could train a
historical data are scant. The basic idea is to take the network on a scaled data set and then unscale the
historical data, fit a demand distribution to those results to apply to the original data set; such an ap-
data, and then train SRDQN on many samples gen- proach is a topic for future research, as are improved
erated randomly from that distribution; in other methods for training directly on nonscaled data sets.
words, we use a small data set to generate a much larger In the following sections, we do not report results
one, which is used for training. A secondary question for random coplayers. We did not test this case in
in this section is whether the method performs well even order to conserve computational resources and be-
when the demand distribution has larger mean and vari- cause these results tend to be similar to the cases with
ance, in which case the d + x rule is less effective. base-stock coplayers.
In each data set, we randomly selected three
categories/items for our experiment. Using the em- 4.3.1. Real-World Data Set I. The first data set is a
pirical demand distribution, we generated 60,000 ep- basket data set (Pentaho 2008), which consists of data
isodes and used the same training procedure as in from a retailer (Foodmart) in 1997 and 1998. This data
previous sections. In particular, we assumed the same set provides demands for 24 item categories, from
shortage and holding costs as in the setting with which we randomly chose three. Histograms of the
N(10, 2) demand in Section 4.2, obtained the optimal demands for the selected categories are shown in
base-stock levels, and trained models for the three Figure 8, (a)–(c). In each category, there are fewer than
Figure 8. (Color online) Demand Histograms for the Randomly Selected Categories/Items for Data Sets I ((a)–(c)) and II ((d)
and (e))
Table 4. Results for Real-World Data Set I with Scaled Demands

Category 6 Category 13 Category 22
Coplayer type Agent -SRDQN -BS Gap (%) -SRDQN -BS Gap (%) -SRDQN -BS Gap (%)
BS- R 4.25 4.08 4.05 4.16 4.07 2.38 3.10 3.01 3.00
W 4.32 4.08 5.83 4.36 4.07 7.29 3.14 3.01 4.33
D 4.64 4.08 13.68 4.24 4.07 4.19 3.17 3.01 5.11
M 4.32 4.08 5.85 4.66 4.07 14.64 3.14 3.01 4.33
Average 7.35 7.12 4.20
Strm- R 5.70 5.96 −4.28 5.69 6.26 −9.04 4.28 4.40 −2.79
W 6.56 7.01 −6.47 6.89 7.04 −2.12 4.93 5.26 −6.14
D 6.95 7.81 −11.04 7.57 7.70 −1.74 6.24 6.11 2.16
M 8.83 9.16 −3.57 9.74 8.99 8.31 7.10 7.29 −2.56
Average −6.34 −1.15 −2.33
500 historical samples, which makes the problem costs in 22% (8 of 36 cases), and a larger cost in 2.7%
hard to solve. (1 of 36 cases). This is because the BS player uses a
Table 4 presents the results. As shown, in all cases, fixed base-stock level for the entire episode, which
BS-SRDQN obtains policies whose costs are close to may not be optimal for playing with Strm agents
those of BS-BS, with an average gap of 6.22%. More- who play irrationally. By contrast, SRDQN learns
over, when it plays with Sterman coplayers, in all but the expected cost of the episode—the Q-value—for
two cases, Strm-SRDQN performs better than Strm-BS, each state–action pair. Therefore, at each time step of
with an average cost that is 3.27% lower. the game, SRDQN looks at its current state, and based
on what is has learned from the behavior of the other
4.3.2. Real-World Data Set II. The second real-world agents in the same or similar states, it suggests the
data set includes five years of sales data for 50 items action that results in the smallest approximate sum
(Kaggle.com 2018), with 18,260 records per item. of costs.
Figure 8, (d)–(f), shows the demand histograms of the By contrast, with BS coplayers, the three coplayers
three randomly selected categories. We trained follow rational policies, and BS-BS is optimal. Simi-
SRDQN for similar cases as in Section 4.3.1. larly, for each state–action pair, SRDQN learns the
Table 5 presents the results for the three coplayer sum of the costs for the entire episode. The learning is
types. With base-stock coplayers, BS-SRDQN always quite effective; it achieves statistically equal costs
learns policies whose costs are close to those of BS-BS, in 47.2% of cases (19 of 36), slightly larger costs in
with an average gap of 7.44%. In the case of Sterman 50%, and a smaller cost in one case (see the re-
coplayers, in most cases (11 of 12), Strm-SRDQN sults for C(4,8) in Section 4.2). Moreover, for Rand
attains a smaller cost than Strm-BS, with an aver- coplayers, SRDQN performs almost as well as BS; in
age gap of 10.12%. 75% of cases, they are in a statistical tie; in 16.7%,
Finally, given the 90% confidence intervals, Table 6 SRDQN outperforms BS; and in 8.3%, BS obtains
summarizes the experiments of this section by counting smaller costs.
the number of times that SRDQN or BS obtains a sta- To summarize, SRDQN performs well on well-
tistically smaller cost or that they are statistically equal. scaled real-world data sets regardless of the way the
As shown, with Strm coplayers, SRDQN obtains a other agents play. The SRDQN agent learns to perform
statistically smaller cost in 75% of cases (27 of 36), equal nearly as well as a base-stock policy when its coplayers
Table 5. Results of Real-World Data Set II with Scaled Demands
Category 5 Category 34 Category 46
Coplayertype Agent -SRDQN -BS Gap (%) -SRDQN -BS Gap (%) -SRDQN -BS Gap (%)
BS- R 3.23 3.08 4.88 2.87 2.70 6.21 3.28 3.13 4.77
W 3.38 3.08 9.91 2.96 2.70 9.65 3.28 3.13 4.77
D 3.40 3.08 10.58 2.95 2.70 9.28 3.29 3.13 5.07
M 3.30 3.08 7.25 2.95 2.70 9.28 3.37 3.13 7.73
Average 8.15 8.60 5.57
Strm- R 5.70 5.35 −5.26 4.84 4.78 1.33 5.26 5.49 −4.26
W 4.85 5.49 −11.61 4.44 4.74 −6.25 4.42 5.58 −20.80
D 5.29 6.35 −16.68 4.48 5.24 −14.46 5.53 6.87 −19.60
M 7.18 7.74 −7.16 5.73 6.25 −8.30 7.85 8.56 −8.37
Average −10.18 −6.92 −13.26
Table 6. Summary of the Performance for Each Algorithm with Literature Cases
Literature cases Real-world data set I Real-world data set II
Coplayers SRDQN BS Tie SRDQN BS Tie SRDQN BS Tie
BS- 1 6 5 0 5 7 0 7 5
Strm- 11 0 1 5 1 6 11 0 1
Rand- 2 1 9
follow a base-stock policy (in which case it is optimal 4.5. Sensitivity Analysis
for the remaining agents to follow a base-stock policy). In this section, we explore the ability of a trained
When playing with irrational coplayers, it achieves a model to perform well even if it plays the game using
much smaller cost than a base-stock policy does. settings that are different from those used during training.
Figure 9. (Color online) Sensitivity Analysis with Respect to Cost, Perturbing cp , and ch with the N(10, 2) ((a) and (b)), the
U[0, 8] ((c) and (d)), and the C(4, 8) ((e) and (f)) Demand Distributions
Note. Each subfigure shows the result for one coplayer type.
In particular, we analyze the sensitivity of a data set is much smaller than what is required to train a
trained model to changes in the cost coefficients (in SRDQN model. In Section 4.3.1, we tested our method
Section 4.5.1) and in the size of the historical data set on a real-world data set and concluded that it still
(in Section 4.5.2). performed well if we train it on simulated data from
an empirical distribution that is estimated from the
4.5.1. Parameters. In order to analyze the sensitivity available historical data. In this section, we test this
of the trained model to changes in the shortage and approach more rigorously, explicitly comparing its
holding cost coefficients, we multiply cp and ch (in- performance with that of a model that has been trained
dependently) by 100 + devcp and 100 + devch percent, on the full data set.
where {devcp , devch } ∈ {−90,−80, . . . ,−10, 10, 20, . . . , 100, To this end, we generated 100 observations from the
125, 150, 175,200,250, 300, 400}. For each of these new demand distributions used in Section 4.2, that is,
sets of cost coefficients, we use the models that were U[0, 8] and N(10, 2). (We excluded the classic demand
trained under the original cost coefficients for the distribution C(4, 8), because the demand pattern is
demand distributions in Section 4.2, and we use these fixed in that distribution.) Then, for each demand
models to play the game for 50 episodes under distribution, we obtained a histogram of those 100
the new costs. To generate the BS-BS and Strm-BS data points and generated 60,000 episodes using the
benchmarks, we obtained the optimal base-stock levels empirical distribution implied by the histogram. We
for each new set of costs by the Clark–Scarf algorithm trained two models (one for each distribution) on
and had the agent play the game for 50 episodes these new data sets. We then tested the performance
using a base-stock policy with these base-stock levels. of these low-observation models under the same test
For Sterman coplayers, the Clark–Scarf algorithm data sets that we used in Section 4.2. By comparing the
does not provide the optimal base-stock level for the performance of the low-observation models with that
remaining player. However, searching for the best of the full-observation models (i.e., the models trained
possible base-stock levels in Strm-BS is quite expen- on the full data sets in Section 4.2), we can explore
sive, and there are 8,112 cases to run. Therefore, be- whether a massive historical data set is truly required
cause of a lack of computational resources, we simply for training the model.
used the Clark–Scarf algorithm to find base-stock Our null hypothesis is that the low- and full-
levels for each case. This choice is justified by the observation models have equal performance. In or-
fact that our goal in this section is to study the ro- der to test this null hypothesis, Figure 10 provides box
bustness of SRDQN rather than its performance com- plots for both the full- and low-observation models
pared with a base-stock policy. when the SRDQN plays with base-stock coplayers for
Figure 9 show the results for all demand distri- both normal and uniform demands.
butions. For each pair of devcp and devch values, the Although the low-observation model has more
figure plots the ratio of the BS-SRDQN or Strm- outliers, the performance of the two models is quite
SRDQN cost to the BS-BS or Strm-BS cost, averaged similar; the confidence intervals overlap, and we cannot
over the 50 episodes. From the figures, it is clear that reject the null hypothesis that the two models have
our algorithm works well without any additional equal performance. Therefore, even with only 100
training for a wide range of cost values. When historical observations, our model works well, with a
devcp /devch ≈ 1 (i.e., the cost parameters scale roughly performance that is statistically similar to that in
in sync), we get almost the same performance that we Section 4.2.
got from the fully trained model. The reason for this is
that devcp /devch ≈ 1 results in same ratio of Q-values 4.6. Faster Training Through Transfer Learning
for each action, and as a result, the action-selection We trained an SRDQN network with shape [50, 180,
procedure is not affected by the cost coefficients’ de- 130, 61, 5], m 10, β 20, and C = 10,000 for each
viation. As devcp /devch moves away from one, the cost agent, with the same holding and stockout costs and
ratio increases, relatively slowly. The sharpest increase action spaces as in Section 4.1, using 60,000 training
in the SRDQN:BS ratio occurs when the stockout cost
Figure 10. (Color online) Comparison of Full- and Low-
decreases but the holding cost increases. The reason for
Observation Training Data for Two Demand Distributions
this behavior is that the model is trained assuming that
the stockout cost is greater than the holding cost; when
the stockout cost decreases and the holding cost in-
creases, the trained Q-values become less relevant, and
the trained model fails to provide good results.
4.5.2. Size of Historical Data Set. In many real-world

supply chain settings, the size of the available historical
Table 7. Results of Transfer Learning When π1 Is BS-SRDQN and D1 Is U[0, 2]

(Holding, shortage) cost coefficients
Problem R W D M |A| D2 π2 Gap (%) CPU time (s)
Base agent (2,2) (2,0) (2,0) (2,0) 5 U[0, 2] BS-SRDQN 2.31 28,390,987
Case 1 (2,2) (2,0) (2,0) (2,0) 5 U[0, 2] BS-SRDQN 6.06 1,593,455
Case 2 (5,1) (5,0) (5,0) (5,0) 5 U[0, 2] BS-SRDQN 6.16 1,757,103
Case 3 (2,2) (2,0) (2,0) (2,0) 11 U[0, 2] BS-SRDQN 10.66 1,663,857
Case 4 (10,1) (10,0) (10,0) (10,0) 11 U[0, 2] BS-SRDQN 12.58 1,593,455
Case 5 (1,10) (0.75,0) (0.5,0) (0.25,0) 11 N(10, 22 ) BS-SRDQN 17.41 1,234,461
Case 6 (1,10) (0.75,0) (0.5,0) (0.25,0) 11 N(10, 22 ) Strm-SRDQN −38.20 1,153,571
Case 6 (1,10) (0.75,0) (0.5,0) (0.25,0) 11 N(10, 22 ) Rand-SRDQN −0.25 1,292,295
episodes, and used these as the base networks for our in an average training time (across cases 1–4) of
transfer-learning experiment. (In transfer learning, all 1,613,711 seconds—17.6 times faster than training the
agents should have the same network structure to base agent. Additionally, in case 5, in which a normal
share the learned network among different agents.) distribution is used, full hyperparameter tuning took
The remaining agents use base-stock levels obtained 20,396,459 seconds, with an average gap of 4.76%,
by the Clark–Scarf algorithm. which means that transfer learning was 16.6 times faster
Table 7 shows a summary of the results of the six on average. We did not run the full hyperparameter
cases discussed in Section 3.2. The first set of columns tuning for the instances of case 6, but it is similar to that
indicates the holding and shortage cost coefficients, of case 5 and should take similar training time and, as a
the size of the action space, as well as the demand result, a similar improvement from transfer learning.
distribution and the coplayers’ policy for the base Thus, once we have a trained agent i with a given set Pi1
agent (first row) and the target agent (remaining of parameters, demand D1 , and coplayers’ policy π1 ,
rows). The “Gap” column indicates the average gap we can efficiently train a new agent j with parame-
j
between the cost of BS-SRDQN and the cost of BS-BS; ters P2 , demand D2 , and coplayers’ policy π2 .
in the first row, it is analogous to the 2.31% average In order to get more insights into the transfer-
gap reported in Section 4.1.1. The average gap is learning process, Figure 11 shows the results of case 4,
relatively small in all cases, which shows the effec- which is a quite complex transfer-learning case that we
tiveness of the transfer-learning approach. Moreover, test for the beer game. The target agents have holding
this approach is more efficient than training the and shortage costs (10, 1), (10, 0), (10, 0), and (10, 0) for
agents from scratch, as demonstrated in the last agents 1–4, respectively, and each agent can select any
column, which reports the average central processing action in {−5, . . . , 5}. Each caption reports the base
unit (CPU) times for all agents. In order to get the agent (shown by b) and the value of k used. Compared
base agents, we did hyperparameter tuning and with the original procedure (Figure 4), that is, k 0,
trained 140 instances to get the best possible set the training is less noisy, and after a few thousand
of hyperparameters, which resulted in a total of nonfluctuating training episodes, it converges to the
28,390,987 seconds of training. However, using the final solution. The resulting agents obtain costs that
transfer-learning approach, we do not need any hyper- are close to those of BS-BS, with a 12.58% average gap
parameter tuning; we only need to check which compared with the BS-BS cost. (The details of the
source agent and which k provide the best results. other cases are provided in Sections I.1–I.5 of Online
This requires only 12 instances to train and resulted Appendix I.)
Figure 11. (Color online) Results of Transfer Learning for Case 4 (Different Agent, Cost Coefficients, and Action Space)
Note. In the four panels, the base agent b and number of layers k are as follows: (a) b 3, k 1; (b) b 1, k 1; (c) b 3, k 2; and (d) b 4, k 2.
Table 8. Savings in Computation Time Because of Transfer Learning
k0 k1 k2 k3

Training time 185,679 126,524 118,308 107,711

Decrease in time compared with k 0 — 37.61% 41.66% 46.89%
Average gap in cases 1–4 2.31% 4.39% 4.22% 17.66%
Average gap in cases 1–6 — −15.95% −11.65% 0.34%
Notes. First row provides average training time among all instances. Third row provides average of the best obtained gap in cases for which an
optimal solution exists. Fourth row provides average gap among all transfer-learning instances, that is, cases 1–6.
Finally, Table 8 explores the effect of k on the Furthermore, the algorithm does not require knowl-
tradeoff between training speed and solution accu- edge of the demand probability distribution and uses
racy. As k increases, the number of trainable variables only historical data.
decreases and, not surprisingly, the CPU times are To reduce the computation time required to train new
smaller, but the costs are larger. For example, when agents with different cost coefficients or action spaces,
k 3, the training time is 46.89% smaller than the we propose a transfer-learning method. Training new
training time when k 0, but the solution cost is agents with this approach takes less time because it
17.66% and 0.34% greater than the BS-BS policy avoids the need to tune hyperparameters and has a
compared with 4.22% and −11.65% for k 2. smaller number of trainable variables. Moreover, it is
To summarize, transferring the acquired knowl- quite powerful, resulting in beer game costs that are close
edge between agents is much more efficient than to those of fully trained agents while reducing the
training the agents from scratch. The target agents training time by an order of magnitude.
achieve costs that are close to those of BS-BS, and they A natural extension of this paper is to apply our al-
achieve smaller costs than Strm-BS and Rand-BS, gorithm to supply chain networks with other topolo-
regardless of the dissimilarities between the source gies, for example, distribution networks. Performing
and target agents. The training of the target agents a theoretical analysis of the algorithm would be an
starts from relatively small cost values, the training interesting research direction to determine the rate
trajectories are stable and fairly nonnoisy, and they of convergence or worst-case error bounds. A more
quickly converge to a cost value close to that of BS-BS comprehensive numerical study would also shed
or smaller than Strm-BS and Rand-BS. Even when the additional light on the performance of our approach,
action spaces and costs for the source and target especially if the study can be performed with real-
agents are different, transfer learning is still quite world data. Another important extension is having
effective, resulting in a 12.58% gap compared with BS- multiple learnable agents. Finally, developing algo-
BS. This is an important result because it means that if rithms capable of handling continuous action spaces
the settings change—either within the beer game or in will improve the accuracy of our algorithm.
real supply chain settings—we can train new SRDQN
agents much more quickly than we could if we had to Endnote
begin each training from scratch. 1
Our implementation uses the version of the Clark–Scarf algorithm
presented by Chen and Zheng (1994) and Gallego and Zipkin (1999),
5. Conclusion and Future Work but we refer to it as the Clark–Scarf algorithm throughout.
In this paper, we consider the beer game, a decen-
tralized, multiagent, cooperative supply chain prob- References
lem. A base-stock inventory policy is known to be op- Ban G-Y, Rudin C (2019) The big data newsvendor: Practical insights
timal for special cases, but once some of the agents from machine learning. Oper. Res. 67(1):90–108.
do not follow a base-stock policy (as is common in Ban G-Y, Gallien J, Mersereau AJ (2019) Dynamic procurement of
real-world supply chains), the optimal policy of the new products with covariate information: The residual tree
remaining players is unknown. To address this issue, method. Manufacturing Service Oper. Management 21(4):798–815.
Bernstein DS, Givan R, Immerman N, Zilberstein S (2002) The
we propose an algorithm based on DQNs. Because we
complexity of decentralized control of Markov decision pro-
use a nonlinear approximator, there is no theoretical cesses. Math. Oper. Res. 27(4):819–840.
guarantees to attain the global optimal, although it Chaharsooghi SK, Heydari J, Zegordi SH (2008) A reinforcement
works well in practice. It obtains near-optimal solu- learning model for supply chain ordering management: An
tions when playing alongside agents that follow a application to the beer game. Decision Support Systems 45(4):
949–959.
base-stock policy and performs much better than a
Chan D (2017) The AI that has nothing to learn from humans. Atlantic
base-stock policy when the other agents use a more 20:354–359.
realistic model of ordering behavior. The training of Chen F, Samroengraja R (2000) The stationary beer game. Production
the model might be slow, but the execution is very fast. Oper. Management 9(1):19.
Chen F, Zheng Y (1994) Lower bounds for multi-echelon stochastic Martinez-Moyano IJ, Rahn J, Spencer R (2014) The beer game: Its history and
inventory systems. Management Sci. 40(11):1426–1443. rule changes. Technical report, University at Albany, Albany, NY.
Clark AJ, Scarf H (1960) Optimal policies for a multi-echelon in- Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Silver D,
ventory problem. Management Sci. 6(4):475–490. et al (2016) Asynchronous methods for deep reinforcement
Claus C, Boutilier C (1998) The dynamics of reinforcement learning in learning. Balcan MF, Weinberger KQ, eds. Proc. 33rd Int. Conf.
cooperative multiagent systems. The Dynamics of Reinforcement Mach. Learn. (PMLR, New York), Vol. 48, 1928–1937.
Learning in Cooperative Multiagent Systems, (American Associa- Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG,
tion for Artificial Intelligence, Madison, WI), 746–752. Graves A, et al (2015) Human-level control through deep rein-
Croson R, Donohue K (2006) Behavioral causes of the bullwhip effect forcement learning. Nature 518(7540):529–533.
and the observed value of inventory information. Management Oroojlooyjadid A, Hajinezhad D (2019) A review of cooperative
Sci. 52(3):323–336. multi-agent deep reinforcement learning. Preprint, submitted
Gallego G, Zipkin P (1999) Stock positioning and performance esti- XX, https://arxiv.org/abs/1908.03963.
mation in serial production-transportation systems. Manufacturing Oroojlooyjadid A, Snyder L, Takáč M (2017) Stock-out prediction in
Service Oper. Management 1(1):77–88. multi-echelon networks. Preprint, submitted XX, https://arxiv.org/
Geary S, Disney SM, Towill DR (2006) On bullwhip in supply abs/1709.06922.
chains—Historical review, present practice and expected future Oroojlooyjadid A, Snyder LV, Takáč M (2020) Applying Deep Learning
impact. Internat. J. Production Econom. 101(1):2–18. to the Newsvendor Problem (Taylor & Francis), 444–463.
Giannoccaro I, Pontrandolfo P (2002) Inventory management in supply Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Trans.
chains: A reinforcement learning approach. Internat. J. Production Knowledge Data Engrg. 22(10):1345–1359.
Econom. 78(2):153–161. Pentaho (2008) Foodmart’s database tables. Accessed September 30,
Gijsbrechts J, Boute RN, Van Mieghem JA, Zhang D (2019) Can 2015, http://pentaho.dlpage.phi-integration.com/mondrian/mysql
deep reinforcement learning improve inventory manage- -foodmart-database.
ment? Performance on dual sourcing, lost sales and multi- Rajpurkar P, Irvin J, Zhu K, Yang B, Mehta H, Duan T, Ding D, et al
echelon problems. https://ssrn.com/abstract=3302881 or (2017) Chexnet: Radiologist-level pneumonia detection on chest
http://dx.doi.org/10.2139/ssrn.3302881. x-rays with deep learning. Preprint, submitted XX, https://
Graves SC (1985) A multi-echelon inventory model for a repairable arxiv.org/abs/1711.05225.
item with one-for-one replenishment. Management Sci. 31(10): Snyder LV (2018) Multi-echelon base-stock optimization with up-
1247–1256. stream stockout costs. Technical report, Lehigh University,
Jiang C, Sheng Z (2009) Case-based reinforcement learning for dy- Bethlehem, PA.
namic inventory control in a multi-agent supply chain system. Snyder LV, Shen Z-JM (2019) Fundamentals of Supply Chain Theory,
Expert Systems Appl. 36(3):6520–6526. 2nd ed. (John Wiley & Sons, Hoboken, NJ).
Kaggle.com (2018) Store item demand forecasting challenge. Accessed Sterman JD (1989) Modeling managerial behavior: Misperceptions of
August 26, 2019, https://www.kaggle.com/c/demand-forecasting feedback in a dynamic decision making experiment. Management
-kernels-only/overview. Sci. 35(3):321–339.
Kimbrough SO, Wu D-J, Zhong F (2002) Computers play the beer Strozzi F, Bosch J, Zaldivar J (2007) Beer game order policy opti-
game: Can artificial agents manage supply chains? Decision mization under changing customer demand. Decision Support
Support Systems 33(3):323–333. Systems 42(4):2153–2163.
Lee HL, Padmanabhan V, Whang S (1997) Information distortion Sutton RS, Barto AG (1998) Reinforcement Learning: An Introduction
in a supply chain: The bullwhip effect. Management Sci. 43(4): (MIT Press, Cambridge, MA).
546–558. Yang Z, Xie Y, Wang Z (2019) A theoretical analysis of deep Q-learning.
Lee HL, Padmanabhan V, Whang S (2004) Comments on “Infor- Proc Machine Learning Res. 120:486–489, https://arxiv.org/
mation distortion in a supply chain: The bullwhip effect.” abs/1901.00137.
Management Sci. 50(12S):1887–1893. Zhang K, Yang Z, Başar T (2019) Multi-agent reinforcement learning:
Li Y (2017) Deep reinforcement learning: An overview. Preprint, A selective overview of theories and algorithms. Preprint,
submitted XX, https://arxiv.org/abs/1701.07274. submitted XX, https://arxiv.org/abs/1911.10635.

Oroojlooyjadid Et Al 2021 A Deep Q Network For The Beer Game Deep Reinforcement Learning For Inventory Optimization

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Oroojlooyjadid Et Al 2021 A Deep Q Network For The Beer Game Deep Reinforcement Learning For Inventory Optimization

Uploaded by

Copyright:

Available Formats

This article was downloaded by: [240c:c001:1014:896a:ad35:df13:bc01:100f] On: 31 October 2023, At: 02:08

Manufacturing & Service Operations Management

A Deep Q-Network for the Beer Game: Deep

To cite this article:

Full terms and conditions of use: https://pubsonline.informs.org/Publications/Librarians-Portal/PubsOnLine-Terms-and-

Copyright © 2021, INFORMS

Please scroll down for article—it is on subsequent pages

A Deep Q-Network for the Beer Game: Deep Reinforcement

Keywords: inventory optimization • reinforcement learning • beer game

1. Introduction costs. In this paper, we are interested not in the

algorithm as the shaped-reward DQN (SRDQN) algo-

decision process (MDP) that is relatively easy to solve. In

Algorithm 1 (SRDQN for Beer Game)

Note. Caption indicates the role played by SRDQN.

Note. Caption indicates the role played by SRDQN.

Cost (Strm-SRDQN, Strm-BS)

SRDQN agent Retailer Warehouse Distributor Manufacturer Total

Demand cp ch Base-stock level

U[0, 8] [1,1,1,1] [0.50,0.50,0.50,0.50] [19,20,20,14]

Table 3. Results of Literature Data Sets

Uniform U[0, 8] Normal N(10, 22 )

Table 4. Results for Real-World Data Set I with Scaled Demands

Category 6 Category 13 Category 22

Table 5. Results of Real-World Data Set II with Scaled Demands

Category 5 Category 34 Category 46

Literature cases Real-world data set I Real-world data set II

Coplayers SRDQN BS Tie SRDQN BS Tie SRDQN BS Tie

4.5.2. Size of Historical Data Set. In many real-world

Table 7. Results of Transfer Learning When π1 Is BS-SRDQN and D1 Is U[0, 2]

(Holding, shortage) cost coefﬁcients

Problem R W D M |A| D2 π2 Gap (%) CPU time (s)

Table 8. Savings in Computation Time Because of Transfer Learning

k0 k1 k2 k3

Training time 185,679 126,524 118,308 107,711

You might also like