Professional Documents
Culture Documents
FinalTermThesis Prashant Pandey
FinalTermThesis Prashant Pandey
FinalTermThesis Prashant Pandey
LEARNING
PRASHANT PANDEY
AUGUST 2020
TABLE OF CONTENTS
ACKNOWLEDGEMENTS ........................................................................................................ i
ABSTRACT ............................................................................................................................... ii
LIST OF TABLES ....................................................................................................................iii
LIST OF FIGURES .................................................................................................................. iv
LIST OF ABBREVIATIONS .................................................................................................... v
CHAPTER 1: INTRODUCTION .............................................................................................. 1
Background ................................................................................................................. 1
Objectives .................................................................................................................... 2
Research Questions ..................................................................................................... 2
Scope of the Study....................................................................................................... 2
Significance of the Study ............................................................................................ 3
Structure of the Study.................................................................................................. 3
I would like to thank my thesis mentor, Dr Manoj Jayabalan for his guidance, many fruitful
discussions, and for always making time when I needed advice. Moreover, I would like to thank
my supervisor Dr Suvajit Mukopadhyay and student mentor Rashmi Soniminide for all the
support they offered.
i
ABSTRACT
Building multi-agent interactions and long term goal planning are some of the key challenges
in the field of AI and robotics. Long term goal planning can also be perceived as a challenge in
deciphering very long sequences with numerous possibilities. Overcoming these challenges can
help us build complex AI behaviours and interactions.
This research aims at applying modern reinforcement learning techniques to train agents on the
3D environment, created within Unity to simulate a battle royale shooter game. The
contribution of this thesis is twofold: first, it evaluates the significance of imitation learning for
a complex environment. Second, it compares the performance between the techniques: soft-
actor-critic and proximal policy optimization for long term goal planning and impact of
imitation learning on both of these techniques.
ii
LIST OF TABLES
iii
LIST OF FIGURES
iv
LIST OF ABBREVIATIONS
v
CHAPTER 1
INTRODUCTION
Background
In recent years, with an increase in computational power followed a revolution, that enabled
the usage of neural networks to multiple domains where the state of the art reinforcement
learning was limited, one such area is reinforcement learning. The advancement in
reinforcement learning began with the amalgamation of neural networks and the classical
reinforcement learning domain when Deep Mind (Mnih et al., 2013) showed how an agent can
learn to play Atari Breakout using raw pixels even without any prior knowledge of domain
except the game rules.
In 2016, Lee Sedol the world champion of Go was defeated by deep mind’s Alpha Go marks
as a milestone in the advancement of deep reinforcement learning as the game of Go was a
challenge in the field of computer science and Alpha Go was the first computer program to
defeat a human world champion in the game of Go. The initial model of Alpha go was trained
using expert replay and later improved using reinforcement learning (Silver et al., 2016). Later
in 2017 deep mind came with a generalized algorithm to master any board game with name
Alpha zero which can learn using tabula rasa (Silver et al., 2017). These events changed the
perception of the world towards RL.
Soon after conquering the space of board games, Deep Mind was able to create AI Alpha Star
(Vinyals et al., 2019) for starcraft which is believed to be one of the most difficult games to
master even for humans. The action space for starcraft is immensely huge and the resulting
outcomes can be vividly different, the actions taken a long time ago maybe during the beginning
of the game can have dire impacts in one of the end game fights. Starcraft is an example of a
game that requires both micromanagement (managing immediate actions) and long term goal
planning and adapting as per opponent’s strategy or gameplay.
Games have acted as a great domain for test and research for reinforcement learning-based
algorithms. Taking inspiration from some of the previous work, this research also attempts to
evaluate the performance of some of the recently developed algorithms to get a deeper
understanding of when and how these should be utilized on a game environment.
1
Objectives
The research is intended to evaluate the significance of imitation learning in training of RL
agent for a multi-agent 3D shooter battle royale game, and to compare the performance and
effectiveness of proximal policy-based learning and soft-actor critic based learning during this
training. This research also seeks to identify the impact of generative adversarial imitation
network on the effectiveness of proximal policy optimization and soft actor-critic methods of
training. The research objectives are formulated based on the aim of this study, that are as
follows:
Design and implement a battle royale shooter game environment, identify and implement a
reward structure for the training of agents. Manage spawning, movement, and shooting for
agents. Create a system for tracking performance like kills per round, the total number of
wins, and the number of items collected.
To identify the curriculum for learning macro and micro strategy of the game.
To train the RL agent separately using PPO, SAC, PPO-GAIL combination, and SAC-
GAIL combination and obtain the performance statistics.
To compare the results obtained for agents through each of the techniques and draw
inferences based on results.
Research Questions
The research questions are:
Which of the two methods PPO (off-policy) and SAC (on-policy) is more suitable for
training RL agents in a complex environment?
Can imitation learning enhance the training quality and time for RL agents in complex
scenarios or does it rather prevent the agent from fully exploring the action space and
degrades its performance?
Does imitation learning (GAIL) differently affect PPO and SAC, if so what would be
the implications?
2
which can be modified to the first person or third-person perspective. The research uses this
environment for evaluation of the performance of state-of-the-art algorithms (PPO and SAC)
and for identifying the impact of imitation learning (GAIL) on these algorithms. The training
in this research is intentionally restricted to less number of steps (2 million), but the
environment or Unity engine has no such limitations. Furthermore, this research has not utilised
LSTM or CNN based architectures to simplify the model and to speed-up training. As the future
scope, one can utilise this environment for evaluation of other algorithms or validation of some
new model architectures.
3
Chapter 4 describes the details of the implementation for the RL – environment and the
agent
Chapter 5 contains results obtained from this research
Chapter 6 draws meaning from the results obtained and provides details on the future
scope
4
CHAPTER 2
LITERATURE REVIEW
Introduction
This research intends to evaluate and compare reinforcement learning techniques, namely PPO,
SAC, GAIL-PPO and GAIL-SAC. Hence, the following section consists of a walkthrough of
important terminologies used in RL and DRL, and study of some of well-known RL and DRL
techniques including their advantages and limitations. This chapter also explores some case
studies involving DRL research on games, which are considered important and are responsible
for causing a breakthrough in recent years.
5
2.2.1.1 Environment
An environment is the play area for the agent, where agents can interact and perform actions
resulting in positive or negative feedback. In the real world, the environment is considered as a
subset of an infinite universe that is observable and is part of interest. A 3D game environment
is a virtual space in which the agent can traverse and perform certain physical actions based on
the simulation or implementation of a virtual environment. Mathematically an environment is
a model that is a key source for rewards on the transition of states. Most of the tasks and
problems are sufficiently complex and require some sort of simulation to handle all the
necessary parameters of a state-model (Juliani et al., 2018). On the other hand in the real-world,
exactly defining boundaries of an environment are much more difficult because of the sheer
number of parameters that need to be considered.
𝑅𝑡 = ∑ 𝛾 𝑘 𝑟𝑡+𝑘+1
𝑘=0
6
Markov's decision process comes with Markov's assumption, which states that at any point in
time the next state is dependent only and only on the current state, it does not matter how current
state was achieved, rendering the consideration of preceding states insignificant.
One episode is defined as the start of the transition to the end of the transition. Episodes form a
sequence in MDP and if the transition probabilities and reward functions could be determined
then the problem becomes an optimal control problem (Alagoz et al., n.d.).
The reinforcement learning algorithm can be sorted among two classes based on whether they
deal with value function or they deal with policy optimization without considering the value
function iterations (Kormushev et al., 2013), these two classes are:
Value function based reinforcement learning
Policy search based reinforcement learning
2.2.1.4 Policy
The policy is the relation that has the mapping to determine which actions need to be taken in
a given state. A policy is said to be deterministic if it provides an action to be taken at input
state s and it is said to be stochastic if it provides a probability distribution of actions for an
input state 𝑠. The deterministic policy is given by:
𝜋(𝑠𝑡 ) = 𝑎𝑡
And stochastic policy is given by:
𝜋(𝑎𝑡 |𝑠𝑡 ) = 𝑝𝑖
An optimal policy can maximize the most amount of rewards in a given environment. The
optimal policy is represented by 𝜋 ∗ .
𝑖=0
The value function is thus the measure of the quality of each state.
7
2.2.1.6 Quality Function
Optimal policy or sub-optimal (very close to optimal policy) can be obtained empirically by
multiple methods. One such method is the utilization of quality function also known as Q-value
(Sutton and Barto, n.d.) to identify the rewards accumulated on taking an action at a given state.
Q function tells how good an action is when taken from a given state, it does this by summing
the reward obtained by transition occurred by the virtue of action from given state to future
state and also considers the discounted future rewards.
Mathematically,
∞
Bellman optimality equations can be solved iteratively using dynamic programming, given that
the transition probabilities and the reward functions are pre-determined or could be determined
on the fly. The algorithms which are based on the assumption that the probabilities are known
or could be estimated online fall under the category of model-based algorithms.
Most of the algorithms used in practice perform rollouts on the system to determine the policy
and value function as in most cases transition probabilities are not known, such algorithms are
known as model-free algorithms.
8
2.2.1.8 Model-free methods
The knowledge of the model of the environment is not required for Model-free methods. These
techniques are based on estimating the value function. If the optimal policy is being inferred by
the approximated value function then the technique is categorized as the value function-based
method. On the other hand, if the optimal policy is obtained by searching for the domain of
policy parameters then it’s categorized amongst Policy search methods. Thus model-free
methods can target and solve most of RL based problems.
Based on policy usage model-free methods are also usually classified among On-policy and
Off-policy based on policy usage. On-policy methods are known to use the current policy for
both generating actions and using the outcome for updating the current policy. Off-policy
methods have separate policies, one is used for generating actions whereas other is being
updated.
9
error is used to update the value function, this update is made every step whereas Monte Carlo
Methods need to wait for the episode to end to make an update. The frequent update results in
lower variance but leads to higher bias.
There are two TD based algorithms popularly utilized for solving RL problems:
Q-Learning (Off-Policy): Q-function 𝑄(𝑠, 𝑎) reveals maximum future reward
discounted by a factor 𝛾, under the optimal policy by taking an action 𝑎 in current state
𝑠. In Q-learning, the RL environment is explored until convergence of iteratively
updated Q table values. Mathematically,
𝑄(𝑠, 𝑎) = 𝑟 + 𝛾 max
′
𝑄(𝑠 ′ , 𝑎′)
𝑎
where 0 ≤ γ ≤ 1. For a given iteration 𝑖, the current state 𝑠 transitioning to future state
𝑠’ by virtue of action 𝑎 with an immediate reward 𝑟, the transition can be interpreted as
10
methods are also known as “black box” methods. If the utility function is known or if the policy
search method uses some structure from the RL problem then it is categorised among “white
box” methods (Kober et al., 2013). The policy search algorithms may utilize gradient descent
for policy optimization. The gradient can be derived in many different ways, leading a to
complete domain of research for gradient-based policy search (Deisenroth et al., 2011).
Value-based approaches usually fail to learn stochastic policies, due to the deterministic nature
of the value-based approach, whereas policy search algorithms have no such limitations. In
practice it has been observed that these algorithms learn slowly and require a large number of
steps for training, the large variance is also observed during policy evaluation step. Hence it is
usually preferred to opt for “white box” methods if training for a large number of steps is not
possible (Kober and Peters, n.d.). Imitation learning can be used to aid the training speed,
leading to the convergence of policy in less number of steps.
11
combination of the value function and policy search approaches called actor-critic structure
(Barto et al., 1983) was proposed for fusing both advantages. The “Actor” is known as control
policy, the “Critic” is known as value function. The action selection is controlled by Actor, the
Critic is used to transmit the values to Actor so that deciding when the policy to be updated and
preferring the chosen action. Although there are several methods in RL fitting different kinds
of problems, these methods all share the same intractable complexity issues, for example,
memory complexity. Searching for a suitable and powerful function approximation becomes
the imminent issue. In reinforcement learning, the value function is approximated based on the
observations sampled while interacting with the environment (Kober et al., 2013). Function
approximation to date is investigated extensively, and since the fast development of deep
learning, the powerful function approximation: deep neural network can solve these complex
issues. We will discuss in the next section with a focus on deep learning and artificial neural
networks. There traditional RL has two primary issues to be resolved. The first is to reduce the
learning time. Second is how to use RL methods when real-world applications don’t follow a
Markov decision process.
13
performing multiple epochs of mini-batch updates, whereas the standard methods are limited
to only one gradient update per data sample.
For neural network architecture, the optimization is done using a combination of losses that
involves policy surrogate, value function error and entropy bonus to ensure proper exploration
as described by (Mnih et al., 2016).
The estimator used by (Mnih et al., 2016), that is also suitable for the recurrent network, given
that the policy runs for T time steps where T is much less than the length of an episode can be
utilized to arrive at a generalized equation for estimation. A PPO algorithm utilizing a fixed
trajectory (number of time steps) is shown in Figure 2.3.5.
where 𝑠𝑡 and 𝑎𝑡 are the state and the action, and the expectation is taken over the policy and the
true dynamics of the system. The policy optimization tries to maximize both the expected
reward (first summand) and the entropy (second summand). The non-negative temperature
parameter 𝛼 acts the control value to balance out the effect and importance of entropy, a setting
𝛼 = 0 makes entropy consideration insignificant. Tuning temperature parameter used to be a
14
manual task, SAC automated the process by using entropy as constrain instead of a constant
value. Entropy is allowed be varied within certain limits and the average is the constrained, this
approach is very similar to (Abdolmaleki et al., 2018) where constrain on the policy was placed
based on previous policy to avoid immediate large deviations.
The practical implementation of SAC utilizes two soft Q functions, these function
independently attempt to minimize soft Bellman residual. The lower of the two methods is used
for stochastic gradient and policy gradient (Fujimoto et al., 2018)
The algorithm can be summarised (Haarnoja et al., 2018) as shown in Figure 2.3.6
15
Contingency, and also on a modified version of online Q learning using experience replay and
stochastic mini-batch updates. The application was on seven different games of Atari 2600.
This modified version seemed to have outperformed SARSA and Contingency on six games
out of seven.
The apparent limitations would be the usage of the low-resolution image with 2D convolution
and the approach was not validated against any 3D game so its performance is unclear on more
complex situations. Hence the approach can only be generalised for Atari 2600 games.
(van Hasselt et al., 2015) demonstrated the commonality of the problem of overoptimism in
DQN based models, also showed that errors caused due to over-optimism can be reduced by
using Double Q-Learning. The validation was performed on 57 games from Atari 2600.
Since the comparison was only made with DQN the performance of Double Q-Learning is
unclear when compared with other algorithms also the validation was only performed in a 2D
environment, so the behaviour might be different in a 3D environment.
(Hosu and Rebedea, 2016) demonstrated that human replays could be utilised for games which
tend to have a very sparse reward structure, which leads to poor greedy exploration. The
approach used the capability of ALE to store checkpoints and utilized 100 checkpoints
generated by humans which were ultimately used for training a Q-Learning based method. This
approach is very similar to modern-day Generative Adversarial Imitation Learning, having
numerous real-life applications.
16
was able to achieve master level proficiency about 2200 rating. The drawback is that this
approach was not very suitable for self-play but was able to achieve good results by training
with humans and the other computer AI.
(Silver et al., 2016) created AlphaGo that was trained using supervised learning from human
experts, was the first computer program in history to defeat a human professional player. Alpha
Go uses a tree search for evaluating positions and used a DRL trained network for move
selection. The approach used is highly specific to the game and domain.
(Silver et al., 2017) came up with an algorithm AlphaGo Zero entirely based on self-play and
RL, starting from a tabula rasa, the model was able to achieve superhuman performance just by
self-play. Self-play was used to improvise the tree search, consequently resulting in better
decisions.
(Silver et al., 2018) introduced AlphaZero algorithm, a general-purpose algorithm for mastering
any board games. The algorithm was validated on games like go, chess and shogi. It was able
to attain superhuman rating with any modification. The algorithm was able to even outperform
computer AI based on alpha-beta search.
17
(Shao et al., 2018a) demonstrated that actor-critic with Kronecker-factored trust region
(ACKTR) outperformed Asynchronous Advantage Actor-Critic which was used as baseline
algorithm previously. This new approach is relatively inexpensive and sample efficient. Though
the validation was performed by comparing the kill counts and rewards attained in battles, a
better comparison would have been by putting the agents in a deathmatch scenario.
(Kolbe et al., 2019) introduced a new approach by coupling Case-Based Reasoning with RL,
the validation was performed on an FPS environment created within Unity Game Engine. The
agent was able to learn from experiences with a consistent increase in kill-death ratio. The
environment was a relatively simplified version actual FPS games, with an increase in
complexity the model might not hold up.
18
performance but, its training was highly dependent on imitation learning, thus the approach
used is only applicable to the domains with such vast known strategies and demonstrations.
(Andersen et al., 2018) introduced Deep RTS, an artificial intelligence research platform made
with a high- performance RTS game supporting accelerated learning. Deep RTS provides
access to partially observable state-spaces and map complexity.
(Hu and Foerster, 2019) presented a new algorithm Simplified Action Decoder (SAD) for multi-
agent training. Agents in SAD during the training phase, observe the greedy actions for their
teammates along with the actions chosen. SAD requires less computational power, as well as
approach, is much simpler than earlier attempts made for multi-agent. The validation was
performed on Hanabi challenge. There is the scope of improvement as algorithm outperformed
some techniques but still is not the best, also the approach is limited to five agents.
(OpenAI et al., 2019) was the first AI system to defeat the human world champion in Dota 2.
The Agent was built using a central shared LSTM network, feeding separately to fully
connected networks for generating value function outputs and the policy. Proximal Policy
Optimization (Schulman et al., 2017b) was used for training the policy, further Generalized
Advantage Estimation (Schulman et al., 2016) was used to stabilize and speed up the training
process. Game of Dota can last for several thousand-time steps, this poses the problem of long
term goal planning, this was tackled by optimization for assigning accurate credits over time.
The techniques used are highly generalizable, but utilized a huge batch size for training also
training required quite a large amount of time.
(Utocurricula and Powell, 2020) demonstrated that when agents are exposed to several different
situations of strategical play, agents tend to build up a self-supervised auto-curriculum with
distinct emergent strategies. The research uses “hide and seek” simulator, where one team of
agents needs to find the agents of another team. The hiders tend to adapt to the strategies chosen
by the seekers based on the items available in the environment. The environment comes with
six built-in modes(strategies).
The environment has a lot of scope for improvement, at later phases of training the agents to
start to exploit the inaccuracies and lose implementation of the physics engine, such behaviours
should be avoided.
19
CHAPTER 3
RESEARCH METHODOLOGY
Introduction
Reinforcement learning research simulations (especially deep reinforcement learning) involve
a huge number of parameters to be addressed. To create 3D simulations a large number of
systems are required to be developed, this creates the challenge for increasingly advanced rapid
RL research. The simulation environment should be easy to use, and scalable to support various
needs of research. Some of the popular simulation platforms are ViZDoom (Kempka et al.,
2016), Mujoco (Todorov et al., 2012), the Arcade Learning Environment (Bellemare et al.,
n.d.). These simulation platforms help us in refining the algorithms used for training, these
platforms can also be used for training the actual model in certain scenarios and deploying them
for real-world usage.
The research targets to create AI model using PPO and SAC for an isometric shooter game and
in turn, evaluate the comparative difference in the performance of PPO and SAC. The other
objective of the research is to evaluate the impact of GAIL on optimization performed by both
of these algorithm PPO and SAC. From the objectives, it’s fairly clear that the RL techniques
will require a platform for implementation and simulation of the environment. Either the
implementation of the platform should be able to support these algorithms. Preferably, the
platform should support the baseline algorithms required for research.
Unity game engine has been selected as the choice of platform, subsequent sections justify the
criteria for the selection and detail the environment properties.
20
able to export the created content on to a number of platforms like windows, android and many
more, is noteworthy.
Unlike most of the research platforms built on top of games or the one created to provide the
experimental setup for some specific research, Unity, as an engine has no restriction for the
kind of simulation one, wants to create. Unity also comes with intuitive graphical interface
Unity Editor that greatly helps to speed up the development process.
21
3.2.2 ML-agents SDK
The ML-Agents toolkit is the primary way by which one can use ML algorithms in Unity. ML
agents SDK provides the means of communication between the environment and agent for
optimization and evaluation, created within Unity Editor with the Python API. The toolkit can
efficiently exploit all the features of the Unity Engine like processing the output of the camera,
raycast based detection, and many more.
The toolkit comes with a set of abstract and base classes with a predefined communication to
help developers implement and extend the functionalities. Using the core functionality
developers and researchers get to create the setup for environments with the Unity Editor and
relevant classes that drives a certain behaviour. Environment and agent get direct access to
interact with Python API if implemented according to the core classes and predefined
communication.
Learning Environment comprises of Agents, Brains, and an Academy. Agents collect
observations and manage the actions that can be taken. The brain does the job of decision
making it utilizes the policy to determine the best action that needs to be taken in a given
situation. The Academy handles the coordination of various constituents of the environment
and each agent for the simulation episode. This is represented in Figure 3.2.2
22
Algorithms and Supporting Techniques
ML-agents toolkit has evolved since its first release and has started supporting increasingly
more baseline algorithms and different types of neural networks, CNN pre-sets, and multiple
modern techniques like asynchronous parallel training. The following sections describe some
of the important algorithms implemented on ML-agents toolkit.
23
that lead to higher cumulative reward than GAIL demonstrations. The working of discriminator
and GAIL is diagrammatically depicted in Figure 3.3.1.
24
3.3.4 Self-Play
Self-Play is a technique generally used in reinforcement learning, where the agent is trained
against a copy of itself. The current policy is compared against the former policy while
performing greedy exploration, the better one is retained and the process is continued. This
approach can be used for the implementation of a scenario where multiple agents of the same
type compete for improvisation of the policy. Unity ML-Agents supports self-play by assigning
different team IDs to each competing member.
25
CHAPTER 4
IMPLEMENTATION
27
Vector observations: Agent receives observation in form of vectors for the direction of
the centre, forward direction, last rotation in the y-axis, last movement direction,
observation for the object in melee range, shooting range or no object in range, the time
elapsed since the game started, current health, ammo held.
Action Set
For this research, all actions that agent can take are discrete in nature and the decision polling
rate of the agent is 20 decisions (set of actions) per second. Discretizing the domain of the action
leads to simplification the problem leading to faster convergence. The set of action consists of
five segments, these are described as below:
Action segment 1 controls vertical movement. The values can be 0, 1, and 2 where “0”
represents no movement, “1” represents negative direction, and “2” represents positive
movement.
Action segment 2 controls horizontal movement. The values can be 0, 1, and 2 where
“0” represents no movement, “1” represents negative direction, and “2” represents
positive movement.
Action segment 3 is for providing the direction of rotational movement. The values can
be 0, 1, and 2 where “0” represents no movement, “1” represents clockwise direction,
and “2” represents anticlockwise movement
Action segment 4 decides the attack, the agent may choose not to attack if the enemy is
not in range. The values can be 0, 1, and 2 where “0” represents no attack, “1” represents
shooting action, and “2” represents melee action.
Action segment 5 controls the magnitude of rotation. The values can be 0, 1, 2, 3, and 4
where “0” represents 0 degrees, “1” represents 1 degree, “2” represents 2 degrees, “3”
represents 4 degrees, and “4” represents 8 degrees.
Reward Structure
Reward Structure for the agent is as follows:
Positive small reward for every time step for surviving.
Positive reward on shooting, if the enemy is in gun range.
Positive reward on damaging an enemy.
Positive reward on killing an enemy.
Positive reward on melee attack, if the enemy is in range.
28
Positive reward on item collection. Each item has a characteristic reward based on usage
as per scenario. For example, taking health items when your health is full won’t result
in any positive reward.
Large positive reward on winning the game.
Negative reward on shooting, if the enemy is not in gun range.
Negative reward on melee attack, if the enemy is not in melee range.
Negative reward on receiving damage.
Large negative reward on death.
Training Enhancement
Training may take a large number of steps and time if the agent is directly exposed to the
environment with raw optimization problem especially if the reward received is sparse. Usually,
complex RL problems are broken down into smaller ones to aid the learning process. Initially,
the agent is trained on a somewhat easier to learn problem and then slowly the complexity is
increased. This approach is known as curriculum learning. The implementation of the
environment in this research is similar to curriculum learning but doesn’t have distinct
implementation for each smaller problem, rather agent is exposed to a different set of problems
sequentially in the very same episode. For instance, “the agent needs to shoot at the enemy not
otherwise”, to learn this behaviour the events that need to take place simultaneously would be
The enemy should be in the range of agent. Since the enemy is also one of the self-play
agents, so the agents first need to learn to navigate in the map or else the condition for
the range will never get fulfilled.
The agent needs to orient itself towards the enemy.
The agent needs to select the shooting action.
The agent needs to have ammo loaded onto the gun.
The above steps need to repeat during training.
It can be observed that the above-mentioned events are not very frequently occurring.
To counter conditions for sparse and infrequent rewards and to speed up the training the
following has been implemented:
Every time a new episode begins the agents spawn in a location surrounded by four
walls, one of which destructible wall, can be destroyed by either shooting bullet or
performing a melee attack and provides the reward same as collected on damaging an
enemy. The destructible wall is also perceived as an enemy by the agent, given that the
29
same tag for identification is provided to both enemies and destructible wall. This initial
setup for training greatly helps the agent to understand that shooting enemy is
favourable action. Figure 4.5.1 shows agent spawn location with a green outline and
the destructible wall with black outline on one of the corners of the map.
Zone naturally supports training as the zone shrinks after a definite interval of time and
agents receive damage if they stay outside of the zone. Thus, zone forces agents to
move towards the centre increasing the probability of occurrence of favourable events
for shooting and combat.
Agents are rewarded for the duration of survival, the reward for survival scales with
time arithmetically, forcing the agents to try and survive for longer. The win reward is
adjusted such that agents prefer victory over survival duration. Furthermore, to increase
the aggressiveness of the agents for learning attacking strategies, the reward received
for damaging and killing enemies is scaled to multiple folds for the training.
For observation collection, raycasting is used instead of CNNs to reduce the training
time and model complexity.
Apart from destructible walls, dummies have been placed at multiple locations in the
map to increase the frequency of occurrence of favourable events for shooting.
Dummies also yield the same reward as damaging enemies and can be destroyed with
just one hit, also they are identified as the enemy by the agents. Figure 5.5.2 shows the
30
location of dummies and agents in the environment, dummies are presented by yellow
outline and agents are shown by pinkish outline.
Agents Dummies
Figure 4.5.2 Showing agents and dummies
31
CHAPTER 5
Performance Evaluation
This research focuses on four different types of training algorithms namely PPO, SAC, PPO-
GAIL hybrid, SAC-GAIL hybrid. The agents are trained separately using these four different
techniques and then performance parameters are evaluated. The performance evaluation is done
on the same battleground by a series of 100 deathmatches. For deathmatch, each agent is
randomly spawned in one of the four default positions each game (after training is done), and
then agents face off each other. It is to be noted that the agents were trained using different
algorithms in isolation using self-play for a predefined number of steps. The performance
parameters considered are as follows:
Survival duration – Higher survival time indicates that the agent has learned to evade
strategies over attacking ones.
Win ratio – Since winning is the primary objective of the game, a higher win ratio
suggests that AI was able to learn macro objectives firmly and can generalize well.
Kills obtained per game – Killing is the direct path to victory but it also exposes players
and possesses risk. High kills per game with low win ratio would suggest that model
failed to learn the macro objective of “How to win”, high kills and high win ratio would
suggest a very well generalized model, with a great understanding of both macro and
micro strategies, overall superior training than others.
Items collected per game – This shows micro strategy, but it could be vital in the end
game, however, the significance is somewhat lower than kills obtained as exploring
environment also exposes the agent to risky situations and the importance of item
collection is not the same throughout the game. For example, you want to collect health
items only when your current health is lower than max health.
Damage dealt per game – A higher value of damage with losses would suggest that
the capabilities of the model are out weight by someone else or the AI is exposing itself
to the risky position like coming in a range of two or more agents frequently. Showing
the agent has learned the micro strategy of shooting very well but has failed to learn the
macro strategy of positioning itself on the battleground.
32
A higher value with high win ratio and low kills would suggest that AI has opted for an alternate
strategy of damaging others by taking small risks and then waiting for them to be killed by
either zone or other players meanwhile evading imminent danger.
Training Statistics
During training, the agents are trained using self-play. Initially, the agents learn strategies for
survival and the cumulative reward obtained gets improved with each iteration. With time the
agent learns to conjure and perceive better strategies than the previous model. The faster the
agents can counter learned strategies the lesser is the cumulative reward. It should be noted that
here cumulative reward refers to reward collected by all four competing agents combined. It
can be observed from Figure 5.2.1 that over the period of training PPO has achieved the highest
cumulative reward.
Elo rating is a way of measuring the skill level of the competing agents in self-play. The agent
with higher Elo rating is highly likely to get more wins as compared to the agent with a lower
Elo rating, but since this training was performed using self-play in isolation, these ratings are
not directly comparable. The training Elo ratings are presented in Figure 5.2.2.
33
Entropy in RL refers to the degree of uncertainty by which the agent can ascertain the highest
cumulative reward with its current policy. The lower the value of entropy the better is the policy.
In this research, training via SAC has achieved the lowest value of entropy as presented in
Figure 5.2.3, this can be accounted to the nature of the algorithm as SAC is based on minimum
entropy framework it is likely to a lower entropy in comparison with PPO, the same can be
observed with GAIL hybrids of PPO and SAC.
As the agent gets trained via self-play, with improvement in policy agents are likely to learn to
kill each other at a faster rate, this is depicted by shortening of the episode length. From Figure
5.2.4, it could be observed that training via PPO leads to the highest reduction in episode length.
This signifies possibly a more aggressive policy learnt by PPO agent possibly agents started
preferring direct combat over strategical approach.
34
Performance Statistics
It can be observed that training statistics provides no reliable means of identifying how well the
algorithm has worked, especially if the training of agents via different techniques is performed
in isolation using self-play. In order to understand the quality of policies learnt, the agents,
trained using different techniques (PPO, SAC, PPO-GAIL, and SAC-GAIL) were set in series
of 100 deathmatches with randomized spawn positions, equal starting resources, the observed
performance characteristics are mentioned in Table 5.3.1.
From Table 5.3, it can be observed that the agent trained using PPO-GAIL hybrid has achieved
the best outcomes for almost all of the performance characteristics. For graphical presentation,
the same results have also been shown in Figure 5.3.1.
35
Mean survival duration Win ratio
60 0.6
50 0.5
40 0.4
30 0.3
20 0.2
10 0.1
0 0
PPO SAC PPO-GAIL SAC-GAIL PPO SAC PPO-GAIL SAC-GAIL
0 0
PPO SAC PPO-GAIL SAC-GAIL PPO SAC PPO-GAIL SAC-GAIL
200
100
0
PPO SAC PPO-GAIL SAC-GAIL
Result Summary
It was found that PPO-GAIL hybrid agent has the highest wins in deathmatch (validation), apart
from wins PPO-GAIL agent shows unmatched performance in other three performance
parameters kills, damage dealt and items collected and have performed very close to SAC-
GAIL agent when it comes to survival duration. The agents using the vanilla implementation
of algorithms for both PPO and SAC are far below their GAIL hybrids. On closer look, PPO
has opted for a more aggressive policy when compared to SAC, provided it has higher damage
and higher kills whereas SAC has higher items collected and higher survival duration,
suggesting that SAC has opted for a more evasive strategy.
36
CHAPTER 6
Introduction
This research introduces a novel game environment on which the evaluation of algorithms –
PPO, SAC, PPO-GAIL and SAC-GAIL was performed. From the results, it can be easily
inferred that for this research the performance of the training techniques was (in decreasing
order): PPO-GAIL, SAC-GAIL, SAC, and PPO. The following sections cover the
interpretation of results and the findings.
37
Contribution
This research introduces a novel environment for RL research, created using the Unity game
engine. This environment can be used for validation of new model architectures and evaluation
of other algorithms.
The major identification and answers to the research questions are mentioned below:
Which of the two methods PPO (off-policy) and SAC (on-policy) is more suitable for
training RL agents in a complex environment?
Answer - For training agents in a complex environment with a discrete observation set
and action set, PPO works better than SAC.
Can imitation learning enhance the training quality and time for RL agents in complex
scenarios or does it rather prevent the agent from fully exploring the action space and
degrades its performance?
Answer - Imitation learning, when used with less number of training steps (<10M),
especially when the number of demonstrations is present in ample quantity, can lead to
a very significant boost in the training performance.
Does imitation learning (GAIL) differently affect PPO and SAC, if so what would be
the implications?
Answer - GAIL seems to have a positive effect on SAC and PPO based training, though
the magnitude of the impact is higher in case of PPO as compared to SAC.
Future Recommendations
As the research was performed on relatively less number of steps (<10M) and simple neural
network model, the following could be considered as inclusion items for future scope for more
elaborate research:
Identify the results by increasing the number of training steps to 10M-100M and beyond
100M.
Craft a mathematical model for the level of complexity by considering the observation
set and the action set.
Correlate the number of demonstration required for GAIL with model complexity.
Identify the agent performance when recurrent network is used with PPO and SAC.
Identify the impact of GAIL when LSTM network is used in conjunction, and the model
is trained via PPO and SAC.
38
REFERENCES
Abdolmaleki, A., Springenberg, J.T., Tassa, Y., Munos, R., Heess, N. and Riedmiller, M.,
(2018) Maximum a Posteriori Policy Optimisation. 6th International Conference on Learning
Representations, ICLR 2018 - Conference Track Proceedings. [online] Available at:
http://arxiv.org/abs/1806.06920 [Accessed 7 Aug. 2020].
Alagoz, O., Hsu, H., Schaefer, A.J. and Roberts, M.S., (n.d.) Markov Decision Processes: A
Tool for Sequential Decision Making under Uncertainty.
Andersen, P.A., Goodwin, M. and Granmo, O.C., (2018) Deep RTS: A Game Environment
for Deep Reinforcement Learning in Real-Time Strategy Games. IEEE Conference on
Computatonal Intelligence and Games, CIG, 2018-Augus, pp.1–8.
Anon (2020) Soft Actor Critic—Deep Reinforcement Learning with Real-World Robots – The
Berkeley Artificial Intelligence Research Blog. [online] Available at:
https://bair.berkeley.edu/blog/2018/12/14/sac/ [Accessed 11 Jun. 2020].
Arulkumaran, K., Deisenroth, M.P., Brundage, M. and Bharath, A.A., (2017) Deep
reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 346, pp.26–38.
Barto, A.G., Sutton, R.S. and Anderson, C.W., (1983) Neuronlike Adaptive Elements That
Can Solve Difficult Learning Control Problems. IEEE Transactions on Systems, Man and
Cybernetics, SMC-135, pp.834–846.
Baxter, J., Tridgell, A. and Weaver, L., (2000) Learning to play chess using temporal
differences. Machine Learning, 403, pp.243–263.
Bellemare, M.G., Naddaf, Y., Veness, J. and Bowling, M., (2012) The Arcade Learning
Environment: An Evaluation Platform for General Agents. IJCAI International Joint
Conference on Artificial Intelligence, [online] 2015-January, pp.4148–4152. Available at:
http://arxiv.org/abs/1207.4708 [Accessed 15 Jun. 2020].
Bellemare, M.G., Naddaf, Y., Veness, J. and Bowling, M., (n.d.) The Arcade Learning
Environment: An Evaluation Platform For General Agents (Extended Abstract) *. [online]
Available at: http://arcadelearningenvironment.org [Accessed 14 Jun. 2020].
Chang, Y., Erera, A.L. and White, C.C., (2015) Value of information for a leader–follower
partially observed Markov game. Annals of Operations Research, 2351, pp.129–153.
Crites, R.H. and Barto, A.G., (1983) An Actor/Critic Algorithm that Equivalent to Q-Learning
• IS.
Deane, (2018) Bullet Physics For Unity | Physics | Unity Asset Store. [online] Available at:
https://assetstore.unity.com/packages/tools/physics/bullet-physics-for-unity-62991 [Accessed
15 Jun. 2020].
Degris, T., White, M. and Sutton, R.S., (2012) Off-Policy Actor-Critic.
Deisenroth, M.P., Neumann, G., Peters, J., Deisenroth, M.P., Neumann, G. and Peters, J.,
(2011) A Survey on Policy Search for Robotics. Foundations and Trends R in Robotics, 22,
pp.1–142.
Fujimoto, S., van Hoof, H. and Meger, D., (2018) Addressing Function Approximation Error
in Actor-Critic Methods. 35th International Conference on Machine Learning, ICML 2018,
39
[online] 4, pp.2587–2601. Available at: http://arxiv.org/abs/1802.09477 [Accessed 7 Aug.
2020].
Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H.,
Gupta, A., Abbeel, P. and Levine, S., (2018) Soft Actor-Critic Algorithms and Applications.
van Hasselt, H., Guez, A. and Silver, D., (2015) Deep Reinforcement Learning with Double
Q-learning. 30th AAAI Conference on Artificial Intelligence, AAAI 2016, [online] pp.2094–
2100. Available at: http://arxiv.org/abs/1509.06461 [Accessed 10 May 2020].
Hosu, I.-A. and Rebedea, T., (2016) Playing Atari Games with Deep Reinforcement Learning
and Human Checkpoint Replay. [online] Available at: http://arxiv.org/abs/1607.05077
[Accessed 10 May 2020].
Hu, H. and Foerster, J.N., (2019) Simplified Action Decoder for Deep Multi-Agent
Reinforcement Learning. [online] Available at: http://arxiv.org/abs/1912.02288 [Accessed 10
May 2020].
Juliani, A., Henry, H. and Lange, D., (2018) Unity : A General Platform for Intelligent
Agents. pp.1–18.
Kempka, M., Wydmuch, M., Runc, G., Toczek, J. and Jaskowski, W., (2016) ViZDoom: A
Doom-based AI research platform for visual reinforcement learning. In: IEEE Conference on
Computatonal Intelligence and Games, CIG. IEEE Computer Society.
Kober, J., Bagnell, J.A. and Peters, J., (2013) Reinforcement learning in robotics: A survey.
International Journal of Robotics Research, 3211, pp.1238–1274.
Kober, J. and Peters, J., (n.d.) Policy Search for Motor Primitives in Robotics.
Kolbe, M., Reuss, P., Schoenborn, J.M. and Althoff, K.D., (2019) Conceptualization and
implementation of a reinforcement learning approach using a case-based reasoning agent in a
FPS scenario. CEUR Workshop Proceedings, 2454.
Kormushev, P., Calinon, S. and Caldwell, D.G., (2013) Reinforcement Learning in Robotics:
Applications and Real-World Challenges †. Robotics, [online] 2, pp.122–148. Available at:
www.mdpi.com/journal/roboticsArticle [Accessed 7 Jun. 2020].
Lample, G. and Chaplot, D.S., (2017) Playing FPS games with deep reinforcement learning.
31st AAAI Conference on Artificial Intelligence, AAAI 2017, 2015, pp.2140–2146.
Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D. and Wierstra,
D., (2016) Continuous control with deep reinforcement learning. In: 4th International
Conference on Learning Representations, ICLR 2016 - Conference Track Proceedings.
International Conference on Learning Representations, ICLR.
Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D. and Wierstra,
D., (n.d.) CONTINUOUS CONTROL WITH DEEP REINFORCEMENT LEARNING. [online]
Available at: https://goo.gl/J4PIAz [Accessed 14 Jun. 2020].
Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T.P., Harley, T., Silver, D. and
Kavukcuoglu, K., (2016) Asynchronous Methods for Deep Reinforcement Learning. 33rd
International Conference on Machine Learning, ICML 2016, [online] 4, pp.2850–2869.
Available at: http://arxiv.org/abs/1602.01783 [Accessed 10 May 2020].
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D. and
Riedmiller, M., (2013) Playing Atari with Deep Reinforcement Learning. [online] Available
40
at: http://arxiv.org/abs/1312.5602 [Accessed 9 May 2020].
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A.,
Riedmiller, M., Fidjeland, A.K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A.,
Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S. and Hassabis, D., (2015)
Human-level control through deep reinforcement learning. Nature, 5187540, pp.529–533.
OpenAI, :, Berner, C., Brockman, G., Chan, B., Cheung, V., Dębiak, P., Dennison, C., Farhi,
D., Fischer, Q., Hashme, S., Hesse, C., Józefowicz, R., Gray, S., Olsson, C., Pachocki, J.,
Petrov, M., Pinto, H.P. de O., Raiman, J., Salimans, T., Schlatter, J., Schneider, J., Sidor, S.,
Sutskever, I., Tang, J., Wolski, F. and Zhang, S., (2019) Dota 2 with Large Scale Deep
Reinforcement Learning. [online] Available at: http://arxiv.org/abs/1912.06680 [Accessed 10
May 2020].
Samsuden, M.A., Diah, N.M. and Rahman, N.A., (2019) A review paper on implementing
reinforcement learning technique in optimising games performance. In: 2019 IEEE 9th
International Conference on System Engineering and Technology, ICSET 2019 - Proceeding.
Institute of Electrical and Electronics Engineers Inc., pp.258–263.
Schulman, J., Levine, S., Moritz, P., Jordan, M.I. and Abbeel, P., (2015) Trust Region Policy
Optimization. 32nd International Conference on Machine Learning, ICML 2015, [online] 3,
pp.1889–1897. Available at: http://arxiv.org/abs/1502.05477 [Accessed 3 Aug. 2020].
Schulman, J., Moritz, P., Levine, S., Jordan, M.I. and Abbeel, P., (2016) H -d c c u g a e.
pp.1–14.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov, O., (2017a) Proximal Policy
Optimization Algorithms. pp.1–12.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov, O., (2017b) Proximal Policy
Optimization Algorithms. [online] Available at: http://arxiv.org/abs/1707.06347 [Accessed 11
Jun. 2020].
Shao, K., Zhao, D., Li, N. and Zhu, Y., (2018a) Learning Battles in ViZDoom via Deep
Reinforcement Learning. In: IEEE Conference on Computatonal Intelligence and Games,
CIG. IEEE Computer Society.
Shao, K., Zhu, Y. and Zhao, D., (2018b) StarCraft Micromanagement with Reinforcement
Learning and Curriculum Transfer Learning. IEEE Transactions on Emerging Topics in
Computational Intelligence, [online] 31, pp.73–84. Available at:
http://arxiv.org/abs/1804.00810 [Accessed 10 May 2020].
Silver, D., Heess, N., Degris, T., Wierstra, D. and Riedmiller, M., (2014) Deterministic Policy
Gradient Algorithms.
Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G.,
Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D.,
Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel,
T. and Hassabis, D., (2016) Mastering the game of Go with deep neural networks and tree
search. Nature, 5297587, pp.484–489.
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre,
L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K. and Hassabis, D., (2018) A general
reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science,
3626419, pp.1140–1144.
41
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T.,
Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., Van Den Driessche,
G., Graepel, T. and Hassabis, D., (2017) Mastering the game of Go without human
knowledge. Nature, 5507676, pp.354–359.
Sutton, R.S., (1988) Learning to Predict by the Methods of Temporal Differences. Machine
Learning, 31, pp.9–44.
Sutton, R.S. and Barto, A.G., (n.d.) Reinforcement Learning: An Introduction Second edition,
in progress.
Tesauro, G., (1994) TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-
Level Play. Neural Computation, 62, pp.215–219.
Todorov, E., (2018) MuJoCo Unity Plugin. [online] Available at:
http://www.mujoco.org/book/unity.html [Accessed 15 Jun. 2020].
Todorov, E., Erez, T. and Tassa, Y., (2012) MuJoCo: A physics engine for model-based
control. In: IEEE International Conference on Intelligent Robots and Systems. pp.5026–5033.
Tsitsiklis, J.N. and Roy, B. Van, (1997) An Analysis of Temporal-Difference Learning with
Function Approximation. IEEE TRANSACTIONS ON AUTOMATIC CONTROL, .
Utocurricula, A. and Powell, G., (2020) EMERGENT TOOL USE FROM MULTI-AGENT
AUTOCURRICULA.
Vinyals, O., Babuschkin, I., Czarnecki, W.M., Mathieu, M., Dudzik, A., Chung, J., Choi,
D.H., Powell, R., Ewalds, T., Georgiev, P., Oh, J., Horgan, D., Kroiss, M., Danihelka, I.,
Huang, A., Sifre, L., Cai, T., Agapiou, J.P., Jaderberg, M., Vezhnevets, A.S., Leblond, R.,
Pohlen, T., Dalibard, V., Budden, D., Sulsky, Y., Molloy, J., Paine, T.L., Gulcehre, C., Wang,
Z., Pfaff, T., Wu, Y., Ring, R., Yogatama, D., Wünsch, D., McKinney, K., Smith, O., Schaul,
T., Lillicrap, T., Kavukcuoglu, K., Hassabis, D., Apps, C. and Silver, D., (2019) Grandmaster
level in StarCraft II using multi-agent reinforcement learning. Nature, 5757782, pp.350–354.
Vinyals, O., Ewalds, T., Bartunov, S., Georgiev, P., Vezhnevets, A.S., Yeo, M., Makhzani,
A., Küttler, H., Agapiou, J., Schrittwieser, J., Quan, J., Gaffney, S., Petersen, S., Simonyan,
K., Schaul, T., van Hasselt, H., Silver, D., Lillicrap, T., Calderone, K., Keet, P., Brunasso, A.,
Lawrence, D., Ekermo, A., Repp, J. and Tsing, R., (2017) StarCraft II: A New Challenge for
Reinforcement Learning. [online] Available at: http://arxiv.org/abs/1708.04782 [Accessed 10
May 2020].
Wu, Y. and Tian, Y., (2017) TRAINING AGENT FOR FIRST-PERSON SHOOTER GAME
WITH ACTOR-CRITIC CURRICULUM LEARNING. [online] Available at:
http://vizdoom.cs.put.edu.pl/competition-cig-2016/results [Accessed 9 May 2020].
42
APPENDIX A: RESEARCH PROPOSAL
PRASHANT PANDEY
MSC IN MACHINE LEARNING AND AI
Research title
Co-operative shooter AI for a battle royal game using deep reinforcement learning
43
ABSTRACT
Building multi-agent interactions and long term goal planning are some of the key challenges
in the field of AI and robotics. Long term goal planning can also be perceived as a challenge
in deciphering very long sequences with numerous possibilities. Overcoming these challenges
can help us build complex AI behaviors and interactions.
44
TABLE OF CONTENTS
1. Introduction 46
2. Background and related research 46
3. Research Questions (If any) 47
4. Aim and Objectives 47
5. Research Methodology 48
6. Expected Outcomes 49
7. Requirements / resources 49
8. Research Plan 50
9. References 50
45
1. INTRODUCTION
Deep reinforcement learning has caused a breakthrough in the field of AI research. The model
or agents built-in computer games or simulations reveal interesting techniques that are being
utilized in building complex robotics, for automating power systems through load prediction,
inventory management and in the prediction of stock market prices and associated risks for
purchase and sell (Deep Reinforcement Learning and Its Applications - Inteliment
Technologies, 2020). It can be utilized in building prediction models for phenomenon governed
by multiple factors, one such example would weather prediction. This research targets on
building a co-operative AI for shooter battle royal game. This involves multi-agent interaction,
built using curriculum training. As compared to single-agent training, learning cooperative
behavior possess a different challenge, as the agents need to build an understanding of the final
goal, and converge their actions in such a way that each of them contribute towards achieving
the goal, and the actions taken in turn should create favorable conditions for all other agents
who are in the same team.
With an increase in computational power followed a revolution, that enabled the usage of neural
networks to multiple domains where the state of art models were limited, one such area is
reinforcement learning. By the aid of neural networks, the capabilities of reinforcement learning
got greatly enhanced and this amalgamation was named deep reinforcement learning.
Researchers have come up with multiple different kinds of environments for deep reinforcement
learning like VizDoom(Wydmuch et al., 2018), starcraft (Vinyals et al., 2019), CTS a modified
version of Quake 3 by Deep Mind (Jaderberg et al., 2019), multi-agent auto curricula
(Utocurricula and Powell, 2020) and Unity 3d (Juliani et al., 2018). Some of these environments
are targeted towards very specific research areas whereas others like Unity 3d, providing the
user, a complete game engine to be able to build their own environment for research. Notable
work has been done by Open AI (OpenAI et al., 2019) in multi-agent interaction, by building
OpenAI Five for playing Dota 2, OpenAI has trained their agents using proximal policy
optimization (Schulman et al., 2017). OpenAI Five was the first computer program to defeat an
Esport world champion. DeepMind has also done significant work in multi-agent interaction,
they have created capture the flag variant from quake 3 arena and demonstrated that multi-agent
interaction can be fine-tuned by using a model architecture based on recurrent latent variable
model (Chung et al., 2015). In this architecture, they have utilized two RNN one immediately
feeding the past states and other one feeding a series of past states after a delay for evaluation
of actions and rewards. Before DeepMind there were multiple successful attempts on utilizing
46
RNN with deep reinforcement learning. OpenAI’s multi-agent auto curricula (Utocurricula and
Powell, 2020) also utilizes proximal policy optimization and Generalized Advantage
Estimation (Schulman et al., 2016) for the training of agents, they demonstrated that new
coordinating behaviors appear with the training. These researches on computer games and
simulations are crucial in investigating and understanding techniques for building multi-agent
co-operative interaction with a large number of unknowns in the environment or partially
observable environments that involve long term objective planning while at the same time agent
is subjected to perform some immediate actions that converge with long term goal.
Can reinforcement learning be used for the creation of co-operative shooter AI for a battle royal
game?
The main aim of this research is to propose a possible solution for engineering AI with
cooperative behavior using reinforcement learning, that is capable of understanding macro and
micro strategies in a battle-royal game environment. This study can later be exploited in the
implementation of co-operative behavior and swarm intelligence, which is of great use in
advanced robotics. The research also provides a comparison between curriculum-based learning
and vanilla proximal policy-based learning for building co-operative behavior.
The research objectives are formulated based on the aim of this study, that are as follows:
47
To compare the results of hybrid imitation-curriculum-based training with that of results
obtained by agents trained directly through proximal policy optimization
5. RESEARCH METHODOLOGY
The training and evaluation of agents will be done in a 3D environment, where agents can move
freely and kill or attack each other either by melee attacks (attacks that can only be executed if
agents are sufficiently close) or by using weapons that can be collected from the 3D
environment. Agents are supposed to work in a team of 3 and the last standing team or segment
of the team will be deemed as the winner for the session/game. The performance statistics like
the number of kills and resources collected will also be recorded and will be part of the reward
structure. These statistics will be used for the analysis of behavior induced and curriculum
planning and improvement.
The environment will be created within Unity 3D and the platform offers an implementation of
state of the art deep reinforcement learning via proximal policy-based learning for ML-agents
(Juliani et al., 2018). ML- agent is a toolkit offered by unity for creating and training
reinforcement learning-based agents. The agent will access the internal game state along with
images through a camera sensor component, to perceive the world around it. Every agent will
have a periphery of vision, beyond which agents won’t be able to detect enemies. Having a
smaller periphery of vision allows faster training on smaller battlegrounds or maps.
Once the environment is in place, the next steps would be identification and understanding of
clauses of the environment to formulate a reward structure that should be used for the training
of ML agents. Since there are multiple objectives associated with a battle royal game such as
resource collection and tactical positioning, to speed up the process of training, and to ensure
that the agent is exposed to the scenarios that promote team-play over self-play behavior, the
agent needs to get trained using curriculum learning and imitation learning. In the initial stages
of training, the agent will be trained using imitation learning to learn some of the micro
strategies. Breaking down the objectives into hierarchies of micro and macro strategies (Zhang
et al., 2018) can be more effective as compared to vanilla proximal policy-based learning. This
would be followed by training the agent in a stepwise manner by exposing the agent to different
subsets of the original environment with increasingly more complex or bigger subsets for
learning the macro strategy of the game (ml-agents/ML-Agents-Overview.md at master ·
Unity-Technologies/ml-agents, 2020). Once AI has learned some basic micro and macro
48
strategies, the agents will be exposed to an environment where the only possible way of winning
is through teamwork. This is ensured by competing, the already trained AI with basic strategies
with a hardcoded AI that has an advantage in terms of health, health regeneration, weapons,
and vision range. Because of these advantages, the hardcoded AI could be termed as a “cheater”
and it’s not possible to defeat the cheater without tactical team-play. Once, the team of trained
AI is capable of taking out 1 cheater AI, the team should be trained in an environment with
more cheaters and more teams of trained AI. Another set of agents without any curriculum will
be trained by directly exposing the team agents to other teams and a number of cheater AI.
Once trained these two sets of agents shall meet each other in the battleground. The results like
win ratio, kill death ratio, and resources acquired by both, the curriculum trained AI and AI
trained without curriculum will be obtained and compared.
6. EXPECTED OUTCOMES
It is expected that the AI agents will start working in a team to beat the hardcoded AI with
health and resources advantage. Thus, the research should be able to infer that co-operative
behavior can be induced by using reinforcement learning.
The AI trained by the curriculum should be able to achieve better results as compared to the AI
trained by the vanilla proximal policy.
7. REQUIREMENTS / RESOURCES
49
- An external GPU might be required at a later stage of research (Upgrad will require to
facilitate)
8. RESEARCH PLAN -
Table (1)
9. REFERENCES
Anon (2020) Deep Reinforcement Learning and Its Applications - Inteliment Technologies. [online]
Available at: https://www.inteliment.com/blog/our-thinking/deep-reinforcement-learning-and-its-
applications/ [Accessed 16 Mar. 2020].
Chung, J., Kastner, K., Dinh, L. and Goel, K., (2015) A Recurrent Latent Variable Model for
Sequential Data Junyoung arXiv : 1506 . 02216v3 [ cs . LG ] 19 Jun 2015. pp.1–9.
Jaderberg, M., Czarnecki, W.M., Dunning, I., Marris, L., Lever, G., Castañeda, A.G., Beattie, C.,
Rabinowitz, N.C., Morcos, A.S., Ruderman, A., Sonnerat, N., Green, T., Deason, L., Leibo, J.Z.,
Silver, D., Hassabis, D., Kavukcuoglu, K. and Graepel, T., (2019) Human-level performance in 3D
multiplayer games with population-based reinforcement learning. Science, 3646443, pp.859–865.
Juliani, A., Henry, H. and Lange, D., (2018) Unity : A General Platform for Intelligent Agents. pp.1–
18.
OpenAI, :, Berner, C., Brockman, G., Chan, B., Cheung, V., Dębiak, P., Dennison, C., Farhi, D.,
50
Fischer, Q., Hashme, S., Hesse, C., Józefowicz, R., Gray, S., Olsson, C., Pachocki, J., Petrov, M.,
Pinto, H.P. de O., Raiman, J., Salimans, T., Schlatter, J., Schneider, J., Sidor, S., Sutskever, I., Tang,
J., Wolski, F. and Zhang, S., (2019) Dota 2 with Large Scale Deep Reinforcement Learning. [online]
Available at: http://arxiv.org/abs/1912.06680.
Schulman, J., Moritz, P., Levine, S., Jordan, M.I. and Abbeel, P., (2016) H -d c c u g a e. pp.1–14.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov, O., (2017) Proximal Policy
Optimization Algorithms. pp.1–12.
Utocurricula, A. and Powell, G., (2020) EMERGENT TOOL USE FROM MULTI-AGENT
AUTOCURRICULA.
Vinyals, O., Vezhnevets, A.S. and Silver, D., (2019) StarCraft II : A New Challenge for
Reinforcement Learning.
Wydmuch, M., Kempka, M. and Jaskowski, W., (2018) ViZDoom Competitions: Playing Doom
From Pixels . IEEE Transactions on Games, 113, pp.248–259.
Zhang, Z., Li, H., Zhang, L., Zheng, T., Zhang, T., Hao, X., Chen, X., Chen, M., Xiao, F. and Zhou,
W., (2018) Hierarchical Reinforcement Learning for Multi-agent MOBA Game.
51
APPENDIX B: TRAINING CONFIGURATION
PPO Configuration
behaviors:
Shooter:
trainer_type: ppo
hyperparameters:
batch_size: 1024
buffer_size: 10240
learning_rate: 0.0003
beta: 0.005
epsilon: 0.2
lambd: 0.95
num_epoch: 3
learning_rate_schedule: linear
network_settings:
normalize: false
hidden_units: 128
num_layers: 2
vis_encode_type: simple
reward_signals:
extrinsic:
gamma: 0.99
strength: 1.0
keep_checkpoints: 5
max_steps: 2000000
time_horizon: 64
summary_freq: 10000
threaded: true
self_play:
save_steps: 1000
team_change: 4000
swap_steps: 40
window: 10
play_against_latest_model_ratio: 0.5
initial_elo: 1200.0
52
PPO-GAIL Configuration
behaviors:
Shooter:
trainer_type: ppo
hyperparameters:
batch_size: 64
buffer_size: 10240
learning_rate: 0.0003
beta: 0.005
epsilon: 0.2
lambd: 0.95
num_epoch: 3
learning_rate_schedule: linear
network_settings:
normalize: false
hidden_units: 128
num_layers: 2
vis_encode_type: simple
reward_signals:
gail:
gamma: 0.99
strength: 0.1
encoding_size: 128
learning_rate: 0.0003
use_actions: false
use_vail: false
demo_path: E:/Users/prash/OneDrive/Documents/ml_shooterproject/Demos/S
hooterX.demo
keep_checkpoints: 5
max_steps: 2000000
time_horizon: 64
summary_freq: 10000
threaded: true
self_play:
save_steps: 1000
team_change: 4000
swap_steps: 40
window: 10
play_against_latest_model_ratio: 0.5
initial_elo: 1200.0
53
SAC Configuration
behaviors:
Shooter:
trainer_type: sac
hyperparameters:
learning_rate: 0.0003
learning_rate_schedule: constant
batch_size: 256
buffer_size: 500000
buffer_init_steps: 0
tau: 0.005
steps_per_update: 10.0
save_replay_buffer: false
init_entcoef: 0.05
reward_signal_steps_per_update: 10.0
network_settings:
normalize: false
hidden_units: 128
num_layers: 2
vis_encode_type: simple
reward_signals:
extrinsic:
gamma: 0.99
strength: 1.0
keep_checkpoints: 5
max_steps: 2000000
time_horizon: 64
summary_freq: 10000
threaded: true
self_play:
save_steps: 1000
team_change: 4000
swap_steps: 40
window: 10
play_against_latest_model_ratio: 0.5
initial_elo: 1200.0
54
SAC-GAIL configuration
behaviors:
Shooter:
trainer_type: sac
hyperparameters:
learning_rate: 0.0003
learning_rate_schedule: constant
batch_size: 256
buffer_size: 500000
buffer_init_steps: 0
tau: 0.005
steps_per_update: 10.0
save_replay_buffer: false
init_entcoef: 0.05
reward_signal_steps_per_update: 10.0
network_settings:
normalize: false
hidden_units: 128
num_layers: 2
vis_encode_type: simple
reward_signals:
gail:
gamma: 0.99
strength: 0.1
encoding_size: 128
learning_rate: 0.0003
use_actions: false
use_vail: false
demo_path: E:/Users/prash/OneDrive/Documents/ml_shooterproject/Demos/S
hooterX.demo
keep_checkpoints: 5
max_steps: 2000000
time_horizon: 64
summary_freq: 10000
threaded: true
self_play:
save_steps: 1000
team_change: 4000
swap_steps: 40
window: 10
play_against_latest_model_ratio: 0.5
initial_elo: 1200.0
55
APPENDIX C: PROJECT DETAILS
56