FinalTermThesis Prashant Pandey

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 65

AI FOR BATTLE ROYALE SHOOTER GAME USING DEEP REINFORCEMENT

LEARNING

PRASHANT PANDEY

A THESIS SUBMITTED IN FULFILMENT OF THE


REQUIREMENTS FOR THE AWARD OF THE DEGREE OF
MASTER OF SCIENCE IN ML AND AI

LIVERPOOL JOHN MOORES UNIVERSITY (LJMU)

AUGUST 2020
TABLE OF CONTENTS

ACKNOWLEDGEMENTS ........................................................................................................ i
ABSTRACT ............................................................................................................................... ii
LIST OF TABLES ....................................................................................................................iii
LIST OF FIGURES .................................................................................................................. iv
LIST OF ABBREVIATIONS .................................................................................................... v
CHAPTER 1: INTRODUCTION .............................................................................................. 1
Background ................................................................................................................. 1
Objectives .................................................................................................................... 2
Research Questions ..................................................................................................... 2
Scope of the Study....................................................................................................... 2
Significance of the Study ............................................................................................ 3
Structure of the Study.................................................................................................. 3

CHAPTER 2: LITERATURE REVIEW ................................................................................... 5


Introduction ................................................................................................................. 5
Classical Reinforcement Learning .............................................................................. 5
2.2.1 Concepts and Terminologies................................................................................ 5
2.2.2 Monte Carlo Methods .......................................................................................... 9
2.2.3 Temporal Difference Methods ............................................................................. 9
2.2.4 Policy Search Methods ...................................................................................... 10
2.2.5 Actor-Critic Methods ......................................................................................... 11
2.2.6 Challenges in reinforcement learning ................................................................ 11
Deep Reinforcement Learning .................................................................................. 12
2.3.1 Deep Q-Learning ............................................................................................... 12
2.3.2 Deep Deterministic Policy Gradient .................................................................. 13
2.3.3 Trust Region Policy Optimization ..................................................................... 13
2.3.4 Proximal Policy Optimization............................................................................ 13
2.3.5 Soft Actor-Critic ................................................................................................ 14
Related Work on Games ........................................................................................... 15
2.4.1 Deep Reinforcement Learning in 2D Games ..................................................... 15
2.4.2 Deep Reinforcement Learning in Board Games ................................................ 16
2.4.3 Deep Reinforcement Learning in FPS Games ................................................... 17
2.4.4 Deep Reinforcement Learning in Strategy Games ............................................ 18

CHAPTER 3: STUDY OF METHODOLOGY ....................................................................... 20


Introduction ............................................................................................................... 20
Selected Platform – Unity Game Engine .................................................................. 20
3.2.1 Environment Properties ..................................................................................... 21
3.2.2 ML-agents SDK ................................................................................................. 22
Algorithms and Supporting Techniques .................................................................... 23
3.3.1 Generative Adversarial Imitation Learning ...................................................... 23
3.3.2 Proximal Policy Optimization............................................................................ 24
3.3.3 Soft Actor-Critic ................................................................................................ 24
3.3.4 Self-Play............................................................................................................. 25
3.3.5 Raycast and Ray Perception Sensor ................................................................... 25

CHAPTER 4: IMPLEMENTATION ...................................................................................... 26


Description of the Game Environment...................................................................... 26
Agent State and Observations ................................................................................... 27
Action Set .................................................................................................................. 28
Reward Structure ....................................................................................................... 28
Training Enhancement .............................................................................................. 29

CHAPTER 5: RESULTS AND EVALUATION .................................................................... 32


Performance Evaluation ............................................................................................ 32
Training Statistics ...................................................................................................... 33
Performance Statistics ............................................................................................... 35
Result Summary ........................................................................................................ 36

CHAPTER 6: CONCLUSIONS AND RECOMMENDATIONS ........................................... 37


Introduction ............................................................................................................... 37
Discussion and Conclusion ....................................................................................... 37
Contribution ............................................................................................................. 38
Future Recommendation ........................................................................................... 38
REFERENCES ........................................................................................................................ 39
APPENDIX A: RESEARCH PROPOSAL ............................................................................. 43
APPENDIX B: TRAINING CONFIGURATION ................................................................... 52
APPENDIX C: PROJECT DETAILS ..................................................................................... 56
ACKNOWLEDGEMENTS

I would like to thank my thesis mentor, Dr Manoj Jayabalan for his guidance, many fruitful
discussions, and for always making time when I needed advice. Moreover, I would like to thank
my supervisor Dr Suvajit Mukopadhyay and student mentor Rashmi Soniminide for all the
support they offered.

i
ABSTRACT

Building multi-agent interactions and long term goal planning are some of the key challenges
in the field of AI and robotics. Long term goal planning can also be perceived as a challenge in
deciphering very long sequences with numerous possibilities. Overcoming these challenges can
help us build complex AI behaviours and interactions.

This research aims at applying modern reinforcement learning techniques to train agents on the
3D environment, created within Unity to simulate a battle royale shooter game. The
contribution of this thesis is twofold: first, it evaluates the significance of imitation learning for
a complex environment. Second, it compares the performance between the techniques: soft-
actor-critic and proximal policy optimization for long term goal planning and impact of
imitation learning on both of these techniques.

ii
LIST OF TABLES

Table 5.3.1 Performance characteristics ……………………………………………………36

iii
LIST OF FIGURES

Figure 2.3.5 PPO algorithm ..................................................................................................... 14


Figure 2.3.6 SAC Algorithm ................................................................................................... 15
Figure 3.2.2 Academy to agent flow ....................................................................................... 23
Figure 3.3.1 GAIL information flow ....................................................................................... 25
Figure 4.1 Collectable items ................................................................................................. 28
Figure 4.2 Showing Raycast from an agent .......................................................................... 28
Figure 4.5.1 Agent spawn location and destructible wall ....................................................... 31
Figure 4.5.2 Showing agents and dummies ............................................................................. 32
Figure 5.2.1 Training Statistics: Cumulative reward ............................................................... 34
Figure 5.2.2 Training Statistics: Elo Rating ............................................................................ 35
Figure 5.2.3 Training Statistics: Entropy ................................................................................ 35
Figure 5.2.4 Training Statistics: Episode Length .................................................................... 36
Figure 5.3.1 Performance characteristics ................................................................................ 37

iv
LIST OF ABBREVIATIONS

CNN………...Convolution Neural Network


DDPG……….Deep Deterministic Policy Gradient
DL…………...Deep Learning
DNN………... Deep Neural Network
DPG…………Deterministic Policy Gradient
DQN………...Deep Q-Learning Network
DRL…………Deep Reinforcement Learning
GAIL………..Generative adversarial Imitation Learning
GPI………….Generalized policy iteration
LSTM………. Long Short Term Memory
M….……...…Million
MDP……...…Markov Decision Process
PPO………….Classification and Regression Trees
RL…………...Reinforcement Learning
SAC...……….Soft Actor-Critic
SARSA...……State Action Reward State Action
TD..…………Temporal difference
TRPO..……...Trust Region Optimization

v
CHAPTER 1

INTRODUCTION

Background
In recent years, with an increase in computational power followed a revolution, that enabled
the usage of neural networks to multiple domains where the state of the art reinforcement
learning was limited, one such area is reinforcement learning. The advancement in
reinforcement learning began with the amalgamation of neural networks and the classical
reinforcement learning domain when Deep Mind (Mnih et al., 2013) showed how an agent can
learn to play Atari Breakout using raw pixels even without any prior knowledge of domain
except the game rules.
In 2016, Lee Sedol the world champion of Go was defeated by deep mind’s Alpha Go marks
as a milestone in the advancement of deep reinforcement learning as the game of Go was a
challenge in the field of computer science and Alpha Go was the first computer program to
defeat a human world champion in the game of Go. The initial model of Alpha go was trained
using expert replay and later improved using reinforcement learning (Silver et al., 2016). Later
in 2017 deep mind came with a generalized algorithm to master any board game with name
Alpha zero which can learn using tabula rasa (Silver et al., 2017). These events changed the
perception of the world towards RL.
Soon after conquering the space of board games, Deep Mind was able to create AI Alpha Star
(Vinyals et al., 2019) for starcraft which is believed to be one of the most difficult games to
master even for humans. The action space for starcraft is immensely huge and the resulting
outcomes can be vividly different, the actions taken a long time ago maybe during the beginning
of the game can have dire impacts in one of the end game fights. Starcraft is an example of a
game that requires both micromanagement (managing immediate actions) and long term goal
planning and adapting as per opponent’s strategy or gameplay.
Games have acted as a great domain for test and research for reinforcement learning-based
algorithms. Taking inspiration from some of the previous work, this research also attempts to
evaluate the performance of some of the recently developed algorithms to get a deeper
understanding of when and how these should be utilized on a game environment.

1
Objectives
The research is intended to evaluate the significance of imitation learning in training of RL
agent for a multi-agent 3D shooter battle royale game, and to compare the performance and
effectiveness of proximal policy-based learning and soft-actor critic based learning during this
training. This research also seeks to identify the impact of generative adversarial imitation
network on the effectiveness of proximal policy optimization and soft actor-critic methods of
training. The research objectives are formulated based on the aim of this study, that are as
follows:

 Design and implement a battle royale shooter game environment, identify and implement a
reward structure for the training of agents. Manage spawning, movement, and shooting for
agents. Create a system for tracking performance like kills per round, the total number of
wins, and the number of items collected.
 To identify the curriculum for learning macro and micro strategy of the game.
 To train the RL agent separately using PPO, SAC, PPO-GAIL combination, and SAC-
GAIL combination and obtain the performance statistics.
 To compare the results obtained for agents through each of the techniques and draw
inferences based on results.

Research Questions
The research questions are:
 Which of the two methods PPO (off-policy) and SAC (on-policy) is more suitable for
training RL agents in a complex environment?
 Can imitation learning enhance the training quality and time for RL agents in complex
scenarios or does it rather prevent the agent from fully exploring the action space and
degrades its performance?
 Does imitation learning (GAIL) differently affect PPO and SAC, if so what would be
the implications?

Scope of the Study


The research introduces a novel game environment created within the Unity game engine that
can be used for evaluation of RL algorithms and further research in the domain. The
environment implements the rules of a Battle royale shooter with an isometric perspective

2
which can be modified to the first person or third-person perspective. The research uses this
environment for evaluation of the performance of state-of-the-art algorithms (PPO and SAC)
and for identifying the impact of imitation learning (GAIL) on these algorithms. The training
in this research is intentionally restricted to less number of steps (2 million), but the
environment or Unity engine has no such limitations. Furthermore, this research has not utilised
LSTM or CNN based architectures to simplify the model and to speed-up training. As the future
scope, one can utilise this environment for evaluation of other algorithms or validation of some
new model architectures.

Significance of the Study


Deep RL has been lately used extensively for research and development. This thesis is an
attempt to unveil the mysteries related to the effectiveness of modern RL training techniques.
The effectiveness of off-policy and on-policy training techniques is largely dependent on the
complexity of environment-agent interaction. The Battle Royale Shooter game is significantly
complex as the agent not only requires to understand the 3D space, for movement including
displacement and rotation but the agent also needs to perform actions by observing the
behaviour of other agents, essentially this is multi-agent interaction, the agent also needs to plan
item collection and while managing the risk of encountering other agents. Thus, Battle Royale
Shooter is a great testbed for understanding the performance characteristics of RL training
methods namely PPO and SAC. The research also aims at finding the impact of imitation
learning (GAIL) on these training methods (PPO and SAC), revealing the clauses when GAIL
should be and should not be utilized. Since GAIL could be very important in training an agent
in complex scenarios, the information on the applicability of GAIL could be vital. The
information obtained by this research is not just limited to the domain of video games but could
be applied to other fields like robotics as well.

Structure of the study


The structure of the study is as follows:
 Chapter 2 contains the study of previous work and background information for the state
of the art deep reinforcement learning and description of various training techniques in
modern RL
 Chapter 3 describes the methodology which includes platform details and details about
the algorithms and the techniques

3
 Chapter 4 describes the details of the implementation for the RL – environment and the
agent
 Chapter 5 contains results obtained from this research
 Chapter 6 draws meaning from the results obtained and provides details on the future
scope

4
CHAPTER 2

LITERATURE REVIEW

Introduction
This research intends to evaluate and compare reinforcement learning techniques, namely PPO,
SAC, GAIL-PPO and GAIL-SAC. Hence, the following section consists of a walkthrough of
important terminologies used in RL and DRL, and study of some of well-known RL and DRL
techniques including their advantages and limitations. This chapter also explores some case
studies involving DRL research on games, which are considered important and are responsible
for causing a breakthrough in recent years.

Classical Reinforcement learning


Reinforcement Learning is inspired by behavioural psychology with which animals and other
living organisms usually interact with the surrounding environment (Sutton and Barto, n.d.).
This is a framework in which agent is not directly told the proper actions to be taken at a given
state instead agent interacts with the environment on trial and error basis and environment
provides a reward based on the actions, agent explores the environment in an attempt to
maximize the amount of total reward acquired by interacting with the environment while trying
to achieve the intended goal or objective. Thus, this approach is unique when compared to both
supervised learning and unsupervised learning as an agent has no prior knowledge of the
environment and learns to maximize the reward just by exploring the environment. RL does not
depend on the pre-existing dataset, all the necessary information is acquired by the exploration
of the environment.
The RL agent needs to learn the optimal policy for maximizing the cumulative reward, by
mapping one action of each state.

2.2.1 Concepts and Terminologies


It’s essential to discuss the terminologies commonly used in RL before discussing the
techniques on RL and DRL. The following subsections describe some of the important concepts
and terminologies.

5
2.2.1.1 Environment
An environment is the play area for the agent, where agents can interact and perform actions
resulting in positive or negative feedback. In the real world, the environment is considered as a
subset of an infinite universe that is observable and is part of interest. A 3D game environment
is a virtual space in which the agent can traverse and perform certain physical actions based on
the simulation or implementation of a virtual environment. Mathematically an environment is
a model that is a key source for rewards on the transition of states. Most of the tasks and
problems are sufficiently complex and require some sort of simulation to handle all the
necessary parameters of a state-model (Juliani et al., 2018). On the other hand in the real-world,
exactly defining boundaries of an environment are much more difficult because of the sheer
number of parameters that need to be considered.

2.2.1.2 Rewards and Goals


In reinforcement learning, an agent gets feedback based on actions taken at the current state,
this feedback can be negative and can be positive. The negative rewards are to be avoided and
positive rewards mean that policy is performing better. The agent is not only concerned with
immediate rewards but also considers future rewards. Each future reward is discounted by a
factor which is also considered as a learning parameter, its value lies between 0 and 1, usually
presented by 𝛾. Future rewards at time t get discounted by 𝛾 𝑘−1 . The lower the value of 𝛾 the
lesser the valuable future rewards are and more is the value of the current or immediate reward.
Mathematically the cumulative reward can be summarised as:
𝑇

𝑅𝑡 = ∑ 𝛾 𝑘 𝑟𝑡+𝑘+1
𝑘=0

2.2.1.3 Markovian Decision Process


MDP helps to model a sequential decision-making process, and consists of the following:
A set of agent states S and set of actions A.
A transition probability function 𝑇: 𝑆 × 𝐴 ∈ [0,1], which maps the transition probability.
𝑇(𝑠, 𝑎, 𝑠′) represents the probability of making a transition taking action a from state 𝑠 to state
𝑠′.
An immediate reward function 𝑅: 𝑆 × 𝐴 ∈ 𝑅, the amount of reward (could be positive or
negative) provided by the environment on the transition of state. 𝑅(𝑠, 𝑎, 𝑠′) represents the
immediate reward received after the transition from state 𝑠 to state 𝑠′ on taking action a.

6
Markov's decision process comes with Markov's assumption, which states that at any point in
time the next state is dependent only and only on the current state, it does not matter how current
state was achieved, rendering the consideration of preceding states insignificant.
One episode is defined as the start of the transition to the end of the transition. Episodes form a
sequence in MDP and if the transition probabilities and reward functions could be determined
then the problem becomes an optimal control problem (Alagoz et al., n.d.).
The reinforcement learning algorithm can be sorted among two classes based on whether they
deal with value function or they deal with policy optimization without considering the value
function iterations (Kormushev et al., 2013), these two classes are:
 Value function based reinforcement learning
 Policy search based reinforcement learning

2.2.1.4 Policy
The policy is the relation that has the mapping to determine which actions need to be taken in
a given state. A policy is said to be deterministic if it provides an action to be taken at input
state s and it is said to be stochastic if it provides a probability distribution of actions for an
input state 𝑠. The deterministic policy is given by:
𝜋(𝑠𝑡 ) = 𝑎𝑡
And stochastic policy is given by:
𝜋(𝑎𝑡 |𝑠𝑡 ) = 𝑝𝑖
An optimal policy can maximize the most amount of rewards in a given environment. The
optimal policy is represented by 𝜋 ∗ .

2.2.1.5 Value Function


A value function evaluates the quality of the policy. It can be evaluated by using random actions
and observing the obtained reward, this could be termed as exploration using trial and error.
The value function can be iteratively calculated by using dynamic programming (Chang et al.,
2015).
Mathematically,

𝜋
𝑉∶ 𝑉 → 𝑅, 𝑉 (𝑠) = 𝐸𝜋 {𝑅𝑡|𝑠𝑡 = 𝑠} = 𝐸𝜋 { ∑ 𝛾 𝑖 𝑟𝑡+𝑖+1|𝑠𝑡=𝑠 }
𝜋

𝑖=0

The value function is thus the measure of the quality of each state.

7
2.2.1.6 Quality Function
Optimal policy or sub-optimal (very close to optimal policy) can be obtained empirically by
multiple methods. One such method is the utilization of quality function also known as Q-value
(Sutton and Barto, n.d.) to identify the rewards accumulated on taking an action at a given state.
Q function tells how good an action is when taken from a given state, it does this by summing
the reward obtained by transition occurred by the virtue of action from given state to future
state and also considers the discounted future rewards.
Mathematically,

𝑄(𝑠, 𝑎)𝜋 = 𝐸𝜋{𝑅𝑡 |𝑠𝑡 = 𝑠, 𝑎𝑡 = 𝑎} = 𝐸𝜋{∑ 𝛾 𝑖 𝑟𝑡+𝑖+1|𝑠𝑡=𝑠,𝑎𝑡=𝑎 }


𝑖=0

2.2.1.7 Bellman Equations


Bellman equations provide means of solving the problem of maximizing the cumulative
rewards by using the iterative process by formulating value function recursively. A policy 𝜋,
when returns a higher value of cumulative reward for all state 𝑠 ∈ 𝑆 than other policy 𝜋 ′ , then
it is more optimal than policy (Arulkumaran et al., 2017). Mathematically, optimal value
function, 𝑉∗ (𝑠) can be written as
𝑉∗ (𝑠) = max 𝑉𝜋 (𝑠), ∀ 𝑠 ∈ 𝑆
π

Optimal action value function 𝑄∗ (𝑠, 𝑎) can be mathematically defined as


𝑄∗ (𝑠, 𝑎) = max 𝑄𝜋 (𝑠, 𝑎), ∀ 𝑠 ∈ 𝑆
π

For optimal policy, this equation can be written as


𝑉∗ (𝑠) = max 𝑄𝜋∗ (𝑠, 𝑎)
𝑎∈𝐴(𝑠)

Bellman optimality equations can be solved iteratively using dynamic programming, given that
the transition probabilities and the reward functions are pre-determined or could be determined
on the fly. The algorithms which are based on the assumption that the probabilities are known
or could be estimated online fall under the category of model-based algorithms.
Most of the algorithms used in practice perform rollouts on the system to determine the policy
and value function as in most cases transition probabilities are not known, such algorithms are
known as model-free algorithms.

8
2.2.1.8 Model-free methods
The knowledge of the model of the environment is not required for Model-free methods. These
techniques are based on estimating the value function. If the optimal policy is being inferred by
the approximated value function then the technique is categorized as the value function-based
method. On the other hand, if the optimal policy is obtained by searching for the domain of
policy parameters then it’s categorized amongst Policy search methods. Thus model-free
methods can target and solve most of RL based problems.
Based on policy usage model-free methods are also usually classified among On-policy and
Off-policy based on policy usage. On-policy methods are known to use the current policy for
both generating actions and using the outcome for updating the current policy. Off-policy
methods have separate policies, one is used for generating actions whereas other is being
updated.

2.2.2 Monte Carlo Methods


Monte Carlo methods are recursive methods generally employed to solve problems with
probabilistic relations. For RL problems there are usually two steps involved in Monte Carlo
Methods. First, the value function is evaluated by using the current policy, followed by the
improvisation of the policy by considering the current value function. The first step is termed
as the policy evaluation step and the second step as the policy improvement step.
A series of rollouts is performed on the system using the current policy for the approximation
of value function. The value function is estimated by considering the distribution of states
encountered and the cumulative reward obtained over the complete episode by performing the
rollouts. The policy is estimated by using a greedy approach to the current value function.
Iterative application of these steps leads to convergence value function and policy to the optimal
value. Despite the direct approach of Monte Carlo in their implementation, practical problems
demand a significantly large number of iterations for convergence and still, the model is prone
to high variance in estimation of the value function.

2.2.3 Temporal Difference Methods


Temporal difference (TD) calculates a temporal error to evaluate the value function. The usage
of temporal error in the policy evaluation step distinguishes TD from Monte Carlo methods.
Temporal error is defined as the difference between the newer and older value
function estimates. It is determined by the immediate reward at the given time step. Temporal

9
error is used to update the value function, this update is made every step whereas Monte Carlo
Methods need to wait for the episode to end to make an update. The frequent update results in
lower variance but leads to higher bias.
There are two TD based algorithms popularly utilized for solving RL problems:
 Q-Learning (Off-Policy): Q-function 𝑄(𝑠, 𝑎) reveals maximum future reward
discounted by a factor 𝛾, under the optimal policy by taking an action 𝑎 in current state
𝑠. In Q-learning, the RL environment is explored until convergence of iteratively
updated Q table values. Mathematically,
𝑄(𝑠, 𝑎) = 𝑟 + 𝛾 max

𝑄(𝑠 ′ , 𝑎′)
𝑎

where 0 ≤ γ ≤ 1. For a given iteration 𝑖, the current state 𝑠 transitioning to future state
𝑠’ by virtue of action 𝑎 with an immediate reward 𝑟, the transition can be interpreted as

𝑄𝑖 (𝑠, 𝑎) = 𝑄𝑖−1 (𝑠, 𝑎) + 𝛼 (𝑟 + 𝛾 max



𝑄(𝑠 ′ , 𝑎′ ) − 𝑄𝑖−1 (𝑠, 𝑎))
𝑎

where α is the learning rate, 0 ≤ α ≤ 1. 𝜀 greedy exploration is commonly used in Q


learning, to let agent experience and learn various transitions. During
training 𝜀 transitions from a higher value to a lower one. For higher values of 𝜀 agent
takes actions on random with time 𝜀 decreases and agent is more probable to take actions
with the highest Q values. Such an approach leads to a proper exploration of state-action
space.
 State Action Reward State Action (On-Policy): SARSA is very similar to Q-learning
but the difference lies in implementation, SARSA is On-Policy whereas Q-Learning is
Off-Policy (Samsuden et al., 2019). SARSA name represents State (current), Action
(action leading to transition), Reward (immediate reward), State (Next state), Action
(Next action). Mathematically, the update step is presented as:
𝑄(𝑠𝑡 , 𝑎𝑡 ) ← 𝑄(𝑠𝑡 , 𝑎𝑡 ) + 𝛼(𝑟𝑡+1 + 𝛾𝑄(𝑠𝑡+1 , 𝑎𝑡+1 ) − 𝑄(𝑠𝑡 , 𝑎𝑡 ))
The environmental interactions lead to Q value updates, thus making policy updates dependent
on action taken. SARSA continually improves the optimality of the policy, until it finds the
optimal policy.

2.2.4 Policy Search Methods


Policy search methods utilize a utility function to optimize the policy by observing the rewards
obtained by the selection of certain actions and behaviours. The actions are described as vectors
in the continuous domain (Sutton, 1988). If the utility function is not available analytically
policy-search methods perform rollouts on the system to obtain the utility function, such

10
methods are also known as “black box” methods. If the utility function is known or if the policy
search method uses some structure from the RL problem then it is categorised among “white
box” methods (Kober et al., 2013). The policy search algorithms may utilize gradient descent
for policy optimization. The gradient can be derived in many different ways, leading a to
complete domain of research for gradient-based policy search (Deisenroth et al., 2011).
Value-based approaches usually fail to learn stochastic policies, due to the deterministic nature
of the value-based approach, whereas policy search algorithms have no such limitations. In
practice it has been observed that these algorithms learn slowly and require a large number of
steps for training, the large variance is also observed during policy evaluation step. Hence it is
usually preferred to opt for “white box” methods if training for a large number of steps is not
possible (Kober and Peters, n.d.). Imitation learning can be used to aid the training speed,
leading to the convergence of policy in less number of steps.

2.2.5 Actor-Critic Methods


Actor-Critic methods were first introduced by (Crites and Barto, 1983). Actor-critic methods
treat the policy explicit and independent of the value function. For implementation, the policy
and value functions are stored in separate memory structures. The policy is termed as the actor
and the value function as the critic. The selected actions by the actor are consistently being
criticized by the critic. Actor-critic methods fall under TD methods. The critic uses TD error to
update the policy and then critic itself is updated by adjusting the estimated value of state using
TD error. In initial researches, the actor-critic methods were limited to on-policy learning. Later
(Degris et al., 2012) introduced off-policy actor-critic.

2.2.6 Challenges in Reinforcement Learning


Value function approaches theoretically require total coverage of state space and the
corresponding reinforced values of all the possible actions at each state. Thus, the
computational complexity could be very high when dealing with high dimensional applications.
And even though a small change of the local reinforce values may cause a large change in
policy. At the same time, finding an appropriate way to store huge data becomes a significant
problem. In contrast to value function methods, policy search methods consider the current
policy and the next policy to the current one, then computing the changes in policy parameters.
The computational complexity is far less than value function methods. More than that, these
methods are also available to continuous features. However, due to the above theory, the policy
search approaches may cause local optimal and even cannot reach global optimal. A

11
combination of the value function and policy search approaches called actor-critic structure
(Barto et al., 1983) was proposed for fusing both advantages. The “Actor” is known as control
policy, the “Critic” is known as value function. The action selection is controlled by Actor, the
Critic is used to transmit the values to Actor so that deciding when the policy to be updated and
preferring the chosen action. Although there are several methods in RL fitting different kinds
of problems, these methods all share the same intractable complexity issues, for example,
memory complexity. Searching for a suitable and powerful function approximation becomes
the imminent issue. In reinforcement learning, the value function is approximated based on the
observations sampled while interacting with the environment (Kober et al., 2013). Function
approximation to date is investigated extensively, and since the fast development of deep
learning, the powerful function approximation: deep neural network can solve these complex
issues. We will discuss in the next section with a focus on deep learning and artificial neural
networks. There traditional RL has two primary issues to be resolved. The first is to reduce the
learning time. Second is how to use RL methods when real-world applications don’t follow a
Markov decision process.

Deep Reinforcement Learning


Deep Reinforcement Learning (DRL) has emerged as the technique to overcome the challenges
of classical reinforcement learning. DRL uses neural networks for approximating internal
functions in RL. Having an approximator as capable as neural networks enable the agent to
approximate the value function and the policy even with partial exploration. DRL can be used
for solving problems with continuous input and output, as well as for problems involving high
dimensionality. Control over high dimensional states has been demonstrated by (Lillicrap et al.,
n.d.) for locomotion. The following sections will provide a brief on state-of-the-art DRL
algorithms.

2.3.1 Deep Q-Learning


(Tsitsiklis and Roy, 1997) showed that utilization of non-linear approximators with Q-Learning
can be unstable and may lead to divergence. Thus, Deep Q-Learning was not utilized broadly
until (Mnih et al., 2015) demonstrated that it can be used for yielding human-level performance
on Atari 2600 games, where the Q function used neural network weights as one of the inputs.
This technique was termed as Deep Q-Learning (DQN).
DQN functions by iteratively minimizing the loss function using stochastic gradient descent.
𝜀-greedy exploration is used to maximize the possibility of encountering global minima. To
12
counter the possible instability and possible overfitting a buffer of experience replay is created
and updated at regular intervals. For training, the samples are extracted in fixed quantity on
random from replay buffer. This helps in the network to generalise well.
The primary limitation of DQN is that it is unsuitable for solving problems in continuous action
space, as finding a greedy policy for continuous space by updating policy at every time step is
very slow for practical problems.

2.3.2 Deep Deterministic Policy Gradient


(Lillicrap et al., 2016) presented model-free algorithm based on Deep Deterministic Policy
Gradient (DDPG). Deterministic policy gradient (DPG) algorithms were originally introduced
by (Silver et al., 2014) for policy updates. DDPG uses off-policy actor-critic and implements
both the actor and the policy using neural networks, DPG algorithm is used for performing the
policy updates. The usage of DPG algorithm for policy optimization makes DDPG suitable for
problems involving high dimensionality and continuous action space. Though vanilla DPG
performed very poorly in many environments, the usage of the target neural network in DDPG
was found to greatly enhance the performance.
In terms of limitations, like most model-free methods DDPG also requires a large number of
episodes for training.

2.3.3 Trust Region Policy Optimization


Trust Region Policy Optimization (TRPO) solves the constrained optimization problem
iteratively by maximizing some objective function subjected to given constraints, thereby
ensuring monotonic improvements to the current policy (Schulman et al., 2015). The algorithm
is effective for updating non-linear policies, in terms of performance it is very similar to policy
gradient methods. This algorithm can be used with neural networks, (Schulman et al., 2015)
demonstrated robust performance of TRPO on several different neural network-based
implementations like, walking, swimming and playing Atari games.

2.3.4 Proximal Policy Optimization


(Schulman et al., 2017a) proposed a new algorithm known as proximal policy optimization that
works by optimizing a “surrogate” objective function using stochastic gradient ascent. It
alternates the step of data sampling by interacting with the environment and optimization. It
distinguishes itself from standard policy gradient methods by using an objective function for

13
performing multiple epochs of mini-batch updates, whereas the standard methods are limited
to only one gradient update per data sample.
For neural network architecture, the optimization is done using a combination of losses that
involves policy surrogate, value function error and entropy bonus to ensure proper exploration
as described by (Mnih et al., 2016).
The estimator used by (Mnih et al., 2016), that is also suitable for the recurrent network, given
that the policy runs for T time steps where T is much less than the length of an episode can be
utilized to arrive at a generalized equation for estimation. A PPO algorithm utilizing a fixed
trajectory (number of time steps) is shown in Figure 2.3.5.

Figure 2.3.5 PPO algorithm

2.3.5 Soft Actor-Critic


(Haarnoja et al., 2018) described that large continuous domains require us to derive a practical
approximation to soft policy iteration. SAC uses two approximators one for soft Q function and
the other for the policy. SAC operates by alternating the network optimization and stochastic
gradient descent for both the networks. The policy is considered Gaussian function with mean
and covariance are given by the neural network. As per (Soft Actor Critic—Deep
Reinforcement Learning with Real-World Robots – The Berkeley Artificial Intelligence
Research Blog, 2020) Soft actor-critic is based on the maximum entropy reinforcement learning
framework, considering the entropy augmented objective, given by
𝑇

𝐽(𝜋) = 𝐸𝜋 [∑ 𝑟(𝑠𝑡 , 𝑎𝑡 ) − 𝛼𝑙𝑜𝑔(𝜋(𝑎𝑡 |𝑠𝑡 ))],


𝑡=0

where 𝑠𝑡 and 𝑎𝑡 are the state and the action, and the expectation is taken over the policy and the
true dynamics of the system. The policy optimization tries to maximize both the expected
reward (first summand) and the entropy (second summand). The non-negative temperature
parameter 𝛼 acts the control value to balance out the effect and importance of entropy, a setting
𝛼 = 0 makes entropy consideration insignificant. Tuning temperature parameter used to be a

14
manual task, SAC automated the process by using entropy as constrain instead of a constant
value. Entropy is allowed be varied within certain limits and the average is the constrained, this
approach is very similar to (Abdolmaleki et al., 2018) where constrain on the policy was placed
based on previous policy to avoid immediate large deviations.
The practical implementation of SAC utilizes two soft Q functions, these function
independently attempt to minimize soft Bellman residual. The lower of the two methods is used
for stochastic gradient and policy gradient (Fujimoto et al., 2018)
The algorithm can be summarised (Haarnoja et al., 2018) as shown in Figure 2.3.6

Figure 2.3.6 SAC Algorithm

Related Work on Games


Games have been consistently used as a testbed for testing and validation of RL algorithms.
The following sections cover some of the important work performed in DRL relevant to
games.

2.4.1 Deep Reinforcement Learning in 2D Games


(Bellemare et al., 2012) introduced “The Arcade Learning Environment” (ALE) which provides
an interface to Atari 2600 games. Since then this platform has been used extensively for AI
research and validation of algorithms in 2D games.
(Mnih et al., 2013) utilized down-sampled raw pixels from Atari 2600 simulation platform to
demonstrate the performance of some well known state-of-the-art methods like SARSA,

15
Contingency, and also on a modified version of online Q learning using experience replay and
stochastic mini-batch updates. The application was on seven different games of Atari 2600.
This modified version seemed to have outperformed SARSA and Contingency on six games
out of seven.
The apparent limitations would be the usage of the low-resolution image with 2D convolution
and the approach was not validated against any 3D game so its performance is unclear on more
complex situations. Hence the approach can only be generalised for Atari 2600 games.
(van Hasselt et al., 2015) demonstrated the commonality of the problem of overoptimism in
DQN based models, also showed that errors caused due to over-optimism can be reduced by
using Double Q-Learning. The validation was performed on 57 games from Atari 2600.
Since the comparison was only made with DQN the performance of Double Q-Learning is
unclear when compared with other algorithms also the validation was only performed in a 2D
environment, so the behaviour might be different in a 3D environment.
(Hosu and Rebedea, 2016) demonstrated that human replays could be utilised for games which
tend to have a very sparse reward structure, which leads to poor greedy exploration. The
approach used the capability of ALE to store checkpoints and utilized 100 checkpoints
generated by humans which were ultimately used for training a Q-Learning based method. This
approach is very similar to modern-day Generative Adversarial Imitation Learning, having
numerous real-life applications.

2.4.2 Deep Reinforcement Learning in Board Games


Turn-based board games, which do not involve dice or similar factor of randomization are
considered deterministic environment sets as you are likely to get same results given that you
perform the exact same move and your opponent also repeats the same. Thus, board games
treated as the separate domain for research in RL.
(Tesauro, 1994) used a backgammon playing program and trained an RL agent just by using
self-play through a model-free approach very similar to Q-Learning. The implementation
involved multi-layer perceptron having one hidden layer for the approximation of value
function. Models were able to achieve superhuman performance. Soon enough the same
approach was applied to other games like chess and go, but the model failed to achieve the same
degree of proficiency, suggesting that this approach was suitable for only for some specific
cases.
(Baxter et al., 2000) introduced a variant of TD(λ) which can be utilized for general-purpose
training of function for min-max search. The algorithm was validated on chess and the model

16
was able to achieve master level proficiency about 2200 rating. The drawback is that this
approach was not very suitable for self-play but was able to achieve good results by training
with humans and the other computer AI.
(Silver et al., 2016) created AlphaGo that was trained using supervised learning from human
experts, was the first computer program in history to defeat a human professional player. Alpha
Go uses a tree search for evaluating positions and used a DRL trained network for move
selection. The approach used is highly specific to the game and domain.
(Silver et al., 2017) came up with an algorithm AlphaGo Zero entirely based on self-play and
RL, starting from a tabula rasa, the model was able to achieve superhuman performance just by
self-play. Self-play was used to improvise the tree search, consequently resulting in better
decisions.
(Silver et al., 2018) introduced AlphaZero algorithm, a general-purpose algorithm for mastering
any board games. The algorithm was validated on games like go, chess and shogi. It was able
to attain superhuman rating with any modification. The algorithm was able to even outperform
computer AI based on alpha-beta search.

2.4.3 Deep Reinforcement Learning in FPS Games


(Kempka et al., 2016) introduced a new 3D platform for RL research based on Doom game.
The platform supports convolution and Q-learning like algorithms and its also possible to utilize
algorithms other than baseline algorithms. The platform is lightweight and also supports some
advanced features like depth pixels and has been utilized
On ViZDoom platform (Lample and Chaplot, 2017) illustrated their implementation utilizing
partially observable states from in-game along with raw pixel data for training the agent. They
utilized Deep Recurrent Q-Network to support long term action planning. The DRQN was
augmented with in-game features. The model outperformed the in-game AI and humans in
deathmatch scenarios. This implementation uses Q-Learning based training and is susceptible
to suffer from over-optimism errors, also using LSTM limits the utilization of this technique to
discrete action domain, at the same time raising the hardware requirement and the number of
steps required for the training.
(Wu and Tian, 2017) crafted a model on ViZDoom platform using Asynchronous Advantage
Actor-Critic and employed curriculum-based learning. The model won the ViZDoom AI
competition in 2016. Though the model performed well, the model performance was limited to
the known map, also the model was only able to recall last 4 frames to make next decision,
hence the model was unable to map out the environment for a global strategy.

17
(Shao et al., 2018a) demonstrated that actor-critic with Kronecker-factored trust region
(ACKTR) outperformed Asynchronous Advantage Actor-Critic which was used as baseline
algorithm previously. This new approach is relatively inexpensive and sample efficient. Though
the validation was performed by comparing the kill counts and rewards attained in battles, a
better comparison would have been by putting the agents in a deathmatch scenario.
(Kolbe et al., 2019) introduced a new approach by coupling Case-Based Reasoning with RL,
the validation was performed on an FPS environment created within Unity Game Engine. The
agent was able to learn from experiences with a consistent increase in kill-death ratio. The
environment was a relatively simplified version actual FPS games, with an increase in
complexity the model might not hold up.

2.4.4 Deep Reinforcement Learning in Strategy Games


A new reinforcement learning platform based on StarCraft II was introduced by (Vinyals et al.,
2017). Along with the platform that interfaces with game engine RL logic, they also provided
baseline implementation of agents, the agents can be considered equivalent to novice players.
(Shao et al., 2018b) demonstrated that micromanagement within StarCraft II can be achieved
using reinforcement and curriculum transfer learning, the research also simplified the problem
by reducing the complexity using efficient state space and the units were trained using multi-
agent gradient descent SARSA(λ). The agent was able to learn the strategies of StarCraft II and
was able to achieve a 100% win rate in small scale scenarios against baseline AI.
The model was compared against baseline AI and the performance was still not close to
professional human players. Cooperative behaviour was achieved by sharing policy network,
though it was efficient means of sharing other unit’s information, the research was limited to
training of only ranged ground troops, and the approach could not be used for training melee
ground troops.
StarCraft II which is considered one of the most difficult games was conquered by AlphaStar
(Vinyals et al., 2019) with a GrandMaster level and a win percentage of 99.8% without putting
any sort of restriction to the original game. AlphaStar used an LSTM network backed with
reinforcement learning network for maintaining memory between the steps. Initially, AlphaStar
was trained using supervised learning with replays of top human players. To address multi-
agent learning once the model was able to achieve a significant level, the agents were pushed
into league play to avoid cycles commonly encountered while using self-play. Agents were
matched using an algorithm to aid their learning. AlphaStar was able to achieve superhuman

18
performance but, its training was highly dependent on imitation learning, thus the approach
used is only applicable to the domains with such vast known strategies and demonstrations.

(Andersen et al., 2018) introduced Deep RTS, an artificial intelligence research platform made
with a high- performance RTS game supporting accelerated learning. Deep RTS provides
access to partially observable state-spaces and map complexity.

(Hu and Foerster, 2019) presented a new algorithm Simplified Action Decoder (SAD) for multi-
agent training. Agents in SAD during the training phase, observe the greedy actions for their
teammates along with the actions chosen. SAD requires less computational power, as well as
approach, is much simpler than earlier attempts made for multi-agent. The validation was
performed on Hanabi challenge. There is the scope of improvement as algorithm outperformed
some techniques but still is not the best, also the approach is limited to five agents.

(OpenAI et al., 2019) was the first AI system to defeat the human world champion in Dota 2.
The Agent was built using a central shared LSTM network, feeding separately to fully
connected networks for generating value function outputs and the policy. Proximal Policy
Optimization (Schulman et al., 2017b) was used for training the policy, further Generalized
Advantage Estimation (Schulman et al., 2016) was used to stabilize and speed up the training
process. Game of Dota can last for several thousand-time steps, this poses the problem of long
term goal planning, this was tackled by optimization for assigning accurate credits over time.
The techniques used are highly generalizable, but utilized a huge batch size for training also
training required quite a large amount of time.

(Utocurricula and Powell, 2020) demonstrated that when agents are exposed to several different
situations of strategical play, agents tend to build up a self-supervised auto-curriculum with
distinct emergent strategies. The research uses “hide and seek” simulator, where one team of
agents needs to find the agents of another team. The hiders tend to adapt to the strategies chosen
by the seekers based on the items available in the environment. The environment comes with
six built-in modes(strategies).
The environment has a lot of scope for improvement, at later phases of training the agents to
start to exploit the inaccuracies and lose implementation of the physics engine, such behaviours
should be avoided.

19
CHAPTER 3

RESEARCH METHODOLOGY

Introduction
Reinforcement learning research simulations (especially deep reinforcement learning) involve
a huge number of parameters to be addressed. To create 3D simulations a large number of
systems are required to be developed, this creates the challenge for increasingly advanced rapid
RL research. The simulation environment should be easy to use, and scalable to support various
needs of research. Some of the popular simulation platforms are ViZDoom (Kempka et al.,
2016), Mujoco (Todorov et al., 2012), the Arcade Learning Environment (Bellemare et al.,
n.d.). These simulation platforms help us in refining the algorithms used for training, these
platforms can also be used for training the actual model in certain scenarios and deploying them
for real-world usage.
The research targets to create AI model using PPO and SAC for an isometric shooter game and
in turn, evaluate the comparative difference in the performance of PPO and SAC. The other
objective of the research is to evaluate the impact of GAIL on optimization performed by both
of these algorithm PPO and SAC. From the objectives, it’s fairly clear that the RL techniques
will require a platform for implementation and simulation of the environment. Either the
implementation of the platform should be able to support these algorithms. Preferably, the
platform should support the baseline algorithms required for research.
Unity game engine has been selected as the choice of platform, subsequent sections justify the
criteria for the selection and detail the environment properties.

Selected Platform – Unity Game Engine


Most of the well-known simulation platforms used for research are based on games and game
engines as mentioned by (Juliani et al., 2018). Games have catered the needs of AI researchers
by providing the easy of creating a real life-like simulation for exposing the model to situations
close to the real world.
Unity is a feature-rich game development platform, consisting of a graphical user interface shell
to facilitate users and comprises of a powerful core written in C++, at the same time letting the
users programme in C#, a high-level language for the ease of usage. Over the years Unity has
matured as a general-purpose game engine, allowing developers to create the content of varied
nature like 3D games, 2D games, VR content, and even movies. Unity’s capability of being

20
able to export the created content on to a number of platforms like windows, android and many
more, is noteworthy.
Unlike most of the research platforms built on top of games or the one created to provide the
experimental setup for some specific research, Unity, as an engine has no restriction for the
kind of simulation one, wants to create. Unity also comes with intuitive graphical interface
Unity Editor that greatly helps to speed up the development process.

3.2.1 Environment Properties


Sensory Complexity - Unity supports high-fidelity graphical simulation that can be used to
create a photo-realistic rendering. The photo-realistic renders can then be utilized for training
models for real life-like situations. Unity also allows the creation of custom shaders to morph
the rendering characteristics. One can also capture the depth value for each pixel if some model
could possibly utilize it. Apart from images that can be utilized as a source of information
observation other techniques like ray-cast based detection systems and directly feeding vector
observations to the model is also possible.
Physical Complexity - The Unity engine comes with built-in physics implementation provided
by Nvidia PhysX. One can also write a custom physics engine within unity if required. The
presence of a physics engine in the package is highly useful as it lets the developers and
researchers freely create physics-based simulation without having to worry about every detail
of the implementation of physical laws. Lately, other 3rd party physics engines have also started
supporting Unity, for example, there are plugins available for Unity which provide both the
Bullet (Deane, 2018) and Mujoco physics (Todorov, 2018) engines which can be utilized as
alternatives to PhysX.
Cognitive Complexity – C# is the primary language for programming within Unity, being a
high-level language with access to the massive library of .Net, Unity’s C# support is highly
desirable. Unity happens to have a component-based system that enables the addition of new
behaviours onto an instantiable class known as GameObject, enabling users to easily modify
the parameters and behaviours of environment or the agent.
Social Complexity – Multi-agent scenarios can be implemented easily, given that Unity
supports multiplayer and networking out of the box. There are several popular third-party
plugins also available for multiplayer support for Unity.

21
3.2.2 ML-agents SDK
The ML-Agents toolkit is the primary way by which one can use ML algorithms in Unity. ML
agents SDK provides the means of communication between the environment and agent for
optimization and evaluation, created within Unity Editor with the Python API. The toolkit can
efficiently exploit all the features of the Unity Engine like processing the output of the camera,
raycast based detection, and many more.
The toolkit comes with a set of abstract and base classes with a predefined communication to
help developers implement and extend the functionalities. Using the core functionality
developers and researchers get to create the setup for environments with the Unity Editor and
relevant classes that drives a certain behaviour. Environment and agent get direct access to
interact with Python API if implemented according to the core classes and predefined
communication.
Learning Environment comprises of Agents, Brains, and an Academy. Agents collect
observations and manage the actions that can be taken. The brain does the job of decision
making it utilizes the policy to determine the best action that needs to be taken in a given
situation. The Academy handles the coordination of various constituents of the environment
and each agent for the simulation episode. This is represented in Figure 3.2.2

Figure 3.2.2 Academy to agent flow

22
Algorithms and Supporting Techniques
ML-agents toolkit has evolved since its first release and has started supporting increasingly
more baseline algorithms and different types of neural networks, CNN pre-sets, and multiple
modern techniques like asynchronous parallel training. The following sections describe some
of the important algorithms implemented on ML-agents toolkit.

3.3.1 Generative Adversarial Imitation Learning


For training, the agent neural network is initialized with random weights. In the initial stages
the model picks up random action, and each action taken is evaluated. Based on how good the
action taken was in a given situation the policy function gets approximated to maximize the
rewards. Slowly the agent starts picking correct actions as the policy gets optimized. For the
complex situations where the observation space and action space could be large, and the
environment is likely to have multiple local optima, especially if the agent requires to perform
a series of actions and any of those activities can lead to entirely different reward in future. In
such scenarios, the chances of the agent to get stuck in one of the local optima while training is
relatively high, increasing the number of total iterations performed for policy optimization can
be a way of dealing with this, but it can be computationally heavy hence may not be a feasible
solution always.
GAIL Generative Adversarial Imitation Learning is an approach where we use a discriminator
to aid the agent’s learning, through discriminator the action taken by the agent is evaluated
based on some high-quality example set. Essentially this is called imitation learning as the agent
gets to learn based on the expert observation and action demonstration provided. For using
imitation learning through ML agents toolkit one needs to generate or record demonstrations
either by human or by bot. The agent collects observation from the environment during training
a following greedy exploration. GAIL utilizes expert demonstrations to enable the functioning
of its discriminator. For each step, the agent is rewarded based on the relative closeness of
observation and action pair with that of demonstrations. Slowly agent starts following the
demonstrations something similar to supervised learning. The discriminator is also updated
with time, to be able to provide more harsh feedback to the deviations from the demonstration.
Iteratively both discriminator and the agent improves with time, but the agent starts to mimic
the demonstrations.
GAIL can be effectively combined with any DRL algorithm, as it simply adds an additional
feedback loop and leaves the rest of the training process untouched. A balance between GAIL
reward and environment reward is required to be tuned, to ensure that the agent prefers actions

23
that lead to higher cumulative reward than GAIL demonstrations. The working of discriminator
and GAIL is diagrammatically depicted in Figure 3.3.1.

Figure 3.3.1 GAIL information flow

3.3.2 Proximal Policy Optimization


Initial releases of ML-agents toolkit were primarily based on PPO. ML-agents comes with on-
policy implementation of PPO that requires multiple epochs of mini-batch updates to improve
its objective function which in turn improves the policy. To make PPO more effective one can
stack several environments for training, ML-agents can take care of several collected
observations from the stacked environments even if the episode length is not fixed. The
algorithm selection and hyperparameters for training in Unity is defined in the configuration
file, the details can be found in Appendix B.

3.3.3 Soft Actor-Critic


The newer releases (v0.9+) of ML-agents come with SAC implementation to support both
discrete and continuous action space. SAC is an off-policy algorithm. To improve the policy
SAC requires a collection of samples also known as experience buffer, the size of experience
buffer and other parameters can be stored in the configuration file. SAC can be utlized for tasks
that require continous control.

24
3.3.4 Self-Play
Self-Play is a technique generally used in reinforcement learning, where the agent is trained
against a copy of itself. The current policy is compared against the former policy while
performing greedy exploration, the better one is retained and the process is continued. This
approach can be used for the implementation of a scenario where multiple agents of the same
type compete for improvisation of the policy. Unity ML-Agents supports self-play by assigning
different team IDs to each competing member.

3.3.5 Raycast and Ray Perception Sensor


Raycast is a technique used in Unity for detection of physics-based bounds also known as
colliders. In raycast, an invisible ray is shot in the specified direction from a specified origin
for detection of a collider in a specified range, raycast returns a structure “RaycasthHit”(in form
of out variable) which can be utilized of identification of the features of the collider obstructing
the ray.
Ray Perception Sensor is a component class, that comes with ML-agent SDK that can be
utilized for detection of surroundings using raycast. Ray Perception Sensor provides a
configurable graphical interface and configurable parameters such as the number of rays, the
angle of rays and the object tags to detect can set as per requirement. Ray Perception Sensor,
when compared with image-based observation, provides a very significant advantage in
simplifying network architecture, though the efficiency of Ray Perception Senor as the source
of information observations might not always be preferable.

25
CHAPTER 4

IMPLEMENTATION

Description of the Game Environment


The game environment created for this search is an isometric shooter, for spectators to provide
better visibility it is rendered in isometric view. Four players are supposed to play this game at
a time. All four players are considered part of different teams and compete against each other,
the last survivor of the game is deemed the winner. Players are allowed to collect items from
the environment, which can provide an advantage over other players in the game. Players can
hurt each other either by shooting a bullet from a long-range or by using a melee attack on close
range. The amount of ammo a player receives on the starting of the game is limited and agent
needs to collect more as required with game progression, in case all the ammo is exhausted,
players can still use a melee attack to hurt and kill each other.
The game environment consists of a plane bounded by walls surrounded by a spherical dome
of the electromagnetic field. Players are allowed to traverse freely if not obstructed by walls.
Players can also control rotation and movement separately, this means it’s possible to move in
a direction different from where the gun is targeting. The field is termed as the zone. Zone
shrinks with time, if the players get out of the zone, they receive damage and this can result in
death and losing the game as well. So, it’s vital to stay within the zone, but after a certain time
zone vanishes and players have to either fight for victory or die from zone damage. The bounded
area has sub-partitions within, creating sections within the environment. The agents can collect
three types of objects :
 Health - On collection increases the current health, though the value of current health is
always capped by an upper limit and the threshold cannot be crossed by this item.
Collection of this item when current health is already at maximum value destroys the
item without having any effect.
 Booster - On collection increases the maximum value of health (the capped threshold
of health) and also increases current health to retain the percentage ratio of current health
to max health (former value).
 Ammo – On collection increases the number of ammo held by the agent. Each agent has
a limited capacity for holding ammo. Collection of this item when capacity is already
full destroys the item without having any effect. This can be used as a strategy, the agent
can destroy the resources so that others won’t get to utilize it.
26
Figure 4.1 shows the three collectable items.

Figure 4.1 Collectable items

Agent State and Observations


All the agents collect observation in two ways:
 Raycast based observations: Agent cast 10 rays in an anticlockwise direction and 10 in
a clockwise direction with an angle of +175 and -175 degrees respectively. Detectable
tags (objects) include ammo, agent, wall, health, and booster. Each ray hitting a target
returns information such as distance, and object type tag. The zone and the bullet
projectile are an exception to raycast based detection. The zone is set to a different
physics layer and does not obstruct any of the rays. Figure 4.2 shows the raycasts (the
red lines) performed by the Ray Perception Sensor on an agent.

Figure 4.2 Showing Raycast from an agent

27
 Vector observations: Agent receives observation in form of vectors for the direction of
the centre, forward direction, last rotation in the y-axis, last movement direction,
observation for the object in melee range, shooting range or no object in range, the time
elapsed since the game started, current health, ammo held.

Action Set
For this research, all actions that agent can take are discrete in nature and the decision polling
rate of the agent is 20 decisions (set of actions) per second. Discretizing the domain of the action
leads to simplification the problem leading to faster convergence. The set of action consists of
five segments, these are described as below:
 Action segment 1 controls vertical movement. The values can be 0, 1, and 2 where “0”
represents no movement, “1” represents negative direction, and “2” represents positive
movement.
 Action segment 2 controls horizontal movement. The values can be 0, 1, and 2 where
“0” represents no movement, “1” represents negative direction, and “2” represents
positive movement.
 Action segment 3 is for providing the direction of rotational movement. The values can
be 0, 1, and 2 where “0” represents no movement, “1” represents clockwise direction,
and “2” represents anticlockwise movement
 Action segment 4 decides the attack, the agent may choose not to attack if the enemy is
not in range. The values can be 0, 1, and 2 where “0” represents no attack, “1” represents
shooting action, and “2” represents melee action.
 Action segment 5 controls the magnitude of rotation. The values can be 0, 1, 2, 3, and 4
where “0” represents 0 degrees, “1” represents 1 degree, “2” represents 2 degrees, “3”
represents 4 degrees, and “4” represents 8 degrees.

Reward Structure
Reward Structure for the agent is as follows:
 Positive small reward for every time step for surviving.
 Positive reward on shooting, if the enemy is in gun range.
 Positive reward on damaging an enemy.
 Positive reward on killing an enemy.
 Positive reward on melee attack, if the enemy is in range.

28
 Positive reward on item collection. Each item has a characteristic reward based on usage
as per scenario. For example, taking health items when your health is full won’t result
in any positive reward.
 Large positive reward on winning the game.
 Negative reward on shooting, if the enemy is not in gun range.
 Negative reward on melee attack, if the enemy is not in melee range.
 Negative reward on receiving damage.
 Large negative reward on death.

Training Enhancement
Training may take a large number of steps and time if the agent is directly exposed to the
environment with raw optimization problem especially if the reward received is sparse. Usually,
complex RL problems are broken down into smaller ones to aid the learning process. Initially,
the agent is trained on a somewhat easier to learn problem and then slowly the complexity is
increased. This approach is known as curriculum learning. The implementation of the
environment in this research is similar to curriculum learning but doesn’t have distinct
implementation for each smaller problem, rather agent is exposed to a different set of problems
sequentially in the very same episode. For instance, “the agent needs to shoot at the enemy not
otherwise”, to learn this behaviour the events that need to take place simultaneously would be
 The enemy should be in the range of agent. Since the enemy is also one of the self-play
agents, so the agents first need to learn to navigate in the map or else the condition for
the range will never get fulfilled.
 The agent needs to orient itself towards the enemy.
 The agent needs to select the shooting action.
 The agent needs to have ammo loaded onto the gun.
 The above steps need to repeat during training.
It can be observed that the above-mentioned events are not very frequently occurring.
To counter conditions for sparse and infrequent rewards and to speed up the training the
following has been implemented:
 Every time a new episode begins the agents spawn in a location surrounded by four
walls, one of which destructible wall, can be destroyed by either shooting bullet or
performing a melee attack and provides the reward same as collected on damaging an
enemy. The destructible wall is also perceived as an enemy by the agent, given that the

29
same tag for identification is provided to both enemies and destructible wall. This initial
setup for training greatly helps the agent to understand that shooting enemy is
favourable action. Figure 4.5.1 shows agent spawn location with a green outline and
the destructible wall with black outline on one of the corners of the map.

Figure 4.5.1 Agent spawn location and destructible wall

 Zone naturally supports training as the zone shrinks after a definite interval of time and
agents receive damage if they stay outside of the zone. Thus, zone forces agents to
move towards the centre increasing the probability of occurrence of favourable events
for shooting and combat.
 Agents are rewarded for the duration of survival, the reward for survival scales with
time arithmetically, forcing the agents to try and survive for longer. The win reward is
adjusted such that agents prefer victory over survival duration. Furthermore, to increase
the aggressiveness of the agents for learning attacking strategies, the reward received
for damaging and killing enemies is scaled to multiple folds for the training.
 For observation collection, raycasting is used instead of CNNs to reduce the training
time and model complexity.
 Apart from destructible walls, dummies have been placed at multiple locations in the
map to increase the frequency of occurrence of favourable events for shooting.
Dummies also yield the same reward as damaging enemies and can be destroyed with
just one hit, also they are identified as the enemy by the agents. Figure 5.5.2 shows the

30
location of dummies and agents in the environment, dummies are presented by yellow
outline and agents are shown by pinkish outline.

Agents Dummies
Figure 4.5.2 Showing agents and dummies

31
CHAPTER 5

RESULTS AND EVALUATION

Performance Evaluation
This research focuses on four different types of training algorithms namely PPO, SAC, PPO-
GAIL hybrid, SAC-GAIL hybrid. The agents are trained separately using these four different
techniques and then performance parameters are evaluated. The performance evaluation is done
on the same battleground by a series of 100 deathmatches. For deathmatch, each agent is
randomly spawned in one of the four default positions each game (after training is done), and
then agents face off each other. It is to be noted that the agents were trained using different
algorithms in isolation using self-play for a predefined number of steps. The performance
parameters considered are as follows:
 Survival duration – Higher survival time indicates that the agent has learned to evade
strategies over attacking ones.
 Win ratio – Since winning is the primary objective of the game, a higher win ratio
suggests that AI was able to learn macro objectives firmly and can generalize well.
 Kills obtained per game – Killing is the direct path to victory but it also exposes players
and possesses risk. High kills per game with low win ratio would suggest that model
failed to learn the macro objective of “How to win”, high kills and high win ratio would
suggest a very well generalized model, with a great understanding of both macro and
micro strategies, overall superior training than others.
 Items collected per game – This shows micro strategy, but it could be vital in the end
game, however, the significance is somewhat lower than kills obtained as exploring
environment also exposes the agent to risky situations and the importance of item
collection is not the same throughout the game. For example, you want to collect health
items only when your current health is lower than max health.
 Damage dealt per game – A higher value of damage with losses would suggest that
the capabilities of the model are out weight by someone else or the AI is exposing itself
to the risky position like coming in a range of two or more agents frequently. Showing
the agent has learned the micro strategy of shooting very well but has failed to learn the
macro strategy of positioning itself on the battleground.

32
A higher value with high win ratio and low kills would suggest that AI has opted for an alternate
strategy of damaging others by taking small risks and then waiting for them to be killed by
either zone or other players meanwhile evading imminent danger.

Training Statistics
During training, the agents are trained using self-play. Initially, the agents learn strategies for
survival and the cumulative reward obtained gets improved with each iteration. With time the
agent learns to conjure and perceive better strategies than the previous model. The faster the
agents can counter learned strategies the lesser is the cumulative reward. It should be noted that
here cumulative reward refers to reward collected by all four competing agents combined. It
can be observed from Figure 5.2.1 that over the period of training PPO has achieved the highest
cumulative reward.

PPO SAC PPO-GAIL SAC-GAIL


Figure 5.2.1 Training Statistics: Cumulative reward

Elo rating is a way of measuring the skill level of the competing agents in self-play. The agent
with higher Elo rating is highly likely to get more wins as compared to the agent with a lower
Elo rating, but since this training was performed using self-play in isolation, these ratings are
not directly comparable. The training Elo ratings are presented in Figure 5.2.2.

PPO SAC PPO-GAIL SAC-GAIL


Figure 5.2.2 Training Statistics: Elo Rating

33
Entropy in RL refers to the degree of uncertainty by which the agent can ascertain the highest
cumulative reward with its current policy. The lower the value of entropy the better is the policy.
In this research, training via SAC has achieved the lowest value of entropy as presented in
Figure 5.2.3, this can be accounted to the nature of the algorithm as SAC is based on minimum
entropy framework it is likely to a lower entropy in comparison with PPO, the same can be
observed with GAIL hybrids of PPO and SAC.

PPO SAC PPO-GAIL SAC-GAIL


Figure 5.2.3 Training Statistics: Entropy

As the agent gets trained via self-play, with improvement in policy agents are likely to learn to
kill each other at a faster rate, this is depicted by shortening of the episode length. From Figure
5.2.4, it could be observed that training via PPO leads to the highest reduction in episode length.
This signifies possibly a more aggressive policy learnt by PPO agent possibly agents started
preferring direct combat over strategical approach.

PPO SAC PPO-GAIL SAC-GAIL


Figure 5.2.4 Training Statistics: Episode Length

34
Performance Statistics
It can be observed that training statistics provides no reliable means of identifying how well the
algorithm has worked, especially if the training of agents via different techniques is performed
in isolation using self-play. In order to understand the quality of policies learnt, the agents,
trained using different techniques (PPO, SAC, PPO-GAIL, and SAC-GAIL) were set in series
of 100 deathmatches with randomized spawn positions, equal starting resources, the observed
performance characteristics are mentioned in Table 5.3.1.

PPO SAC PPO-GAIL SAC-GAIL


Mean Survival 35.73 37.17 51.43 54.48
duration (in seconds)
Win 0.07 0.09 0.53 0.31
ratio
Mean Kills obtained 0.29 0.19 1.29 0.5
per game
Mean Items collected 1.2 1.79 4.5 2.9
per game
Mean Damage dealt 74 46 198 86.5
per game

Table 5.3.1 Performance characteristics

From Table 5.3, it can be observed that the agent trained using PPO-GAIL hybrid has achieved
the best outcomes for almost all of the performance characteristics. For graphical presentation,
the same results have also been shown in Figure 5.3.1.

35
Mean survival duration Win ratio
60 0.6
50 0.5
40 0.4
30 0.3
20 0.2
10 0.1
0 0
PPO SAC PPO-GAIL SAC-GAIL PPO SAC PPO-GAIL SAC-GAIL

Mean kills per game Mean items collected per


1.5 game
6
1
4
0.5
2

0 0
PPO SAC PPO-GAIL SAC-GAIL PPO SAC PPO-GAIL SAC-GAIL

Mean damage dealt per


game
300

200

100

0
PPO SAC PPO-GAIL SAC-GAIL

Figure 5.3.1 Performance characteristics

Result Summary
It was found that PPO-GAIL hybrid agent has the highest wins in deathmatch (validation), apart
from wins PPO-GAIL agent shows unmatched performance in other three performance
parameters kills, damage dealt and items collected and have performed very close to SAC-
GAIL agent when it comes to survival duration. The agents using the vanilla implementation
of algorithms for both PPO and SAC are far below their GAIL hybrids. On closer look, PPO
has opted for a more aggressive policy when compared to SAC, provided it has higher damage
and higher kills whereas SAC has higher items collected and higher survival duration,
suggesting that SAC has opted for a more evasive strategy.

36
CHAPTER 6

CONCLUSION AND RECOMMENDATIONS

Introduction
This research introduces a novel game environment on which the evaluation of algorithms –
PPO, SAC, PPO-GAIL and SAC-GAIL was performed. From the results, it can be easily
inferred that for this research the performance of the training techniques was (in decreasing
order): PPO-GAIL, SAC-GAIL, SAC, and PPO. The following sections cover the
interpretation of results and the findings.

Discussion and Conclusion


PPO-GAIL can be said to have learnt a better policy considering all the performance
parameters. It can be concluded from this research that agents trained via GAIL learn faster
(require less number of steps to learn the policy) and also generalize well. It should be noted
that this research was performed with a limit of 2 million steps, and by nature, the SAC
algorithm performs better with a higher number of steps explored and with larger buffer size,
so the results could be different if more favourable conditions are offered to SAC. Also, the
performance of the algorithms is likely to change if these action sets are changed to continuous
space.
It can be observed that agents trained via GAIL hybrid with both PPO and SAC have triumphed
over their vanilla implementations. This signifies the importance of imitation learning. Though
it is difficult to approximate the exact or the minimum amount of demonstrations one would
require for a specific problem, as there is no exact mathematical measure of the complexity of
the problem that can be related with a number of demonstrations. For a case where the
demonstrations are available in ample quantity, usage of GAIL should be a preference. Since
GAIL can significantly boost the performance of both PPO and SAC, it is safe to conclude that
GAIL positively impacts both of these algorithms, but the improvements seen, in case of PPO
is much higher as compared to SAC. The effectiveness of GAIL is also likely to vary with the
number of training steps involved this opens up the possibility of future research with an
increased number of steps.

37
Contribution
This research introduces a novel environment for RL research, created using the Unity game
engine. This environment can be used for validation of new model architectures and evaluation
of other algorithms.
The major identification and answers to the research questions are mentioned below:
 Which of the two methods PPO (off-policy) and SAC (on-policy) is more suitable for
training RL agents in a complex environment?
Answer - For training agents in a complex environment with a discrete observation set
and action set, PPO works better than SAC.
 Can imitation learning enhance the training quality and time for RL agents in complex
scenarios or does it rather prevent the agent from fully exploring the action space and
degrades its performance?
Answer - Imitation learning, when used with less number of training steps (<10M),
especially when the number of demonstrations is present in ample quantity, can lead to
a very significant boost in the training performance.
 Does imitation learning (GAIL) differently affect PPO and SAC, if so what would be
the implications?
Answer - GAIL seems to have a positive effect on SAC and PPO based training, though
the magnitude of the impact is higher in case of PPO as compared to SAC.

Future Recommendations
As the research was performed on relatively less number of steps (<10M) and simple neural
network model, the following could be considered as inclusion items for future scope for more
elaborate research:
 Identify the results by increasing the number of training steps to 10M-100M and beyond
100M.
 Craft a mathematical model for the level of complexity by considering the observation
set and the action set.
 Correlate the number of demonstration required for GAIL with model complexity.
 Identify the agent performance when recurrent network is used with PPO and SAC.
 Identify the impact of GAIL when LSTM network is used in conjunction, and the model
is trained via PPO and SAC.

38
REFERENCES

Abdolmaleki, A., Springenberg, J.T., Tassa, Y., Munos, R., Heess, N. and Riedmiller, M.,
(2018) Maximum a Posteriori Policy Optimisation. 6th International Conference on Learning
Representations, ICLR 2018 - Conference Track Proceedings. [online] Available at:
http://arxiv.org/abs/1806.06920 [Accessed 7 Aug. 2020].
Alagoz, O., Hsu, H., Schaefer, A.J. and Roberts, M.S., (n.d.) Markov Decision Processes: A
Tool for Sequential Decision Making under Uncertainty.
Andersen, P.A., Goodwin, M. and Granmo, O.C., (2018) Deep RTS: A Game Environment
for Deep Reinforcement Learning in Real-Time Strategy Games. IEEE Conference on
Computatonal Intelligence and Games, CIG, 2018-Augus, pp.1–8.
Anon (2020) Soft Actor Critic—Deep Reinforcement Learning with Real-World Robots – The
Berkeley Artificial Intelligence Research Blog. [online] Available at:
https://bair.berkeley.edu/blog/2018/12/14/sac/ [Accessed 11 Jun. 2020].
Arulkumaran, K., Deisenroth, M.P., Brundage, M. and Bharath, A.A., (2017) Deep
reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 346, pp.26–38.
Barto, A.G., Sutton, R.S. and Anderson, C.W., (1983) Neuronlike Adaptive Elements That
Can Solve Difficult Learning Control Problems. IEEE Transactions on Systems, Man and
Cybernetics, SMC-135, pp.834–846.
Baxter, J., Tridgell, A. and Weaver, L., (2000) Learning to play chess using temporal
differences. Machine Learning, 403, pp.243–263.
Bellemare, M.G., Naddaf, Y., Veness, J. and Bowling, M., (2012) The Arcade Learning
Environment: An Evaluation Platform for General Agents. IJCAI International Joint
Conference on Artificial Intelligence, [online] 2015-January, pp.4148–4152. Available at:
http://arxiv.org/abs/1207.4708 [Accessed 15 Jun. 2020].
Bellemare, M.G., Naddaf, Y., Veness, J. and Bowling, M., (n.d.) The Arcade Learning
Environment: An Evaluation Platform For General Agents (Extended Abstract) *. [online]
Available at: http://arcadelearningenvironment.org [Accessed 14 Jun. 2020].
Chang, Y., Erera, A.L. and White, C.C., (2015) Value of information for a leader–follower
partially observed Markov game. Annals of Operations Research, 2351, pp.129–153.
Crites, R.H. and Barto, A.G., (1983) An Actor/Critic Algorithm that Equivalent to Q-Learning
• IS.
Deane, (2018) Bullet Physics For Unity | Physics | Unity Asset Store. [online] Available at:
https://assetstore.unity.com/packages/tools/physics/bullet-physics-for-unity-62991 [Accessed
15 Jun. 2020].
Degris, T., White, M. and Sutton, R.S., (2012) Off-Policy Actor-Critic.
Deisenroth, M.P., Neumann, G., Peters, J., Deisenroth, M.P., Neumann, G. and Peters, J.,
(2011) A Survey on Policy Search for Robotics. Foundations and Trends R in Robotics, 22,
pp.1–142.
Fujimoto, S., van Hoof, H. and Meger, D., (2018) Addressing Function Approximation Error
in Actor-Critic Methods. 35th International Conference on Machine Learning, ICML 2018,

39
[online] 4, pp.2587–2601. Available at: http://arxiv.org/abs/1802.09477 [Accessed 7 Aug.
2020].
Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H.,
Gupta, A., Abbeel, P. and Levine, S., (2018) Soft Actor-Critic Algorithms and Applications.
van Hasselt, H., Guez, A. and Silver, D., (2015) Deep Reinforcement Learning with Double
Q-learning. 30th AAAI Conference on Artificial Intelligence, AAAI 2016, [online] pp.2094–
2100. Available at: http://arxiv.org/abs/1509.06461 [Accessed 10 May 2020].
Hosu, I.-A. and Rebedea, T., (2016) Playing Atari Games with Deep Reinforcement Learning
and Human Checkpoint Replay. [online] Available at: http://arxiv.org/abs/1607.05077
[Accessed 10 May 2020].
Hu, H. and Foerster, J.N., (2019) Simplified Action Decoder for Deep Multi-Agent
Reinforcement Learning. [online] Available at: http://arxiv.org/abs/1912.02288 [Accessed 10
May 2020].
Juliani, A., Henry, H. and Lange, D., (2018) Unity : A General Platform for Intelligent
Agents. pp.1–18.
Kempka, M., Wydmuch, M., Runc, G., Toczek, J. and Jaskowski, W., (2016) ViZDoom: A
Doom-based AI research platform for visual reinforcement learning. In: IEEE Conference on
Computatonal Intelligence and Games, CIG. IEEE Computer Society.
Kober, J., Bagnell, J.A. and Peters, J., (2013) Reinforcement learning in robotics: A survey.
International Journal of Robotics Research, 3211, pp.1238–1274.
Kober, J. and Peters, J., (n.d.) Policy Search for Motor Primitives in Robotics.
Kolbe, M., Reuss, P., Schoenborn, J.M. and Althoff, K.D., (2019) Conceptualization and
implementation of a reinforcement learning approach using a case-based reasoning agent in a
FPS scenario. CEUR Workshop Proceedings, 2454.
Kormushev, P., Calinon, S. and Caldwell, D.G., (2013) Reinforcement Learning in Robotics:
Applications and Real-World Challenges †. Robotics, [online] 2, pp.122–148. Available at:
www.mdpi.com/journal/roboticsArticle [Accessed 7 Jun. 2020].
Lample, G. and Chaplot, D.S., (2017) Playing FPS games with deep reinforcement learning.
31st AAAI Conference on Artificial Intelligence, AAAI 2017, 2015, pp.2140–2146.
Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D. and Wierstra,
D., (2016) Continuous control with deep reinforcement learning. In: 4th International
Conference on Learning Representations, ICLR 2016 - Conference Track Proceedings.
International Conference on Learning Representations, ICLR.
Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D. and Wierstra,
D., (n.d.) CONTINUOUS CONTROL WITH DEEP REINFORCEMENT LEARNING. [online]
Available at: https://goo.gl/J4PIAz [Accessed 14 Jun. 2020].
Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T.P., Harley, T., Silver, D. and
Kavukcuoglu, K., (2016) Asynchronous Methods for Deep Reinforcement Learning. 33rd
International Conference on Machine Learning, ICML 2016, [online] 4, pp.2850–2869.
Available at: http://arxiv.org/abs/1602.01783 [Accessed 10 May 2020].
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D. and
Riedmiller, M., (2013) Playing Atari with Deep Reinforcement Learning. [online] Available
40
at: http://arxiv.org/abs/1312.5602 [Accessed 9 May 2020].
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A.,
Riedmiller, M., Fidjeland, A.K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A.,
Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S. and Hassabis, D., (2015)
Human-level control through deep reinforcement learning. Nature, 5187540, pp.529–533.
OpenAI, :, Berner, C., Brockman, G., Chan, B., Cheung, V., Dębiak, P., Dennison, C., Farhi,
D., Fischer, Q., Hashme, S., Hesse, C., Józefowicz, R., Gray, S., Olsson, C., Pachocki, J.,
Petrov, M., Pinto, H.P. de O., Raiman, J., Salimans, T., Schlatter, J., Schneider, J., Sidor, S.,
Sutskever, I., Tang, J., Wolski, F. and Zhang, S., (2019) Dota 2 with Large Scale Deep
Reinforcement Learning. [online] Available at: http://arxiv.org/abs/1912.06680 [Accessed 10
May 2020].
Samsuden, M.A., Diah, N.M. and Rahman, N.A., (2019) A review paper on implementing
reinforcement learning technique in optimising games performance. In: 2019 IEEE 9th
International Conference on System Engineering and Technology, ICSET 2019 - Proceeding.
Institute of Electrical and Electronics Engineers Inc., pp.258–263.
Schulman, J., Levine, S., Moritz, P., Jordan, M.I. and Abbeel, P., (2015) Trust Region Policy
Optimization. 32nd International Conference on Machine Learning, ICML 2015, [online] 3,
pp.1889–1897. Available at: http://arxiv.org/abs/1502.05477 [Accessed 3 Aug. 2020].
Schulman, J., Moritz, P., Levine, S., Jordan, M.I. and Abbeel, P., (2016) H -d c c u g a e.
pp.1–14.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov, O., (2017a) Proximal Policy
Optimization Algorithms. pp.1–12.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov, O., (2017b) Proximal Policy
Optimization Algorithms. [online] Available at: http://arxiv.org/abs/1707.06347 [Accessed 11
Jun. 2020].
Shao, K., Zhao, D., Li, N. and Zhu, Y., (2018a) Learning Battles in ViZDoom via Deep
Reinforcement Learning. In: IEEE Conference on Computatonal Intelligence and Games,
CIG. IEEE Computer Society.
Shao, K., Zhu, Y. and Zhao, D., (2018b) StarCraft Micromanagement with Reinforcement
Learning and Curriculum Transfer Learning. IEEE Transactions on Emerging Topics in
Computational Intelligence, [online] 31, pp.73–84. Available at:
http://arxiv.org/abs/1804.00810 [Accessed 10 May 2020].
Silver, D., Heess, N., Degris, T., Wierstra, D. and Riedmiller, M., (2014) Deterministic Policy
Gradient Algorithms.
Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G.,
Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D.,
Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel,
T. and Hassabis, D., (2016) Mastering the game of Go with deep neural networks and tree
search. Nature, 5297587, pp.484–489.
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre,
L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K. and Hassabis, D., (2018) A general
reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science,
3626419, pp.1140–1144.

41
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T.,
Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., Van Den Driessche,
G., Graepel, T. and Hassabis, D., (2017) Mastering the game of Go without human
knowledge. Nature, 5507676, pp.354–359.
Sutton, R.S., (1988) Learning to Predict by the Methods of Temporal Differences. Machine
Learning, 31, pp.9–44.
Sutton, R.S. and Barto, A.G., (n.d.) Reinforcement Learning: An Introduction Second edition,
in progress.
Tesauro, G., (1994) TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-
Level Play. Neural Computation, 62, pp.215–219.
Todorov, E., (2018) MuJoCo Unity Plugin. [online] Available at:
http://www.mujoco.org/book/unity.html [Accessed 15 Jun. 2020].
Todorov, E., Erez, T. and Tassa, Y., (2012) MuJoCo: A physics engine for model-based
control. In: IEEE International Conference on Intelligent Robots and Systems. pp.5026–5033.
Tsitsiklis, J.N. and Roy, B. Van, (1997) An Analysis of Temporal-Difference Learning with
Function Approximation. IEEE TRANSACTIONS ON AUTOMATIC CONTROL, .
Utocurricula, A. and Powell, G., (2020) EMERGENT TOOL USE FROM MULTI-AGENT
AUTOCURRICULA.
Vinyals, O., Babuschkin, I., Czarnecki, W.M., Mathieu, M., Dudzik, A., Chung, J., Choi,
D.H., Powell, R., Ewalds, T., Georgiev, P., Oh, J., Horgan, D., Kroiss, M., Danihelka, I.,
Huang, A., Sifre, L., Cai, T., Agapiou, J.P., Jaderberg, M., Vezhnevets, A.S., Leblond, R.,
Pohlen, T., Dalibard, V., Budden, D., Sulsky, Y., Molloy, J., Paine, T.L., Gulcehre, C., Wang,
Z., Pfaff, T., Wu, Y., Ring, R., Yogatama, D., Wünsch, D., McKinney, K., Smith, O., Schaul,
T., Lillicrap, T., Kavukcuoglu, K., Hassabis, D., Apps, C. and Silver, D., (2019) Grandmaster
level in StarCraft II using multi-agent reinforcement learning. Nature, 5757782, pp.350–354.
Vinyals, O., Ewalds, T., Bartunov, S., Georgiev, P., Vezhnevets, A.S., Yeo, M., Makhzani,
A., Küttler, H., Agapiou, J., Schrittwieser, J., Quan, J., Gaffney, S., Petersen, S., Simonyan,
K., Schaul, T., van Hasselt, H., Silver, D., Lillicrap, T., Calderone, K., Keet, P., Brunasso, A.,
Lawrence, D., Ekermo, A., Repp, J. and Tsing, R., (2017) StarCraft II: A New Challenge for
Reinforcement Learning. [online] Available at: http://arxiv.org/abs/1708.04782 [Accessed 10
May 2020].
Wu, Y. and Tian, Y., (2017) TRAINING AGENT FOR FIRST-PERSON SHOOTER GAME
WITH ACTOR-CRITIC CURRICULUM LEARNING. [online] Available at:
http://vizdoom.cs.put.edu.pl/competition-cig-2016/results [Accessed 9 May 2020].

42
APPENDIX A: RESEARCH PROPOSAL

PRASHANT PANDEY
MSC IN MACHINE LEARNING AND AI

Research title

Co-operative shooter AI for a battle royal game using deep reinforcement learning

43
ABSTRACT

Building multi-agent interactions and long term goal planning are some of the key challenges
in the field of AI and robotics. Long term goal planning can also be perceived as a challenge
in deciphering very long sequences with numerous possibilities. Overcoming these challenges
can help us build complex AI behaviors and interactions.

44
TABLE OF CONTENTS

1. Introduction 46
2. Background and related research 46
3. Research Questions (If any) 47
4. Aim and Objectives 47
5. Research Methodology 48
6. Expected Outcomes 49
7. Requirements / resources 49
8. Research Plan 50
9. References 50

45
1. INTRODUCTION

Deep reinforcement learning has caused a breakthrough in the field of AI research. The model
or agents built-in computer games or simulations reveal interesting techniques that are being
utilized in building complex robotics, for automating power systems through load prediction,
inventory management and in the prediction of stock market prices and associated risks for
purchase and sell (Deep Reinforcement Learning and Its Applications - Inteliment
Technologies, 2020). It can be utilized in building prediction models for phenomenon governed
by multiple factors, one such example would weather prediction. This research targets on
building a co-operative AI for shooter battle royal game. This involves multi-agent interaction,
built using curriculum training. As compared to single-agent training, learning cooperative
behavior possess a different challenge, as the agents need to build an understanding of the final
goal, and converge their actions in such a way that each of them contribute towards achieving
the goal, and the actions taken in turn should create favorable conditions for all other agents
who are in the same team.

2. BACKGROUND AND RELATED RESEARCH

With an increase in computational power followed a revolution, that enabled the usage of neural
networks to multiple domains where the state of art models were limited, one such area is
reinforcement learning. By the aid of neural networks, the capabilities of reinforcement learning
got greatly enhanced and this amalgamation was named deep reinforcement learning.
Researchers have come up with multiple different kinds of environments for deep reinforcement
learning like VizDoom(Wydmuch et al., 2018), starcraft (Vinyals et al., 2019), CTS a modified
version of Quake 3 by Deep Mind (Jaderberg et al., 2019), multi-agent auto curricula
(Utocurricula and Powell, 2020) and Unity 3d (Juliani et al., 2018). Some of these environments
are targeted towards very specific research areas whereas others like Unity 3d, providing the
user, a complete game engine to be able to build their own environment for research. Notable
work has been done by Open AI (OpenAI et al., 2019) in multi-agent interaction, by building
OpenAI Five for playing Dota 2, OpenAI has trained their agents using proximal policy
optimization (Schulman et al., 2017). OpenAI Five was the first computer program to defeat an
Esport world champion. DeepMind has also done significant work in multi-agent interaction,
they have created capture the flag variant from quake 3 arena and demonstrated that multi-agent
interaction can be fine-tuned by using a model architecture based on recurrent latent variable
model (Chung et al., 2015). In this architecture, they have utilized two RNN one immediately
feeding the past states and other one feeding a series of past states after a delay for evaluation
of actions and rewards. Before DeepMind there were multiple successful attempts on utilizing

46
RNN with deep reinforcement learning. OpenAI’s multi-agent auto curricula (Utocurricula and
Powell, 2020) also utilizes proximal policy optimization and Generalized Advantage
Estimation (Schulman et al., 2016) for the training of agents, they demonstrated that new
coordinating behaviors appear with the training. These researches on computer games and
simulations are crucial in investigating and understanding techniques for building multi-agent
co-operative interaction with a large number of unknowns in the environment or partially
observable environments that involve long term objective planning while at the same time agent
is subjected to perform some immediate actions that converge with long term goal.

3. RESEARCH QUESTIONS (IF ANY)

The research questions are::

Can reinforcement learning be used for the creation of co-operative shooter AI for a battle royal
game?

Is the performance of curriculum-based reinforcement learning for achieving co-operative


behavior better than vanilla proximal policy optimization-based reinforcement learning?

4. AIM AND OBJECTIVES

The main aim of this research is to propose a possible solution for engineering AI with
cooperative behavior using reinforcement learning, that is capable of understanding macro and
micro strategies in a battle-royal game environment. This study can later be exploited in the
implementation of co-operative behavior and swarm intelligence, which is of great use in
advanced robotics. The research also provides a comparison between curriculum-based learning
and vanilla proximal policy-based learning for building co-operative behavior.

The research objectives are formulated based on the aim of this study, that are as follows:

 To identify the reward structure for the training


 To identify the curriculum for learning macro and micro strategy
 To suggest the curriculum to improve team-play behavior
 To evaluate the performance of the model against hardcoded AI that has resource and
health advantage

47
 To compare the results of hybrid imitation-curriculum-based training with that of results
obtained by agents trained directly through proximal policy optimization

5. RESEARCH METHODOLOGY

The training and evaluation of agents will be done in a 3D environment, where agents can move
freely and kill or attack each other either by melee attacks (attacks that can only be executed if
agents are sufficiently close) or by using weapons that can be collected from the 3D
environment. Agents are supposed to work in a team of 3 and the last standing team or segment
of the team will be deemed as the winner for the session/game. The performance statistics like
the number of kills and resources collected will also be recorded and will be part of the reward
structure. These statistics will be used for the analysis of behavior induced and curriculum
planning and improvement.

The environment will be created within Unity 3D and the platform offers an implementation of
state of the art deep reinforcement learning via proximal policy-based learning for ML-agents
(Juliani et al., 2018). ML- agent is a toolkit offered by unity for creating and training
reinforcement learning-based agents. The agent will access the internal game state along with
images through a camera sensor component, to perceive the world around it. Every agent will
have a periphery of vision, beyond which agents won’t be able to detect enemies. Having a
smaller periphery of vision allows faster training on smaller battlegrounds or maps.

Once the environment is in place, the next steps would be identification and understanding of
clauses of the environment to formulate a reward structure that should be used for the training
of ML agents. Since there are multiple objectives associated with a battle royal game such as
resource collection and tactical positioning, to speed up the process of training, and to ensure
that the agent is exposed to the scenarios that promote team-play over self-play behavior, the
agent needs to get trained using curriculum learning and imitation learning. In the initial stages
of training, the agent will be trained using imitation learning to learn some of the micro
strategies. Breaking down the objectives into hierarchies of micro and macro strategies (Zhang
et al., 2018) can be more effective as compared to vanilla proximal policy-based learning. This
would be followed by training the agent in a stepwise manner by exposing the agent to different
subsets of the original environment with increasingly more complex or bigger subsets for
learning the macro strategy of the game (ml-agents/ML-Agents-Overview.md at master ·
Unity-Technologies/ml-agents, 2020). Once AI has learned some basic micro and macro

48
strategies, the agents will be exposed to an environment where the only possible way of winning
is through teamwork. This is ensured by competing, the already trained AI with basic strategies
with a hardcoded AI that has an advantage in terms of health, health regeneration, weapons,
and vision range. Because of these advantages, the hardcoded AI could be termed as a “cheater”
and it’s not possible to defeat the cheater without tactical team-play. Once, the team of trained
AI is capable of taking out 1 cheater AI, the team should be trained in an environment with
more cheaters and more teams of trained AI. Another set of agents without any curriculum will
be trained by directly exposing the team agents to other teams and a number of cheater AI.

Once trained these two sets of agents shall meet each other in the battleground. The results like
win ratio, kill death ratio, and resources acquired by both, the curriculum trained AI and AI
trained without curriculum will be obtained and compared.

6. EXPECTED OUTCOMES

It is expected that the AI agents will start working in a team to beat the hardcoded AI with
health and resources advantage. Thus, the research should be able to infer that co-operative
behavior can be induced by using reinforcement learning.

The AI trained by the curriculum should be able to achieve better results as compared to the AI
trained by the vanilla proximal policy.

The evaluation matrix would be comprised of the following parameters:

1. Kill death ratio


2. Win percentage
3. Resources collected

7. REQUIREMENTS / RESOURCES

- Unity 3d game engine (available free of cost for personal use)


- python (available free of cost)
- Keras (available free of cost)
- tensorflow (available free of cost)
- Visual Studio and C# tools (available free of cost for non - commercial use)

49
- An external GPU might be required at a later stage of research (Upgrad will require to
facilitate)

8. RESEARCH PLAN -

Activity Expected time duration


(in days)
Literature search and review 5
Preparing the environment in unity 3D 10
Evaluating reward structure 7
Building Curriculum 7
Training with curriculum evaluating and 10
fine-tuning hyperparameters
Training without curriculum evaluating 10
and fine-tuning hyperparameters
Report writing 7

Table (1)

9. REFERENCES

Anon (2020) Deep Reinforcement Learning and Its Applications - Inteliment Technologies. [online]
Available at: https://www.inteliment.com/blog/our-thinking/deep-reinforcement-learning-and-its-
applications/ [Accessed 16 Mar. 2020].

Anon (2020) ml-agents/ML-Agents-Overview.md at master · Unity-Technologies/ml-agents. [online]


Available at: https://github.com/Unity-Technologies/ml-agents/blob/master/docs/ML-Agents-
Overview.md [Accessed 15 Mar. 2020].

Chung, J., Kastner, K., Dinh, L. and Goel, K., (2015) A Recurrent Latent Variable Model for
Sequential Data Junyoung arXiv : 1506 . 02216v3 [ cs . LG ] 19 Jun 2015. pp.1–9.

Jaderberg, M., Czarnecki, W.M., Dunning, I., Marris, L., Lever, G., Castañeda, A.G., Beattie, C.,
Rabinowitz, N.C., Morcos, A.S., Ruderman, A., Sonnerat, N., Green, T., Deason, L., Leibo, J.Z.,
Silver, D., Hassabis, D., Kavukcuoglu, K. and Graepel, T., (2019) Human-level performance in 3D
multiplayer games with population-based reinforcement learning. Science, 3646443, pp.859–865.

Juliani, A., Henry, H. and Lange, D., (2018) Unity : A General Platform for Intelligent Agents. pp.1–
18.

OpenAI, :, Berner, C., Brockman, G., Chan, B., Cheung, V., Dębiak, P., Dennison, C., Farhi, D.,

50
Fischer, Q., Hashme, S., Hesse, C., Józefowicz, R., Gray, S., Olsson, C., Pachocki, J., Petrov, M.,
Pinto, H.P. de O., Raiman, J., Salimans, T., Schlatter, J., Schneider, J., Sidor, S., Sutskever, I., Tang,
J., Wolski, F. and Zhang, S., (2019) Dota 2 with Large Scale Deep Reinforcement Learning. [online]
Available at: http://arxiv.org/abs/1912.06680.

Schulman, J., Moritz, P., Levine, S., Jordan, M.I. and Abbeel, P., (2016) H -d c c u g a e. pp.1–14.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov, O., (2017) Proximal Policy
Optimization Algorithms. pp.1–12.

Utocurricula, A. and Powell, G., (2020) EMERGENT TOOL USE FROM MULTI-AGENT
AUTOCURRICULA.

Vinyals, O., Vezhnevets, A.S. and Silver, D., (2019) StarCraft II : A New Challenge for
Reinforcement Learning.

Wydmuch, M., Kempka, M. and Jaskowski, W., (2018) ViZDoom Competitions: Playing Doom
From Pixels . IEEE Transactions on Games, 113, pp.248–259.

Zhang, Z., Li, H., Zhang, L., Zheng, T., Zhang, T., Hao, X., Chen, X., Chen, M., Xiao, F. and Zhou,
W., (2018) Hierarchical Reinforcement Learning for Multi-agent MOBA Game.

51
APPENDIX B: TRAINING CONFIGURATION
PPO Configuration
behaviors:
Shooter:
trainer_type: ppo
hyperparameters:
batch_size: 1024
buffer_size: 10240
learning_rate: 0.0003
beta: 0.005
epsilon: 0.2
lambd: 0.95
num_epoch: 3
learning_rate_schedule: linear
network_settings:
normalize: false
hidden_units: 128
num_layers: 2
vis_encode_type: simple
reward_signals:
extrinsic:
gamma: 0.99
strength: 1.0
keep_checkpoints: 5
max_steps: 2000000
time_horizon: 64
summary_freq: 10000
threaded: true
self_play:
save_steps: 1000
team_change: 4000
swap_steps: 40
window: 10
play_against_latest_model_ratio: 0.5
initial_elo: 1200.0

52
PPO-GAIL Configuration
behaviors:
Shooter:
trainer_type: ppo
hyperparameters:
batch_size: 64
buffer_size: 10240
learning_rate: 0.0003
beta: 0.005
epsilon: 0.2
lambd: 0.95
num_epoch: 3
learning_rate_schedule: linear
network_settings:
normalize: false
hidden_units: 128
num_layers: 2
vis_encode_type: simple
reward_signals:
gail:
gamma: 0.99
strength: 0.1
encoding_size: 128
learning_rate: 0.0003
use_actions: false
use_vail: false
demo_path: E:/Users/prash/OneDrive/Documents/ml_shooterproject/Demos/S
hooterX.demo
keep_checkpoints: 5
max_steps: 2000000
time_horizon: 64
summary_freq: 10000
threaded: true
self_play:
save_steps: 1000
team_change: 4000
swap_steps: 40
window: 10
play_against_latest_model_ratio: 0.5
initial_elo: 1200.0

53
SAC Configuration
behaviors:
Shooter:
trainer_type: sac
hyperparameters:
learning_rate: 0.0003
learning_rate_schedule: constant
batch_size: 256
buffer_size: 500000
buffer_init_steps: 0
tau: 0.005
steps_per_update: 10.0
save_replay_buffer: false
init_entcoef: 0.05
reward_signal_steps_per_update: 10.0
network_settings:
normalize: false
hidden_units: 128
num_layers: 2
vis_encode_type: simple
reward_signals:
extrinsic:
gamma: 0.99
strength: 1.0
keep_checkpoints: 5
max_steps: 2000000
time_horizon: 64
summary_freq: 10000
threaded: true
self_play:
save_steps: 1000
team_change: 4000
swap_steps: 40
window: 10
play_against_latest_model_ratio: 0.5
initial_elo: 1200.0

54
SAC-GAIL configuration
behaviors:
Shooter:
trainer_type: sac
hyperparameters:
learning_rate: 0.0003
learning_rate_schedule: constant
batch_size: 256
buffer_size: 500000
buffer_init_steps: 0
tau: 0.005
steps_per_update: 10.0
save_replay_buffer: false
init_entcoef: 0.05
reward_signal_steps_per_update: 10.0
network_settings:
normalize: false
hidden_units: 128
num_layers: 2
vis_encode_type: simple
reward_signals:
gail:
gamma: 0.99
strength: 0.1
encoding_size: 128
learning_rate: 0.0003
use_actions: false
use_vail: false
demo_path: E:/Users/prash/OneDrive/Documents/ml_shooterproject/Demos/S
hooterX.demo
keep_checkpoints: 5
max_steps: 2000000
time_horizon: 64
summary_freq: 10000
threaded: true
self_play:
save_steps: 1000
team_change: 4000
swap_steps: 40
window: 10
play_against_latest_model_ratio: 0.5
initial_elo: 1200.0

55
APPENDIX C: PROJECT DETAILS

Project Link: https://gitlab.com/CodedTyphoon/ml_shooterproject


Unity Version: 2018.4.15f1
ML-Agents version: 0.17.0
ML-Agents-Env version: 0.17.0
TensorFlow: 1.15.0-rc2

56

You might also like