Download as pdf or txt
Download as pdf or txt
You are on page 1of 97

Evaluating the impact of curriculum

learning on the training process for an


intelligent agent in a video game

Rigoberto Sáenz Imbacuán

National University of Colombia


Faculty of Engineering
Dept. of Systems and Industrial Engineering
Bogotá, Colombia
2020
Evaluating the impact of curriculum
learning on the training process for an
intelligent agent in a video game

Rigoberto Sáenz Imbacuán

A final project submitted in partial fulfillment of the requirements for the degree if:
Master in Systems and Computing Engineering

Supervised by:
Ph.D. Jorge Eliécer Camargo Mendoza

Research line:
Reinforcement Learning in Video Games

National University of Colombia


Faculty of Engineering
Dept. of Systems and Industrial Engineering
Bogotá, Colombia
2020
Acknowledgments
I would like to thank Professor Ph.D. Jorge Eliécer Camargo Mendoza for sharing his advice,
guidance, and valuable help throughout this project, the Faculty of Engineering staff for providing
a great learning environment and helping me to achieve all my educational goals. Most
importantly, thanks to my family and all my loved ones for your help throughout this entire unique
life-changing experience.
Abstract
We want to measure the impact of the curriculum learning technique on a reinforcement training
setup, several experiments were designed with different training curriculums adapted for the video
game chosen as a case study. Then all were executed on a selected game simulation platform,
using two reinforcement learning algorithms, and using the mean cumulative reward as a
performance measure. Results suggest that curriculum learning has a significant impact on the
training process, increasing training times in some cases, and decreasing them up to 40% percent
in some other cases.

Resumen
Se desea medir el impacto de la técnica de aprendizaje por currículos sobre el tiempo de
entrenamiento de un agente inteligente que está aprendiendo a jugar un video juego usando
aprendizaje por refuerzo, para esto se diseñaron varios experimentos con diferentes currículos
adaptados para el video juego seleccionado como caso de estudio, y se ejecutaron en una
plataforma de simulación de juegos seleccionada, usando dos algoritmos de aprendizaje por
refuerzo, y midiendo su desempeño usando la recompensa media acumulada. Los resultados
sugieren que usar aprendizaje por currículos tiene un impacto significativo sobre el proceso de
entrenamiento, en algunos casos alargando los tiempos de entrenamiento, y en otros casos
disminuyéndolos en hasta en un 40% por ciento.

Keywords

Curriculum Learning, Reinforcement Learning, Training Curriculum, Mean Cumulative Reward,


Proximal Policy Optimization, Video Games, Game AI, Unity Machine Learning Agents, Unity ML-
Agents Toolkit, Unity Engine.
Este Trabajo Final de maestría fue calificado en octubre de 2020 por el siguiente evaluador:

Germán Jairo Hernández Pérez PhD


Prof. Departamento de Ingeniería de Sistemas e Industrial – Facultad de Ingeniería
Universidad Nacional de Colombia
List of Figures
Page

Figure 1-1: An agent was trained in a video game with an action space consisting of four discrete actions, 3
and then transferred to a robot with a different action space with a small amount of training for the robot [29]
Karttunen et al. (2020).

Figure 1-2: (a) A schematic showing 4 affordance variables (lane_LL, lane_L, lane_R, lane_RR). (b) A 4
schematic showing the other 4 affordance variables (angle, car_L, car_M, car_R). (c) An in-game screenshot
showing lane_LL, lane_L, lane_R, and lane_RR. (d) An in-game screenshot showing the detection of the
number of lanes in a road. (e) An in-game screenshot showing angle. (f) An in-game screenshot showing
car_L, car_M, and car_R [30] Martinez et al. (2017).

Figure 2-1: Typical Reinforcement Learning training cycle. [19] Juliani (2017). 7

Figure 2-2: The perception-action-learning loop [38] Arulkumaran et al. (2017). 11

Figure 2-3: Influence diagram of relevant deep learning techniques applied to commonly used games for 12
game AI research. [7] Justesen et al. (2017).

Figure 2-4: Typical network architecture used in deep reinforcement learning for game-playing. [7] Justesen 14
et al. (2017).

Figure 2-5: Example of a mathematics curriculum. Lessons progress from simpler topics to more complex 14
ones, with each building on the last. [4] Juliani (2017).

Figure 2-6: A simplified visual representation of how a continuation method works, by defining a sequence 16
of optimization problems of increasing complexity, where the first ones are easy to solve but only the last
one corresponds to the actual problem of interest. [5] Gulcehre et al. (2019).

Figure 2-7: Different subgames in the game of Quick Chess, which are used to form a curriculum for learning 17
the full game of Chess [39] Narvekar et al. (2020).

Figure 3-1: Typical deep reinforcement learning model [35] Shao et al. (2019). 19

Figure 3-2: Taxonomy of game simulation platforms based on the flexibility of environment specification 22
according to [18] Juliani et al. (2020).

Figure 3-3: Diagram of high-level components inside The Unity Machine Learning Agents Toolkit [40] Juliani 24
et al. (2020).

Figure 4-1: Pseudocode for PPO-Clip [54] Schulman et al. (2017). 28

Figure 4-2: Pseudocode for Soft Actor-Critic [42] Haarnoja et al. (2018) 30
Figure 6-1: Screenshot of learning environment Soccer Twos [40] Juliani et al. (2020). 37

Figure 6-2: Screenshot of 10 soccer matches running simultaneously in a Unity scene. 38

Figure 6-3: Rays detecting objects around the blue agent. 39

Figure 6-4: Objects that can be detected by the agent: top-left: Soccer ball, top-center: Agents of the same 40
team as the agent in training, top-right: Agents of the opposite team, bottom-left: Walls of the soccer field,
bottom-center: Agent’s own goal, bottom-right: Opposite team’s goal.

Figure 6-5: Possible movements that the agent can perform: left: lateral motion, center: frontal motion, right: 41
rotation around its Y-axis.

Figure 6-6: Expected behavior of mean cumulative reward values over time for the blue agent in a 43
successful training lesson.

Figure 7-1: Mean cumulative reward for Proximal Policy Optimization over 100 million matches. 47

Figure 7-2: Mean cumulative reward for Soft Actor-Critic over 100 million soccer matches. 47

Figure 7-3: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal Policy 49
Optimization and Curriculum A over 100 million soccer matches.

Figure 7-4: Mean cumulative reward for Soft Actor-Critic control experiment vs Soft Actor-Critic and 49
Curriculum A over 100 million soccer matches.

Figure 7-5: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal Policy 51
Optimization and Curriculum B over 100 million soccer matches.

Figure 7-6: Mean cumulative reward for Soft Actor-Critic control experiment vs Soft Actor-Critic and 51
Curriculum B over 100 million soccer matches.

Figure 7-7: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal Policy 53
Optimization and Curriculum A+B over 100 million soccer matches.

Figure 7-8: Mean cumulative reward for Soft Actor-Critic control experiment vs Soft Actor-Critic and 53
Curriculum A+B over 100 million soccer matches.

Figure 7-9: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal Policy 55
Optimization and Curriculum D over 100 million soccer matches.

Figure 7-10: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal 56
Policy Optimization and Curriculum G over 100 million soccer matches.

Figure 7-11: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal 57
Policy Optimization and Curriculum H over 100 million soccer matches.
Figure 7-12: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal 58
Policy Optimization and Curriculum P over 100 million soccer matches.

Figure 7-13: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal 59
Policy Optimization and Curriculum Q over 100 million soccer matches.

Figure 7-14: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal 60
Policy Optimization and Curriculum R over 100 million soccer matches.

Figure 7-15: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal 61
Policy Optimization and Curriculum U over 100 million soccer matches.

Figure 7-16: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal 62
Policy Optimization and Curriculum W over 100 million soccer matches.

Figure 7-17: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal 63
Policy Optimization and Curriculum X over 100 million soccer matches.

Figure 7-18: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal 64
Policy Optimization and Curriculum Y over 100 million soccer matches.

Figure 7-19: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal 65
Policy Optimization and Curriculum Z over 100 million soccer matches.

Figure 7-20: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal 66
Policy Optimization and Curriculum C over 100 million soccer matches.

Figure 7-21: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal 67
Policy Optimization and Curriculum J over 100 million soccer matches.

Figure 7-22: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal 68
Policy Optimization and Curriculum M over 100 million soccer matches.

Figure 7-23: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal 69
Policy Optimization and Curriculum N over 100 million soccer matches.

Figure 7-24: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal 70
Policy Optimization and Curriculum O over 100 million soccer matches.

Figure 7-25: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal 71
Policy Optimization and Curriculum V over 100 million soccer matches.

Figure 7-26: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal 72
Policy Optimization and Curriculum E over 100 million soccer matches.

Figure 7-27: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal 73
Policy Optimization and Curriculum F over 100 million soccer matches.

Figure 7-28: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal 74
Policy Optimization and Curriculum K over 100 million soccer matches.

Figure 7-29: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal 75
Policy Optimization and Curriculum L over 100 million soccer matches.

Figure 8-1: Mean cumulative reward for Proximal Policy Optimization control experiment vs all curriculum 76
experiments that had lower performance than the control experiment at the end of 100 million soccer
matches.

Figure 8-2: Mean cumulative reward for Proximal Policy Optimization control experiment vs all curriculum 78
experiments that had the same performance as the control experiment at the end of 100 million soccer
matches.

Figure 8-3: Mean cumulative reward for Proximal Policy Optimization control experiment vs all curriculum 79
experiments that had higher performance than the control experiment at the end of 100 million soccer
matches.

Figure 8-4: Mean cumulative reward for Proximal Policy Optimization control experiment vs the best 4 80
curriculum experiments compared over 100 million soccer matches.
Contents
Page

1. Introduction 1

1.1. Problem Statement 1

1.2. Motivation 3

1.3. Document Structure 5

1.4. Contributions 6

2. Background 7

2.1. Reinforcement Learning 7

2.2. Perception-Action-Learning 11

2.3. Deep Learning in Video Games 12

2.4. Curriculum Learning 14

3. Game Simulation Platform 19

3.1. Review of Game Simulation Platforms 19

3.2. Game Platform Taxonomy 22

3.3. Platform Architecture Description 23

4. Reinforcement Learning Algorithms 26

4.1. Proximal Policy Optimization (PPO) 26

4.2. Soft Actor-Critic (SAC) 28

5. Hyperparameters Description 31

5.1. Common Parameters 31

5.2. Proximal Policy Optimization Hyperparameters 33

5.3. Soft Actor-Critic Hyperparameters 34

5.4. Hyperparameters Selection 35

6. Case Study: Toy Soccer Game 37

6.1. Learning Environment 37


11

6.2. Observation Space 39

6.3. Action Space 41

6.4. Reward Signals 41

6.5. Environment Parameters 43

7. Experimental Results 46

7.1. Experimental Setup 46

7.2. Control Experiments 46

7.3. Preliminary Experiments 48

7.4. Curriculums with 3 and 4 Lessons 54

7.5. Curriculums with 5 and 6 Lessons 66

7.6. Curriculums with 1 and 9 Lessons 72

8. Discussion 76

8.1. Curriculums with the Lowest Performance 76

8.2. Curriculums with Same Performance 77

8.3. Curriculums with the Highest Performance 78

9. Conclusions 81

10. References 82
1

1. Introduction
In this document we present the results of several experiments with curriculum learning applied
to a game AI learning process to measure its effects on the learning time, specifically we trained
an agent using a reinforcement learning algorithm to play a video game running on a game
simulation platform, then we trained another agent under the same conditions but including a
training curriculum, which is a set of rules that modify the learning environment at specific times
to make it easier to master by the agent at the beginning, then we compared both results. Our
initial hypothesis is that in some cases using a training curriculum would allow the agent to learn
faster, reducing the training time required.

We describe in detail all the main elements of our work, including the choice of the game
simulation platform to run the training experiments, the review of the reinforcement learning
algorithms used to train the agent, the description of the video game selected as case study, the
parameters used to design the training curriculums, and the discussion of the results obtained.

1.1. Problem Statement


According to [7] Justesen et al. (2017) there are several challenges that are still open in the
domain of artificial intelligence applied to video games, also known as Game AI, some of them
are listed below:

■ General video game playing: This challenge refers to creating general intelligent agents that
can play not only a single game, but an arbitrary amount of known and unknown games.
According to [23] Legg et al. (2007) being able to solve a single problem does not make you
intelligent, to learn general intelligent behavior you need to train on not just a single task, but
in many different tasks. [24] Schaul et al. (2011) suggest that video games are ideal
environments for Artificial General Intelligence (AGI), in part because there are multiple video
games that share common interface and reward conventions.
■ Computational resources: Usually training an agent using deep neural networks for learning
how to play an open-world game like Grand Theft Auto V [32] Rockstar Games (2020) requires
huge amounts of computational power. This issue becomes even more noticeable if we want
to train several agents. According to [7] Justesen et al. (2017) it is not yet feasible to train
deep networks in real-time to allow agents to adapt instantly to changes in the game or adapt
it to a particular playing style, which could be useful in the design of new types of games.
■ Games with very sparse rewards: Some games such as Montezuma’s Revenge are
characterized by very sparse rewards and still pose a challenge for most current deep
reinforcement learning techniques. Some approaches that might be useful to solve this kind
of games include hierarchical reinforcement learning [31] Barto et al. (2003) and intrinsically
motivated reinforcement learning [31] Singh et al. (2005).
■ Dealing with extremely large decision spaces: For well-known board games such as Chess
the average branching factor is around 30, but for video games like Grand Theft Auto V [32]
Rockstar Games (2020) or StarCraft the branching factor is several orders of magnitudes
larger. How can we scale deep reinforcement learning to handle such levels of complexity is
an important open challenge.

These challenges are problems that have not a straightforward solution, rather requiring a
combination of multiple techniques to mitigate their impact, we believe curriculum learning could
be one of those techniques since it could allow the agents to learn in less time, which in turn could
mean less computational resources required.

In the case of general video game playing, we believe curriculum learning could help to overcome
this challenge is most scenarios, if agents are trained on easier 2D games first, allowing the
agents to learn how to explore levels and collect items that can provide help later to defeat
potential enemies and overcome future obstacles, and then if they are be trained on harder and
complex 3D games, using all the experience gained when learning simpler games. Setting the
training process this way could make the agent learn faster than a training process in which games
are provided to the agent in a random order of complexity.

Using less time for training provides additional side benefits including less money required for
cloud servers to run the training, less energy consumption, less carbon footprint, and in case of
commercial video games, faster product release times.

In this context, the main question we address in this work is:

Does an agent that uses reinforcement learning to learn how to play a video game learn faster if
curriculum learning is used during the training?
3

1.2. Motivation
Training intelligent agents to play video games have several applications in the real world since
a video game can be seen as a low-cost and low-risk playground for learning complex tasks. In
most cases, direct agent interaction with the real world is either expensive or not feasible, since
the real world is far too complex for the agent to perceive and understand, so it makes sense to
simulate the interaction in a virtual learning environment which receives input and returns
feedback on a decision made by the agent, then most of the knowledge about the environment
can be transferred to a physical agent, usually a robot, this approach is called Sim-to-Real
(Simulation to Real World). In this scenario, having faster learning times when training agents
could benefit several real-world applications, a few are listed below:

■ [29] Karttunen et al. (2020) performed several Sim-to-Real experiments training an agent
using deep reinforcement learning to learn a navigation task in ViZDoom [12] Kempka et al.
(2016), then the learned policy was transferred to a physical robot using transfer learning,
freezing most of the pre-trained neural network parameters, this process is depicted in Figure
1-1.

Figure 1-1: An agent was trained in a video game with an action space consisting of four
discrete actions, and then transferred to a robot with a different action space with a small
amount of training for the robot [29] Karttunen et al. (2020).

■ [30] Martinez et al. (2017) used over 480,000 labeled images of highway driving generated in
Grand Theft Auto V, a popular video game released in 2013 by [32] Rockstar Games (2020),
to train a convolutional neural network to calculate the distance to cars and objects ahead,
lane markings, and driving angle (angular heading relative to lane centerline), all variables
required for the development of an autonomous driving system. Figure 1-2 shows several
schematics and game screenshots of the lanes and cars detected.

Figure 1-2: (a) A schematic showing 4 affordance variables (lane_LL, lane_L, lane_R,
lane_RR). (b) A schematic showing the other 4 affordance variables (angle, car_L, car_M,
car_R). (c) An in-game screenshot showing lane_LL, lane_L, lane_R, and lane_RR. (d) An
in-game screenshot showing the detection of the number of lanes in a road. (e) An in-game
screenshot showing angle. (f) An in-game screenshot showing car_L, car_M, and car_R [30]
Martinez et al. (2017).
5

1.3. Document Structure


This document is structured as follows:

In chapter 2 ‘Background’’, we provide a formal definition of reinforcement learning, which is


based on the idea of learning by using rewards and penalties, highlighting why it is suitable for
video games. Also, we review the Perception-Action-Learning loop that models the interaction
between an agent and a video game, highlighting that cumulative reward is what agents try to
maximize over time. An influence diagram of deep learning algorithms used on video games is
shown, including a neural network architecture typically used for game playing. Then curriculum
learning is formally defined, which is a training strategy that uses easier examples first and harder
complex ones later to increase the learning rate of an agent that is learning a particular task.

In chapter 3 ‘Game Simulation Platform’, several game simulation platforms are listed, describing
their main features, advantages, and restrictions. A taxonomy to classify the platforms is
presented and used to support the selection of Unity and its machine learning library ML-Agents
Toolkit as the preferred platform to run the experiments, then its main components are listed and
described, including a high-level architecture diagram.

In chapter 4 ‘Reinforcement Learning Algorithms’, we provide formal definitions for two policy-
based reinforcement learning algorithms chosen for our experiments, both included in the Unity
ML-Agents Toolkit: Proximal Policy Optimization (PPO), an on-policy method that collects a small
batch of experiences to update its decision-making policy, ensuring that the updated policy does
not change too much from the previous one, and Soft Actor-Critic (SAC), an off-policy method
that seeks to maximize the entropy of its stochastic policy and learns using past samples from
experience replay buffers.

In chapter 5 ‘Hyperparameters Description’, we list and describe all hyperparameters and training
configurations offered by the Unity ML-Agents Toolkit for Proximal Policy Optimization (PPO) and
Soft Actor-Critic (SAC) algorithms. The toolkit provides recommended values for most
parameters, we based our training setup based on those recommendations.

In chapter 6 ‘Case Study: Toy Soccer Game’, we describe in detail the toy soccer video game
chosen as the learning environment to run our experiments, game that has two agents competing
against each other to win the soccer match by scoring goals. We review in detail how the agents
can perceive its surroundings, what kind of actions they can perform, and how they are rewarded
for their actions. We also describe how the mean cumulative reward can be used to measure the
agent’s performance and what is the expected behavior of this value over time in a successful
training session. Finally, we describe the environment parameters used to control how difficult the
game is for our agent, parameters that allowed us to design several training curriculums adapted
for the case study.

In chapter 7 ‘Experimental Results’, we describe the hardware and software setup used to run
the experiments. Then we describe the experiments we used as control, to allow comparisons
against a training process that uses a curriculum, using the mean cumulative reward as the main
metric. We conducted some preliminary experiments with 3 curriculums, describing our initial
findings, then we ran 20 additional experiments with curriculums classified by the number of its
lessons, their results are shown in this chapter.

In chapter 8 Discussion’, we classify the curriculums according to its results when compared
against the control experiment. We identify a few common patterns that could explain why some
curriculum could increase or decrease the agent’s performance, affecting the training time.

In chapter 9 ‘Conclusions’, we highlight the most relevant findings.

1.4. Contributions
As a result of this work, we provide several resources that are available to use for academic
purposes, including further research about curriculum learning applied to video games:

■ The video game used as the case study, which is based on a 3D learning environment
provided by the Unity ML-Agents Toolkit [40] Juliani et al. (2020).
■ The curriculums designed for the case study, specified in the format required by the Unity ML-
Agents Toolkit [40] Juliani et al. (2020).
■ The results of all experiments exported in .csv format, results that include performance metrics
we used in this work such as the mean cumulative reward.
■ The software setup required to run the experiments, specifically the set of Linux commands
required to install all the dependencies and the learning environment.
■ A paper that was submitted to the Iberoamerican Artificial Intelligence journal, presenting the
findings of this work.

All of these resources can be found in this repository: https://github.com/rsaenzi/master-thesis


7

2. Background
2.1. Reinforcement Learning
Machine learning is traditionally divided into three different types of learning [20] Alpaydin. (2010):
supervised learning, unsupervised learning, and reinforcement learning. While supervised
learning, in which intelligent agents are trained by example, has shown impressive results in a
variety of different domains, it requires a large amount of training data that often has to be curated
by humans [7] Justesen et al. (2017), a condition that is not always meet on the video games
domain, sometimes there is no training data available (e.g. playing an unknown game), or the
available training data is insufficient and the process of collecting more data can be very labor-
intensive and sometimes infeasible. In these cases, reinforcement learning methods are often
applied [7] Justesen et al. (2017).

Reinforcement Learning (RL) [34] Bertsekas et al. (1996) [21] Sutton et al. (2018), is an attempt
of formalizing the idea of learning based on rewards and penalties [22] Wolfshaar. (2017), in which
an agent interacts with an environment and its goal is to learn a behavior policy through this
interaction to maximize future rewards, this training cycle is depicted In Figure 2-1. How the
environment reacts to a certain action is defined by a model that usually does not know, the agent
can be in one of a set of states (𝑠 ∈ 𝑆) and can take one of many actions (𝑎 ∈ 𝐴) to change
from one state to another. The decision of which state is chosen is decided by transition
probabilities between states (𝑃), once an action is taken the environments return a reward (𝑟 ∈
𝑅) as feedback. The model defines the reward function and the transition probabilities [43] Weng
(2018).

Figure 2-1: Typical Reinforcement Learning training cycle. [19] Juliani (2017).
The policy 𝜋(𝑠) indicates what is the optimal action to take in a particular state to maximize the
total rewards, each state is associated with a value function 𝑉(𝑠) that predicts the future rewards
the agent can receive in the state. In RL the objective is to learn the policy 𝜋(𝑠) and value 𝑉(𝑠)
functions. Agent and environment interaction involve a sequence of actions and rewards in time
𝑡 = 1, 2, 3, . . . , 𝑇. Defining 𝑆! , 𝐴! , and 𝑅! as the state, action, and reward at time step 𝑡
respectively, an episode is defined as a sequence of states, actions and rewards ending at
terminal state 𝑆" as follows 𝑆# , 𝐴# , 𝑅$ , 𝑆$ , 𝐴$ , . . . , 𝑆" .

A transition is defined as the action of going from current state 𝑠 to next state 𝑠′ to get a reward
𝑟, presented by a tuple (𝑠, 𝑎, 𝑠′, 𝑟). The reward function 𝑅 predicts the next reward triggered by
one action:

𝑅(𝑠, 𝑎) = 𝔼[𝑅!"# | 𝑆! = 𝑠, 𝐴! = 𝑎] = / 𝑟 / 𝑃(𝑠′, 𝑟 | 𝑠, 𝑎)


$∈' () ∈ *

The value function 𝑉(𝑠) indicates how rewarding an action is by prediction of future reward. The
return is defined as the future reward, the sum of discounted rewards going forward. The return
𝐺! starting from time 𝑡 is defined as follows:

𝐺! = 𝑅!"# + 𝛾𝑅!"+ + . . . = / 𝛾 , 𝑅! " , " #


,./

The discounting factor 𝛾 ∈ [0, 1] penalize the future reward because it may have higher
uncertainty or does not provide immediate benefits.

The state-value of a state is defined as the expected return in that state at time 𝑡, 𝑆 ! = 𝑠.

𝑉0 (𝑠) = 𝔼0 [𝐺! | 𝑆! = 𝑠]

The action-value (Q-value) pair is defined as:

𝑄0 (𝑠, 𝑎) = 𝔼0 [𝐺! | 𝑆! = 𝑠, 𝐴! = 𝑎]
9

The difference between action-value and state-value is the action advantage function (A-value):

𝐴0 (𝑠, 𝑎) = 𝑄0 (𝑠, 𝑎) − 𝑉0 (𝑠)

The optimal value function produces the maximum return:

𝑉∗ (𝑠) = 𝑚𝑎𝑥0 𝑉0 (𝑠), 𝑄∗ (𝑠, 𝑎) = 𝑚𝑎𝑥0 𝑄0 (𝑠, 𝑎)

The optimal policy achieves optimal value functions:

𝜋∗ = 𝑎𝑟𝑔 𝑚𝑎𝑥0 𝑉0 (𝑠), 𝜋∗ = 𝑎𝑟𝑔 𝑚𝑎𝑥0 𝑄0 (𝑠, 𝑎)

Now we have 𝑉0∗ (𝑠) = 𝑉∗ (𝑠) and 𝑄0∗ (𝑠, 𝑎) = 𝑄∗ (𝑠, 𝑎). Usually memorizing 𝑄∗ (𝑠, 𝑎)
values for all state-action pairs is computationally infeasible when state and action space are
large, in this case, function approximation (i.e a machine learning model) is used: 𝑄(𝑠, 𝑎; 𝜃) which
is a Q value function with a parameter 𝜃 that is used to approximate Q values. Deep Q-Networks
(DQN) [44] Mnih et al. (2015) are a common function approximation technique that improves the
training results using two mechanisms:

■ Experience Replay: All steps 𝑒! = (𝑆! , 𝐴! , 𝑅! , 𝑆!%# ) are stored in a replay memory 𝐷! =
{𝑒# , 𝑒$ , 𝑒& , . . . , 𝑒! }. Samples are drawn at random from replay memory, improving data
efficiency, removing correlation in observation sequences, and smoothing changes in the data
distribution.
■ Periodically Updated Target: The Q network is cloned and kept frozen as the optimization
target every C steps, with C as a hyperparameter. This makes training more resilient to short-
term oscillations.

The loss function for DQN is defined as:

𝕃(𝜃) = 𝔼((,4,$,())∼7(8) [(𝑟 + 𝛾 𝑚𝑎𝑥4) 𝑄(𝑠′, 𝑎′; 𝜃 9 ) − 𝑄(𝑠, 𝑎; 𝜃))+ ]

where 𝑈(𝐷) is a uniform distribution over the replay memory 𝐷, 𝜃 ' is the parameter of the frozen
target Q-network [44] Mnih et al. (2015).

Dynamic programming, monte-carlo, and temporal-difference methods aim to learn the


state/action-value function to select the best actions, Policy Gradient methods instead learn the
policy directly with a parametrized function respect to 𝜃, 𝜋(𝑎 | 𝑠; 𝜃). Reward function (opposite of
loss function) is defined as the expected return, in discrete space:

𝕁(𝜃) = 𝑉0: (𝑆# ) = 𝔼0: [𝑉# ]

where 𝑆# is the initial starting state. In continuous space is defined as:

𝕁(𝜃) = / 𝑑0: (𝑠) 𝑉0: (𝑠) = / C𝑑0: (𝑠) / 𝜋(𝑎 | 𝑠, 𝜃) 𝑄0 (𝑠, 𝑎)D
(;* (;* 4;<

where 𝑑() (𝑠) is the stationary distribution of Markov chain for 𝜋) .

Using gradient ascent the 𝜃 can be moved toward the direction suggested by the gradient 𝛻) 𝕁(𝜃)
to find the best 𝜃 for 𝜋) that produces the highest return.

Getting good results using policy gradient methods is challenging because they are sensitive to
the choice of step size: slow progress if a small step size is chosen or drops in performance when
big step size is used [54] Schulman et al. (2017). Several policy gradient algorithms have been
proposed in recent years, including:

■ Advantage Actor-Critic (AC2) [37] Mnih et al. (2016)


■ Asynchronous Advantage Actor-Critic (A3C) [37] Mnih et al. (2016)
■ Deterministic Policy Gradient (DPG) [45] Silver et al. (2014)
■ Deep Deterministic Policy Gradient (DDPG) [46] Lillicrap et al. (2016)
■ Distributed Distributional DDPG (D4PG) [47] Barth-Maron et al. (2018)
■ Trust Region Policy Optimization (TRPO) [48] Schulman et al. (2015)
■ Proximal Policy Optimization (PPO) [41] Schulman et al. (2017)
■ Actor-Critic with Experience Replay (ACER) [49] Wang et al. (2017)
■ Actor-Critic using Kronecker-factored Trust Region (ACTKR) [50] Wu et al. (2017)
■ Soft Actor-Critic (SAC) [42] Haarnoja et al. (2018)
■ Twin Delayed Deep Deterministic (TD3) [51] Fujimoto et al. (2018)
■ Stein Variational Policy Gradient (SVPG) [52] Liu et al. (2017)
■ Importance Weighted Actor-Learner Architecture (IMPALA) [53] Espeholt et al. (2018)
11

2.2. Perception-Action-Learning
A video game can easily be modeled as an environment in an RL setting, wherein agents have a
finite set of actions that can be taken at each step and their sequence of moves determines their
success [7] Justesen et al. (2017). At any given moment the agent is in a certain state, from that
state it can take one of a set of actions. The value of a given state refers to how ultimately
rewarding it is to be in that state. Taking an action in a state can bring an agent to a new state,
provide a reward, or both, this is called the perception-action-learning loop [38] Arulkumaran et
al. (2017), and it is depicted in Figure 2-2.

Figure 2-2: The perception-action-learning loop [38] Arulkumaran et al. (2017).

At time 𝑡, the agent receives state 𝑠! from the environment. The agent uses its policy to choose
an action 𝑎! . Once the action is executed, the environment transitions a step, providing the next
state 𝑠!%# as well as feedback in the form of a reward 𝑟!%# . The agent uses knowledge of state
transitions, of the form (𝑠! , 𝑎! , 𝑠!%# , 𝑟!%# ), in order to learn and improve its policy [38] Arulkumaran
et al. (2017).

The total cumulative reward is what all RL agents try to maximize over time. Depending on the
game mechanic, reward signals can be sent frequently, for instance, every time an agent performs
actions like killing an enemy, reaching a checkpoint or getting a health box, or it can be very
sparse if the agent is rewarded only until finding something valuable inside an extensive terrain
or a complex maze.
2.3. Deep Learning in Video Games
[7] Justesen et al. (2017) list several deep learning techniques used for playing different video
game genres in a believable, entertaining, or human-like manner, focused on games that have
been used extensively as game AI research platforms.

Figure 2-3: Influence diagram of relevant deep learning techniques applied to commonly used
games for game AI research. [7] Justesen et al. (2017).
13

Figure 2-3 shows an influence diagram of these techniques, in which each node is an algorithm
while the color represents the game platform. The distance from the center represents the date
that the original paper was published on arXiv.org. Each node points to all other nodes that used
or modified that technique. The arrows represent how techniques are related, arrows pointing to
a particular algorithm show which algorithms influenced its design.

Worth to highlight the use by [36] Wu et al. (2017) of Asynchronous Advantage Actor-Critic (A3C)
[37] Mnih et al. (2016) plus Curriculum Learning (blue node at middle-right of the influence
diagram) to train a vision-based agent called F1 to play First-Person Shooters (FPS), in particular
Doom, to compete in ViZDoom AI Competition [12] Kempka et al. (2016) hosted by IEEE
Conference on Computational Intelligence and Games. The agent was first trained on easier tasks
(weak opponents, smaller map) to gradually face harder problems (stronger opponents, bigger
maps), this agent won the edition 2016 of the competition [57] Wydmuch et al. (2018).

Some of these techniques assume having full access to the game’s data in the training phase,
this scenario is called a Fully Observable Environment, which is useful for techniques that focus
only on the learning process itself [25] Ortega et al. (2015). In other scenarios, the technique tries
to infer a valid game model from raw pixel data and then learn how to play the inferred game
model [26] Lample et at. (2016), [27] Adil et al. (2017), some of them even try to predict the current
game frame, from past frames and the actions performed by the agent [28] Wang et al. (2017),
these scenarios are called Partially Observable Environments.

Figure 2-4 shows an example of a typical network architecture used in deep reinforcement
learning for game-playing on partially observable environments. The input typically consists of a
preprocessed screen image, or several concatenated images, which is followed by a couple of
convolutional layers without pooling, and a few fully connected layers [7] Justesen et al. (2017).
Recurrent networks have a recurrent layer after the fully connected layers. The output layer
usually consists of one unit for each unique combination of actions in the game, and for actor-
critic methods such as A3C [37] Mnih et al. (2016), it also has one for the value function 𝑉(𝑠).
Figure 2-4: Typical network architecture used in deep reinforcement learning for game-playing.
[7] Justesen et al. (2017).

2.4. Curriculum Learning


According to [1] Bengio et al. (2009) humans learn better when concepts and examples are taught
in a meaningful order, using previously learned concepts to ease the learning of more abstract,
complex concepts. Our current educational model heavily relies on this to increase the speed at
which learning can occur, by using curriculums that use the “starting small” strategy. [1] Bengio
et al. (2009) state that choosing which examples to present and defining a specific order for
showing them, based on criteria of incremental difficulty, can increase the learning speed.

An easy way to demonstrate this strategy is thinking about the way arithmetic, algebra, and
calculus is taught in a typical education system. Arithmetic is taught before algebra. Likewise,
algebra is taught before calculus. The skills and knowledge learned in the earlier subjects provide
scaffolding for later lessons. The same principle can be applied to machine learning, where
training on easier tasks can provide scaffolding for harder tasks in the future according to [4]
Juliani (2017). Figure 2-5 show a schematic of a mathematics curriculum.

Figure 2-5: Example of a mathematics curriculum. Lessons progress from simpler topics to more
complex ones, with each building on the last. [4] Juliani (2017).
15

[2] Elman et al. (1993) state that the “starting small” strategy could make possible for humans to
learn what might otherwise prove to be unlearnable, however, according to [3] Harris (1991),
some problems can best be learned if the whole data set is available for the neural network from
the beginning, otherwise they often fail to learn the correct generalization and remain stuck in a
local minimum.

[1] Bengio et al. (2009) formalize the use of curriculums as training strategy in the context of
machine learning by calling it Curriculum Learning, stating that this strategy can have two effects:
in convex criteria, problems can increase the speed of convergence of the training process to a
minimum, in case of non-convex criteria can increase the quality of the local minima obtained.

[1] Bengio et al. (2009) hypothesize that a well-chosen curriculum strategy can act as a
continuation method, which is a general strategy for global optimization of nonconvex functions,
especially useful on deep learning methods that attempt to learn feature hierarchies, where higher
levels are formed by the composition of lower-level features, training these deep architectures
involve potentially intractable non-convex optimization problems.

[6] Allgower (2003) indicates how continuation methods address complex optimization problems
by smoothing the original function, turning into a different problem that is easier to optimize. By
gradually reducing the amount of smoothing, it is possible to consider a sequence of optimization
problems that converge to the optimization problem of interest. A visual representation of how
this process work can be found in Figure 2-6.

Let 𝐶* (𝜃) be a single-parameter family of cost functions such that 𝐶+ can be optimized easily
(maybe convex in 𝜃), while 𝐶# is the criterion that is wanted to be minimized. 𝐶+ (𝜃) is minimized
first and then 𝜆 is gradually increased while keeping 𝜃 at a local minimum of 𝐶* (𝜃).

Typically, 𝐶+ is a highly smoothed version of 𝐶# , so that 𝜃 gradually moves into the basin of
attraction of a dominant (if not global) minimum of 𝐶# . Applying a continuation method to the
problem of minimizing a training criterion involves a sequence of training criteria, starting from
one that is easier to optimize, and ending with the training criterion of interest.
Figure 2-6: A simplified visual representation of how a continuation method works, by defining a
sequence of optimization problems of increasing complexity, where the first ones are easy to
solve but only the last one corresponds to the actual problem of interest. [5] Gulcehre et al. (2019).

A curriculum can be seen as a sequence of training criteria, in which each criterion is associated
with a different set of weights on the training examples, generally speaking, a reweighting of the
training distribution. In the beginning, the weights favor easier examples, next training criterion
involves a slight change in the weighting of examples that increase the probability of sampling
more difficult examples.

This idea can be formalized as follows: Let 𝑧 be a random variable representing an example for
the learner (for example, an (𝑥, 𝑦) pair for supervised learning). Let 𝑃(𝑧) be the target training
distribution from which the learner should ultimately learn a function of interest.
17

Let 0 ≤ 𝑊* (𝑧) ≤ 1 be the weight applied to example 𝑧 at the step 𝜆 in the curriculum sequence,
with 0 ≤ 𝜆 ≤ 1, and 𝑊# (𝑧) = 1. The corresponding training distribution at step 𝜆 is:

𝑄= (𝑧) ∝ 𝑊= (𝑧)𝑃(𝑧) ∀𝑧

such that ∫ 𝑄* (𝑧)𝑑𝑧 = 1. Then we have:

𝑄# (𝑧) = 𝑃(𝑧) ∀𝑧

Consider a monotonically increasing sequence of 𝜆 values, staring from 𝜆 = 0 and ending at 𝜆 =


1. The corresponding sequence of distributions 𝑄* is called a Curriculum if the entropy of these
distributions increases:

𝐻(𝑄= ) < 𝐻(𝑄="; ) ∀; > 0

and 𝑊* (𝑧) is monotonically increasing in 𝜆, i.e.:

𝑊="; (𝑧) ≥ 𝑊= (𝑧) ∀> , ∀; > 0

[39] Narvekar et al. (2020) present Quick Chess as an example of how a curriculum can be used
to teach how to play a game using the “starting small” strategy. Quick Chess is a game designed
to introduce children to the full game of chess, by using a sequence of progressively more difficult
subgames. The first subgame is played on a 5x5 board with pawns only, which is useful for
children to learn how pawns move, get promoted, and take other pieces. Figure 2-7 show several
subgames of Quick Chess.

Figure 2-7: Different subgames in the game of Quick Chess, which are used to form a curriculum
for learning the full game of Chess [39] Narvekar et al. (2020).
In the second subgame the King piece is added, which introduces a new objective: Keep the king
alive. In each successive subgame, new elements are introduced (such as new pieces, a larger
board, or different configurations) that require learning new skills and building upon the knowledge
learned in previous subgames, the final subgame is the full game of chess [39] Narvekar et al.
(2020).
19

3. Game Simulation Platform


3.1. Review of Game Simulation Platforms
The first step of our work was to choose an adequate game simulation platform that allows us to
run all the game AI experiments to test our ideas. The chosen platform needs to support
experiments with reinforcement learning, should be highly configurable, easy to use and it should
not be constrained to a specific type of video game.

According to [7] Justesen et al. (2017) increasing use of deep learning methods in video games
is due to the practice of comparing game-playing algorithm results on public available game
simulation platforms, in which algorithms are ranked on their ability to score points or win in
games.

A diagram of a typical deep reinforcement learning model running on a game simulation platform
is shown in Figure 3-1. Input is taken from the game environment and meaningful features are
extracted automatically, the RL agent produces actions based on these features, making the
environment transition to the next state [35] Shao et al. (2019).

Figure 3-1: Typical deep reinforcement learning model [35] Shao et al. (2019).
Several game simulation platforms listed by [18] Juliani et al. (2020), [35] Shao et al. (2019) and
[7] Justesen et al. (2017) are described below:

■ Arcade Learning Environment (ALE) [8] Bellemare et al. (2013) is a free, open-source
software framework for interfacing with hundreds of games for Atari 2600, a second-
generation home video game console originally released in 1977 and sold for over a decade
[9] Montfort et al. (2009). ALE allows domain-independent AI algorithms evaluation providing
an interface to game environments that present research challenges for reinforcement
learning, model learning, model-based planning, imitation learning, transfer learning, and
intrinsic motivation. ALE allows users to receive joystick motion, send and receive RAM data,
and provides a game-handling layer that transforms the running game into a standard
reinforcement learning problem by identifying the accumulated score and informing if the
game has ended [8] Bellemare et al. (2013).
■ Retro Learning Environment (RLE) [10] Bhonker et al. (2017) is an open-source software
environment that allows intelligent agents to be trained to play games for Super Nintendo
Entertainment System (SNES), Nintendo Ultra 64 (N64), Game Boy, Sega Genesis, Sega
Saturn, Dreamcast, and PlayStation. RLE supports working with multi-agent reinforcement
learning (MARL) [11] Buşoniu et al. (2010) tasks, particularly useful to train and evaluate
agents that compete against each other, rather than against a pre-configured in-game AI. RLE
separates the learning environment from the emulator, incorporating an interface called
LibRetro that allows communication between front-end programs and game console
emulators, allowing to support over 15 game consoles, each containing hundreds of games,
having an estimated total of 7.000 games. [10] Bhonker et al. (2017).
■ OpenNero [17] Karpov et al. (2008) is a general-purpose open-source game platform
designed for research and education in-game AI. The project is based on the Neuro-Evolving
Robotic Operatives (NERO) game developed by graduate and undergraduate students at the
Neural Networks Research Group and Department of Computer Science at the University of
Texas at Austin. OpenNero has a client-server architecture, features 3D graphics, physics
simulation, 3D audio rendering, networking, and has been used for planning, natural
processing language, multi-agent systems, reinforcement learning, and evolutionary
computation. OpenNERO has an API to implement machine learning tasks, environment, and
agents, and includes an extensible collection of ready-to-use AI algorithms such as value
function reinforcement learning, heuristic search and neuroevolution [17] Karpov et al. (2008).
21

■ ViZDoom [12] Kempka et al. (2016) is an AI research platform for visual reinforcement learning
based on the classical first-person shooter (FPS) video game Doom, a semi-realistic 3D world
that can be observed from a first-person perspective. VizDoom allows developing agents to
play Doom using the screen buffer as input, agents that have to perceive, interpret and learn
the 3D world in order to make tactical and strategic decisions such as where to go and how
to act. VizDoom allows users to define custom training scenarios that differ by maps,
environment elements, non-player characters, rewards, goals, and actions available to the
agent.
■ DeepMind Lab [13] Beattie et al (2016) is a first-person 3D game platform designed for
research and development of general artificial intelligence and machine learning systems,
which can be used to study how pixels-to-actions autonomous intelligent agents do learn
complex tasks in large, partially observed and visually diverse worlds. DeepMind Lab was
built on top of Quake III Arena created by ‘id Software’ in 1999, a game that features rich
science fiction-style 3D visuals and more naturalistic physics. DeepMind Lab allows agents to
learn tasks including navigation in mazes, traversing dangerous passages, and avoiding
falling off cliffs, bouncing through space using launch pads to move between platforms,
collecting items, learning, and remembering random procedurally generated environments
[13] Beattie et al (2016).
■ Project Malmo [14] Johnson et al. (2016) is an AI experimentation platform built on top of the
popular video game Minecraft, a platform designed to support fundamental research in
artificial general intelligence (AGI) and related areas including robotics, computer vision,
reinforcement learning, planning, and multi-agent systems, by providing a rich, structured and
dynamic 3D environment with complex dynamics. Project Malmo is heavily focused on the
research of flexible AI that can learn to perform well on a wide range of tasks, similar to the
kind of flexible learning seen in humans and other animals, in contrast to most AI approaches
that are mainly designed to perform narrow tasks [14] Johnson et al. (2016).
■ TorchCraft [15] Synnaeve et al. (2016) is a library that enables deep learning research on
real-time strategy games such as StarCraft: Brood War, a popular real-time strategy (RTS)
video game published in 1998 by Blizzard Entertainment. RTS games have been a domain of
interest for the planning and decision-making research communities [16] Silva et al. (2017),
since they aim to simulate the control of multiple units in a military setting at different scales
and levels of complexity, normally in a fixed size 2D map. The agents to train have to collect
resources, create buildings and military units to fight and destroy opponents. TorchCraft
serves as a low-level bridge between the game and Torch, a scientific computing framework
for LuaJIT, by dynamically injecting code to the game engine that hosts the game [15]
Synnaeve et al. (2016).
■ Unity [18] Juliani et al. (2020) is a 3D real-time cross-platform video game development
platform, featuring high-quality rendering and physics simulation, created by Unity
Technologies. Unity is not restricted to any specific genre of gameplay or simulation, this
flexibility enables the creation of tasks ranging from simple 2D grid world problems to complex
3D strategy games, physics-based puzzles, or multi-agent competitive games possible. Unity
provides an open-source framework called Unity Machine Learning Agents Toolkit (Unity ML-
Agents Toolkit) [19] Juliani (2017), [18] Juliani et al. (2020), [40] Juliani et al. (2020), a game
AI framework that enables 2D, 3D, VR/AR games and simulations to serve as learning
environments for training intelligent agents. Agents can be trained using reinforcement
learning, curriculum learning, imitation learning, neuroevolution, and other machine learning
methods [33] Mattar et al. (2020).

3.2. Game Platform Taxonomy


After listing the game simulation platforms, we needed a way to support the selection of one of
them to run our experiments, so we decided to rely on a taxonomy proposed by [18] Juliani et al.
(2020) that classifies game simulation platforms based on its flexibility to run different game
environments. Figure 3-2 show several platforms classified using this taxonomy.

Figure 3-2: Taxonomy of game simulation platforms based on the flexibility of environment
specification according to [18] Juliani et al. (2020).
23

Each taxonomy category is described below:

■ Single Environment: Platforms that act as a black box from an agent’s perspective, and usually
run only a specific game.
■ Environment Suite: Consist of several learning environments packaged together to
benchmark the performance of a game AI technique along with different games.
■ Domain-Specific Platform: Allows the creation of a set of tasks within a specific domain such
as first-person navigation, car racing, or human locomotion.
■ General Platform: Platforms that allow creating custom learning environments with arbitrarily
complex visuals, physical and social interactions, and configurable tasks.

According to [18] Juliani et al. (2020) Unity is the only platform that has the complexity, flexibility,
and computational properties expected from a general game simulation platform for game AI. This
platform is the only one that provides out-of-the-box support for both reinforcement learning and
curriculum learning, and it is not constrained to a specific game or learning environment. Taking
in consideration all these reasons, we decided to choose Unity and its Unity ML-Agents Toolkit
[18] Juliani et al. (2020) as the simulation platform to run our experiments.

3.3. Platform Architecture Description


The Unity ML-Agents Toolkit [18] Juliani et al. (2020) is made of several components described
below:

■ Learning Environment: Unity scene that contains all game characters and the environment in
which agents can observe, act, and learn. Contains agents and behaviors.
■ Agents: Unity component that is attached to a Unity game object (any character within a Unity
Scene), generates its observations, performs actions it receives and assigns a reward
(positive or negative) when appropriate. Each agent is linked to one behavior.
■ Behaviors: Defines specific attributes of the agent such as the number of actions the agent
can take, can be thought of as a function that receives observations and rewards from the
Agent, and returns actions. Behaviors can be of one of three types:
■ Learning Behavior: Behavior that is not, yet, defined but about to be trained.
■ Heuristic Behavior: Behavior defined by a hard-coded set of rules written in code.
■ Inference Behavior: Behavior that includes a trained Neural Network file (.nn), after a
learning behavior is trained, it becomes an inference behavior.
■ Python low-level API: Python API interface for interacting and manipulating a learning
environment, this is not part of Unity, but lives outside and communicates with Unity through
the communicator.
■ External Communicator: Connects the learning environment with the Python low-level API.
■ Python Trainers: Contains all the machine learning algorithms that enable training agents.

Every learning environment has at least one agent, each agent must be linked to one behavior,
is possible for agents that have similar observations and actions to have the same behavior [40]
Juliani et al. (2020). A schematic showing the main components of the Unity ML-Agents Toolkit
is shown in Figure 3-3.

Figure 3-3: High-level components diagram of Unity ML-Agents Toolkit [40] Juliani et al. (2020).
25

The Unity ML-Agents Toolkit [18] Juliani et al. (2020) provides different scenarios to train agents
[40] Juliani et al. (2020):

■ Single-Agent: This is the traditional way of training agents, used in most cases, only one
reward signal is used.
■ Simultaneous Single-Agent: A parallelized version of the first scenario, it has several
independent agents with individual reward signals with the same behavior parameters. This
scenario can speed-up the training process.
■ Adversarial Self-Play: In this case, there are two interacting agents with inverse reward
signals, scenario especially designed for two-player games.
■ Cooperative Multi-Agent: Multiple agents working together to accomplish a task that cannot
be done alone, all of them share the reward signal, and could have the same or different
behavior parameters. Tower defense video games fit into this scenario.
■ Competitive Multi-Agent: Multiple agents interacting and competing with each other to either
win a competition or getting a set of limited resources. Agents use inverse signal rewards and
could have the same or different behavior parameters. This scenario works well for team
sports video games.
■ Ecosystem: Multiple interacting agents with independent signal rewards, this scenario is
adequate for open-world video games such as Grand Theft Auto V [32] Rockstar Games
(2020), which have several kinds of characters, each one with its own goals. Autonomous
driving simulation within an urban environment can be represented by this scenario.

The training scenario we found the most convenient for our experiments is the Simultaneous
Single-Agent scenario since it allows us to train multiple instances of the same agent to reduce
the time of the training process.
4. Reinforcement Learning Algorithms
The Unity ML-Agents Toolkit [18] Juliani et al. (2020) provides an implementation based on
TensorFlow of two state-of-the-art reinforcement learning algorithms that we found appropriate to
train our agents:

■ Proximal Policy Optimization (PPO) [41] Schulman et al. (2017)


■ Soft Actor-Critic (SAC) [42] Haarnoja et al. (2018)

Proximal Policy Optimization (PPO) is an on-policy algorithm that has been shown to be more
general-purpose and stable than many other reinforcement learning algorithms. On the other
hand, Soft Actor-Critic (SAC) is off-policy, which means it can learn from any past experiences,
which are collected and placed in an experience replay buffer to be randomly drawn during
training. We provide a formal description of each one below:

4.1. Proximal Policy Optimization (PPO)


This algorithm aims to take the biggest possible improvement step on a policy to improve
performance, without stepping so far that could cause a performance collapse. Proximal Policy
Optimization (PPO) trains a stochastic policy in an on-policy way, which means that it explores
by sampling actions according to the latest version of its stochastic policy. The amount of
randomness in action selection depends on both initial conditions and the training procedure.
Over the course of training the policy typically becomes progressively less random, as the update
rule encourages it to explore rewards that it has already found.

There are two primary variants:

■ PPO-Penalty: Approximately solves a KL-constrained policy update, constraint expressed in


terms of KL-Divergence, a measure of the distance between probability distributions that
controls how close the new and old policies are allowed to be. This variant penalizes the KL-
Divergence in the objective function instead of making it a hard constraint, automatically
adjusting the penalty coefficient over the course of training so that it is scaled appropriately.
■ PPO-Clip: It does not have a KL-Divergence term in the objective function and does not have
a constraint at all, instead relies on the specialized clipping to remove incentives for the new
policy to get far from the old policy. This variant is considered an improvement over the first
one [54] Schulman et al. (2017).
27

Let denote 𝜋) a policy with a parameter 𝜃. PPO-clip updates policies via:

𝜃, " # = 𝑎𝑟𝑔 𝑚𝑎𝑥: 𝔼(,4 ∼ 0:, [𝐿(𝑠, 𝑎, 𝜃, , 𝜃 )]

taking multiple steps of Stochastic Gradient Descent (usually Minibatch) to maximize the objective
function. 𝐿 is given by:

𝜋" (𝑎|𝑠) #" 𝜋" (𝑎|𝑠)


𝐿(𝑠, 𝑎, 𝜃! , 𝜃) = 𝑚𝑖𝑛 , 𝐴 ! (𝑠, 𝑎), 𝑐𝑙𝑖𝑝 , , 1 − 𝜖, 1 + 𝜖8 𝐴#"! (𝑠, 𝑎)8
𝜋"! (𝑎|𝑠) 𝜋"! (𝑎|𝑠)

which is 𝜖 is a hyperparameter that represents how far away the new policy is allowed to go from
𝜋𝜃 (4|()
the old policy, is the ratio of the probability under the new and old policies, respectively,
0., (4|()
0.,
and 𝐴 is the estimated advantage. There is a simplified version of the equation above:

𝜋: (𝑎|𝑠) 0.
𝐿 (𝑠, 𝑎, 𝜃, , 𝜃 ) = 𝑚𝑖𝑛 Q 𝐴 , (𝑠, 𝑎), 𝑔(𝜖, 𝐴0., (𝑠, 𝑎))S
𝜋:, (𝑎|𝑠)

where:

𝑔(𝜖, 𝐴) = {(1 + 𝜖)𝐴 𝐴 ≥ 0; (1 − 𝜖)𝐴 𝐴 < 0}

Clipping serves as a regularizer by removing incentives for the policy to change dramatically, and
the hyperparameter 𝜖 corresponds to how far away the new policy can go from the old while still
profiting the objective.

An implementation in pseudocode of PPO-Clip variant is shown in Figure 4-1.


Figure 4-1: Pseudocode for PPO-Clip [54] Schulman et al. (2017).

4.2. Soft Actor-Critic (SAC)


This algorithm optimizes a stochastic policy in an off-policy way, using entropy regularization. The
policy is trained to maximize a trade-off between expected return and entropy, a measure of
randomness in the policy, which has a close connection to the exploration-exploitation trade-off:
increasing entropy results in more exploration, which can accelerate learning later on, but it can
also prevent the policy from prematurely converging to a bad local optimum.

SAC learns a policy 𝜋) and two Q-functions 𝑄/# , 𝑄/$ . There are two main variants of SAC:

■ Using a fixed entropy regularization coefficient 𝛼


■ Enforcing an entropy constraint by varying 𝛼 over the course of training
29

Let 𝑥 be a random variable with probability mass or density function 𝑃. The entropy 𝐻 of 𝑥 is
computed from its distribution 𝑃 according to:

𝐻(𝑃) = 𝐸@∼A [−𝑙𝑜𝑔𝑃(𝑥)]

In entropy-regularized reinforcement learning the agent gets a bonus reward at each time step
proportional to the entropy of the policy at that time step. This changes the RL problem to:


𝜋 = 𝑎𝑟𝑔 𝑚𝑎𝑥𝜋 𝐸𝜏∼𝜋 T U 𝛾𝑡 (𝑅(𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1) + 𝛼𝐻(𝜋(⋅ | 𝑠𝑡)))V
𝑡=0

where 𝛼 > 0 is the trade-off coefficient, assuming an infinite-horizon discounted setting. For this
new problem a new value function 𝑉 ( is defined to include the entropy bonuses from every time
step:

𝑉 + (𝑠) = 𝐸,∼+ 6 7 𝛾 - (𝑅(𝑠- , 𝑎- , 𝑠-./ ) + 𝛼𝐻(𝜋(⋅ | 𝑠- ))) | 𝑠0 = 𝑠8


-20

𝑄( is changed to include the entropy bonuses from every time step except the first:

- -

𝑄 0 (𝑠, 𝑎) = 𝐸B∼0 Z / 𝛾 ! 𝑅(𝑠! , 𝑎! , 𝑠!"# ) + 𝛼 / 𝛾 ! 𝐻(𝜋(⋅ | 𝑠! )) | 𝑠/ = 𝑠, 𝑎/ = 𝑎]


!./ !.#

The loss function for the Q-networks in SAC are:

+
𝐿(𝜙C , 𝐷) = 𝐸((,4,$,(),D)∼8 `a𝑄E0 (𝑠, 𝑎) − 𝑦(𝑟, 𝑠′, 𝑑)c d

where the target is given by:

𝑦(𝑟, 𝑠′, 𝑑) = 𝑟 + 𝛾(1 − 𝑑) e𝑚𝑖𝑛F.#,+ 𝑄E1234,6 (𝑠′, 𝑎′) − 𝛼 𝑙𝑜𝑔𝜋: (𝑎′ | 𝑠′)f , 𝑎′

∼ 𝜋: (⋅ |𝑠′)

The entropy regularization coefficient 𝛼 explicitly controls the explore-exploit trade-off, with higher
𝛼 corresponding to more exploration and lower 𝛼 corresponding to more exploitation.
An implementation in pseudocode of Soft Actor-Critic is shown in Figure 4-2.

Figure 4-2: Pseudocode for Soft Actor-Critic [42] Haarnoja et al. (2018)
31

5. Hyperparameters Description
To run experiments using the Unity ML-Agents Toolkit [18] Juliani et al. (2020) a
trainer_config.yaml configuration file must be provided containing all the hyperparameters
and training configurations. There are some parameters that are specific to either Proximal Policy
Optimization (PPO) or Soft Actor-Critic (SAC), while some others are common to both algorithms.
The toolkit gives recommended values for most of the parameters of the config file.

All configuration parameters provided by the toolkit are listed and described below:

5.1. Common Parameters


■ trainer: Reinforcement learning algorithm to use on experiments: ppo for Proximal Policy
Optimization (PPO) and sac for Soft Actor-Critic (SAC).
■ summary_freq: Number of training steps that need to be collected to generate training
statistics, determining the granularity of graphs in a graph tool such as Tensorboard.
■ batch_size: Number of training steps in each iteration of gradient descent. Recommended
values: For continuous action space, this value should be in the order of thousands, for
discrete action space should be in the order of tens. For continuous PPO: between 512 and
5,120 is recommended, for continuous SAC: 128 to 1,024, for discrete PPO and SAC: 32 to
512.
■ buffer_size: Number of training steps to collect before updating the policy model,
indicating how many experiences should be collected before doing any learning or updating
the model. Typically, a larger value corresponds to more stable training updates.
Recommended values: This parameter should have a value multiple times larger than
batch_size so that SAC can learn from old as well as new experiences. For PPO: between
2,048 and 409,600 is recommended, for SAC: 50,000 to 1,000,000.
■ hidden_units: Number of units in the hidden layers of the neural network, indicated how
many units are in each fully connected layer. Recommended values: For problems where the
correct action is a straightforward combination of the observation inputs, a small value should
be used, for problems where the action is a very complex interaction between the observed
variables, the value should be large. For PPO and SAC: between 32 and 512 is
recommended.
■ learning_rate: Initial learning rate for gradient descent, corresponding to the strength of
each gradient descent update step. Recommended values: This value should be decreased
if training is unstable and reward does not increase consistently. For PPO and SAC: between
0,00001 and 0,001 is recommended.
■ learning_rate_schedule: Indicates how learning rate changes over time: linear decay
the learning rate linearly, reaching 0 at max_steps, constant indicates a constant learning
rate for the entire experiment run. Recommended values: For PPO linear decay is
recommended so learning converges more stably. For SAC a constant learning rate is
preferred so the agent can continue to learn until its Q function converges naturally.
■ max_steps: Total number of training steps that must be collected from the game simulation
before ending the training process. Recommended values: For PPO and SAC: between
500,000 and 10,000,000 is recommended.
■ normalize: Set to true to apply normalization to the observation space vector.
Recommended values: Normalization can be helpful in complex continuous control problems
but can be harmful to simpler discrete control problems.
■ num_layers: Number of hidden layers in the neural network after the observation input.
Recommended values: A few layers are likely to train faster on simpler problems, for more
complex control problems more layers may be necessary. For PPO and SAC: between 1 and
3 is recommended.
■ time_horizon: Indicates how many training steps must be collected per-agent before
adding it to the experience buffer. When the limit is reached before the end of an episode, to
predict the overall expected reward from the agent’s current state, a value estimate is used.
This parameter represents the trade-off between a less biased, but higher variance estimate
(long horizon), and a more biased, but less varied estimate (short time horizon).
Recommended values: For PPO and SAC: between 32 and 2,048 is recommended.
■ use_recurrent: Set to true to enable agents to use memory using Recurrent Neural
Networks (RNN), specifically Long Short-Term Memory (LSTM) [56] Hochreiter et al. (1997).
LSTM does not work well with continuous action space vectors, so should be used with
discrete vectors.
■ memory_size: Size of the memory the agent must keep. In order to use an LSTM, training
requires a sequence of experiences instead of single experiences. Corresponds to the size of
the array of floating-point numbers used to store the hidden state of the recurrent neural
network of the policy. Recommended values: Must be a multiple of 2 and should scale with
the amount of information the agent needs to remember in order to complete the task
successfully. For PPO and SAC: between 32 and 256 is recommended.
33

■ sequence_length: Indicates how long the sequences of experiences must be while training.
Recommended values: If this value is too small the agent will not be able to remember things
over longer periods of time, if it is too large the neural network will take longer to train. For
PPO and SAC: between 4 and 128 is recommended.
■ strength: Factor by which to multiply the rewards coming from the environment (extrinsic
rewards). Recommended values: 1,0
■ gamma: Discount factor for future rewards coming from the environment, represents how far
into the future the agent should care about possible rewards. Must be strictly smaller than 1.
Recommended values: In situations when the agent should be acting in the present in order
to prepare for rewards in the distant future, this value should be large, in cases when rewards
are more immediate should be smaller. For PPO and SAC: between 0,8 and 0,995 is
recommended.

5.2. Proximal Policy Optimization Hyperparameters


■ beta: Indicates the strength of the entropy regularization, which makes the policy “more
random”, ensuring that agents properly explore the action space during training. It should be
adjusted so the entropy slowly decreases alongside increases in reward. Recommended
values: For PPO and SAC: between 0,0001 and 0,01 is recommended.
■ epsilon: Influences how rapidly the policy can evolve during training, corresponds to the
acceptable threshold of divergence between the old and new policies during gradient descent
updating. Recommended values: Small values will result in more stable updates but also will
slow the training process. For PPO and SAC: between 0,1 and 0,3 is recommended.
■ lambd: Regularization parameter used when calculating the Generalized Advantage Estimate
(GAE). It can be seen as how much the agent relies on its current value estimate when
calculating an updated value estimate. Low values correspond to relying more on the current
value estimate (high bias), and high values correspond to relying more on actual rewards
received in the environment (high variance). Recommended values: between 0,9 and 0,95 is
recommended.
■ num_epoch: Number of passes to make through the experience buffer when performing
gradient descent optimization. Decreasing this parameter will ensure more stable updates at
the cost of slower learning. Recommended values: between 3 and 10 is recommended.
5.3. Soft Actor-Critic Hyperparameters
■ buffer_init_steps: Number of experiences to collect into the buffer before updating the
policy model. Since the untrained policy is fairly random, prefilling the buffer with random
actions is useful for exploration. Recommended values: between 1000 and 10,000 is
recommended.
■ init_entcoef: Represents how much the agent should explore at the beginning of training.
Correspond to the initial entropy coefficient which incentives the agent to make entropic
actions to facilitate better exploration. This coefficient is automatically adjusted [55] Haarnoja
et al. (2019) to a preset target entropy. Recommended values: For continuous SAC: between
0,5 and 1,0 is recommended, for discrete SAC: 0.05 to 0,5.
■ save_replay_buffer: If set to true the experience replay buffer and the model is saved
and loaded when quitting and restarting training. This can help to resume to go more smoothly,
as the experiences collected will not be wiped.
■ tau: Indicates how aggressively to update the target network used for bootstrapping value
estimation in SAC. Corresponds to the magnitude of the target Q update during the SAC
model update. In SAC there are two neural networks: the target and the policy. The target
network is used to bootstrap the policy’s estimate of the future rewards at a given state and is
fixed while the policy is being updated, this target is then slowly updated according to tau.
Recommended values: Typically, this value should be left at 0,005. For simple problems,
increasing tau to 0,01 could reduce the time it takes to learn, at the cost of stability.
■ steps_per_update: Average ratio of agent actions taken to updates made of the agent’s
policy. In SAC a single update corresponds to grabbing a batch of size batch_size from the
experience replay buffer and using this mini batch to update the models. Typically, this
parameter should be greater than or equal to 1. Setting a lower value will improve sample
efficiency (reducing the number of steps required to train) but will increase the CPU time spent
performing updates. Recommended values: between 1 and 20 is recommended.
35

5.4. Hyperparameters Selection


Initially we decided to run all our experiments using both Proximal Policy Optimization (PPO) and
Soft Actor-Critic (SAC) algorithms, using the recommended values for each hyperparameter. The
specific hyperparameter values for each algorithm are provided below:

Contents of trainer_config.yaml file for experiments using Proximal Policy Optimization:

# Proximal Policy Optimization


SoccerAcademy:

# Common parameters
trainer: ppo
summary_freq: 10000
batch_size: 5120
buffer_size: 512000
hidden_units: 512
learning_rate: 0.0003
learning_rate_schedule: linear
max_steps: 100000000
normalize: false
num_layers: 3
time_horizon: 1024
use_recurrent: false
memory_size: 128
sequence_length: 128
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99

# PPO hyperparameters
beta: 0.005
epsilon: 0.2
lambd: 0.95
num_epoch: 3
Contents of trainer_config.yaml file for experiments using Soft Actor-Critic:

# Soft Actor-Critic
SoccerAcademy:

# Common parameters
trainer: sac
summary_freq: 10000
batch_size: 512
buffer_size: 512000
hidden_units: 512
learning_rate: 0.0003
learning_rate_schedule: constant
max_steps: 100000000
normalize: false
num_layers: 3
time_horizon: 1024
use_recurrent: false
memory_size: 128
sequence_length: 128
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99

# SAC hyperparameters
buffer_init_steps: 5000
init_entcoef: 0.5
save_replay_buffer: false
tau: 0.005
steps_per_update: 5
37

6. Case Study: Toy Soccer Game


6.1. Learning Environment
The next step in our work is the selection of a video game to run our experiments. The Unity ML-
Agents Toolkit [40] Juliani et al. (2020) includes a set of 15 preconfigured 3D game environments
specially designed to serve as challenges for game AI researchers, and test novel machine
learning techniques, especially reinforcement learning ones. After inspecting all the learning
environments, we selected one that fulfilled all our requirements: Soccer Twos. This environment
is a toy soccer game that has 2 agents of opposite teams, blue and red, that compete against
each other in a 30x15 meters rectangular field called pitch, which has one goal at each end. The
objective of each agent is to push a ball inside the opponent’s goal to win the soccer match. The
pitch is surrounded by a wall to prevent the ball from leaving the field. At the beginning of the
match, the ball starts at the center of the pitch, and both agents are placed 2 meters away from
the ball, then the agents can move at a maximum speed of 2 meters per second, and they can
push the ball in any direction but they cannot throw it into the air. A screenshot of Soccer Twos
environment is shown in Figure 6-1.

Figure 6-1: Screenshot of learning environment Soccer Twos [40] Juliani et al. (2020).
Since the environment is a simplification of the real soccer game, there are elements that are not
present, for example, there are no other players, including the goalkeeper, there are no team
manager and no referee. Also, there are soccer rules that do not apply, including getting a yellow
or red card for fouls done by the players, or special ways of restarting the game like penalty kick
or corner kick.

We constrained the environment to 15.000 iterations, also called frames or simulation steps, if no
agent is able to score a goal before the iteration limit is reached, a draw is declared and the match
finishes. A modern six-core CPU usually executes between 60 to 120 frames per second, this
means a match could last from 125 to 250 seconds if no agent scores a goal. The scenario
includes a trained reinforcement learning model that can be used by both agents, in our case we
used this model on the red agent only. All our training experiments were run only on the blue
agent, serving the red agent only as the opponent to defeat.

We configured the environment to be able to run up to 10 soccer matches simultaneously, in order


to run our experiments faster, taking advantage of the Simultaneous Single-Agent scenario
provided by the Unity ML-Agents Toolkit [40] Juliani et al. (2020). A screenshot of 10 soccer
matches running simultaneously is shown in Figure 6-2.

Figure 6-2: Screenshot of 10 soccer matches running simultaneously in a Unity scene.


39

6.2. Observation Space


Each agent, represented in the environment as a cuboid with little decorations resembling eyes
and a mouth, can observe its surroundings using collision rays that end in a collision detection
sphere, each ray is shot in a particular direction to detect the distance to other objects. The agent
uses 3 rays shot backward separated by 45 degrees, and 11 rays shot forward distributed across
120 degrees. Figure 6-3 show the rays and spheres used by the agent to perceive its
surroundings.

Figure 6-3: Rays detecting objects around the blue agent.

Each ray can detect 6 types of objects:

■ The soccer ball


■ The walls enclosing the soccer field
■ Agents from the same team (blue color)
■ Agents from the opposite team (red color)
■ Its own goal (blue color)
■ The opponent’s goal (red color)

Figure 6-4 shows all the objects that can be detected by the agent.
Figure 6-4: Objects that can be detected by the agent: top-left: Soccer ball, top-center: Agents of
the same team as the agent in training, top-right: Agents of the opposite team, bottom-left: Walls
of the soccer field, bottom-center: Agent’s own goal, bottom-right: Opposite team’s goal.

By each type of object, the x, y, and z coordinates plus the distance to the object are detected. In
total, the agent has an Observation Space vector of 336 variables per iteration:

14 𝑟𝑎𝑦𝑠 ∗ 6 𝑡𝑦𝑝𝑒𝑠 𝑜𝑓 𝑜𝑏𝑗𝑒𝑐𝑡𝑠 ∗ [ (𝑥, 𝑦, 𝑧) + 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑡𝑜 𝑜𝑏𝑗𝑒𝑐𝑡 ] = 336 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠

Since the game can run from 60 to 120 frames per second, each variable is being updated 60 to
120 times per second. This is the way the agent can perceive its surroundings.
41

6.3. Action Space


The Action Space vector is a vector composed of 3 branched actions the agent can use to explore
and interact with the environment:

■ Frontal motion: If this value is greater than 0 the agent moves forward, if the value is less than
zero it moves backward, and if the value is equal to zero no frontal motion is done.
■ Lateral motion: If this value is greater than 0 the agent moves to its right, if the value is less
than zero it moves to its left, and if the value is equal to zero no lateral motion is done.
■ Rotation: If this value is greater than 0 the agent rotates around its Y-axis clockwise, if the
value is less than zero it rotates around its Y-axis counterclockwise, if the value is equal to
zero no rotation is done.

The maximum lateral or frontal speed is constrained to 2 meters per second. Figure 6-5 show the
3 actions that agent can use to move inside the soccer field.

Figure 6-5: Possible movements that the agent can perform: left: lateral motion, center: frontal
motion, right: rotation around its Y-axis.

6.4. Reward Signals


Reward signals are the rules to reward and punish the agent for its actions to guide learning, they
are the way to measure its performance in the game. The maximum reward value in a match is
1, and the maximum punishment value is -1. The reward rules applied to our agent are:

■ -1: When the opponent agent scores a goal.


■ (1 - accumulated_time_penalty): When the blue agents score a goal.
■ No reward or punishment value is given to either agent if the match time runs out
The value accumulated_time_penalty is incremented by (1/MaxSteps) every iteration and
is reset to 0 at the beginning of a new match. This penalty value is set this way to encourage our
agent to score goals faster. MaxSteps is equal to 15000 simulation steps.

The main method of measuring the agent’s performance over multiple soccer matches is the
mean cumulative reward, which can tell us on average if the agent is scoring goals or not. A mean
value close to 1 means that our agent is winning almost every single match, outperforming its
opponent, close to -1 means is almost losing every single time, close to 0 means that in most
cases the matches are ending in a draw, no agent is scoring goals which mean that both have a
low game performance. If the mean value is close to 0.5 it means that both agents are being able
to score goals at similar rates.

We do not have information about the trained model that the red agent is using, we do not know
which reinforcement learning technique was used, or which learning environment configuration
and hyperparameters were set, but it is safe to assume that it was trained using one of the state-
of-the-art reinforcement learning algorithms included in the toolkit, using an optimal set of
hyperparameters, since the learning environment chosen was designed to serve as a challenge
to test new ML algorithms and techniques.

We wanted how the mean cumulative reward behaves if we use the same trained model on both
blue and red agents and make them compete against to each other, it turned out that the mean
cumulative reward stays close to 0.5 over time, meaning both agents perform well and score goals
at similar rates, each one having a 50% chance of winning the game.

In this context it is safe to assume that if one of our experiments ends up with a mean cumulative
reward value tending to 0.5, the training was successful, and the agent is performing at least as
well as the red agent which is using the trained model. Since the blue agent has an opponent that
is already trained to play in an optimal way, it is expected that at the beginning of the experiments
the mean cumulative reward value starts at zero, meaning the blue agent is being outperformed
by the red agent. Figure 6-6 shows the expected behavior of mean cumulative reward values over
time for the blue agent in a successful training lesson.
43

Figure 6-6: Expected behavior of mean cumulative reward values over time for the blue agent in
a successful training lesson.

What we wanted to evaluate in this work is if using the curriculum learning technique on the
training process the mean cumulative reward value gets closer to 0.5 faster than without a
curriculum, in other words, if this technique makes the agent to learn faster.

6.5. Environment Parameters


To run experiments with curriculum learning using the Unity ML-Agents Toolkit [18] Juliani et al.
(2020), additionally to trainer_config.yaml configuration file, we must provide a
curriculum.yaml file that contains the parameters we want to vary over time to control how
difficult is for our agent to learn how to play, the idea is to vary those parameters in a way that
allows the agent to learn faster. These variations are discrete and are divided into lessons.

The difficulty our agent has when learning at the beginning of the training is linked directly to the
performance of the opponent agent, the easier the opponent is, the easier is for our agent to learn,
so it makes sense to design training curriculums that alter the opponent’s performance to give our
agent an initial advantage over it.
We adapted the case study to support the variation of 2 environment parameters that alter the
opponent’s behavior, and one that alters the reward signals:

■ opponent_exist: At the beginning of the training the blue agent needs to learn how to
move inside the pitch and how to rotate and move towards the ball, but if the red agent is
present, this learning process is interrupted, because the red agent will score a goal in a very
short time, almost immediately, since it is already trained on how to do it efficiently. In this
context, we want to be able to remove the opponent from the field while our agent is mastering
how to move efficiently towards the ball and to push it towards the opponent's goal. After the
agent learned how to play alone in the field, we want to restore the opponent back, so our
agent can start learning how to defeat it.
■ opponent_speed: We believe modifying the movement speed of the red agent will give the
blue agent a clear advantage over its opponent, the slower the red agent is, more
opportunities the blue agent will have to learn how to score goals since it will be able to move
faster towards the ball than the red agent. As soon as our agent learns how to defeat a slow
version of its opponent, we want to increase the opponent’s speed so it can become a harder
player to overcome.
■ ball_touch_reward: We want to give an additional reward to our agent every single time
it touches the ball, because we wonder if this will incentivize our agent to move toward the
ball faster, but this approach has an important risk to consider: This can teach the agent how
to seek the ball very quickly but not necessarily how to score goals. This option has an
additional problem: The mean cumulative reward values will be altered by the additional
reward, this will invalidate any possible performance comparison we want to make against
another agent’s cumulative reward values.

In the case of parameter opponent_exist a value of 1 means the opponent does exist, and a
value of 0 means the opponent does not exist. For opponent_speed parameter, a value of 0
means the opponent cannot move. The maximum movement speed of all agents is 2 meters per
second, so a greater value for the opponent_speed parameter will be rounded down to 2.

For each curriculum, we want to experiment with, a curriculum.yaml file must be provided,
containing the definition of which of those parameters are going to vary over time, and the times
when these variations occur, times called lesson thresholds.

Below you can find an example of a curriculum file that illustrates how a curriculum is defined:
45

# Curriculum example
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds: [0.1]
parameters:
opponent_speed: [0.0, 2.0]

In this example, the curriculum sets the opponent’s movement speed to 0 meters per second at
the beginning of the training, then, when 10% of the total matches have been played, the speed
will be incremented to 2 meters per second. Worth to mention that a curriculum can control just
1, or all 3 environment parameters at the same time. Also, in the example, we have only one
lesson threshold, but we can have an arbitrary number of thresholds if we want.
7. Experimental Results
7.1. Experimental Setup
We used Google Cloud Platform to execute all or experiments, specifically we used Google
Compute Engine (GCE), an Infrastructure as a Service (IaaS) solution used by Google’s
applications including Gmail and YouTube. GCE enables its users to launch Virtual Machines
(VMs) on demand, that can be accessed using Secure Shell (SSH). We activated 2 virtual
machines running Fedora 10, a popular Linux distribution. Each virtual machine had a virtual 8-
core Intel CPU, 16 GB of RAM and 10 GB of Hard Drive.

We created a bash file called SetupVirtualMachine.sh containing a set of terminal


commands to setup the machines to run the experiments, setup that includes the installation of
all required third-party libraries, installation of Unity and the ML-Agents Toolkit, activation of the
Unity License and cloning of the repository containing the learning environment. After running all
these commands, the virtual machine is ready to run our experiments. This file can be found here:
https://github.com/rsaenzi/master-thesis.

To start an experiment in a virtual machine, we run the following terminal command:

mlagents-learn 'trainer_config.yaml' --run-id='EXPERIMENT_NAME'


--curriculum 'curriculum.yaml' --env='LEARNING_ENVIROMENT_PATH'
--num-envs 1 --no-graphics
Parameter --no-graphics shut down the rendering tasks, so the training can run faster.

7.2. Control Experiments


Each experiment for this work consisted of 100 million soccer matches, in the first two experiments
we ran, we used Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC) algorithms
without setting a curriculum, so these two can be used as control experiments.

Scatter plots showing the mean cumulative reward over the 100 million matches for each
algorithm are shown in Figures 7-1 and 7-2, a gaussian filter was applied to reduce noise, we
drew a green dotted line over the 0.5 reward value to help visualize where the cumulative reward
should be after running all the soccer matches so the training can be considered successful:
47

Figure 7-1: Mean cumulative reward for Proximal Policy Optimization over 100 million matches.

Figure 7-2: Mean cumulative reward for Soft Actor-Critic over 100 million soccer matches.
In the experiment with Proximal Policy Optimization (PPO), we got a result similar to our initial
hypothesis about how the mean cumulative reward values should vary over time in a successful
training session. On the other hand, with Soft Actor-Critic (SAC) we got results that indicate a
poor agent performance because the mean values tend to -1.

7.3. Preliminary Experiments


We wanted to try different combinations of environment parameters and thresholds, initially, we
designed 2 curriculums named A and B, the first one using the opponent_exist and
opponent_speed parameters, the second using only ball_touch_reward.

With curriculum A we wanted to test if our agents learn faster by removing the opponent during
the first 10% of the 100 million soccer matches, after that the opponent would be present but with
limited movement speed, first having no speed at all, then incrementing its speed linearly in
increments of 0.25 meters per second every 10% of the total matches until reaching its full speed
at 90% of the training.

Contents of the curriculum.yaml file for curriculum A are shown below:

# Curriculum A
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds:
[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
parameters:
opponent_speed:
[0.0, 0.0, 0.25, 0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 2.0]
opponent_exist:
[0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

In figure 7-3 and 7-4 the mean cumulative reward results for curriculum A are plotted in blue,
alongside with the PPO and SAC control experiments for comparison, the vertical dotted lines in
blue represent the curriculum thresholds, where one or more environment parameters were
changed, the dotted line is wider where the curriculum has no effect anymore, in this case after
90% of the matches.
49

Figure 7-3: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum A over 100 million soccer matches.

Figure 7-4: Mean cumulative reward for Soft Actor-Critic control experiment vs Soft Actor-Critic
and Curriculum A over 100 million soccer matches.
In the scatter plot for Proximal Policy Optimization only vs Proximal Policy Optimization and
Curriculum A we can notice that our agent performs well while the opponent is not present or its
movement speed is reduced, but starting from 50% of the matches we can notice performance
drops every time the opponent increments its movement speed, and after 90% the agents is
slightly over-performed by the control case. When we train the agent using PPO only, we get a
mean cumulative reward around 0.45 at the end of the training, but using a curriculum we get
around 0.35, this means the curriculum is hurting the performance.

In the case of Soft Actor-Critic only vs Soft Actor-Critic and Curriculum A is evident that the
curriculum has a positive impact only on the first 10% of the training, this means the agent learns
very fast how to score goals if the opponent is not present, but after 20% the agent performs the
same and even worse than the control case. In the end, from 90% of the training, there is no
noticeable difference in the agent’s performance when using the curriculum vs the control, both
perform very badly.

For curriculum B, additional to the default reward signal established previously, we wanted to give
our agent a bonus reward for touching the ball, a reward that is reduced over time. In the first 10%
of the 100 million soccer matches, we give a reward of 0.3, then we reduce the reward in 0.1
every 10% of the total matches until having no additional reward at 30% of the training.

Contents of the curriculum.yaml file for curriculum B are shown below:

# Curriculum B
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds: [0.1, 0.2, 0.3]
parameters:
ball_touch_reward: [0.3, 0.2, 0.1, 0.0]
Figure 7-5 and 7-6 show the mean cumulative reward results for curriculum B compared to the
control experiments.
51

Figure 7-5: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum B over 100 million soccer matches.

Figure 7-6: Mean cumulative reward for Soft Actor-Critic control experiment vs Soft Actor-Critic
and Curriculum B over 100 million soccer matches.
In the experiment using curriculum B with Soft Actor-Critic, we get similar results using the
curriculum A: a peak of good performance in the first 10% of the soccer matches, but then we get
a bad performance overall. On the other hand, for Proximal Policy Optimization we get better
performance all the time, not only when curriculum B is active, but even after 30% of the training
when it is not. The blue line crosses the 0.5 cumulative reward limit around 60% of training, totally
outperforming the agent that was trained in the control experiment, its line does not cross the 0.5
limit, however, it gets very close to it.

The importance of this experiment is that, in some cases like this one, we can get the same or
even training results in fewer soccer matches, only 60 million matches are required to have a
good performance if we use curriculum B, but around 120 million will be required to get the same
results using PPO alone.

Is important to emphasize that the cumulative reward values before the 30% of the training are
being altered by the additional reward we are giving to the agent when touching the ball, so in the
range of 0% to 30% the cumulative reward cannot be considered a valid benchmark for the
agent’s performance, so it cannot be used for comparison.

After getting these results we decided to combine curriculums A and B into a single one, to test if
their effects over the training process merge somehow so we can get even better results than
having each one separated, we named this combination curriculum A+B, and its results are shown
in Figure 7-7 and 7-8:

# Curriculum A+B
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds:
[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
parameters:
opponent_speed:
[0.0, 0.0, 0.25, 0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 2.0]
opponent_exist:
[0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
ball_touch_reward:
[0.3, 0.2, 0.1, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
53

Figure 7-7: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum A+B over 100 million soccer matches.

Figure 7-8: Mean cumulative reward for Soft Actor-Critic control experiment vs Soft Actor-Critic
and Curriculum A+B over 100 million soccer matches.
In case of Proximal Policy Optimization, the results when curriculum A+B are worse than the ones
obtained with curriculums A and B independently, we get a mean cumulative reward under 0 at
the end of the training, this means the agent has not been able to learn how to score goals in a
consistent way. In the case of Soft Actor-Critic, the performance is very low, not only in the control
experiment but also when using a curriculum, so at this point we decided not to continue using
SAC.

7.4. Curriculums with 3 and 4 Lessons


After running the preliminary experiments we realized that each experiment could take from 10 to
30 days to finish, so is not feasible, at least for us, to test all possible combinations of thresholds
and values for 1, 2 and 3 environment parameters, the costs are prohibitive. What we did instead
was experimenting with a small set of relevant curriculums, trying to infer any useful information
from their results.

Since we got a good performance in one of the experiments that have 3 lessons, we decided to
run more experiments, not only with 3 but also with 4 lessons, each one having different
combinations of environment parameters and thresholds. Later we tested curriculums with 5, 6,
1, and 9 lessons.

In the following pages the curriculums we designed with 3 and 4 lessons are presented, using
only Proximal Policy Optimization (PPO) as the reinforcement learning algorithm, and the mean
cumulative reward as the performance measure. All of them are compared against the PPO
control experiment. We assigned a letter to each experiment in a random order, so it has no
special meaning, the order we use to present the experiment results is not relevant either.
55

Contents of the curriculum.yaml file for curriculum D are shown below:

# Curriculum D
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds: [0.1, 0.2, 0.3, 0.4]
parameters:
opponent_speed: [0.0, 0.5, 1.0, 1.5, 2.0]

Figure 7-9: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum D over 100 million soccer matches.
Contents of the curriculum.yaml file for curriculum G are shown below:

# Curriculum G
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds: [0.1, 0.2, 0.3, 0.4]
parameters:
opponent_speed: [0.0, 0.0, 1.0, 1.5, 2.0]
opponent_exist: [0.0, 1.0, 1.0, 1.0, 1.0]

Figure 7-10: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum G over 100 million soccer matches.
57

Contents of the curriculum.yaml file for curriculum H are shown below:

# Curriculum H
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds: [0.07, 0.15, 0.30, 0.45]
parameters:
opponent_speed: [0.0, 0.0, 1.0, 1.5, 2.0]
opponent_exist: [0.0, 1.0, 1.0, 1.0, 1.0]

Figure 7-11: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum H over 100 million soccer matches.
Contents of the curriculum.yaml file for curriculum P are shown below:

# Curriculum P
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds: [0.07, 0.15, 0.30, 0.45]
parameters:
opponent_speed: [0.0, 0.25, 1.25, 1.75, 2.0]
opponent_exist: [0.0, 1.0, 1.0, 1.0, 1.0]

Figure 7-12: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum P over 100 million soccer matches.
59

Contents of the curriculum.yaml file for curriculum Q are shown below:

# Curriculum Q
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds: [0.06, 0.10, 0.22, 0.34]
parameters:
opponent_speed: [0.0, 0.25, 1.25, 1.75, 2.0]
opponent_exist: [0.0, 1.0, 1.0, 1.0, 1.0]

Figure 7-13: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum Q over 100 million soccer matches.
Contents of the curriculum.yaml file for curriculum R are shown below:

# Curriculum R
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds: [0.06, 0.14, 0.26, 0.40]
parameters:
opponent_speed: [0.0, 0.50, 1.25, 1.75, 2.0]
opponent_exist: [0.0, 1.0, 1.0, 1.0, 1.0]

Figure 7-14: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum R over 100 million soccer matches.
61

Contents of the curriculum.yaml file for curriculum U are shown below:

# Curriculum U
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds: [0.15, 0.30, 0.45]
parameters:
ball_touch_reward: [0.5, 0.25, 0.1, 0.0]

Figure 7-15: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum U over 100 million soccer matches.
Contents of the curriculum.yaml file for curriculum W are shown below:

# Curriculum W
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds: [0.07, 0.15, 0.30, 0.45]
parameters:
opponent_speed: [0.0, 0.0, 1.0, 1.5, 2.0]
opponent_exist: [0.0, 1.0, 1.0, 1.0, 1.0]
ball_touch_reward: [0.3, 0.2, 0.1, 0.0, 0.0]

Figure 7-16: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum W over 100 million soccer matches.
63

Contents of the curriculum.yaml file for curriculum X are shown below:

# Curriculum X
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds: [0.07, 0.15, 0.30, 0.45]
parameters:
opponent_speed: [0.0, 0.0, 1.0, 1.5, 2.0]
opponent_exist: [0.0, 1.0, 1.0, 1.0, 1.0]
ball_touch_reward: [0.2, 0.1, 0.0, 0.0, 0.0]

Figure 7-17: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum X over 100 million soccer matches.
Contents of the curriculum.yaml file for curriculum Y are shown below:

# Curriculum Y
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds: [0.07, 0.15, 0.30, 0.45]
parameters:
opponent_speed: [0.0, 0.0, 1.0, 1.5, 2.0]
ball_touch_reward: [0.2, 0.1, 0.0, 0.0, 0.0]

Figure 7-18: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum Y over 100 million soccer matches.
65

Contents of the curriculum.yaml file for curriculum Z are shown below:

# Curriculum Z
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds: [0.07, 0.15, 0.30, 0.45]
parameters:
opponent_speed: [0.0, 0.0, 1.0, 1.5, 2.0]
ball_touch_reward: [0.3, 0.2, 0.1, 0.0, 0.0]

Figure 7-19: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum Z over 100 million soccer matches.
7.5. Curriculums with 5 and 6 Lessons
Contents of the curriculum.yaml file for curriculum C are shown below:

# Curriculum C
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds: [0.1, 0.2, 0.3, 0.4, 0.5]
parameters:
opponent_speed: [0.0, 0.0, 0.5, 1.0, 1.5, 2.0]
opponent_exist: [0.0, 1.0, 1.0, 1.0, 1.0, 1.0]

Figure 7-20: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum C over 100 million soccer matches.
67

Contents of the curriculum.yaml file for curriculum J are shown below:

# Curriculum J
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds: [0.10, 0.15, 0.25, 0.35, 0.45, 0.55]
parameters:
opponent_speed:
[0.0, 0.0, 1.0, 1.25, 1.5, 1.75, 2.0]
opponent_exist:
[0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

Figure 7-21: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum J over 100 million soccer matches.
Contents of the curriculum.yaml file for curriculum M are shown below:

# Curriculum M
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds: [0.08, 0.16, 0.24, 0.32, 0.40]
parameters:
opponent_speed: [0.0, 0.0, 0.5, 1.0, 1.5, 2.0]
opponent_exist: [0.0, 1.0, 1.0, 1.0, 1.0, 1.0]

Figure 7-22: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum M over 100 million soccer matches.
69

Contents of the curriculum.yaml file for curriculum N are shown below:

# Curriculum N
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds: [0.08, 0.12, 0.20, 0.28, 0.36]
parameters:
opponent_speed: [0.0, 0.0, 0.5, 1.0, 1.5, 2.0]
opponent_exist: [0.0, 1.0, 1.0, 1.0, 1.0, 1.0]

Figure 7-23: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum N over 100 million soccer matches.
Contents of the curriculum.yaml file for curriculum O are shown below:

# Curriculum O
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds: [0.06, 0.10, 0.16, 0.24, 0.32]
parameters:
opponent_speed: [0.0, 0.0, 0.5, 1.0, 1.5, 2.0]
opponent_exist: [0.0, 1.0, 1.0, 1.0, 1.0, 1.0]

Figure 7-24: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum O over 100 million soccer matches.
71

Contents of the curriculum.yaml file for curriculum V are shown below:

# Curriculum V
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds: [0.1, 0.2, 0.3, 0.4, 0.5]
parameters:
ball_touch_reward: [0.5, 0.4, 0.3, 0.2, 0.1, 0.0]

Figure 7-25: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum V over 100 million soccer matches.
7.6. Curriculums with 1 and 9 Lessons
Despite the fact that we got bad performance when we used a 9-lesson curriculum in our
preliminary experiments, we considered it was pertinent to run a few more experiments with 9-
lesson, to see if it was possible to get better performance. Also, we wanted to test 1-lesson
curriculums, since they are an edge case it was not considered in our preliminary experiments.

Contents of the curriculum.yaml file for curriculum E are shown below:

# Curriculum E
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds: [0.1]
parameters:
opponent_speed: [0.0, 2.0]
opponent_exist: [0.0, 1.0]

Figure 7-26: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum E over 100 million soccer matches.
73

Contents of the curriculum.yaml file for curriculum F are shown below:

# Curriculum F
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds: [0.1]
parameters:
opponent_speed: [0.0, 2.0]

Figure 7-27: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum F over 100 million soccer matches.
Contents of the curriculum.yaml file for curriculum K are shown below:

# Curriculum K
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds:
[0.08, 0.14, 0.20, 0.26, 0.32, 0.38, 0.44, 0.50, 0.56]
parameters:
opponent_speed:
[0.0, 0.0, 0.25, 0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 2.0]
opponent_exist:
[0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

Figure 7-28: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum K over 100 million soccer matches.
75

Contents of the curriculum.yaml file for curriculum L are shown below:

# Curriculum L
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds:
[0.08, 0.12, 0.16, 0.20, 0.24, 0.30, 0.36, 0.42, 0.48]
parameters:
opponent_speed:
[0.0, 0.0, 0.25, 0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 2.0]
opponent_exist:
[0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

Figure 7-29: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum L over 100 million soccer matches.
8. Discussion
After completing all experiments, we proceeded to group all curriculums according to its
performance when compared against the Proximal Policy Optimization (PPO) control experiment,
so we can analyze them as a cluster in order to identify common patterns:

8.1. Curriculums with the Lowest Performance


The first group contains all curriculums that underperformed the control experiment, in this group
we have curriculums A, A+B, D, E, F, G, W, Z. Figure 8-1 shows the mean cumulative reward
results for all these curriculums.

Figure 8-1: Mean cumulative reward for Proximal Policy Optimization control experiment vs all
curriculum experiments that had lower performance than the control experiment at the end of 100
million soccer matches.
77

We noticed a few patterns when we analyzed these curriculums:

■ Curriculums F and E belong to this group, both have a single threshold at 10% of the training
at which the opponent’s movement speed is incremented from 0 to 2 meters per second, it
seems that this sudden change in the velocity does not provide the agent enough time to learn
how to play.
■ Curriculums D and G have 4 thresholds, at 10%, 20%, 30% and 40% of the training, in the
latest 3 thresholds the opponent’s movement speed is incremented linearly, adding 0.5 meters
per second. This result suggests that a linear increment in velocity at the end of the training
does not provide good performance results.
■ Curriculums W and Z use both ball_touch_reward and opponent_speed environment
parameters to encourage the agent to learn, but is not a good strategy because it seems the
agent suffers from overfitting, meaning the agent learn how to move towards the ball quickly
when the opponent is slow, but not necessarily learn how to score goals efficiently.
■ Curriculums A and A+B have a big space between the thresholds, in both cases of 10%, it
seems that this scenario is not good for the agent because it plays under the initial favorable
environment conditions for too long, making it unable to adapt to new harder environment
conditions.

8.2. Curriculums with Same Performance


The second group contains all curriculums that performed almost the same at the end of the
training than the control experiment, in this group we have curriculums J, M, Q, V. Figure 8-2
shows the mean cumulative reward results for all these curriculums.

In this group most of the curriculums do not have a consistent increase rate in the opponent’s
movement speed, the distribution of the thresholds does not follow a pattern. The results indicate
that this is not a good strategy in terms of the agent’s performance.

In case of curriculum V, we give additional reward to our agent, starting with 0.5 at the beginning
of the training, decreasing the value by 0.1 meters per second every 10% of the training. The bad
performance of this curriculum suggests that the agent learns how to reach the ball quickly since
the reward is big for this action, but then is unable to learn how to push it towards the opponent’s
goal.
Figure 8-2: Mean cumulative reward for Proximal Policy Optimization control experiment vs all
curriculum experiments that had the same performance as the control experiment at the end of
100 million soccer matches.

8.3. Curriculums with the Highest Performance


The third group contains all curriculums that performed better at the end of the training than the
control experiment, in this group we have curriculums B, C, H, K, L, N, O, P, R, U, X, Y. Figure 8-
3 shows the mean cumulative reward results for all these curriculums.

In this group we noticed a few interesting things:

■ Curriculums B and U suggest that giving an additional reward to the agent can be beneficial,
but only if the reward decreases rapidly during the training, to avoid the agent to learn to rely
only on touching the ball to get a reward.
■ Several curriculums in this group suggest that it is important for the agent to have a strong
advantage over its opponent but only over a short period of time. In most curriculums the
opponent is not present for the first 10% of training time, giving the agent the opportunity to
learn how to move inside the pitch and how to move the soccer ball around, then the
79

opponent’s movement speed is incremented but not in a linear way, rather in a logarithmic
way, this prevents the agent to overfit to the initial conditions.

Figure 8-3: Mean cumulative reward for Proximal Policy Optimization control experiment vs all
curriculum experiments that had higher performance than the control experiment at the end of
100 million soccer matches.

Here we list the times at which the mean cumulative reward reaches the optimal value of 0.5:

■ Curriculum B: At 60% of the training time


■ Curriculum H: At 82% of the training time
■ Curriculum K: At 95% of the training time
■ Curriculum N: At 82% of the training time
■ Curriculum O: At 90% of the training time
■ Curriculum P: At 95% of the training time
■ Curriculum U: At 90% of the training time
■ Curriculum X: At 75% of the training time
■ Curriculum Y: At 95% of the training time
We can notice that the best curriculums are B, X, H and N. Both curriculums H and N require
around 18% less training time to get similar results when compared to the control experiment,
curriculum X requires around 25% less time, and curriculum B requires around 40% less time.
These values are actually higher since the control experiment does not reach the 0.5 value after
100 million experiments but extrapolating its values it is safe to say that the control experiment
will reach the 0.5 limit around 120 million matches. Figure 8-4 shows the mean cumulative reward
results only for curriculums B, X, H, N.

Figure 8-4: Mean cumulative reward for Proximal Policy Optimization control experiment vs the
best 4 curriculum experiments compared over 100 million soccer matches.
81

9. Conclusions
Several experiments with curriculum learning were executed to measure its effects over the
training process of an agent that is learning to play a video game using reinforcement learning.
The initial hypothesis is that, in some cases, using a curriculum would allow the agent to learn
faster. Our results indicate that using curriculum learning could have a significant impact on the
learning process, in some cases helping the agent to learn faster, overperforming the control
experiment, in other cases having a negative impact, making the agent to learn slower.

In this work 24 curriculums were designed, each one having a different configuration of learning
environment parameters and thresholds. We were able to infer several patterns from the training
results that could indicate if a curriculum will improve or hurt the agent’s performance, measured
by the mean cumulative reward. In 12 experiments we get better performance than the control
experiment, the best curriculum could save up to 40% of the training time required to reach an
optimal performance level.

In this work we used only two algorithms in our experiments, Proximal Policy Optimization (PPO)
and Soft Actor-Critic (SAC), would be interesting to try more reinforcement learning algorithms,
or even supervised learning ones to test our main hypothesis. In the case of Soft Actor-Critic
(SAC) a process of hyperparameter optimization is needed to see if it can work well with the case
study we chose.

We used a learning environment included in the Unity ML-Agents Toolkit [18] Juliani et al. (2020)
as case study: Soccer Twos, a toy soccer video game, but would be interesting to test the
curriculum learning technique on a wider variety of video games, including 2D and 3D, first-person
and third-person games, of multiple genres such as strategy, adventure, role-playing and puzzle
to see if this technique has a bigger impact on a particular genre.
10. References
[1] Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009). Curriculum learning.
Proceedings of the 26th International Conference on Machine Learning, ICML 2009, 41–48.
https://dl.acm.org/doi/10.1145/1553374.1553380
[2] Elman, J. L. (1993). Learning and development in neural networks: The importance of
starting small. Cognition, 48, 71–99. https://doi.org/10.1016/S0010-0277(02)00106-3
[3] Harris, C. (1991). Parallel distributed processing models and metaphors for language and
development. Ph.D. dissertation, University of California, San Diego.
https://elibrary.ru/item.asp?id=5839109
[4] Juliani, Arthur. (2017, December 8). Introducing ML-Agents Toolkit v0.2: Curriculum
Learning, new environments, and more. https://blogs.unity3d.com/2017/12/08/introducing-ml-
agents-v0-2-curriculum-learning-new-environments-and-more/
[5] Gulcehre, C., Moczulski, M., Visin, F., & Bengio, Y. (2019). Mollifying networks. 5th
International Conference on Learning Representations, ICLR 2017 - Conference Track
Proceedings. http://arxiv.org/abs/1608.04980
[6] Allgower, E. L., & Georg, K. (2003). Introduction to numerical continuation methods. In
Classics in Applied Mathematics (Vol. 45). Colorado State University.
https://doi.org/10.1137/1.9780898719154
[7] Justesen, N., Bontrager, P., Togelius, J., & Risi, S. (2017). Deep Learning for Video Game
Playing. IEEE Transactions on Games, 12(1), 1–20. https://doi.org/10.1109/tg.2019.2896986
[8] Bellemare, M. G., Naddaf, Y., Veness, J., & Bowling, M. (2013). The arcade learning
environment: An evaluation platform for general agents. IJCAI International Joint Conference on
Artificial Intelligence, 2013, 4148–4152. https://doi.org/10.1613/jair.3912
[9] Montfort, N., & Bogost, I. (2009). Racing the beam: The Atari video computer system. MIT
Press, Cambridge Massachusetts.
https://pdfs.semanticscholar.org/2e91/086740f228934e05c3de97f01bc58368d313.pdf
[10] Bhonker, N., Rozenberg, S., & Hubara, I. (2017). Playing SNES in the Retro Learning
Environment. https://arxiv.org/pdf/1611.02205.pdf
[11] Buşoniu, L., Babuška, R., & De Schutter, B. (2010). Multi-agent reinforcement learning:
An overview. Studies in Computational Intelligence, 310, 183–221. https://doi.org/10.1007/978-3-
642-14435-6_7
[12] Kempka, M., Wydmuch, M., Runc, G., Toczek, J., & Jaskowski, W. (2016). ViZDoom: A
Doom-based AI research platform for visual reinforcement learning. IEEE Conference on
Computational Intelligence and Games, CIG, 0. https://doi.org/10.1109/CIG.2016.7860433
[13] Beattie, C., Leibo, J. Z., Teplyashin, D., Ward, T., Wainwright, M., Küttler, H., Lefrancq,
A., Green, S., Valdés, V., Sadik, A., Schrittwieser, J., Anderson, K., York, S., Cant, M., Cain, A.,
Bolton, A., Gaffney, S., King, H., Hassabis, D., … Petersen, S. (2016). DeepMind Lab.
https://arxiv.org/pdf/1612.03801.pdf
[14] Johnson, M., Hofmann, K., Hutton, T., & Bignell, D. (2016). The malmo platform for
artificial intelligence experimentation. Twenty-Fifth International Joint Conference on Artificial
Intelligence (IJCAI-16), 2016-January, 4246–4247. http://stella.sourceforge.net/
83

[15] Synnaeve, G., Nardelli, N., Auvolat, A., Chintala, S., Lacroix, T., Lin, Z., Richoux, F., &
Usunier, N. (2016). TorchCraft: a Library for Machine Learning Research on Real-Time Strategy
Games. https://arxiv.org/pdf/1611.00625.pdf
[16] Silva, V. do N., & Chaimowicz, L. (2017). MOBA: a New Arena for Game AI.
https://arxiv.org/pdf/1705.10443.pdf
[17] Karpov, I. V., Sheblak, J., & Miikkulainen, R. (2008). OpenNERO: A game platform for AI
research and education. Proceedings of the 4th Artificial Intelligence and Interactive Digital
Entertainment Conference, AIIDE 2008, 220–221.
https://www.aaai.org/Papers/AIIDE/2008/AIIDE08-038.pdf
[18] Juliani, A., Berges, V.-P., Teng, E., Cohen, A., Harper, J., Elion, C., Goy, C., Gao, Y.,
Henry, H., Mattar, M., & Lange, D. (2020). Unity: A General Platform for Intelligent Agents.
https://arxiv.org/pdf/1809.02627.pdf
[19] Juliani, A. (2017). Introducing: Unity Machine Learning Agents Toolkit.
https://blogs.unity3d.com/2017/09/19/introducing-unity-machine-learning-agents/
[20] Alpaydin, E. (2010). Introduction to Machine Learning. In Massachusetts Institute of
Technology (Second Edition). The MIT Press.
https://kkpatel7.files.wordpress.com/2015/04/alppaydin_machinelearning_2010
[21] Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (Second
Edition). The MIT Press. http://incompleteideas.net/sutton/book/RLbook2018.pdf
[22] Wolfshaar, J. Van De. (2017). Deep Reinforcement Learning of Video Games [University
of Groningen, The Netherlands].
http://fse.studenttheses.ub.rug.nl/15851/1/Artificial_Intelligence_Deep_R_1.pdf
[23] Legg, S., & Hutter, M. (2007). Universal intelligence: A definition of machine intelligence.
Minds and Machines, 17(4), 391–444. https://doi.org/10.1007/s11023-007-9079-x
[24] Schaul, T., Togelius, J., & Schmidhuber, J. (2011). Measuring Intelligence through
Games. https://arxiv.org/pdf/1109.1314.pdf
[25] Ortega, D. B., & Alonso, J. B. (2015). Machine Learning Applied to Pac-Man [Barcelona
School of Informatics]. https://upcommons.upc.edu/bitstream/handle/2099.1/26448/108745.pdf
[26] Lample, G., & Chaplot, D. S. (2016). Playing FPS Games with Deep Reinforcement
Learning. https://arxiv.org/pdf/1609.05521.pdf
[27] Adil, K., Jiang, F., Liu, S., Grigorev, A., Gupta, B. B., & Rho, S. (2017). Training an Agent
for FPS Doom Game using Visual Reinforcement Learning and VizDoom. In (IJACSA)
International Journal of Advanced Computer Science and Applications (Vol. 8, Issue 12).
https://pdfs.semanticscholar.org/74c3/5bb13e71cdd8b5a553a7e65d9ed125ce958e.pdf
[28] Wang, E., Kosson, A., & Mu, T. (2017). Deep Action Conditional Neural Network for Frame
Prediction in Atari Games. http://cs231n.stanford.edu/reports/2017/pdfs/602.pdf
[29] Karttunen, J., Kanervisto, A., Kyrki, V., & Hautamäki, V. (2020). From Video Game to Real
Robot: The Transfer between Action Spaces. 5. https://arxiv.org/pdf/1905.00741.pdf
[30] Martinez, M., Sitawarin, C., Finch, K., Meincke, L., Yablonski, A., & Kornhauser, A. (2017).
Beyond Grand Theft Auto V for Training, Testing and Enhancing Deep Learning in Self Driving
Cars [Princeton University]. https://arxiv.org/pdf/1712.01397.pdf
[31] Singh, S., Barto, A. G., & Chentanez, N. (2005). Intrinsically Motivated Reinforcement
Learning. http://www.cs.cornell.edu/~helou/IMRL.pdf
[32] Rockstar Games. (2020). https://www.rockstargames.com/
[33] Mattar, M., Shih, J., Berges, V.-P., Elion, C., & Goy, C. (2020). Announcing ML-Agents
Unity Package v1.0! Unity Blog. https://blogs.unity3d.com/2020/05/12/announcing-ml-agents-
unity-package-v1-0/
[34] Bertsekas, D., & Tsitsiklis, J. (1996). Neuro-Dynamic Programming. In Encyclopedia of
Optimization. Springer US. https://doi.org/10.1007/978-0-387-74759-0_440
[35] Shao, K., Tang, Z., Zhu, Y., Li, N., & Zhao, D. (2019). A Survey of Deep Reinforcement
Learning in Video Games. https://arxiv.org/pdf/1912.10944.pdf
[36] Wu, Y., & Tian, Y. (2017). Training agent for first-person shooter game with actor-critic
curriculum learning. ICLR 2017, 10. https://openreview.net/pdf?id=Hk3mPK5gg
[37] Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., Silver, D., &
Kavukcuoglu, K. (2016, February 4). Asynchronous Methods for Deep Reinforcement Learning.
33rd International Conference on Machine Learning. https://arxiv.org/pdf/1602.01783.pdf
[38] Arulkumaran, K., Deisenroth, M. P., Brundage, M., & Bharath, A. A. (2017, August 19). A
Brief Survey of Deep Reinforcement Learning. IEEE Signal Processing Magazine.
https://doi.org/10.1109/MSP.2017.2743240
[39] Narvekar, S., Peng, B., Leonetti, M., Sinapov, J., Taylor, M. E., & Stone, P. (2020).
Curriculum Learning for Reinforcement Learning Domains: A Framework and Survey.
https://arxiv.org/pdf/2003.04960.pdf
[40] Juliani, A., Berges, V.-P., Teng, E., Cohen, A., Harper, J., Elion, C., Goy, C., Gao, Y.,
Henry, H., Mattar, M., & Lange, D. (2020). Unity ML-Agents Toolkit. https://github.com/Unity-
Technologies/ml-agents
[41] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy
Optimization Algorithms. https://arxiv.org/pdf/1707.06347.pdf
[42] Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft Actor-Critic: Off-Policy
Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor.
https://arxiv.org/pdf/1801.01290.pdf
[43] Weng, L. (2018). A (Long) Peek into Reinforcement Learning. Lil Log.
https://lilianweng.github.io/lil-log/2018/02/19/a-long-peek-into-reinforcement-learning.html
[44] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves,
A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou,
I., King, H., Kumaran, D., Wierstra, D., Legg, S., & Hassabis, D. (2015). Human-level control
through deep reinforcement learning. Nature, 518(7540), 529–533.
https://doi.org/10.1038/nature14236
[45] Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., & Riedmiller, M. (2014).
Deterministic Policy Gradient Algorithms. Proceedings of the 31st International Conference on
Machine Learning. https://hal.inria.fr/file/index/docid/938992/filename/dpg-icml2014.pdf
[46] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra,
D. (2016, September 9). Continuous control with deep reinforcement learning. ICLR 2016.
https://arxiv.org/pdf/1509.02971.pdf
[47] Barth-Maron, G., Hoffman, M. W., Budden, D., Dabney, W., Horgan, D., Tb, D., Muldal,
A., Heess, N., & Lillicrap, T. (2018). Distributed distributional deterministic policy gradients. ICLR
2018. https://openreview.net/pdf?id=SyZipzbCb
85

[48] Schulman, J., Levine, S., Moritz, P., Jordan, M. I., & Abbeel, P. (2015, February 19). Trust
Region Policy Optimization. Proceeding of the 31st International Conference on Machine
Learning. https://arxiv.org/pdf/1502.05477.pdf
[49] Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., & de Freitas, N.
(2017, November 3). Sample Efficient Actor-Critic with Experience Replay. ICLR 2017.
https://arxiv.org/pdf/1611.01224.pdf
[50] Wu, Y., Mansimov, E., Liao, S., Grosse, R., & Ba, J. (2017). Scalable trust-region method
for deep reinforcement learning using Kronecker-factored approximation.
https://arxiv.org/pdf/1708.05144.pdf
[51] Fujimoto, S., van Hoof, H., & Meger, D. (2018, February 26). Addressing Function
Approximation Error in Actor-Critic Methods. Proceedings of the 35th International Conference on
Machine Learning. https://arxiv.org/pdf/1802.09477.pdf
[52] Liu, Y., Ramachandran, P., Liu, Q., & Peng, J. (2017). Stein Variational Policy Gradient.
https://arxiv.org/pdf/1704.02399.pdf
[53] Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu,
V., Harley, T., Dunning, I., Legg, S., & Kavukcuoglu, K. (2018). IMPALA: Scalable Distributed
Deep-RL with Importance Weighted Actor-Learner Architectures.
https://arxiv.org/pdf/1802.01561.pdf
[54] Schulman, J., Klimov, O., Wolski, F., Dhariwal, P., & Radford, A. (2017). Proximal Policy
Optimization. https://openai.com/blog/openai-baselines-ppo/
[55] Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H.,
Gupta, A., Abbeel, P., & Levine, S. (2019). Soft Actor-Critic Algorithms and Applications.
https://arxiv.org/pdf/1812.05905.pdf
[56] Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation,
9(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
[57] Wydmuch, M., Kempka, M., & Jaskowski, W. (2018). ViZDoom Competitions: Playing
Doom from Pixels. IEEE Transactions on Games, 11(3), 248–259.
https://doi.org/10.1109/tg.2018.2877047

You might also like