Download as pdf or txt
Download as pdf or txt
You are on page 1of 70

The Art of Reinforcement Learning:

Fundamentals, Mathematics, and


Implementations with Python 1st
Edition Michael Hu
Visit to download the full and correct content document:
https://ebookmeta.com/product/the-art-of-reinforcement-learning-fundamentals-mathe
matics-and-implementations-with-python-1st-edition-michael-hu-2/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...

The Art of Reinforcement Learning: Fundamentals,


Mathematics, and Implementations with Python 1st
Edition Michael Hu

https://ebookmeta.com/product/the-art-of-reinforcement-learning-
fundamentals-mathematics-and-implementations-with-python-1st-
edition-michael-hu/

Python AI Programming: Navigating fundamentals of ML,


deep learning, NLP, and reinforcement learning in
practice Patrick J

https://ebookmeta.com/product/python-ai-programming-navigating-
fundamentals-of-ml-deep-learning-nlp-and-reinforcement-learning-
in-practice-patrick-j/

Deep Reinforcement Learning with Python: With PyTorch,


TensorFlow and OpenAI Gym 1st Edition Nimish Sanghi

https://ebookmeta.com/product/deep-reinforcement-learning-with-
python-with-pytorch-tensorflow-and-openai-gym-1st-edition-nimish-
sanghi-3/

Deep Reinforcement Learning with Python With PyTorch


TensorFlow and OpenAI Gym 1st Edition Nimish Sanghi

https://ebookmeta.com/product/deep-reinforcement-learning-with-
python-with-pytorch-tensorflow-and-openai-gym-1st-edition-nimish-
sanghi/
Deep Reinforcement Learning with Python With PyTorch
TensorFlow and OpenAI Gym 1st Edition Nimish Sanghi

https://ebookmeta.com/product/deep-reinforcement-learning-with-
python-with-pytorch-tensorflow-and-openai-gym-1st-edition-nimish-
sanghi-2/

Primary Mathematics 3A Hoerst

https://ebookmeta.com/product/primary-mathematics-3a-hoerst/

Cambridge IGCSE and O Level History Workbook 2C - Depth


Study: the United States, 1919-41 2nd Edition Benjamin
Harrison

https://ebookmeta.com/product/cambridge-igcse-and-o-level-
history-workbook-2c-depth-study-the-united-states-1919-41-2nd-
edition-benjamin-harrison/

Mastering Python Forensics master the art of digital


forensics and analysis with Python 1st Edition Michael
Spreitzenbarth Johann Uhrmann

https://ebookmeta.com/product/mastering-python-forensics-master-
the-art-of-digital-forensics-and-analysis-with-python-1st-
edition-michael-spreitzenbarth-johann-uhrmann/

Reinforcement Learning: Optimal Feedback Control with


Industrial Applications Jinna Li

https://ebookmeta.com/product/reinforcement-learning-optimal-
feedback-control-with-industrial-applications-jinna-li/
Michael Hu

The Art of Reinforcement Learning


Fundamentals, Mathematics, and Implementations
with Python
Michael Hu
Shanghai, Shanghai, China

ISBN 978-1-4842-9605-9 e-ISBN 978-1-4842-9606-6


https://doi.org/10.1007/978-1-4842-9606-6

© Michael Hu 2023

Apress Standard

The use of general descriptive names, registered names, trademarks,


service marks, etc. in this publication does not imply, even in the
absence of a specific statement, that such names are exempt from the
relevant protective laws and regulations and therefore free for general
use.

The publisher, the authors, and the editors are safe to assume that the
advice and information in this book are believed to be true and accurate
at the date of publication. Neither the publisher nor the authors or the
editors give a warranty, expressed or implied, with respect to the
material contained herein or for any errors or omissions that may have
been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Apress imprint is published by the registered company APress


Media, LLC, part of Springer Nature.
The registered company address is: 1 New York Plaza, New York, NY
10004, U.S.A.
To my beloved family,
This book is dedicated to each of you, who have been a constant source of
love and support throughout my writing journey.
To my hardworking parents, whose tireless efforts in raising us have been
truly remarkable. Thank you for nurturing my dreams and instilling in
me a love for knowledge. Your unwavering dedication has played a
pivotal role in my accomplishments.
To my sisters and their children, your presence and love have brought
immense joy and inspiration to my life. I am grateful for the laughter and
shared moments that have sparked my creativity.
And to my loving wife, your consistent support and understanding have
been my guiding light. Thank you for standing by me through the highs
and lows, and for being my biggest cheerleader.
—Michael Hu
Preface
Reinforcement learning (RL) is a highly promising yet challenging
subfield of artificial intelligence (AI) that plays a crucial role in shaping
the future of intelligent systems. From robotics and autonomous agents
to recommendation systems and strategic decision-making, RL enables
machines to learn and adapt through interactions with their
environment. Its remarkable success stories include RL agents
achieving human-level performance in video games and even
surpassing world champions in strategic board games like Go. These
achievements highlight the immense potential of RL in solving complex
problems and pushing the boundaries of AI.
What sets RL apart from other AI subfields is its fundamental
approach: agents learn by interacting with the environment, mirroring
how humans acquire knowledge. However, RL poses challenges that
distinguish it from other AI disciplines. Unlike methods that rely on
precollected training data, RL agents generate their own training
samples. These agents are not explicitly instructed on how to achieve a
goal; instead, they receive state representations of the environment and
a reward signal, forcing them to explore and discover optimal strategies
on their own. Moreover, RL involves complex mathematics that
underpin the formulation and solution of RL problems.
While numerous books on RL exist, they typically fall into two
categories. The first category emphasizes the fundamentals and
mathematics of RL, serving as reference material for researchers and
university students. However, these books often lack implementation
details. The second category focuses on practical hands-on coding of RL
algorithms, neglecting the underlying theory and mathematics. This
apparent gap between theory and implementation prompted us to
create this book, aiming to strike a balance by equally emphasizing
fundamentals, mathematics, and the implementation of successful RL
algorithms.
This book is designed to be accessible and informative for a diverse
audience. It is targeted toward researchers, university students, and
practitioners seeking a comprehensive understanding of RL. By
following a structured approach, the book equips readers with the
necessary knowledge and tools to apply RL techniques effectively in
various domains.
The book is divided into four parts, each building upon the previous
one. Part I focuses on the fundamentals and mathematics of RL, which
form the foundation for almost all discussed algorithms. We begin by
solving simple RL problems using tabular methods. Chapter 2, the
cornerstone of this part, explores Markov decision processes (MDPs)
and the associated value functions, which are recurring concepts
throughout the book. Chapters 3 to 5 delve deeper into these
fundamental concepts by discussing how to use dynamic programming
(DP), Monte Carlo methods, and temporal difference (TD) learning
methods to solve small MDPs.
Part II tackles the challenge of solving large-scale RL problems that
render tabular methods infeasible due to their complexity (e.g., large or
infinite state spaces). Here, we shift our focus to value function
approximation, with particular emphasis on leveraging (deep) neural
networks. Chapter 6 provides a brief introduction to linear value
function approximation, while Chap. 7 delves into the renowned Deep
Q-Network (DQN) algorithm. In Chap. 8, we discuss enhancements to
the DQN algorithm.
Part III explores policy-based methods as an alternative approach to
solving RL problems. While Parts I and II primarily focus on value-
based methods (learning the value function), Part III concentrates on
learning the policy directly. We delve into the theory behind policy
gradient methods and the REINFORCE algorithm in Chap. 9.
Additionally, we explore Actor-Critic algorithms, which combine policy-
based and value-based approaches, in Chap. 10. Furthermore, Chap. 11
covers advanced policy-based algorithms, including surrogate objective
functions and the renowned Proximal Policy Optimization (PPO)
algorithm.
The final part of the book addresses advanced RL topics. Chapter 12
discusses how distributed RL can enhance agent performance, while
Chap. 13 explores the challenges of hard-to-explore RL problems and
presents curiosity-driven exploration as a potential solution. In the
concluding chapter, Chap. 14, we delve into model-based RL by
providing a comprehensive examination of the famous AlphaZero
algorithm.
Unlike a typical hands-on coding handbook, this book does not
primarily focus on coding exercises. Instead, we dedicate our resources
and time to explaining the fundamentals and core ideas behind each
algorithm. Nevertheless, we provide complete source code for all
examples and algorithms discussed in the book. Our code
implementations are done from scratch, without relying on third-party
RL libraries, except for essential tools like Python, OpenAI Gym, Numpy,
and the PyTorch deep learning framework. While third-party RL
libraries expedite the implementation process in real-world scenarios,
we believe coding each algorithm independently is the best approach
for learning RL fundamentals and mastering the various RL algorithms.
Throughout the book, we employ mathematical notations and
equations, which some readers may perceive as heavy. However, we
prioritize intuition over rigorous proofs, making the material accessible
to a broader audience. A foundational understanding of calculus at a
basic college level, minimal familiarity with linear algebra, and
elementary knowledge of probability and statistics are sufficient to
embark on this journey. We strive to ensure that interested readers
from diverse backgrounds can benefit from the book’s content.
We assume that readers have programming experience in Python
since all the source code is written in this language. While we briefly
cover the basics of deep learning in Chap. 7, including neural networks
and their workings, we recommend some prior familiarity with
machine learning, specifically deep learning concepts such as training a
deep neural network. However, beyond the introductory coverage,
readers can explore additional resources and materials to expand their
knowledge of deep learning.
This book draws inspiration from Reinforcement Learning: An
Introduction by Richard S. Sutton and Andrew G. Barto, a renowned RL
publication. Additionally, it is influenced by prestigious university RL
courses, particularly the mathematical style and notation derived from
Professor Emma Brunskill’s RL course at Stanford University. Although
our approach may differ slightly from Sutton and Barto’s work, we
strive to provide simpler explanations. Additionally, we have derived
some examples from Professor David Silver’s RL course at University
College London, which offers a comprehensive resource for
understanding the fundamentals presented in Part I. We would like to
express our gratitude to Professor Dimitri P. Bertsekas for his
invaluable guidance and inspiration in the field of optimal control and
reinforcement learning. Furthermore, the content of this book
incorporates valuable insights from research papers published by
various organizations and individual researchers.
In conclusion, this book aims to bridge the gap between the
fundamental concepts, mathematics, and practical implementation of
RL algorithms. By striking a balance between theory and
implementation, we provide readers with a comprehensive
understanding of RL, empowering them to apply these techniques in
various domains. We present the necessary mathematics and offer
complete source code for implementation to help readers gain a deep
understanding of RL principles. We hope this book serves as a valuable
resource for readers seeking to explore the fundamentals, mathematics,
and practical aspects of RL algorithms. We must acknowledge that
despite careful editing from our editors and multiple round of reviews,
we cannot guarantee the book’s content is error free. Your feedback and
corrections are invaluable to us. Please do not hesitate to contact us
with any concerns or suggestions for improvement.

Source Code
You can download the source code used in this book from github.​com/​
apress/​art-of-reinforcement-lear ning.
Michael Hu
Any source code or other supplementary material referenced by the
author in this book is available to readers on GitHub (https://​github.​
com/​Apress). For more detailed information, please visit https://​www.​
apress.​com/​gp/​services/​source-code.
Contents
Part I Foundation
1 Introduction
1.​1 AI Breakthrough in Games
1.​2 What Is Reinforcement Learning
1.​3 Agent-Environment in Reinforcement Learning
1.​4 Examples of Reinforcement Learning
1.​5 Common Terms in Reinforcement Learning
1.​6 Why Study Reinforcement Learning
1.​7 The Challenges in Reinforcement Learning
1.​8 Summary
References
2 Markov Decision Processes
2.​1 Overview of MDP
2.​2 Model Reinforcement Learning Problem Using MDP
2.​3 Markov Process or Markov Chain
2.​4 Markov Reward Process
2.​5 Markov Decision Process
2.​6 Alternative Bellman Equations for Value Functions
2.​7 Optimal Policy and Optimal Value Functions
2.​8 Summary
References
3 Dynamic Programming
3.​1 Use DP to Solve MRP Problem
3.​2 Policy Evaluation
3.​3 Policy Improvement
3.​4 Policy Iteration
3.​5 General Policy Iteration
3.​6 Value Iteration
3.​7 Summary
References
4 Monte Carlo Methods
4.​1 Monte Carlo Policy Evaluation
4.​2 Incremental Update
4.​3 Exploration vs.​Exploitation
4.​4 Monte Carlo Control (Policy Improvement)
4.​5 Summary
References
5 Temporal Difference Learning
5.​1 Temporal Difference Learning
5.​2 Temporal Difference Policy Evaluation
5.3 Simplified 𝜖-Greedy Policy for Exploration
5.​4 TD Control—SARSA
5.​5 On-Policy vs.​Off-Policy
5.​6 Q-Learning
5.​7 Double Q-Learning
5.​8 N-Step Bootstrapping
5.​9 Summary
References
Part II Value Function Approximation
6 Linear Value Function Approximation
6.​1 The Challenge of Large-Scale MDPs
6.​2 Value Function Approximation
6.​3 Stochastic Gradient Descent
6.​4 Linear Value Function Approximation
6.​5 Summary
References
7 Nonlinear Value Function Approximation
7.​1 Neural Networks
7.​2 Training Neural Networks
7.​3 Policy Evaluation with Neural Networks
7.​4 Naive Deep Q-Learning
7.​5 Deep Q-Learning with Experience Replay and Target
Network
7.​6 DQN for Atari Games
7.​7 Summary
References
8 Improvements to DQN
8.​1 DQN with Double Q-Learning
8.​2 Prioritized Experience Replay
8.​3 Advantage function and Dueling Network Architecture
8.​4 Summary
References
Part III Policy Approximation
9 Policy Gradient Methods
9.​1 Policy-Based Methods
9.​2 Policy Gradient
9.​3 REINFORCE
9.​4 REINFORCE with Baseline
9.​5 Actor-Critic
9.​6 Using Entropy to Encourage Exploration
9.​7 Summary
References
10 Problems with Continuous Action Space
10.​1 The Challenges of Problems with Continuous Action Space
10.​2 MuJoCo Environments
10.​3 Policy Gradient for Problems with Continuous Action
Space
10.​4 Summary
References
11 Advanced Policy Gradient Methods
11.​1 Problems with the Standard Policy Gradient Methods
11.​2 Policy Performance Bounds
11.​3 Proximal Policy Optimization
11.​4 Summary
References
Part IV Advanced Topics
12 Distributed Reinforcement Learning
12.​1 Why Use Distributed Reinforcement Learning
12.​2 General Distributed Reinforcement Learning Architecture
12.​3 Data Parallelism for Distributed Reinforcement Learning
12.​4 Summary
References
13 Curiosity-Driven Exploration
13.​1 Hard-to-Explore Problems vs.​Sparse Reward Problems
13.​2 Curiosity-Driven Exploration
13.​3 Random Network Distillation
13.​4 Summary
References
14 Planning with a Model:​AlphaZero
14.​1 Why We Need to Plan in Reinforcement Learning
14.​2 Monte Carlo Tree Search
14.​3 AlphaZero
14.​4 Training AlphaZero on a 9 × 9 Go Board
14.​5 Training AlphaZero on a 13 × 13 Gomoku Board
14.​6 Summary
References
Index
About the Author
Michael Hu
is an exceptional software engineer with a wealth of
expertise spanning over a decade, specializing in the
design and implementation of enterprise-level
applications. His current focus revolves around leveraging
the power of machine learning (ML) and artificial
intelligence (AI) to revolutionize operational systems
within enterprises. A true coding enthusiast, Michael finds
solace in the realms of mathematics and continuously
explores cutting-edge technologies, particularly machine learning and
deep learning. His unwavering passion lies in the realm of deep
reinforcement learning, where he constantly seeks to push the
boundaries of knowledge. Demonstrating his commitment to the field,
he has built various numerous open source projects on GitHub that
closely emulate state-of-the-art reinforcement learning algorithms
pioneered by DeepMind, including notable examples like AlphaZero,
MuZero, and Agent57. Through these projects, Michael demonstrates
his commitment to advancing the field and sharing his knowledge with
fellow enthusiasts. He currently resides in the city of Shanghai, China.
About the Technical Reviewer
Shovon Sengupta
has over 14 years of expertise and a deepened
understanding of advanced predictive analytics, machine
learning, deep learning, and reinforcement learning. He
has established a place for himself by creating innovative
financial solutions that have won numerous awards. He is
currently working for one of the leading multinational
financial services corporations in the United States as the
Principal Data Scientist at the AI Center of Excellence. His job entails
leading innovative initiatives that rely on artificial intelligence to
address challenging business problems. He has a US patent (United
States Patent: Sengupta et al.: Automated Predictive Call Routing Using
Reinforcement Learning [US 10,356,244 B1]) to his credit. He is also a
Ph.D. scholar at BITS Pilani. He has reviewed quite a few popular titles
from leading publishers like Packt and Apress and has also authored a
few courses for Packt and CodeRed (EC-Council) in the realm of
machine learning. Apart from that, he has presented at various
international conferences on machine learning, time series forecasting,
and building trustworthy AI. His primary research is concentrated on
deep reinforcement learning, deep learning, natural language
processing (NLP), knowledge graph, causality analysis, and time series
analysis. For more details about Shovon’s work, please check out his
LinkedIn page: www.​linkedin.​com/​in/​shovon-sengupta-272aa917.
Part I
Foundation
© The Author(s), under exclusive license to APress Media, LLC, part of Springer
Nature 2023
M. Hu, The Art of Reinforcement Learning
https://doi.org/10.1007/978-1-4842-9606-6_1

1. Introduction
Michael Hu1
(1) Shanghai, Shanghai, China

Artificial intelligence has made impressive progress in recent years,


with breakthroughs achieved in areas such as image recognition,
natural language processing, and playing games. In particular,
reinforcement learning, a type of machine learning that focuses on
learning by interacting with an environment, has led to remarkable
achievements in the field.
In this book, we focus on the combination of reinforcement learning
and deep neural networks, which have become central to the success of
agents that can master complex games such as board game Go and Atari
video games.
This first chapter provides an overview of reinforcement learning,
including key concepts such as states, rewards, policies, and the
common terms used in reinforcement learning, like the difference
between episodic and continuing reinforcement learning problems,
model-free vs. model-based methods.
Despite the impressive progress in the field, reinforcement learning
still faces significant challenges. For example, it can be difficult to learn
from sparse rewards, and the methods can suffer from instability.
Additionally, scaling to large state and action spaces can be a challenge.
Throughout this book, we will explore these concepts in greater
detail and discuss state-of-the-art techniques used to address these
challenges. By the end of this book, you will have a comprehensive
understanding of the principles of reinforcement learning and how they
can be applied to real-world problems.
We hope this introduction has sparked your curiosity about the
potential of reinforcement learning, and we invite you to join us on this
journey of discovery.

Fig. 1.1 A DQN agent learning to play Atari’s Breakout. The goal of the game is to
use a paddle to bounce a ball up and break through a wall of bricks. The agent only
takes in the raw pixels from the screen, and it has to figure out what’s the right action
to take in order to maximize the score. Idea adapted from Mnih et al. [1]. Game
owned by Atari Interactive, Inc.

1.1 AI Breakthrough in Games


Atari
The Atari 2600 is a home video game console developed by Atari
Interactive, Inc. in the 1970s. It features a collection of iconic video
games. These games, such as Pong, Breakout, Space Invaders, and Pac-
Man, have become classic examples of early video gaming culture. In
this platform, players can interact with these classic games using a
joystick controller.
The breakthrough in Atari games came in 2015 when Mnih et al. [1]
from DeepMind developed an AI agent called DQN to play a list of Atari
video games, some even better than humans.
What makes the DQN agent so influential is how it was trained to
play the game. Similar to a human player, the agent was only given the
raw pixel image of the screen as inputs, as illustrated in Fig. 1.1, and it
has to figure out the rules of the game all by itself and decide what to do
during the game to maximize the score. No human expert knowledge,
such as predefined rules or sample games of human play, was given to
the agent.
The DQN agent is a type of reinforcement learning agent that learns
by interacting with an environment and receiving a reward signal. In
the case of Atari games, the DQN agent receives a score for each action
it takes.
Mnih et al. [1] trained and tested their DQN agents on 57 Atari video
games. They trained one DQN agent for one Atari game, with each agent
playing only the game it was trained on; the training was over millions
of frames. The DQN agent can play half of the games (30 of 57 games) at
or better than a human player, as shown by Mnih et al. [1]. This means
that the agent was able to learn and develop strategies that were better
than what a human player could come up with.
Since then, various organizations and researchers have made
improvements to the DQN agent, incorporating several new techniques.
The Atari video games have become one of the most used test beds for
evaluating the performance of reinforcement learning agents and
algorithms. The Arcade Learning Environment (ALE) [2], which
provides an interface to hundreds of Atari 2600 game environments, is
commonly used by researchers for training and testing reinforcement
learning agents.
In summary, the Atari video games have become a classic example
of early video gaming culture, and the Atari 2600 platform provides a
rich environment for training agents in the field of reinforcement
learning. The breakthrough of DeepMind’s DQN agent, trained and
tested on 57 Atari video games, demonstrated the capability of an AI
agent to learn and make decisions through trial-and-error interactions
with classic games. This breakthrough has spurred many
improvements and advancements in the field of reinforcement learning,
and the Atari games have become a popular test bed for evaluating the
performance of reinforcement learning algorithms.

Go
Go is an ancient Chinese strategy board game played by two players,
who take turns laying pieces of stones on a 19x19 board with the goal
of surrounding more territory than the opponent. Each player has a set
of black or white stones, and the game begins with an empty board.
Players alternate placing stones on the board, with the black player
going first.

Fig. 1.2 Yoda Norimoto (black) vs. Kiyonari Tetsuya (white), Go game from the 66th
NHK Cup, 2018. White won by 0.5 points. Game record from CWI [4]
The stones are placed on the intersections of the lines on the board,
rather than in the squares. Once a stone is placed on the board, it
cannot be moved, but it can be captured by the opponent if it is
completely surrounded by their stones. Stones that are surrounded and
captured are removed from the board.
The game continues until both players pass, at which point the
territory on the board is counted. A player’s territory is the set of empty
intersections that are completely surrounded by their stones, plus any
captured stones. The player with the larger territory wins the game. In
the case of the final board position shown in Fig. 1.2, the white won by
0.5 points.
Although the rules of the game are relatively simple, the game is
extremely complex. For instance, the number of legal board positions in
Go is enormously large compared to Chess. According to research by
Tromp and Farnebä ck [3], the number of legal board positions in Go is
approximately , which is vastly greater than the number of
atoms in the universe.
This complexity presents a significant challenge for artificial
intelligence (AI) agents that attempt to play Go. In March 2016, an AI
agent called AlphaGo developed by Silver et al. [5] from DeepMind
made history by beating the legendary Korean player Lee Sedol with a
score of 4-1 in Go. Lee Sedol is a winner of 18 world titles and is
considered one of the greatest Go player of the past decade. AlphaGo’s
victory was remarkable because it used a combination of deep neural
networks and tree search algorithms, as well as the technique of
reinforcement learning.
AlphaGo was trained using a combination of supervised learning
from human expert games and reinforcement learning from games of
self-play. This training enabled the agent to develop creative and
innovative moves that surprised both Lee Sedol and the Go community.
The success of AlphaGo has sparked renewed interest in the field of
reinforcement learning and has demonstrated the potential for AI to
solve complex problems that were once thought to be the exclusive
domain of human intelligence. One year later, Silver et al. [6] from
DeepMind introduced a new and more powerful agent, AlphaGo Zero.
AlphaGo Zero was trained using pure self-play, without any human
expert moves in its training, achieving a higher level of play than the
previous AlphaGo agent. They also made other improvements like
simplifying the training processes.
To evaluate the performance of the new agent, they set it to play
games against the exact same AlphaGo agent that beat the world
champion Lee Sedol in 2016, and this time the new AlphaGo Zero beats
AlphaGo with score 100-0.
In the following year, Schrittwieser et al. [7] from DeepMind
generalized the AlphaGo Zero agent to play not only Go but also other
board games like Chess and Shogi (Japanese chess), and they called this
generalized agent AlphaZero. AlphaZero is a more general
reinforcement learning algorithm that can be applied to a variety of
board games, not just Go, Chess, and Shogi.
Reinforcement learning is a type of machine learning in which an
agent learns to make decisions based on the feedback it receives from
its environment. Both DQN and AlphaGo (and its successor) agents use
this technique, and their achievements are very impressive. Although
these agents are designed to play games, this does not mean that
reinforcement learning is only capable of playing games. In fact, there
are many more challenging problems in the real world, such as
navigating a robot, driving an autonomous car, and automating web
advertising. Games are relatively easy to simulate and implement
compared to these other real-world problems, but reinforcement
learning has the potential to be applied to a wide range of complex
challenges beyond game playing.

1.2 What Is Reinforcement Learning


In computer science, reinforcement learning is a subfield of machine
learning that focuses on learning how to act in a world or an
environment. The goal of reinforcement learning is for an agent to learn
from interacting with the environment in order to make a sequence of
decisions that maximize accumulated reward in the long run. This
process is known as goal-directed learning.
Unlike other machine learning approaches like supervised learning,
reinforcement learning does not rely on labeled data to learn from.
Instead, the agent must learn through trial and error, without being
directly told the rules of the environment or what action to take at any
given moment. This makes reinforcement learning a powerful tool for
modeling and solving real-world problems where the rules and optimal
actions may not be known or easily determined.
Reinforcement learning is not limited to computer science, however.
Similar ideas are studied in other fields under different names, such as
operations research and optimal control in engineering. While the
specific methods and details may vary, the underlying principles of
goal-directed learning and decision-making are the same.
Examples of reinforcement learning in the real world are all around
us. Human beings, for example, are naturally good at learning from
interacting with the world around us. From learning to walk as a baby
to learning to speak our native language to learning to drive a car, we
learn through trial and error and by receiving feedback from the
environment. Similarly, animals can also be trained to perform a variety
of tasks through a process similar to reinforcement learning. For
instance, service dogs can be trained to assist individuals in
wheelchairs, while police dogs can be trained to help search for missing
people.
One vivid example that illustrates the idea of reinforcement learning
is a video of a dog with a big stick in its mouth trying to cross a narrow
bridge.1 The video shows the dog attempting to pass the bridge, but
failing multiple times. However, after some trial and error, the dog
eventually discovers that by tilting its head, it can pass the bridge with
its favorite stick. This simple example demonstrates the power of
reinforcement learning in solving complex problems by learning from
the environment through trial and error.

1.3 Agent-Environment in Reinforcement Learning


Reinforcement learning is a type of machine learning that focuses on
how an agent can learn to make optimal decisions by interacting with
an environment. The agent-environment loop is the core of
reinforcement learning, as shown in Fig. 1.3. In this loop, the agent
observes the state of the environment and a reward signal, takes an
action, and receives a new state and reward signal from the
environment. This process continues iteratively, with the agent learning
from the rewards it receives and adjusting its actions to maximize
future rewards.

Environment
The environment is the world in which the agent operates. It can be a
physical system, such as a robot navigating a maze, or a virtual
environment, such as a game or a simulation. The environment
provides the agent with two pieces of information: the state of the
environment and a reward signal. The state describes the relevant
information about the environment that the agent needs to make a
decision, such as the position of the robot or the cards in a poker game.
The reward signal is a scalar value that indicates how well the agent is
doing in its task. The agent’s objective is to maximize its cumulative
reward over time.

Fig. 1.3 Top: Agent-environment in reinforcement learning in a loop. Bottom: The


loop unrolled by time

The environment has its own set of rules, which determine how the
state and reward signal change based on the agent’s actions. These
rules are often called the dynamics of the environment. In many cases,
the agent does not have access to the underlying dynamics of the
environment and must learn them through trial and error. This is
similar to how we humans interact with the physical world every day,
normally we have a pretty good sense of what’s going on around us, but
it’s difficult to fully understand the dynamics of the universe.
Game environments are a popular choice for reinforcement learning
because they provide a clear objective and well-defined rules. For
example, a reinforcement learning agent could learn to play the game of
Pong by observing the screen and receiving a reward signal based on
whether it wins or loses the game.
In a robotic environment, the agent is a robot that must learn to
navigate a physical space or perform a task. For example, a
reinforcement learning agent could learn to navigate a maze by using
sensors to detect its surroundings and receiving a reward signal based
on how quickly it reaches the end of the maze.

State
In reinforcement learning, an environment state or simply state is the
statistical data provided by the environment to represent the current
state of the environment. The state can be discrete or continuous. For
instance, when driving a stick shift car, the speed of the car is a
continuous variable, while the current gear is a discrete variable.
Ideally, the environment state should contain all relevant
information that’s necessary for the agent to make decisions. For
example, in a single-player video game like Breakout, the pixels of
frames of the game contain all the information necessary for the agent
to make a decision. Similarly, in an autonomous driving scenario, the
sensor data from the car’s cameras, lidar, and other sensors provide
relevant information about the surrounding environment.
However, in practice, the available information may depend on the
task and domain. In a two-player board game like Go, for instance,
although we have perfect information about the board position, we
don’t have perfect knowledge about the opponent player, such as what
they are thinking in their head or what their next move will be. This
makes the state representation more challenging in such scenarios.
Furthermore, the environment state might also include noisy data.
For example, a reinforcement learning agent driving an autonomous car
might use multiple cameras at different angles to capture images of the
surrounding area. Suppose the car is driving near a park on a windy
day. In that case, the onboard cameras could also capture images of
some trees in the park that are swaying in the wind. Since the
movement of these trees should not affect the agent’s ability to drive,
because the trees are inside the park and not on the road or near the
road, we can consider these movements of the trees as noise to the self-
driving agent. However, it can be challenging to ignore them from the
captured images. To tackle this problem, researchers might use various
techniques such as filtering and smoothing to eliminate the noisy data
and obtain a cleaner representation of the environment state.

Reward
In reinforcement learning, the reward signal is a numerical value that
the environment provides to the agent after the agent takes some
action. The reward can be any numerical value, positive, negative, or
zero. However, in practice, the reward function often varies from task to
task, and we need to carefully design a reward function that is specific
to our reinforcement learning problem.
Designing an appropriate reward function is crucial for the success
of the agent. The reward function should be designed to encourage the
agent to take actions that will ultimately lead to achieving our desired
goal. For example, in the game of Go, the reward is 0 at every step
before the game is over, and +1 or if the agent wins or loses the
game, respectively. This design incentivizes the agent to win the game,
without explicitly telling it how to win.
Similarly, in the game of Breakout, the reward can be a positive
number if the agent destroys some bricks negative number if the agent
failed to catch the ball, and zero reward otherwise. This design
incentivizes the agent to destroy as many bricks as possible while
avoiding losing the ball, without explicitly telling it how to achieve a
high score.
The reward function plays a crucial role in the reinforcement
learning process. The goal of the agent is to maximize the accumulated
rewards over time. By optimizing the reward function, we can guide the
agent to learn a policy that will achieve our desired goal. Without the
reward signal, the agent would not know what the goal is and would
not be able to learn effectively.
In summary, the reward signal is a key component of reinforcement
learning that incentivizes the agent to take actions that ultimately lead
to achieving the desired goal. By carefully designing the reward
function, we can guide the agent to learn an optimal policy.

Agent
In reinforcement learning, an agent is an entity that interacts with an
environment by making decisions based on the received state and
reward signal from the environment. The agent’s goal is to maximize its
cumulative reward in the long run. The agent must learn to make the
best decisions by trial and error, which involves exploring different
actions and observing the resulting rewards.
In addition to the external interactions with the environment, the
agent may also has its internal state represents its knowledge about the
world. This internal state can include things like memory of past
experiences and learned strategies.
It’s important to distinguish the agent’s internal state from the
environment state. The environment state represents the current state
of the world that the agent is trying to influence through its actions. The
agent, however, has no direct control over the environment state. It can
only affect the environment state by taking actions and observing the
resulting changes in the environment. For example, if the agent is
playing a game, the environment state might include the current
positions of game pieces, while the agent’s internal state might include
the memory of past moves and the strategies it has learned.
In this book, we will typically use the term “state” to refer to the
environment state. However, it’s important to keep in mind the
distinction between the agent’s internal state and the environment
state. By understanding the role of the agent and its interactions with
the environment, we can better understand the principles behind
reinforcement learning algorithms. It is worth noting that the terms
“agent” and “algorithm” are frequently used interchangeably in this
book, particularly in later chapters.

Action
In reinforcement learning, the agent interacts with an environment by
selecting actions that affect the state of the environment. Actions are
chosen from a predefined set of possibilities, which are specific to each
problem. For example, in the game of Breakout, the agent can choose to
move the paddle to the left or right or take no action. It cannot perform
actions like jumping or rolling over. In contrast, in the game of Pong, the
agent can choose to move the paddle up or down but not left or right.
The chosen action affects the future state of the environment. The
agent’s current action may have long-term consequences, meaning that
it will affect the environment’s states and rewards for many future time
steps, not just the next immediate stage of the process.
Actions can be either discrete or continuous. In problems with
discrete actions, the set of possible actions is finite and well defined.
Examples of such problems include Atari and Go board games. In
contrast, problems with continuous actions have an infinite set of
possible actions, often within a continuous range of values. An example
of a problem with continuous actions is robotic control, where the
degree of angle movement of a robot arm is often a continuous action.
Reinforcement learning problems with discrete actions are
generally easier to solve than those with continuous actions. Therefore,
this book will focus on solving reinforcement learning problems with
discrete actions. However, many of the concepts and techniques
discussed in this book can be applied to problems with continuous
actions as well.

Policy
A policy is a key concept in reinforcement learning that defines the
behavior of an agent. In particular, it maps each possible state in the
environment to the probabilities of chose different actions. By
specifying how the agent should behave, a policy guides the agent to
interact with its environment and maximize its cumulative reward. We
will delve into the details of policies and how they interact with the
MDP framework in Chap. 2.
For example, suppose an agent is navigating a grid-world
environment. A simple policy might dictate that the agent should
always move to the right until it reaches the goal location. Alternatively,
a more sophisticated policy could specify that the agent should choose
its actions based on its current position and the probabilities of moving
to different neighboring states.

Model
In reinforcement learning, a model refers to a mathematical description
of the dynamics function and reward function of the environment. The
dynamics function describes how the environment evolves from one
state to another, while the reward function specifies the reward that the
agent receives for taking certain actions in certain states.
In many cases, the agent does not have access to a perfect model of
the environment. This makes learning a good policy challenging, since
the agent must learn from experience how to interact with the
environment to maximize its reward. However, there are some cases
where a perfect model is available. For example, if the agent is playing a
game with fixed rules and known outcomes, the agent can use this
knowledge to select its actions strategically. We will explore this
scenario in detail in Chap. 2.
In reinforcement learning, the agent-environment boundary can be
ambiguous. Despite a house cleaning robot appearing to be a single
agent, the agent’s direct control typically defines its boundary, while the
remaining components comprise the environment. In this case, the
robot’s wheels and other hardwares are considered to be part of the
environment since they aren’t directly controlled by the agent. We can
think of the robot as a complex system composed of several parts, such
as hardware, software, and the reinforcement learning agent, which can
control the robot’s movement by signaling the software interface, which
then communicates with microchips to manage the wheel movement.

1.4 Examples of Reinforcement Learning


Reinforcement learning is a versatile technique that can be applied to a
variety of real-world problems. While its success in playing games is
well known, there are many other areas where it can be used as an
effective solution. In this section, we explore a few examples of how
reinforcement learning can be applied to real-world problems.

Autonomous Driving
Reinforcement learning can be used to train autonomous vehicles to
navigate complex and unpredictable environments. The goal for the
agent is to safely and efficiently drive the vehicle to a desired location
while adhering to traffic rules and regulations. The reward signal could
be a positive number for successful arrival at the destination within a
specified time frame and a negative number for any accidents or
violations of traffic rules. The environment state could contain
information about the vehicle’s location, velocity, and orientation, as
well as sensory data such as camera feeds and radar readings.
Additionally, the state could include the current traffic conditions and
weather, which would help the agent to make better decisions while
driving.

Navigating Robots in a Factory Floor


One practical application of reinforcement learning is to train robots to
navigate a factory floor. The goal for the agent is to safely and efficiently
transport goods from one point to another without disrupting the work
of human employees or other robots. In this case, the reward signal
could be a positive number for successful delivery within a specified
time frame and a negative number for any accidents or damages
caused. The environment state could contain information about the
robot’s location, the weight and size of the goods being transported, the
location of other robots, and sensory data such as camera feeds and
battery level. Additionally, the state could include information about the
production schedule, which would help the agent to prioritize its tasks.

Automating Web Advertising


Another application of reinforcement learning is to automate web
advertising. The goal for the agent is to select the most effective type of
ad to display to a user, based on their browsing history and profile. The
reward signal could be a positive number for when the user clicks on
the ad, and zero otherwise. The environment state could contain
information such as the user’s search history, demographics, and
current trends on the Internet. Additionally, the state could include
information about the context of the web page, which would help the
agent to choose the most relevant ad.
Video Compression
Reinforcement learning can also be used to improve video
compression. DeepMind’s MuZero agent has been adapted to optimize
video compression for some YouTube videos. In this case, the goal for
the agent is to compress the video as much as possible without
compromising the quality. The reward signal could be a positive
number for high-quality compression and a negative number for low-
quality compression. The environment state could contain information
such as the video’s resolution, bit rate, frame rate, and the complexity of
the scenes. Additionally, the state could include information about the
viewing device, which would help the agent to optimize the
compression for the specific device.
Overall, reinforcement learning has enormous potential for solving
real-world problems in various industries. The key to successful
implementation is to carefully design the reward signal and the
environment state to reflect the specific goals and constraints of the
problem. Additionally, it is important to continually monitor and
evaluate the performance of the agent to ensure that it is making the
best decisions.

1.5 Common Terms in Reinforcement Learning


Episodic vs. Continuing Tasks
In reinforcement learning, the type of problem or task is categorized as
episodic or continuing depending on whether it has a natural ending.
Natural ending refers to a point in a task or problem where it is
reasonable to consider the task or problem is completed.
Episodic problems have a natural termination point or terminal
state, at which point the task is over, and a new episode starts. A new
episode is independent of previous episodes. Examples of episodic
problems include playing an Atari video game, where the game is over
when the agent loses all lives or won the game, and a new episode
always starts when we reset the environment, regardless of whether
the agent won or lost the previous game. Other examples of episodic
problems include Tic-Tac-Toe, chess, or Go games, where each game is
independent of the previous game.
On the other hand, continuing problems do not have a natural
endpoint, and the process could go on indefinitely. Examples of
continuing problems include personalized advertising or
recommendation systems, where the agent’s goal is to maximize a
user’s satisfaction or click-through rate over an indefinite period.
Another example of a continuing problem is automated stock trading,
where the agent wants to maximize their profits in the stock market by
buying and selling stocks. In this scenario, the agent’s actions, such as
the stocks they buy and the timing of their trades, can influence the
future prices and thus affect their future profits. The agent’s goal is to
maximize their long-term profits by making trades continuously, and
the stock prices will continue to fluctuate in the future. Thus, the
agent’s past trades will affect their future decisions, and there is no
natural termination point for the problem.
It is possible to design some continuing reinforcement learning
problems as episodic by using a time-constrained approach. For
example, the episode could be over when the market is closed.
However, in this book, we only consider natural episodic problems that
is, the problems with natural termination.
Understanding the differences between episodic and continuing
problems is crucial for designing effective reinforcement learning
algorithms for various applications. For example, episodic problems
may require a different algorithmic approach than continuing problems
due to the differences in their termination conditions. Furthermore, in
real-world scenarios, distinguishing between episodic and continuing
problems can help identify the most appropriate reinforcement
learning approach to use for a particular task or problem.

Deterministic vs. Stochastic Tasks


In reinforcement learning, it is important to distinguish between
deterministic and stochastic problems. A problem is deterministic if the
outcome is always the same when the agent takes the same action in
the same environment state. For example, Atari video game or a game
of Go is a deterministic problem. In these games, the rules of the game
are fixed; when the agent repeatedly takes the same action under the
same environment condition (state), the outcome (reward signal and
next state) is always the same.
The reason that these games are considered deterministic is that
the environment’s dynamics and reward functions do not change over
time. The rules of the game are fixed, and the environment always
behaves in the same way for a given set of actions and states. This
allows the agent to learn a policy that maximizes the expected reward
by simply observing the outcomes of its actions.
On the other hand, a problem is stochastic if the outcome is not
always the same when the agent takes the same action in the same
environment state. One example of a stochastic environment is playing
poker. The outcome of a particular hand is not entirely determined by
the actions of the player. Other players at the table can also take actions
that influence the outcome of the hand. Additionally, the cards dealt to
each player are determined by a shuffled deck, which introduces an
element of chance into the game.
For example, let’s say a player is dealt a pair of aces in a game of
Texas hold’em. The player might decide to raise the bet, hoping to win a
large pot. However, the other players at the table also have their own
sets of cards and can make their own decisions based on the cards they
hold and the actions of the other players.
If another player has a pair of kings, they might also decide to raise
the bet, hoping to win the pot. If a third player has a pair of twos, they
might decide to fold, as their hand is unlikely to win. The outcome of
the hand depends not only on the actions of the player with the pair of
aces but also on the actions of the other players at the table, as well as
the cards dealt to them.
This uncertainty and complexity make poker a stochastic problem.
While it is possible to use various strategies to improve one’s chances
of winning in poker, the outcome of any given hand is never certain, and
a skilled player must be able to adjust their strategy based on the
actions of the other players and the cards dealt.
Another example of stochastic environment is the stock market. The
stock market is a stochastic environment because the outcome of an
investment is not always the same when the same action is taken in the
same environment state. There are many factors that can influence the
price of a stock, such as company performance, economic conditions,
geopolitical events, and investor sentiment. These factors are
constantly changing and can be difficult to predict, making it impossible
to know with certainty what the outcome of an investment will be.
For example, let’s say you decide to invest in a particular stock
because you believe that the company is undervalued and has strong
growth prospects. You buy 100 shares at a price of $145.0 per share.
However, the next day, the company announces that it has lost a major
customer and its revenue projections for the next quarter are lower
than expected. The stock price drops to $135.0 per share, and you have
lost $1000 on your investment. It’s most likely the stochastic nature of
the environment led to the loss outcome other than the action (buying
100 shares).
While it is possible to use statistical analysis and other tools to try
to predict stock price movements, there is always a level of uncertainty
and risk involved in investing in the stock market. This uncertainty and
risk are what make the stock market a stochastic environment, and why
it is important to use appropriate risk management techniques when
making investment decisions.
In this book, we focus on deterministic reinforcement learning
problems. By understanding the fundamentals of deterministic
reinforcement learning, readers will be well equipped to tackle more
complex and challenging problems in the future.

Model-Free vs. Model-Based Reinforcement


Learning
In reinforcement learning, an environment is a system in which an
agent interacts with in order to achieve a goal. A model is a
mathematical representation of the environment’s dynamics and
reward functions. Model-free reinforcement learning means the agent
does not use the model of the environment to help it make decisions.
This may occur because either the agent lacks access to the accurate
model of the environment or the model is too complex to use during
decision-making.
In model-free reinforcement learning, the agent learns to take
actions based on its experiences of the environment without explicitly
simulating future outcomes. Examples of model-free reinforcement
learning methods include Q-learning, SARSA (State-Action-Reward-
State-Action), and deep reinforcement learning algorithms such as
DQN, which we’ll introduce later in the book.
On the other hand, in model-based reinforcement learning, the
agent uses a model of the environment to simulate future outcomes and
plan its actions accordingly. This may involve constructing a complete
model of the environment or using a simplified model that captures
only the most essential aspects of the environment’s dynamics. Model-
based reinforcement learning can be more sample-efficient than model-
free methods in certain scenarios, especially when the environment is
relatively simple and the model is accurate. Examples of model-based
reinforcement learning methods include dynamic programming
algorithms, such as value iteration and policy iteration, and
probabilistic planning methods, such as Monte Carlo Tree Search in
AlphaZero agent, which we’ll introduce later in the book.
In summary, model-free and model-based reinforcement learning
are two different approaches to solving the same problem of
maximizing rewards in an environment. The choice between these
approaches depends on the properties of the environment, the
available data, and the computational resources.

1.6 Why Study Reinforcement Learning


Machine learning is a vast and rapidly evolving field, with many
different approaches and techniques. As such, it can be challenging for
practitioners to know which type of machine learning to use for a given
problem. By discussing the strengths and limitations of different
branches of machine learning, we can better understand which
approach might be best suited to a particular task. This can help us
make more informed decisions when developing machine learning
solutions and ultimately lead to more effective and efficient systems.
There are three branches of machine learning. One of the most
popular and widely adopted in the real world is supervised learning,
which is used in domains like image recognition, speech recognition,
and text classification. The idea of supervised learning is very simple:
given a set of training data and the corresponding labels, the objective
is for the system to generalize and predict the label for data that’s not
present in the training dataset. These training labels are typically
provided by some supervisors (e.g., humans). Hence, we’ve got the
name supervised learning.
Another branch of machine learning is unsupervised learning. In
unsupervised learning, the objective is to discover the hidden
structures or features of the training data without being provided with
any labels. This can be useful in domains such as image clustering,
where we want the system to group similar images together without
knowing ahead of time which images belong to which group. Another
application of unsupervised learning is in dimensionality reduction,
where we want to represent high-dimensional data in a lower-
dimensional space while preserving as much information as possible.
Reinforcement learning is a type of machine learning in which an
agent learns to take actions in an environment in order to maximize a
reward signal. It’s particularly useful in domains where there is no clear
notion of “correct” output, such as in robotics or game playing.
Reinforcement learning has potential applications in areas like robotics,
healthcare, and finance.
Supervised learning has already been widely used in computer
vision and natural language processing. For example, the ImageNet
classification challenge is an annual computer vision competition
where deep convolutional neural networks (CNNs) dominate. The
challenge provides a training dataset with labels for 1.2 million images
across 1000 categories, and the goal is to predict the labels for a
separate evaluation dataset of about 100,000 images. In 2012,
Krizhevsky et al. [8] developed AlexNet, the first deep CNN system used
in this challenge. AlexNet achieved an 18% improvement in accuracy
compared to previous state-of-the-art methods, which marked a major
breakthrough in computer vision.
Since the advent of AlexNet, almost all leading solutions to the
ImageNet challenge have been based on deep CNNs. Another
breakthrough came in 2015 when researchers He et al. [9] from
Microsoft developed ResNet, a new architecture designed to improve
the training of very deep CNNs with hundreds of layers. Training deep
CNNs is challenging due to vanishing gradients, which makes it difficult
to propagate the gradients backward through the network during
backpropagation. ResNet addressed this challenge by introducing skip
connections, which allowed the network to bypass one or more layers
during forward propagation, thereby reducing the depth of the network
that the gradients have to propagate through.
While supervised learning is capable of discovering hidden patterns
and features from data, it is limited in that it merely mimics what it is
told to do during training and cannot interact with the world and learn
from its own experience. One limitation of supervised learning is the
need to label every possible stage of the process. For example, if we
want to use supervised learning to train an agent to play Go, then we
would need to collect the labels for every possible board position,
which is impossible due to the enormous number of possible
combinations. Similarly, in Atari video games, a single pixel change
would require relabeling, making supervised learning inapplicable in
these cases. However, supervised learning has been successful in many
other applications, such as language translation and image
classification.
Unsupervised learning tries to discover hidden patterns or features
without labels, but its objective is completely different from that of RL,
which is to maximize accumulated reward signals. Humans and animals
learn by interacting with their environment, and this is where
reinforcement learning (RL) comes in. In RL, the agent is not told which
action is good or bad, but rather it must discover that for itself through
trial and error. This trial-and-error search process is unique to RL.
However, there are other challenges that are unique to RL, such as
dealing with delayed consequences and balancing exploration and
exploitation.
While RL is a distinct branch of machine learning, it shares some
commonalities with other branches, such as supervised and
unsupervised learning. For example, improvements in supervised
learning and deep convolutional neural networks (CNNs) have been
adapted to DeepMind’s DQN, AlphaGo, and other RL agents. Similarly,
unsupervised learning can be used to pretrain the weights of RL agents
to improve their performance. Furthermore, many of the mathematical
concepts used in RL, such as optimization and how to train a neural
network, are shared with other branches of machine learning.
Therefore, while RL has unique challenges and applications, it also
benefits from and contributes to the development of other branches of
machine learning.
1.7 The Challenges in Reinforcement Learning
Reinforcement learning (RL) is a type of machine learning in which an
agent learns to interact with an environment to maximize some notion
of cumulative reward. While RL has shown great promise in a variety of
applications, it also comes with several common challenges, as
discussed in the following sections:

Exploration vs. Exploitation Dilemma


The exploration-exploitation dilemma refers to the fundamental
challenge in reinforcement learning of balancing the need to explore
the environment to learn more about it with the need to exploit
previous knowledge to maximize cumulative reward. The agent must
continually search for new actions that may yield greater rewards while
also taking advantage of actions that have already proven successful.
In the initial exploration phase, the agent is uncertain about the
environment and must try out a variety of actions to gather information
about how the environment responds. This is similar to how humans
might learn to play a new video game by trying different button
combinations to see what happens. However, as the agent learns more
about the environment, it becomes increasingly important to focus on
exploiting the knowledge it has gained in order to maximize cumulative
reward. This is similar to how a human might learn to play a game more
effectively by focusing on the actions that have already produced high
scores.
There are many strategies for addressing the exploration-
exploitation trade-off, ranging from simple heuristics to more
sophisticated algorithms. One common approach is to use an -greedy
policy, in which the agent selects the action with the highest estimated
value with probability and selects a random action with
probability in order to encourage further exploration. Another
approach is to use a Thompson sampling algorithm, which balances
exploration and exploitation by selecting actions based on a
probabilistic estimate of their expected value.
It is important to note that the exploration-exploitation trade-off is
not a simple problem to solve, and the optimal balance between
exploration and exploitation will depend on many factors, including the
complexity of the environment and the agent’s prior knowledge. As a
result, there is ongoing research in the field of reinforcement learning
aimed at developing more effective strategies for addressing this
challenge.

Credit Assignment Problem


In reinforcement learning (RL), the credit assignment problem refers to
the challenge of determining which actions an agent took that led to a
particular reward. This is a fundamental problem in RL because the
agent must learn from its own experiences in order to improve its
performance.
To illustrate this challenge, let’s consider the game of Tic-Tac-Toe,
where two players take turns placing Xs and Os on a 3 3 grid until
one player gets three in a row. Suppose the agent is trying to learn to
play Tic-Tac-Toe using RL, and the reward is +1 for a win, for a loss,
and 0 for a draw. The agent’s goal is to learn a policy that maximizes its
cumulative reward.
Now, suppose the agent wins a game of Tic-Tac-Toe. How can the
agent assign credit to the actions that led to the win? This can be a
difficult problem to solve, especially if the agent is playing against
another RL agent that is also learning and adapting its strategies.
To tackle the credit assignment problem in RL, there are various
techniques that can be used, such as Monte Carlo methods or temporal
difference learning. These methods use statistical analysis to estimate
the value of each action taken by the agent, based on the rewards
received and the states visited. By using these methods, the agent can
gradually learn to assign credit to the actions that contribute to its
success and adjust its policy accordingly.
In summary, credit assignment is a key challenge in reinforcement
learning, and it is essential to develop effective techniques for solving
this problem in order to achieve optimal performance.

Reward Engineering Problem


The reward engineering problem refers to the process of designing a
good reward function that encourages the desired behavior in a
reinforcement learning (RL) agent. The reward function determines
what the agent is trying to optimize, so it is crucial to make sure it
reflects the desired goal we want the the agent to achieve.
An example of good reward engineering is in the game of Atari
Breakout, where the goal of the agent is to clear all the bricks at the top
of the screen by bouncing a ball off a paddle. One way to design a
reward function for this game is to give the agent a positive reward for
each brick it clears and a negative reward for each time the ball passes
the paddle and goes out of bounds. However, this reward function alone
may not lead to optimal behavior, as the agent may learn to exploit a
loophole by simply bouncing the ball back and forth on the same side of
the screen without actually clearing any bricks.
To address this challenge, the reward function can be designed to
encourage more desirable behavior. For example, the reward function
can be modified to give the agent a larger positive reward for clearing
multiple bricks in a row or for clearing the bricks on the edges of the
screen first. This can encourage the agent to take more strategic shots
and aim for areas of the screen that will clear more bricks at once.
An example of bad reward engineering is the CoastRunners video
game in Atari; it’s a very simple boat racing game. The goal of the game
is to finish the boat race as quickly as possible. But there’s one small
issue with the game, the player can earn higher scores by hitting some
targets laid out along the route. There’s a video that shows that a
reinforcement learning agent plays the game by repeatedly hitting the
targets instead of finishing the race.2 This example should not be
viewed as the failure of the reinforcement learning agent, but rather
humans failed to design and use the correct reward function.
Overall, reward engineering is a crucial part of designing an
effective RL agent. A well-designed reward function can encourage the
desired behavior and lead to optimal performance, while a poorly
designed reward function can lead to suboptimal behavior and may
even encourage undesired behavior.

Generalization Problem
In reinforcement learning (RL), the generalization problem refers to the
ability of an agent to apply what it has learned to new and previously
unseen situations. To understand this concept, consider the example of
a self-driving car. Suppose the agent is trying to learn to navigate a
particular intersection, with a traffic light and crosswalk. The agent
receives rewards for reaching its destination quickly and safely, but it
must also follow traffic laws and avoid collisions with other vehicles
and pedestrians.
During training, the agent is exposed to a variety of situations at the
intersection, such as different traffic patterns and weather conditions.
It learns to associate certain actions with higher rewards, such as
slowing down at the yellow light and stopping at the red light. Over
time, the agent becomes more adept at navigating the intersection and
earns higher cumulative rewards.
However, when the agent is faced with a new intersection, with
different traffic patterns and weather conditions, it may struggle to
apply what it has learned. This is where generalization comes in. If the
agent has successfully generalized its knowledge, it will be able to
navigate the new intersection based on its past experiences, even
though it has not seen this exact intersection before. For example, it
may slow down at a yellow light, even if the timing is slightly different
than what it has seen before, or it may recognize a pedestrian crossing
and come to a stop, even if the appearance of the crosswalk is slightly
different.
If the agent has not successfully generalized its knowledge, it may
struggle to navigate the new intersection and may make mistakes that
lead to lower cumulative rewards. For example, it may miss a red light
or fail to recognize a pedestrian crossing, because it has only learned to
recognize these situations in a particular context.
Therefore, generalization is a crucial aspect of RL, as it allows the
agent to apply its past experiences to new and previously unseen
situations, which can improve its overall performance and make it more
robust to changes in the environment.

Sample Efficiency Problem


The sample efficiency problem in reinforcement learning refers to the
ability of an RL agent to learn an optimal policy with a limited number
of interactions with the environment. This can be challenging,
especially in complex environments where the agent may need to
explore a large state space or take a large number of actions to learn
the optimal policy.
To better understand sample efficiency, let’s consider an example of
an RL agent playing a game of Super Mario Bros. In this game, the agent
must navigate Mario through a series of levels while avoiding enemies
and obstacles, collecting coins, and reaching the flag at the end of each
level.
To learn how to play Super Mario Bros., the agent must interact with
the environment, taking actions such as moving left or right, jumping,
and shooting fireballs. Each action leads to a new state of the
environment, and the agent receives a reward based on its actions and
the resulting state.
For example, the agent may receive a reward for collecting a coin or
reaching the flag and a penalty for colliding with an enemy or falling
into a pit. By learning from these rewards, the agent can update its
policy to choose actions that lead to higher cumulative rewards over
time.
However, learning the optimal policy in Super Mario Bros. can be
challenging due to the large state space and the high dimensionality of
the input data, which includes the position of Mario, the enemies, and
the obstacles on the screen.
To address the challenge of sample efficiency, the agent may use a
variety of techniques to learn from a limited number of interactions
with the environment. For example, the agent may use function
approximation to estimate the value or policy function based on a small
set of training examples. The agent may also use off-policy learning,
which involves learning from data collected by a different policy than
the one being optimized.
Overall, sample efficiency is an important challenge in
reinforcement learning, especially in complex environments.
Techniques such as function approximation, off-policy learning, can
help address this challenge and enable RL agents to learn optimal
policies with a limited number of interactions with the environment.

1.8 Summary
In the first chapter of the book, readers were introduced to the concept
of reinforcement learning (RL) and its applications. The chapter began
by discussing the breakthroughs in AI in games, showcasing the success
of RL in complex games such as Go. The chapter then provided an
overview of the agent-environment interaction that forms the basis of
RL, including key concepts such as environment, agent, reward, state,
action, and policy. Several examples of RL were presented, including
Atari video game playing, board game Go, and robot control tasks.
Additionally, the chapter introduced common terms used in RL,
including episodic vs. continuing tasks, deterministic vs. stochastic
tasks, and model-free vs. model-based reinforcement learning. The
importance of studying RL was then discussed, including its potential to
solve complex problems and its relevance to real-world applications.
The challenges faced in RL, such as the exploration-exploitation
dilemma, the credit assignment problem, and the generalization
problem, were also explored.
The next chapter of the book will focus on Markov decision
processes (MDPs), which is a formal framework used to model RL
problems.

References
[1] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness,
Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg
Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen
King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis.
Human-level control through deep reinforcement learning. Nature,
518(7540):529–533, Feb 2015.
[Crossref]
[2]
M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning
environment: An evaluation platform for general agents. Journal of Artificial
Intelligence Research, 47:253–279, Jun 2013.
[Crossref]
[3]
John Tromp and Gunnar Farnebäck. Combinatorics of go. In H. Jaap van den
Herik, Paolo Ciancarini, and H. H. L. M. (Jeroen) Donkers, editors, Computers and
Games, pages 84–99, Berlin, Heidelberg, 2007. Springer Berlin Heidelberg.
[4]
CWI. 66th NHK Cup. https://​homepages.​c wi.​nl/​~aeb/​go/​games/​games/​N HK/​
66/​index.​html, 2018.
[5]
David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van
den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam,
Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner,
Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore
Graepel, and Demis Hassabis. Mastering the game of go with deep neural
networks and tree search. Nature, 529(7587):484–489, Jan 2016.
[Crossref]
[6]
David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja
Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton,
Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche,
Thore Graepel, and Demis Hassabis. Mastering the game of go without human
knowledge. Nature, 550(7676):354–359, Oct 2017.
[Crossref]
[7]
David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew
Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel,
Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. Mastering chess and
shogi by self-play with a general reinforcement learning algorithm, 2017.
[8]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification
with deep convolutional neural networks. In F. Pereira, C.J. Burges, L. Bottou, and
K.Q. Weinberger, editors, Advances in Neural Information Processing Systems,
volume 25. Curran Associates, Inc., 2012.
[9]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning
for image recognition, 2015.

Footnotes
1 Dog Thinks Through A Problem: www.​youtube.​c om/​watch?​v =​m_​C rIu01SnM.

2 Reinforcement learning agent playing the CoastRunners game: www.​youtube.​c om/​


watch?​v =​tlOIHko8ySg.
Another random document with
no related content on Scribd:
that, if the thief does not restore the stolen property, he will be a
dead man within a month.34.1
Similarly in Nias, an island to the west of
Thieves cursed in Sumatra, when a thief cannot be found he is
Nias.
cursed, and to give weight to the curse a dog is
burned alive. While the animal is expiring in torments, the man who
has been robbed expresses his wish that the thief may likewise die
in agony; and they say that thieves who have been often cursed do
die screaming.34.2 Curses are also employed for
Thieves cursed
among the Sea
the same purpose with excellent effect by the Sea
Dyaks of Borneo. Dyaks of Borneo. On this point a missionary bears
the following testimony. “With an experience of
nearly twenty years in Borneo, during which I came into contact with
thousands of the people, I have known of only two instances of theft
among the Dyaks. One was a theft of rice. The woman who lost the
rice most solemnly and publicly cursed the thief, whoever it might be.
The next night the rice was secretly left at her door. The other was a
theft of money. In this case, too, the thief was cursed. The greater
part of the money was afterwards found returned to the box from
which it had been abstracted. Both these incidents show the great
dread the Dyak has of a curse. Even an undeserved curse is
considered a terrible thing, and, according to Dyak law, to curse a
person for no reason at all is a fineable offence.
“A Dyak curse is a terrible thing to listen to. I have only once heard
a Dyak curse, and I am sure I do not want to do so again. I was
travelling in the Saribas district, and at that time many of the Dyaks
there had gone in for coffee-planting; indeed, several of them had
started coffee plantations on a small scale. A woman told me that
some one had over and over again stolen the ripe coffee-berries
from her plantation. Not only were the ripe berries stolen, but the
thief had carelessly picked many of the young berries and thrown
them on the ground, and many of the branches of the plants had
been broken off. In the evening, when I was seated in the public part
of the house with many Dyak men and women round me, we
happened to talk about coffee-planting. The woman was present,
and told us of her experiences, and how her coffee had been stolen
by some thief, who, she thought, must be one of the inmates of the
house. Then she solemnly cursed the thief. She began in a calm
voice, but worked herself up into a frenzy. We all listened horror-
struck, and no one interrupted her. She began by saying what had
happened, and how these thefts had gone on for some time. She
had said nothing before, hoping that the thief would mend his ways;
but the matter had gone on long enough, and she was going to curse
the thief, as nothing, she felt sure, would make him give up his evil
ways. She called on all the spirits of the waters and the hills and the
air to listen to her words and to aid her. She began quietly, but
became more excited as she went on. She said something of this
kind:
“ ‘If the thief be a man, may he be unfortunate in
Curses on a man all he undertakes! May he suffer from a disease
thief.
that does not kill him, but makes him helpless—
always in pain—and a burden to others. May his wife be unfaithful to
him, and his children become as lazy and dishonest as he is himself.
If he go out on the war-path, may he be killed, and his head smoked
over the enemy’s fire. If he be boating, may his boat be swamped
and may he be drowned. If he be out fishing, may an alligator kill him
suddenly, and may his relatives never find his body. If he be cutting
down a tree in the jungle, may the tree fall on him and crush him to
death. May the gods curse his farm so that he may have no crops,
and have nothing to eat, and when he begs for food, may he be
refused, and die of starvation.
“ ‘If the thief be a woman, may she be childless,
Curses on a woman or if she happen to be with child let her be
thief.
disappointed, and let her child be still-born, or,
better still, let her die in childbirth. May her husband be untrue to her,
and despise her and ill-treat her. May her children all desert her if
she live to grow old. May she suffer from such diseases as are
peculiar to women, and may her eyesight grow dim as the years go
on, and may there be no one to help her or lead her about when she
is blind.’
“I have only given the substance of what she said; but I shall never
forget the silence and the awed faces of those who heard her. I left
the house early next morning, so I do not know what was the result
of her curse—whether the thief confessed or not.”36.1
The ancient Greeks seem to have made a very
Thieves cursed in liberal use of curses as a cheap and effective
ancient Greece.
mode of protecting property, which dispenses the
injured party from resorting to the tedious, expensive, and too often
fruitless formalities of the law. These curses they inscribed on tablets
of lead and other materials and deposited either in the place which
was to be protected from depredation or in the temple of the god to
whose tender mercies the criminal was committed. For example, in a
sacred precinct dedicated to Demeter, Persephone, Pluto and other
deities of a stern and inflexible temper at Cnidus, a number of leaden
tablets were found inscribed with curses which consigned the
malefactors of various sorts to the vengeance of the two Infernal
Goddesses, Demeter and her daughter. “May he or she never find
Persephone propitious!” is the constantly repeated burden of these
prayers; and in some of them the sinner is not only excommunicated
in this world but condemned to eternal torments in the world
hereafter. Often the persons who launched these curses were ladies.
One irate dame consigns to perdition the thief who had stolen her
bracelet or the defaulter who had failed to send back her
underclothes.36.2 Another curse, engraved on a marble slab found at
Smyrna, purports that if any man should steal one of the sacred
vessels of a certain goddess or injure her sacred fish, he may die a
painful death, devoured by the fishes.36.3 Sometimes, apparently,
these Greek imprecations were as effective in reclaiming sinners as
Dyak curses are to this day. Thus we read of a curious dedication to
a lunar deity of Asia Minor, by name Men Aziottenos, which declares
how one Artemidorus, having been reviled by a couple of rude
fellows, cursed them in a votive tablet, and how one of the culprits,
having been punished by the god, made a propitiatory offering and
mended his wicked ways.37.1 To prevent people
Landmarks
protected by gods
from encroaching on their neighbours’ land by
and curses. removing the boundary stones, the Greeks
committed landmarks to the special protection of
the great god Zeus;37.2 and Plato dwells with unction on the double
punishment, divine and human, to which the sinner exposed himself
who dared to tamper with these sacred stones.37.3 The Romans
went even further, for they created a god for the sole purpose of
looking after landmarks, and he must have had his hands very full if
he executed all the curses which were levelled not only at every man
who shifted his neighbour’s boundary stone, but even at the oxen
which he employed to plough up his neighbour’s land.37.4 The
Hebrew code of Deuteronomy pronounced a solemn curse on such
as removed their neighbour’s landmarks;37.5 and Babylonian kings
exhausted their imagination in pouring out a flood of imprecations
against the abandoned wretch who thus set at naught the rights of
property in land.37.6 King Nebuchadnezzar in particular, before he
was turned out to grass, appears to have distinguished himself by
the richness and variety of his execrations, if we may judge by a
specimen of them which has survived. A brief extract from this
masterpiece may serve to illustrate the king’s style of minatory
eloquence. Referring to the bold bad man, “be it shepherd or
governor, or agent or regent, levy master or magistrate,” whosoever
he might be, who “for all days to come, for the future of human
habitations,” should dare to tamper with the land which his Majesty
had just marked out, “Ninib, lord of boundaries and boundary-stones,
tear out his boundary stone. Gula, great lady, put lingering illness
into his body, that dark and light red blood he may pour out like
water. Ishtar, lady of countries, whose fury is a flood, reveal
difficulties to him, that he escape not from misfortune. Nusku, mighty
lord, powerful burner, the god, my creator, be his evil demon and
may he burn his root. Whoever removes this stone, in the dust hides
it, burns it with fire, casts it into water, shuts it up in an enclosure,
causes a fool, a deaf man, an idiot to take it, places it in an invisible
place, may the great gods, who upon this stone are mentioned by
their names, curse him with an evil curse, tear out his foundation and
destroy his seed.”38.1
In Africa also superstition is a powerful ally of
Superstition as an the rights of private property. Thus the Balonda
ally of the rights of
private property in place beehives on high trees in the forest and
Africa. protect them against thieves by tying a charm or
“piece of medicine” round the tree-trunks. This
proves a sufficient protection. “The natives,” says Livingstone,
“seldom rob each other, for all believe that certain medicines can
inflict disease and death; and though they consider that these are
only known to a few, they act on the principle that it is best to let
them all alone. The gloom of these forests strengthens the
superstitious feelings of the people. In other quarters, where they are
not subjected to this influence, I have heard the chiefs issue
proclamations to the effect, that real witchcraft medicines had been
placed at certain gardens from which produce had been stolen; the
thieves having risked the power of the ordinary charms previously
placed there.”38.2
The Wanika of East Africa “believe in the power
The Wanika of East and efficacy of charms and amulets, and they
Africa.
wear them in great variety; legs, arms, neck, waist,
hair, and every part of the body are laden with them, either for the
cure or prevention of disease; for the expulsion or repulsion of evil
spirits; and to keep at bay snakes, wild animals, and every other evil.
They hang painted calabashes from the baobab at their hut doors to
keep away thieves; shells, dolls, eggs scratched over with Arabic
characters by the Wana Chuoni (sons of the book) of the coast, are
placed about their plantations and in their fruit-trees, and they
believe that death would overtake a thief who should disregard them.
A charm bound to the leg of a fowl is ample protection for the village.
There is no doubt that, superstitious as the people are, they dread
running great risks for the sake of small gains, and so these charms
answer their purpose.”39.1 Among the Boloki of the
The Boloki of the
Congo.
Upper Congo, when a woman finds that the
cassava roots, which she keeps soaking in a
water-hole, are being stolen, she takes a piece of gum copal, and
fixing it in the cleft of a split stick she puts it on the side of the hole,
while at the same time she calls down a curse on the thief. If the thief
is a man, he will henceforth have no luck in fishing; if she is a
woman, she will have no more success in farming.39.2 The Ekoi of
Southern Nigeria protect their farms against thieves by bundles of
palm leaves to which they give the name of okpata. Should any one
steal from a farm thus protected, he will fall sick and will not recover
unless he gives a certain dance, to which the name of okpata is also
applied.39.3
In the mountains of Marrah, a district of Darfur,
Guardian spirits houses, goods, and cattle are protected against
(damzogs) of
property in Darfur. thieves by certain fierce and dangerous guardian-
spirits called damzogs, which can be bought like
watch dogs. Under the guardianship of such a spiritual protector the
sheep and cows are left free to wander at will; for if any one were
rash enough to attempt to steal or kill one of the beasts, his hand
with the knife in it would remain sticking fast to the animal’s throat till
the owner came and caught the rascal. An Arab merchant, travelling
in Darfur, received from a friend the following account of the way to
procure one of these useful guardians. “At the time when I first
began to trade, my friend, I often heard that damzogs could be
bought and sold, and that to procure one I must apply to the owner
of a damzog, and discuss the price with him. When the bargain is
concluded, it is necessary to give a large gourd of milk to the seller,
who takes it to his house, where are his damzogs. On entering he
salutes them, and goes and hangs up his vase to a hook, saying,
—‘One of my friends—such a one—very rich, is in fear of robbers,
and asks me to supply him with a guardian. Will one of you go and
live in his house? There is plenty of milk there, for it is a house of
blessing, and the proof thereof is, that I bring you this kara of milk.’
The damzogs at first refuse to comply with the invitation. ‘No, no,’
say they, ‘not one of us will go.’ The master of the hut conjures them
to comply with his desires, saying, ‘Oh! let the one that is willing
descend into the kara.’ He then retires a little, and presently one of
the damzogs is heard to flop into the milk, upon which he hastens
and claps upon the vase a cover made of date-leaves. Thus stopped
up he unhooks the kara, and hands it over to the buyer, who takes it
away and hangs it on the wall of his hut, and confides it to the care
of a slave or of a wife, who every morning comes and takes it,
emptying out the milk, washing it and replenishing it, and hanging it
up again. From that time forward the house is safe from theft or
loss.” The merchant’s informant, the Shereef Ahmed Bedawee, had
himself purchased one of these guardian spirits, who proved most
vigilant and efficient in the discharge of his duties; indeed his zeal
was excessive, for he not only killed several slaves who tried to rob
his master, but did summary execution on the Shereef’s own son,
when the undutiful young man essayed to pilfer from his father’s
shop. This was too much for the Shereef; he invited a party of friends
to assist him in expelling the inflexible guardian. They came armed
with guns and a supply of ammunition, and by raking the shop with
repeated volleys of musketry they at last succeeded in putting the
spirit to flight.40.1
Amongst the Nandi of British East Africa nobody
The curses of dares to steal anything from a smith; for if he did,
smiths and potters.
the smith would heat his furnace, and as he blew
the bellows to make the flames roar he would curse the thief so that
he would die. And in like manner among these people, with whom
the potters are women, nobody dares to filch anything from a potter;
for next time she heated her wares the potter would curse him,
saying, “Burst like a pot, and may thy house become red,” and the
thief so cursed would die.41.1 In Loango, when a
Charms to protect
property in West
man is about to absent himself from home for a
Africa. considerable time he protects his hut by placing a
charm or fetish before it, consisting perhaps of a
branch with some bits of broken pots or trash of that sort; and we are
told that even the most determined robber would not dare to cross a
threshold defended by these mysterious signs.41.2 On the coast of
Guinea fetishes are sometimes inaugurated for the purpose of
detecting and punishing certain kinds of theft; and not only the culprit
himself, but any person who knows of his crime and fails to give
information is liable to be punished by the fetish. When such a fetish
is instituted, the whole community is warned of it, so that he who
transgresses thereafter does so at his peril. For example, a fetish
was set up to prevent sheep-stealing and the people received
warning in the usual way. Shortly afterwards a slave, who had not
heard of the law, stole a sheep and offered to divide it with a friend.
The friend had often before shared with him in similar enterprises,
but the fear of the fetish was now too strong for him; he informed on
the thief, who was brought to justice and died soon after of a
lingering and painful disease. Nobody in the country ever doubted
but that the fetish had killed him.41.3 Among the Ewe-speaking tribes
of the Slave Coast in West Africa houses and household property
are guarded by amulets (võ-sesao), which derive their virtue from
being consecrated or belonging to the gods. The crops, also, in
solitary glades of the forest are left under the protection of such
amulets, generally fastened to long sticks in some conspicuous
position; and so guarded they are quite safe from pillage. By the side
of the paths, too, may be seen food and palm-wine lying exposed for
sale with nothing but a charm to protect them; a few cowries placed
on each article indicate its price. Yet no native would dare to take the
food or the wine without depositing its price; for he dreads the
unknown evil which the god who owns the charm would bring upon
him for thieving.42.1 In Sierra Leone charms, called greegrees, are
often placed in plantations to deter people from stealing, and it is
said that “a few old rags placed upon an orange tree will generally,
though not always, secure the fruit as effectually as if guarded by the
dragons of the Hesperides. When any person falls sick, if, at the
distance of several months, he recollects having stolen fruit, etc., or
having taken it softly as they term it, he immediately supposes
wangka has caught him, and to get cured he must go or send to the
person whose property he had taken, and make to him whatever
recompense he demands.”42.2
Superstitions of the same sort have been
Charms to protect transported by the negroes to the West Indies,
property in the West
Indies. where the name for magic is obi and the magician
is called the obeah man. There also, we are told,
the stoutest-hearted negroes “tremble at the very sight of the ragged
bundle, the bottle or the egg-shells, which are stuck in the thatch or
hung over the door of a hut, or upon the branch of a plantain tree, to
deter marauders.… When a negro is robbed of a fowl or a hog, he
applies directly to the Obeah-man or woman; it is then made known
among his fellow blacks, that obi is set for the thief; and as soon as
the latter hears the dreadful news, his terrified imagination begins to
work, no resource is left but in the superior skill of some more
eminent Obeah-man of the neighbourhood, who may counteract the
magical operations of the other; but if no one can be found of higher
rank and ability; or if, after gaining such an ally, he should still fancy
himself affected, he presently falls into a decline, under the incessant
horror of impending calamities. The slightest painful sensation in the
head, the bowels, or any other part, any casual loss or hurt, confirms
his apprehensions, and he believes himself the devoted victim of an
invisible and irresistible agency. Sleep, appetite and cheerfulness
forsake him; his strength decays, his disturbed imagination is
haunted without respite, his features wear the settled gloom of
despondency: dirt, or any other unwholesome substance, becomes
his only food, he contracts a morbid habit of body, and gradually
sinks into the grave.”43.1 Superstition has killed him.
Similar evidence might doubtless be multiplied,
Conclusion. but the foregoing cases suffice to shew that
among many peoples and in many parts of the
world superstitious fear has operated as a powerful motive to deter
men from stealing. If that is so, then my second proposition may be
regarded as proved, namely, that among certain races and at certain
times superstition has strengthened the respect for private property
and has thereby contributed to the security of its enjoyment.
IV.
MARRIAGE

I pass now to my third proposition, which is, that


Superstition as a among certain races and at certain times
prop of sexual
morality. superstition has strengthened the respect for
marriage, and has thereby contributed to a stricter
observance of the rules of sexual morality both among the married
and the unmarried. That this is true will appear, I think, from the
following instances.
Among the Karens of Burma “adultery, or
Adultery or fornication, is supposed to have a powerful
fornication
supposed by the influence to injure the crops. Hence, if there have
Karens to blight the been bad crops in a village for a year or two, and
crops. the rains fail, the cause is attributed to secret sins
of this character, and they say the God of heaven
and earth is angry with them on this account; and all the villagers
unite in making an offering to appease him.” And
Pig’s blood used to when a case of adultery or fornication has come to
expiate the crime.
light, “the elders decide that the transgressors
must buy a hog, and kill it. Then the woman takes one foot of the
hog, and the man takes another, and they scrape out furrows in the
ground with each foot, which they fill with the blood of the hog. They
next scratch the ground with their hands and pray: ‘God of heaven
and earth, God of the mountains and hills, I have destroyed the
productiveness of the country. Do not be angry with me, do not hate
me; but have mercy on me, and compassionate me. Now I repair the
mountains, now I heal the hills, and the streams and the lands. May
there be no failure of crops, may there be no unsuccessful labours,
or unfortunate efforts in my country. Let them be dissipated to the
foot of the horizon. Make thy paddy fruitful, thy rice abundant. Make
the vegetables to flourish. If we cultivate but little, still grant that we
may obtain a little.’ After each has prayed thus, they return to the
house and say they have repaired the earth.”45.1 Thus, according to
the Karens adultery and fornication are not simply moral offences
which concern no one but the culprits and their families: they
physically affect the course of nature by blighting the earth and
destroying its fertility; hence they are public crimes which threaten
the very existence of the whole community by cutting off its food
supplies at the root. But the physical injury which these offences do
to the soil can be physically repaired by saturating it with pig’s blood.
Some of the tribes of Assam similarly trace a
Disastrous effects connexion between the crops and the behaviour of
ascribed to sexual
crime in Assam, the human sexes; for they believe that so long as
Bengal, and the crops remain ungarnered, the slightest
Annam.
incontinence would ruin all.45.2 Again, the
inhabitants of the hills near Rajamahal in Bengal imagine that
adultery, undetected and unexpiated, causes the inhabitants of the
village to be visited by a plague or destroyed by tigers or other
ravenous beasts. To prevent these evils an adulteress generally
makes a clean breast. Her paramour has then to furnish a hog, and
he and she are sprinkled with its blood, which is supposed to wash
away their sin and avert the divine wrath. When a village suffers from
plague or the ravages of wild beasts, the people religiously believe
that the calamity is a punishment for secret immorality, and they
resort to a curious form of divination to discover the culprits, in order
that the crime may be duly expiated.45.3 The Khasis of Assam are
divided into a number of clans which are exogamous, that is to say,
no man may marry a woman of his own clan. Should a man be found
to cohabit with a woman of his own clan, it is treated as incest and is
believed to cause great disasters; the people will be struck by
lightning or killed by tigers, the women will die in child-bed, and so
forth. The guilty couple are taken by their clansmen to a priest and
obliged to sacrifice a pig and a goat; after that they are made
outcasts, for their offence is inexpiable.46.1 The Orang Glai, a savage
tribe in the mountains of Annam, similarly suppose that illicit love is
punished by tigers, which devour the sinners. If a girl is found with
child, her family offers a feast of pigs, fowls, and wine to appease the
offended spirits.46.2
The Battas of Sumatra in like manner think that if an unmarried
woman is with child, she must be given in marriage at once, even to
a man of lower rank; for otherwise the people will
Similar views held be infested with tigers, and the crops in the fields
by the Battas of
Sumatra. will not be abundant. They also believe that the
adultery of married women causes a plague of
tigers, crocodiles, or other wild beasts. The crime of incest, in their
opinion, would blast the whole harvest, if the wrong were not
speedily repaired. Epidemics and other calamities that affect the
whole people are almost always traced by them to incest, by which is
to be understood any marriage that conflicts with their customs.46.3
The natives of Nias, an island to the west of Sumatra, imagine that
heavy rains are caused by the tears of a god weeping at the
commission of adultery or fornication. The punishment for these
crimes is death. The two delinquents, man and woman, are buried in
a narrow grave with only their heads projecting above ground; then
their throats are stabbed with a spear or cut with a knife, and the
grave is filled up. Sometimes, it is said, they are buried alive.
However, the judges are not always incorruptible and the injured
family not always inaccessible to the allurement of gain; and
pecuniary compensation is sometimes accepted as a sufficient salve
for wounded honour. But if the wronged man is a chief, the culprits
must surely die. As a consequence, perhaps, of this severity, the
crimes of adultery and fornication are said to be far less frequent in
Nias than in Europe.47.1
Similar views prevail among many tribes in
Similar views Borneo. Thus in regard to the Sea Dyaks we are
among the tribes of
Borneo. told by Archdeacon Perham that “immorality
among the unmarried is supposed to bring a
plague of rain upon the earth, as a punishment inflicted by Petara. It
must be atoned for with sacrifice and fine. In a
Excessive rains function which is sometimes held to procure fine
thought by the
Dyaks to be caused weather, the excessive rain is represented as the
by sexual offences. result of the immorality of two young people.
Petara is invoked, the offenders are banished from
their home, and the bad weather is said to cease. Every district
traversed by an adulterer is believed to be accursed of the gods until
the proper sacrifice has been offered.”47.2 When rain pours down day
after day and the crops are rotting in the fields, these Dyaks come to
the conclusion that some people have been secretly indulging in
lusts of the flesh; so the elders lay their heads together and
adjudicate on all cases of incest and bigamy, and purify the earth
with the blood of pigs, which appears to these
Blood of pigs shed savages, as sheep’s blood appeared to the
to expiate incest
and unchastity. ancient Hebrews, to possess the valuable property
of atoning for moral guilt. Not long ago the
offenders, whose lewdness had thus brought the whole country into
danger, would have been punished with death or at least slavery. A
Dyak may not marry his first cousin unless he first performs a special
ceremony called bergaput to avert evil consequences from the land.
The couple repair to the water-side, fill a small pitcher with their
personal ornaments, and sink it in the river; or instead of a jar they
fling a chopper and a plate into the water. A pig is then sacrificed on
the bank, and its carcase, drained of blood, is thrown in after the jar.
Next the pair are pushed into the water by their friends and ordered
to bathe together. Lastly, a joint of bamboo is filled with pig’s blood,
and the couple perambulate the country and the villages round
about, sprinkling the blood on the ground. After that they are free to
marry. This is done, we are told, for the sake of the whole country, in
order that the rice may not be blasted by the marriage of cousins.48.1
Again, we are informed that the Sibuyaus, a Dyak tribe of Sarawak,
are very careful of the honour of their daughters, because they
imagine that if an unmarried girl is found to be with child it is
offensive to the higher powers, who, instead of always chastising the
culprits, punish the tribe by visiting its members with misfortunes.
Hence when such a crime is detected they fine the lovers and
sacrifice a pig to appease the angry powers and to avert the
sickness or other calamities that might follow. Further, they inflict
fines on the families of the couple for any severe accident or death
by drowning that may have happened at any time within a month
before the religious atonement was made; for they regard the
families of the culprits as responsible for these mishaps. The fines
imposed for serious or fatal accidents are heavy; for simple wounds
they are lighter. With the fear of these fines before their eyes parents
keep a watchful eye on the conduct of their daughters. Among the
Dyaks of the Batang Lupar river the chastity of the unmarried girls is
not so strictly guarded; but in respectable families, when a daughter
proves frail, they sacrifice a pig and sprinkle its blood on the doors to
wash away the sin.48.2 The Hill Dyaks of Borneo abhor incest and do
not allow the marriage even of cousins. In 1846 the Baddat Dyaks
complained to Mr. Hugh Low that one of their chiefs had disturbed
the peace and prosperity of the village by marrying his own
granddaughter. Since that disastrous event, they said, no bright day
had blessed their territory; rain and darkness alone prevailed, and
unless the plague-spot were removed, the tribe would soon be
ruined. The old sinner was degraded from office, but apparently
allowed to retain his wife; and the domestic brawls between this ill-
assorted couple gave much pain to the virtuous villagers.49.1
Among the pagan tribes of Borneo in general,
Incest punished but of Sarawak in particular, “almost all offences
with death by the
pagan tribes of are punished by fines only. Of the few offences
Borneo. which are felt to require a heavier punishment, the
one most seriously regarded is incest. For this
offence, which is held to bring grave peril to the whole house,
especially the danger of starvation through failure of the padi crop,
two punishments have been customary. If the guilt of the culprits is
perfectly clear, they are taken to some open spot on the river-bank at
some distance from the house. There they are thrown together upon
the ground and a sharpened bamboo stake is driven through their
bodies, so that they remain pinned to the earth. The bamboo, taking
root and growing luxuriantly on this spot, remains as a warning to all
who pass by; and, needless to say, the spot is looked on with horror
and shunned by all men. The other method of punishment is to shut
up the offenders in a strong wicker cage and to throw them into the
river. This method is resorted to as a substitute for the former one,
owing to the difficulty of getting any one to play the part of
executioner and to drive in the stake, for this involves the shedding
of the blood of the community. The kind of incest most commonly
committed is the connection of a man with an adopted daughter, and
(possibly on account of this frequency) this is the kind which is most
strongly reprobated.… The punishment of the incestuous couple
does not suffice to ward off the danger brought by them upon the
community. The household must be purified with the blood of pigs
and fowls; the animals used are the property of the offenders or of
their family; and in this way a fine is imposed. When any calamity
threatens or falls upon a house, especially a great rising of the river
which threatens to sweep away the house or the tombs of the
household, the Kayans are led to suspect that incestuous
intercourse in their own or in neighbouring houses has taken place;
and they look round for evidences of it, and sometimes detect a case
which otherwise would have remained hidden. It seems probable
that there is some intimate relation between this belief and the
second of the two modes of punishment described above; but we
have no direct evidence of such connection. All the other peoples
also, except the Punans, punish incest with death. Among the Sea
Dyaks the most common form of incest is that between a youth and
his aunt, and this is regarded at least as seriously as any other
form.”50.1
Nor is it the heinous crime of incest alone which
Evil and confusion in the opinion of the Sea Dyaks endangers the
supposed by the
Dyaks to be whole community. The same effect is supposed to
wrought by follow whenever an unmarried woman is found
fornication. with child and cannot or will not name her seducer.
“The greatest disgrace,” we are told, “is attached
to a woman found in a state of pregnancy, without being able to
name her husband; and cases of self-poisoning, to avoid the shame,
are not of unusual occurrence. If one be found in this state, a fine
must be paid of pigs and other things. Few even of the chiefs will
come forward without incurring considerable responsibility. A pig is
killed, which nominally becomes the father, for want, it is supposed,
of another and better one. Then the surrounding neighbours have to
be furnished with a share of the fine to banish the Jabu, which exists
after such an event. If the fine be not forthcoming, the woman dare
not move out of her room, for fear of being molested, as she is
supposed to have brought evil (kudi) and confusion upon the
inhabitants and their belongings.”50.2
The foregoing accounts refer especially to the
Similar beliefs and tribes of Borneo under British rule; but similar
customs among the
tribes of Dutch ideas and customs prevail among the kindred
Borneo. tribes of Dutch Borneo. Thus the Kayans or
Bahaus in the interior of the island believe that adultery is punished
by the spirits, who visit the whole tribe with failure of the crops and
other misfortunes. Hence in order to avert these evil consequences
from the innocent members of the tribe, the two culprits, with all their
possessions, are first placed on a gravel bank in the middle of the
river, in order to isolate or, in electrical language, to insulate them
and so prevent the moral or rather physical infection from spreading.
Then pigs and fowls are killed, and with the blood priestesses smear
the property of the guilty pair in order to disinfect it. Finally, the two
are placed on a raft, with sixteen eggs, and allowed to drift down
stream. They may save themselves by plunging into the water and
swimming ashore; but this is perhaps a mitigation of an older
sentence of death by drowning, for young people still shower long
grass stalks, representing spears, at the shamefaced and dripping
couple.51.1 Certain it is, that some Dyak tribes used to punish incest
by fastening the man and woman in separate baskets laden with
stones and drowning them in the river. By incest they understood the
cohabitation of parents with children, of brothers with sisters, and of
uncles and aunts with nieces and nephews. A Dutch resident had
much difficulty in saving the life of an uncle and niece who had
married each other; finally he procured their banishment to a distant
part of Borneo.51.2 The Blu-u Kayans, another tribe in the interior of
Borneo, believe that an intrigue between an unmarried pair is
punished by the spirits with failure of the harvest, of the fishing, and
of the hunt. Hence the delinquents have to appease the wrath of the
spirits by sacrificing a pig and a certain quantity of rice.51.3 In Pasir, a
district of Eastern Borneo, incest is thought to bring dearth,
epidemics, and all sorts of evils on the land.51.4 In the island of
Ceram a man convicted of unchastity has to smear every house in
the village with the blood of a pig and a fowl: this is supposed to wipe
out his guilt and ward off misfortunes from the village.51.5
When the harvest fails in Southern Celebes, the
Failure of the crops Macassars and Bugineese regard it as a sure sign
and other disasters
thought to be that incest has been committed and that the spirits
caused by incest in are angry. In the years 1877 and 1878 it happened
Celebes. that the west monsoon did not blow and that the
rice crop in consequence came to nothing; moreover many buffaloes
died of a murrain. At the same time there was in the gaol at Takalar a
prisoner, who had been formerly accused of incest. Some of the
people of his district begged the Dutch governor to give the criminal
up to them, for according to the general opinion the plagues would
never cease till the guilty man had received the punishment he
deserved. All the governor’s powers of persuasion were needed to
induce the petitioners to return quietly to their villages; and when the
prisoner, having served his time, was released shortly afterwards, he
was, at his own request, given an opportunity of sailing away to
another land, as he no longer felt safe in his own country.52.1 Even
when the incestuous couple has been brought to
Disastrous effects justice, their blood may not be shed; for the people
supposed to follow
from shedding the think that, were the ground to be polluted by the
blood of incestuous blood of such criminals, the rivers would dry up
couples on the and the supply of fish would run short, the harvest
ground.
and the produce of the gardens would miscarry,
edible fruits would fail, sickness would be rife among cattle and
horses, civil strife would break out, and the country would suffer from
other widespread calamities. Hence the punishment of the guilty is
such as to avoid the spilling of their blood: usually they are tied up in
a sack and thrown into the sea to drown. Yet they get on their
journey to eternity the necessary provisions, consisting of a bag of
rice, salt, dried fish, coco-nuts, and other things, among which three
quids of betel are not forgotten.52.2 We can now perhaps understand
why the Romans used to sew up a parricide in a sack with a dog, a
cock, a viper, and an ape for company, and fling him into the sea.
They probably feared to defile the soil of Italy by spilling upon it the
blood of such a miscreant.52.3 Amongst the Tomori of Central
Celebes a person guilty of incest is throttled; no drop of his blood
may fall on the ground, for if it did, the rice would never grow again.
The union of uncle with niece is regarded by these people as incest,
but it can be expiated by an offering. A garment of the man and one
of the woman are laid on a copper vessel; the blood of a sacrificed
animal, either a goat or a fowl, is allowed to drip on the garments,
and then the vessel with its contents is set floating down the river.53.1
Among the Tololaki, another tribe of Central Celebes, persons who
have defiled themselves with incest are shut up in a basket and
drowned. No drop of their blood may be spilt on the ground, for that
would hinder the earth from ever bearing fruit again.53.2 Among the
Bare’e-speaking Toradjas of Central Celebes in general the penalty
for incest, that is for the sexual intercourse of parents with children or
of brothers with sisters, is death. But whereas the death-sentence for
adultery is executed with a spear or a sword, the death-sentence for
incest is usually executed among the inland tribes by clubbing or
throttling; for were the blood of the culprits to drip on the ground, the
earth would be rendered barren. The people on the coast put the
guilty pair in a basket, weight it with stones, and fling it into the sea.
This prescribed manner of putting the incestuous to death, we are
informed, makes the execution very grievous. However, the writers
who furnish us with these particulars and who have lived among the
people on terms of intimacy for many years, add that “incest seldom
occurs, or rather the cases that come to light are very few.”53.3 In
some districts of Central Celebes, the marriage of cousins, provided
they are children of two sisters, is forbidden under pain of death; the
people think that such an alliance would anger the spirits, and that
the rice and maize harvests would fail. Strictly speaking, two such
cousins who have committed the offence should be tied together,
weighted with stones, and thrown into water to drown. In practice,
however, the culprits are spared and their sin expiated by shedding
the blood of a buffalo or a goat. The blood is mixed with water and
sprinkled on the rice-fields or poured on the maize-fields, no doubt in
order to appease the angry spirits and restore its fertility to the tilled
land. The natives of these districts believe that were a brother and
sister to commit incest, the ground on which the tribe dwells would
be swallowed up. If such a crime takes place, the guilty pair are tied
together, their feet weighted with stones, and thrown into the sea.54.1
When it rains in torrents, the Galelareese of
Excessive rains, Halmahera, another large East Indian island, say
earthquakes, and
volcanic eruptions that brother and sister, or father and daughter, or
supposed to be in short some near kinsfolk are having illicit
produced by incest relations with each other, and that every human
in Halmahera.
being must be informed of it, for then only will the
rain cease to descend. The superstition has repeatedly caused blood
relations to be accused, rightly or wrongly, of incest. Further, the
people think that alarming natural phenomena, such as a violent
earthquake or the eruption of a volcano, are caused by crimes of the
same sort. Persons charged with such offences are brought to
Ternate; it is said that formerly they were often drowned on the way
or, on being haled thither, were condemned to be thrown into the
volcano.54.2 In the Banggai Archipelago, to the east of Celebes,
earthquakes are explained as punishments inflicted by evil spirits for
indulgence in illicit love.54.3
In some parts of Africa, also, it is believed that
Breaches of sexual breaches of sexual morality disturb the course of
morality thought to
blight the fruits of nature, particularly by blighting the fruits of the
the earth and earth; and probably such views are much more
otherwise disturb widely diffused in that continent than the scanty
the course of nature
in Africa. and fragmentary evidence at our disposal might
lead us to suppose. Thus, the negroes of Loango,
in West Africa, imagine that the commerce of a man with an
immature girl is punished by God with drought and consequent
famine until the transgressors expiate their transgression by dancing
naked before the king and an assembly of the people, who throw hot
gravel and bits of glass at the pair as they run the gauntlet. The rains
in that country should fall in September, but in 1898 there was a long
drought, and when the month of December had nearly passed, the
sun-scorched stocks of the fruitless Indian corn shook their rustling
leaves in the wind, the beans lay shrivelled and black on the ruddy
soil, and the shoots of the sweet potato had flowered and withered
long ago. The people cried out against their rulers for neglecting their
duty to the primeval powers of the earth; the priests of the sacred
groves had recourse to divination and discovered that God was
angry with the land on account of the immorality of certain persons
unknown, who were not observing the traditions and laws of their
God and country. The feeble old king had fled, but the slave who
acted as regent in his room sent word to the chiefs that there were
people in their towns who were the cause of God’s wrath. So every
chief called his subjects together and caused enquiries to be made,
and then it was discovered that three girls had broken the customs of
their country; for they were with child before they had passed
through what is called the paint-house, that is, before they had been
painted red and secluded for a season in token that they had
attained to the age of puberty. The people were incensed and
endeavoured to punish or even kill the three girls; and the English
writer who has recorded the case has thought it worth while to add
that on the very morning when the culprits were brought before the
magistrate rain fell.55.1 Amongst the Bavili of Loango, who are
divided into totemic clans, no man is allowed to marry a woman of
his mother’s clan; and God is believed to punish a breach of this
marriage law by withholding the rains in their due season.56.1 Similar
notions of the blighting influence of sexual crime appear to be
entertained by the Nandi of British East Africa; for we are told that
when a warrior has got a girl with child, she “is punished by being put
in Coventry, none of her girl friends being allowed to speak to or look
at her until after the child is born and buried. She
Sexual purity is also regarded with contempt for the rest of her
required of those
who handle corn or life and may never look inside a granary for fear of
enter a granary. spoiling the corn.”56.2 Among the Basutos in like
manner “while the corn is exposed to view, all
defiled persons are carefully kept from it. If the aid of a man in this
state is necessary for carrying home the harvest, he remains at
some distance while the sacks are filled, and only approaches to
place them upon the draught oxen. He withdraws as soon as the
load is deposited at the dwelling, and under no pretext can he assist
in pouring the corn into the basket in which it is preserved.”56.3 The
nature of the defilement which thus disqualifies a man from handling
the corn is not mentioned, but we may conjecture that unchastity
would fall under this general head. For amongst the Basutos after a
child is born a fresh fire has to be kindled in the dwelling by the
friction of wood, and this must be done by a young man of chaste
habits; it is believed that an untimely death awaits him who should
dare to discharge this holy office after having lost his innocence.56.4
In Morocco whoever enters a granary must first remove his slippers
and must be sexually clean. Were an unclean person to enter, the
people believe not only that the grain would lose its blessed
influence (baraka), but that he himself would fall ill. A Berber told Dr.

You might also like