Professional Documents
Culture Documents
RL-Endterm Report - Mridul Agarwal
RL-Endterm Report - Mridul Agarwal
RL-Endterm Report - Mridul Agarwal
PAGE 1
What is Reinforcement Learning?
Reinforcement learning, in the context of artificial intelligence, is a type of dynamic
programming that trains algorithms using a system of reward and punishment.
Reinforcement learning is also different from what machine learning researchers call
unsupervised learning, which is typically about finding structure hidden in collections of
unlabeled data. Although one might be tempted to think of reinforcement learning as a
kind of unsupervised learning because it does not rely on examples of correct behavior,
reinforcement learning is trying to maximize a reward signal instead of trying to find
hidden structure. Uncovering structure in an agent’s experience can certainly be useful in
reinforcement learning, but by itself does not address the reinforcement learning problem
of maximizing a reward signal. We therefore consider reinforcement learning to be a third
machine learning paradigm, alongside supervised learning and unsupervised learning and
perhaps other paradigms.
PAGE 2
ELEMENTS OF REINFORCEMENT LEARNING
A reinforcement learning system has an agent, an environment. But apart from these there
are four main sub elements which are:
• Policy-
A policy defines the learning agent’s way of behaving at a given time. Roughly
speaking, a policy is a mapping from perceived states of the environment to
actions to be taken when in those states. In some cases the policy may be a simple
function or lookup table, whereas in others it may involve extensive computation
such as a search process.
• Reward Signal-
A reward signal defines the goal of a reinforcement learning problem. On each time
step, the environment sends to the reinforcement learning agent a single number
called the reward. The agent’s sole objective is to maximize the total reward it
receives over the long run. The reward signal is the primary basis for altering the
policy; if an action selected by the policy is followed by low reward, then the policy
may be changed to select some other action in that situation in the future.
• Value Function-
A value function specifies what is good in the long run. Roughly speaking, the value
of a state is the total amount of reward an agent can expect to accumulate over the
future, starting from that state. Whereas rewards determine the immediate,
intrinsic desirability of environmental states, values indicate the long-term
desirability of states after taking into account the states that are likely to follow
and the rewards available in those states.
• Model of the environment
Though not compulsory, a model of the environment can do wonders for an
Agent. A model is something that mimics the behavior of the environment, or
more generally, that allows inferences to be made about how the environment will
behave. Models are used for planning, i.e. deciding on a course of action by
considering possible future situations before they are actually experienced.
Methods for solving reinforcement learning problems that use models and
planning are called model-based methods, as opposed to simpler model-free
methods that are explicitly trial-and-error learners—viewed as almost the opposite
of planning.
PAGE 3
Multi Armed bandits
The most important feature distinguishing reinforcement learning from other types of
learning is that it uses training information that evaluates the actions taken rather than
Instructs by giving correct actions
In a k-armed bandit problem, each of the k actions has an expected or mean reward given
that that action is selected; this is called the value of that action. We denote the action
selected on time step t as At, and the corresponding reward as Rt. The value then of an
arbitrary action a, denoted q*(a), is the expected reward given that a is selected:
q* (a). = E[Rt | At = a] .
Assumption is that the action values are not known with certainty, we only have
estimates. The estimated value of action a at time step t is denoted as Qt(a).
Exploration vs Exploitation
At any time step there is at least one action whose estimated value is the greatest. These
are called greedy actions. When you select one of these actions, it is called exploitation. If
instead you select one of the nongreedy actions, then we say you are exploring, because
this enables you to improve your estimate of the nongreedy action’s value. Exploitation is
the right thing to do to maximize the expected reward on the one step, but exploration
may produce the greater total reward in the long run.
Action-value method
These are methods for estimating the values of actions and for using the estimates to
make action selection decisions. One natural way to estimate this is by averaging the
rewards actually received:
PAGE 4
This is the sample average method as it uses the average of the relevant rewards.
The greedy action selection method is known as-
Another alternative is to behave greedily most of the times but sometimes, with a
probability of ἐ select randomly among all options available with same probability.
We call this ἐ-greedy method.
Incremental Implementation
To address the issue of computing the averages of estimated actions in a computationally
effective manner. For this we use incremental formulas for updating the averages with
small, constant computation required to process each new reward.
PAGE 5
Non-stationary problems:
We often encounter RL problems in which the reward function changes with time. In such
cases it makes sense to give more weight to recent rewards than to long-past rewards.
One of the most popular ways of doing this is to use a constant step-size parameter.
The weight given to Ri decreases as the number of intervening rewards increases. In fact it
decays exponentially, so this is sometimes called exponential-recency weighted-average.
Condition for absolute convergence-
The first condition guarantees that the steps are large enough to eventually overcome any
initial conditions or random fluctuations. The second condition guarantees that
eventually the steps become small enough to assure convergence.
PAGE 6
Upper Confidence Bound Action selection-
In epsilon-greedy method it would be more sensible to select a method which gives more
preference to those actions which are nearly greedy.
One effective way of doing this is-
Here the square root term is a measure of uncertainty of variance in the estimate of a’s
value.
C determines the confidence level.
SO on each step after selecting an action At & receiving the reward Rt, the action
preferences are updated by:
Associative Search-
It involves both trial and error learning to search for the best actions, & association of
these actions with the situations in which they are best.
PAGE 7
FINITE MARKOV DECISION PROCESSES
The Agent Environment Interface-
The learner & decision maker is called the agent.
Everything else is the Environment. Along with this the agent also receives some
representation of the environment’s state. The agent then on that bases selects the Action
and then receives a reward and state for the next time step.
The basic diagram is-
Note that the reward and states at a particular time t depends on it’s pre. Time-
State-transition probabilities-
Also the expected reward for state-action pair and state-action-next state triplet as-
PAGE 8
Reward Hypothesis-
All of what we mean by goals and purposes can be well thought as the maximization of
the expected value of the cumulative sum of a received scalar signal (Called Reward).
Discounting- An approach in which agent tries to select actions so that the sum of the
discounted rewards it receives over the future is maximised.
PAGE 9
Unified notations for Episodic & Continuing Task-
We have defined the return as a sum over a finite number of terms in case of episodic and
as sum over infinite terms in continuative. These two can be unified by considering
episode termination to be the entering of a special absorbing state that transitions only to
itself & generates only rewards of zero.
Alternatively-
PAGE 10
Monte Carlo Methods- They are those where the agent maintains averages for all actions
taken in each state then they approach to the true value of the state as the number of
times the action is taken tends to infinity.
The last equation is the Bellman equation, it expresses a relationship between the value of
a state and the values of its successor states.
There is always 1 such policy and it is called the optimal policy, even in case of multiple
such policies, they always share the same state-value function, called the optimal state-
valued function.
PAGE 11
Dynamic programming
The term dynamic programming refers to a collection of algorithms that can be used to
compute optimal policies given a perfect model of the environment as a MDP.
Policy Prediction
Assuming environment dynamics are known, for our purposes, iterative solution methods
To produce each successive approximation, Vk+1 from Vk, iterative policy evaluation
replaces the old value of s with a new value obtained from the old values of the successor
states of s, and the expected immediate rewards, along with all the one-step transitions
possible under the policy being evaluated. We call this kind of operation an expected
update.
Then the policy π’ must be as good as, or better then, π. That is, it must obtain greater or
equal expected return from all states s S.
PAGE 12
Monte Carlo & Temporal Difference Learning-
The goal of MC and TD learning is to learn the value functions from the agent's experience
as the agent follows its policy .
MC learning updates the value towards the actual return , which is the total discounted
reward from time step t. Whereas TD learning (TD(0) to be precise), updates the value
towards the estimated return, which can be calculated after every step.
The following table summarizes the value estimate’s update equation for Monte Carlo &
Temporal Difference learning Methods.
PAGE 13
SARSA & Q-Learning
It is very useful for an agent to learn the action value function, which informs the agent
about the long-term value of taking action ‘a’ in state ‘s’ so that the agent can take those
actions that will maximize it’s expected, discounted future reward. The SARSA & Q
learning algorithms enable an agent to learn that.
Q-learning is a popular off-policy learning algorithm, and it is similar to SARSA, except for
one thing. Instead of using the Q value estimate for the new state and the action that the
agent took in that new state, it uses the Q value estimate that corresponds to the action
that leads to the maximum obtainable Q value from that new state, S'.
PAGE 14
OpenAI Gym
OpenAI Gym is an open source toolkit that provides a diverse collection of tasks, called
environments, with a common interface for developing and testing your intelligent agent
algorithms. The toolkit introduces a standard Application Programming Interface
(API) for interfacing with environments designed for reinforcement learning. Each
environment has a version attached to it, which ensures meaningful comparisons and
reproducible results with the evolving algorithms and the environments themselves.
The following work has been done in windows OS using Anaconda command Prompt &
the code is written using ‘Jupyter Notebook’.
PAGE 15
Install the Gym Environment-
PAGE 16
After Importing Gym, we make an environment
env = gym.make("ENVIRONMENT_NAME")
We get the first observation from the environment by calling env.reset(). Let's store the
observation in a variable named obs using the following line of code:
obs = env.reset()
Our task is to design the Algorithm that is responsible for taking the action.
Once the action to be taken is decided, we send it to the environment (second arrow in
the diagram) using the env.step() method, which will return four values in this order:
next_state, reward, done, and info:
1) Once the action to be taken is decided, we send it to the environment (second arrow
in the diagram) using the env.step() method, which will return four values in this
order: next_state, reward, done, and info:
2) The reward (third arrow in the diagram) is returned by the environment.
3) The done variable is a Boolean (true or false), which gets a value of true if the episode
has terminated/finished (therefore, it is time to reset the environment) and false
otherwise.
4) The info variable returned is an optional variable, which some environments may
return with some additional information. (We won’t be using this)
import gym
env = gym.make("ENVIRONMENT_NAME")
obs = env.reset() # The first arrow in the picture
# Inner loop (roll out)
action = agent.choose_action(obs) # The second arrow in the
picture
next_state, reward, done, info = env.step(action) # The
third arrow (and more)
obs = next_state
# Repeat Inner loop (roll out)
PAGE 17
Spaces in The Gym Environment
Each environment in the Gym is different. Every game environment under the Atari
category is also different from the others.
To have defined the mathematics for the allowed observation & actions for a given
Environment is Known as spaces .
Different Categories of Spaces Include:
- Box
- Discrete
- Multi-binary
- Dict
- Multi Discrete
- Tuple
The state and action space type, description, and range of allowed values are summarized
in the following table for your reference:
PAGE 19
METHOD 1: TAKING RANDOM STEPS
We implement the famous Q-learning algorithm using the NumPy library and the
MountainCar-V0 environment from the OpenAI Gym library.
Next we set MAX_STEPS_PER_EPISODE. This is the number of steps or actions that the
agent can take before the episode ends. This may be useful in continuing, perpetual, or
looping environments, where the environment itself does not end the episode.
#!/usr/bin/env python
import gym
env = gym.make("MountainCar-v0")
MAX_NUM_EPISODES = 5000
If you run the preceding script, we see the Mountain Car environment come up in a new
window and the car moving left and right randomly for 1,000 episodes. We also see the
episode number, steps taken, and the total reward obtained printed at the end of every
episode.
PAGE 20
METHOD 2: IMPLEMENTING Q-LEARNING ALGORITHM
Before our Q_Learner class declaration, we initialize a few useful hyperparameters.
- EPSILON_MIN: This is the minimum value of the epsilon value that we want the
agent to use while following an epsilon-greedy policy.
- MAX_NUM_EPISODES: The maximum number of episodes that we want the
agent to interact with the environment for.
- STEPS_PER_EPISODE: This is the number of steps in each episode. This could
be the maximum number of steps that an environment will allow per episode or a
custom value that we want to limit based on some time budget.
- ALPHA: This is the learning rate that we want the agent to use. This is the alpha in
the Q-learning update equation listed in the previous section.
- GAMMA: This is the discount factor that the agent will use to factor in future
rewards.
- NUM_DISCRETE_BINS: This is the number of bins of values that the state
space will be discretized into. For the Mountain Car environment, we will be
discretizing the state space into 30 bins
EPSILON_MIN = 0.005
max_num_steps = MAX_NUM_EPISODES * STEPS_PER_EPISODE
EPSILON_DECAY = 500 * EPSILON_MIN / max_num_steps
ALPHA = 0.05 # Learning rate
GAMMA = 0.98 # Discount factor
NUM_DISCRETE_BINS = 30 # Number of bins to Discretize each
observation dim
The __init__(self, env) function takes the environment instance, env, as an input
argument and initializes the dimensions/shape of the observation space and the action
space, and also determines the parameters to discretize the observation space based on
the NUM_DISCRETE_BINS set. The __init__(self, env) function also initializes the Q
function as a NumPy array, based on the shape of the discretized observation space and
the action space dimensions.
PAGE 21
THIS IS HOW THE COMPLETE CODE LOOKS-
Import gym
import numpy as np
MAX_NUM_EPISODES = 50000
STEPS_PER_EPISODE = 200 # This is specific to MountainCar. May change with env
EPSILON_MIN = 0.005
max_num_steps = MAX_NUM_EPISODES * STEPS_PER_EPISODE
EPSILON_DECAY = 500 * EPSILON_MIN / max_num_steps
ALPHA = 0.05 # Learning rate
GAMMA = 0.98 # Discount factor
NUM_DISCRETE_BINS = 30 # Number of bins to Discretize each observation dim
class Q_Learner(object):
def __init__(self, env):
self.obs_shape = env.observation_space.shape
self.obs_high = env.observation_space.high
self.obs_low = env.observation_space.low
self.obs_bins = NUM_DISCRETE_BINS # Number of bins to Discretize each
observation dim
self.bin_width = (self.obs_high - self.obs_low) / self.obs_bins
self.action_shape = env.action_space.n
# Create a multi-dimensional array (aka. Table) to represent the
# Q-values
self.Q = np.zeros((self.obs_bins + 1, self.obs_bins + 1,
self.action_shape)) # (51 x 51 x 3)
self.alpha = ALPHA # Learning rate
self.gamma = GAMMA # Discount factor
self.epsilon = 1.0
PAGE 22
if self.epsilon > EPSILON_MIN:
self.epsilon -= EPSILON_DECAY
if np.random.random() > self.epsilon:
return np.argmax(self.Q[discretized_obs])
else: # Choose a random action
return np.random.choice([a for a in range(self.action_shape)])
if __name__ == "__main__":
env = gym.make('MountainCar-v0')
agent = Q_Learner(env)
learned_policy = train(agent, env)
# Use the Gym Monitor wrapper to evalaute the agent and record video
gym_monitor_path = "./gym_monitor_output"
env = gym.wrappers.Monitor(env, gym_monitor_path, force=True)
for _ in range(1000):
test(agent, env, learned_policy)
env.close()
PAGE 24
THE SIMULATION UNDERWAY
PAGE 25
Bibliography:
• Books
- Artificial Intelligence A Modern Approach 3rd ed by Stuart Russel & Peter Norvig
- Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G.
Barton
- Hands-On Intelligent Agents with OpenAI Gym by Praveen Palanisamy
• Online Courses-
- https://www.cse.iitb.ac.in/~shivaram/teaching/cs337+335-s2019/
- https://www.coursera.org/learn/neural-networks-deep-learning/home/welcome
- https://www.coursera.org/learn/practical-rl/home/week/3
- http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html
PAGE 26