RL-Endterm Report - Mridul Agarwal

Deep Reinforcement Learning
SUMMER OF SCIENCES, 2019
Mridul Agarwal | 180010034 | 27th July 2019

Under the Mentorship of-
Sanchit Jain (160050043)
IF INTELLIGENCE WAS A CAKE, UNSUPERVISED LEARNING WOULD BE THE
CAKE, SUPERVISED LEARNING WOULD BE THE ICING ON THE CAKE, AND
REINFORCEMENT LEARNING WOULD BE THE CHERRY ON THE CAKE.”
-YANN LECUN
PAGE 1
What is Reinforcement Learning?
Reinforcement learning, in the context of artificial intelligence, is a type of dynamic
programming that trains algorithms using a system of reward and punishment.
A reinforcement learning algorithm, or agent, learns by interacting with its environment.

The agent receives rewards by performing correctly and penalties for performing
incorrectly. The agent learns without intervention from a human by maximizing its
reward and minimizing its penalty.
HOW IS IT DIFFERENT FROM SUPERVISED AND UNSUPERVISED LEARNING?

Reinforcement learning is different from supervised learning, in the sense that supervised
learning is learning from a training set of labeled examples provided by a knowledgable
external supervisor. Each example is a description of a situation together with a
specification—the label—of the correct action the system should take to that situation,
which is often to identify a category to which the situation belongs. The object of this kind
of learning is for the system to extrapolate, or generalize, its responses so that it acts
correctly in situations not present in the training set. This is an important kind of
learning, but alone it is not adequate for learning from interaction. In interactive problems
it is often impractical to obtain examples of desired behavior that are both correct and
representative of all the situations in which the agent has to act. In uncharted territory—
where one would expect learning to be most beneficial—an agent must be able to learn
from its own experience.
Reinforcement learning is also different from what machine learning researchers call
unsupervised learning, which is typically about finding structure hidden in collections of
unlabeled data. Although one might be tempted to think of reinforcement learning as a
kind of unsupervised learning because it does not rely on examples of correct behavior,
reinforcement learning is trying to maximize a reward signal instead of trying to find
hidden structure. Uncovering structure in an agent’s experience can certainly be useful in
reinforcement learning, but by itself does not address the reinforcement learning problem
of maximizing a reward signal. We therefore consider reinforcement learning to be a third
machine learning paradigm, alongside supervised learning and unsupervised learning and
perhaps other paradigms.
PAGE 2
ELEMENTS OF REINFORCEMENT LEARNING
A reinforcement learning system has an agent, an environment. But apart from these there
are four main sub elements which are:
• Policy-
A policy defines the learning agent’s way of behaving at a given time. Roughly
speaking, a policy is a mapping from perceived states of the environment to
actions to be taken when in those states. In some cases the policy may be a simple
function or lookup table, whereas in others it may involve extensive computation
such as a search process.
• Reward Signal-
A reward signal defines the goal of a reinforcement learning problem. On each time
step, the environment sends to the reinforcement learning agent a single number
called the reward. The agent’s sole objective is to maximize the total reward it
receives over the long run. The reward signal is the primary basis for altering the
policy; if an action selected by the policy is followed by low reward, then the policy
may be changed to select some other action in that situation in the future.
• Value Function-
A value function specifies what is good in the long run. Roughly speaking, the value
of a state is the total amount of reward an agent can expect to accumulate over the
future, starting from that state. Whereas rewards determine the immediate,
intrinsic desirability of environmental states, values indicate the long-term
desirability of states after taking into account the states that are likely to follow
and the rewards available in those states.
• Model of the environment
Though not compulsory, a model of the environment can do wonders for an
Agent. A model is something that mimics the behavior of the environment, or
more generally, that allows inferences to be made about how the environment will
behave. Models are used for planning, i.e. deciding on a course of action by
considering possible future situations before they are actually experienced.
Methods for solving reinforcement learning problems that use models and
planning are called model-based methods, as opposed to simpler model-free
methods that are explicitly trial-and-error learners—viewed as almost the opposite
of planning.
PAGE 3
Multi Armed bandits
The most important feature distinguishing reinforcement learning from other types of
learning is that it uses training information that evaluates the actions taken rather than
Instructs by giving correct actions
The K-armed bandit problem

In such a problem we are faced repeatedly with a choice among k different actions. After
each choice, a numerical reward is received which is chosen from a stationary probability
distribution that depends on the action we selected. The objective is to maximize the total
reward over a certain time period.
In a k-armed bandit problem, each of the k actions has an expected or mean reward given
that that action is selected; this is called the value of that action. We denote the action
selected on time step t as At, and the corresponding reward as Rt. The value then of an
arbitrary action a, denoted q*(a), is the expected reward given that a is selected:
q* (a). = E[Rt | At = a] .
Assumption is that the action values are not known with certainty, we only have
estimates. The estimated value of action a at time step t is denoted as Qt(a).
Exploration vs Exploitation
At any time step there is at least one action whose estimated value is the greatest. These
are called greedy actions. When you select one of these actions, it is called exploitation. If
instead you select one of the nongreedy actions, then we say you are exploring, because
this enables you to improve your estimate of the nongreedy action’s value. Exploitation is
the right thing to do to maximize the expected reward on the one step, but exploration
may produce the greater total reward in the long run.
Action-value method
These are methods for estimating the values of actions and for using the estimates to
make action selection decisions. One natural way to estimate this is by averaging the
rewards actually received:
PAGE 4
This is the sample average method as it uses the average of the relevant rewards.
The greedy action selection method is known as-
Another alternative is to behave greedily most of the times but sometimes, with a
probability of ἐ select randomly among all options available with same probability.
We call this ἐ-greedy method.
Incremental Implementation
To address the issue of computing the averages of estimated actions in a computationally
effective manner. For this we use incremental formulas for updating the averages with
small, constant computation required to process each new reward.
PAGE 5
Non-stationary problems:
We often encounter RL problems in which the reward function changes with time. In such
cases it makes sense to give more weight to recent rewards than to long-past rewards.
One of the most popular ways of doing this is to use a constant step-size parameter.
The weight given to Ri decreases as the number of intervening rewards increases. In fact it
decays exponentially, so this is sometimes called exponential-recency weighted-average.
Condition for absolute convergence-
The first condition guarantees that the steps are large enough to eventually overcome any
initial conditions or random fluctuations. The second condition guarantees that
eventually the steps become small enough to assure convergence.
Optimistic Initial values-

In each step we set the Initial value much higher than the given values, & hence whichever
value is selected, the next time average done is much less than the previous value and so a
lot of exploration takes place before anything useful actually happens.
Note- Any method that focuses on initial conditions in a special way is not well suited for
nonstationary case cause it’s drive for exploration is inherently temporary. If the task
changes, creating a renewed need for exploration, this method cannot help.
PAGE 6
Upper Confidence Bound Action selection-
In epsilon-greedy method it would be more sensible to select a method which gives more
preference to those actions which are nearly greedy.
One effective way of doing this is-
Here the square root term is a measure of uncertainty of variance in the estimate of a’s
value.
C determines the confidence level.
Gradient Bandit Algorithm-

We set a numerical preference for each action a.
We determine the action probabilities using a soft-max distribution as follows…
SO on each step after selecting an action At & receiving the reward Rt, the action
preferences are updated by:
Associative Search-
It involves both trial and error learning to search for the best actions, & association of
these actions with the situations in which they are best.
PAGE 7
FINITE MARKOV DECISION PROCESSES
The Agent Environment Interface-
The learner & decision maker is called the agent.
Everything else is the Environment. Along with this the agent also receives some
representation of the environment’s state. The agent then on that bases selects the Action
and then receives a reward and state for the next time step.
The basic diagram is-
And the agent kind of generates a trajectory working as-
Note that the reward and states at a particular time t depends on it’s pre. Time-
P defines the dynamics of MDP

The state must include information about all aspects of the past agent-environment
interaction that makes a difference for the future. If it does, then the state is said to have
markov property
State-transition probabilities-
Also the expected reward for state-action pair and state-action-next state triplet as-
PAGE 8
Reward Hypothesis-
All of what we mean by goals and purposes can be well thought as the maximization of
the expected value of the cumulative sum of a received scalar signal (Called Reward).
So we seek to maximize the Expected Return-
Episodes- when the agent-environment interaction breaks naturally into subsequence.

Each episode ends in a special state called the terminal state.
Continuing tasks are those where the agent-environment interaction does not break
naturally into identifiable episodes.
Discounting- An approach in which agent tries to select actions so that the sum of the
discounted rewards it receives over the future is maximised.
PAGE 9
Unified notations for Episodic & Continuing Task-
We have defined the return as a sum over a finite number of terms in case of episodic and
as sum over infinite terms in continuative. These two can be unified by considering
episode termination to be the entering of a special absorbing state that transitions only to
itself & generates only rewards of zero.
Alternatively-
Policies & Value Functions-

RL involves estimating value functions, i.e. functions which say how good this state is for
out agent, or to be precise, the expected future returns. Although it depends on the
actions it takes, called policies.
PAGE 10
Monte Carlo Methods- They are those where the agent maintains averages for all actions
taken in each state then they approach to the true value of the state as the number of
times the action is taken tends to infinity.
A fundamental property of value functions used throughout reinforcement learning &

Dynamic programming is that they satisfy recursive relationships. For any policy π and
any state s, the following consistency condition holds between the value of s and the value
of it’s possible successor states.
The last equation is the Bellman equation, it expresses a relationship between the value of
a state and the values of its successor states.
Optimal Policies & Value Functions

A policy π is defined to be better than or equal to a policy π’ if it’s expected return is
greater than or equal to that of π’ for all states.
There is always 1 such policy and it is called the optimal policy, even in case of multiple
such policies, they always share the same state-value function, called the optimal state-
valued function.
PAGE 11
Dynamic programming
The term dynamic programming refers to a collection of algorithms that can be used to
compute optimal policies given a perfect model of the environment as a MDP.
Policy Prediction
Assuming environment dynamics are known, for our purposes, iterative solution methods
To produce each successive approximation, Vk+1 from Vk, iterative policy evaluation
replaces the old value of s with a new value obtained from the old values of the successor
states of s, and the expected immediate rewards, along with all the one-step transitions
possible under the policy being evaluated. We call this kind of operation an expected
update.
Policy Improvement Theorem:

Let π & π’ be any pair of deterministic policies such that, for all s S.
Then the policy π’ must be as good as, or better then, π. That is, it must obtain greater or
equal expected return from all states s S.
The new greedy policy π’ is given by:
PAGE 12
Monte Carlo & Temporal Difference Learning-
The goal of MC and TD learning is to learn the value functions from the agent's experience
as the agent follows its policy .
MC learning updates the value towards the actual return , which is the total discounted
reward from time step t. Whereas TD learning (TD(0) to be precise), updates the value
towards the estimated return, which can be calculated after every step.
The following table summarizes the value estimate’s update equation for Monte Carlo &
Temporal Difference learning Methods.
PAGE 13
SARSA & Q-Learning
It is very useful for an agent to learn the action value function, which informs the agent
about the long-term value of taking action ‘a’ in state ‘s’ so that the agent can take those
actions that will maximize it’s expected, discounted future reward. The SARSA & Q
learning algorithms enable an agent to learn that.
SARSA is so named because of the sequence State->Action->Reward->State'->Action' that

the algorithm's update step depends on. The description of the sequence goes like this: the
agent, in state S, takes an action A and gets a reward R, and ends up in the next state S',
after which the agent decides to take an action A' in the new state. Based on this
experience, the agent can update its estimate of Q(S,A).
Q-learning is a popular off-policy learning algorithm, and it is similar to SARSA, except for
one thing. Instead of using the Q value estimate for the new state and the action that the
agent took in that new state, it uses the Q value estimate that corresponds to the action
that leads to the maximum obtainable Q value from that new state, S'.
Deep Reinforcement Learning

The state-value function V(s) is a real-value function that takes the current state ‘s’ as the
input and outputs a real-value number. This number is the agent's prediction of how good
it is to be in state ‘s’ and the agent keeps updating the value function based on the new
experiences it gains. Likewise, the action-value function Q(S,A) is also a real-value
function, which takes action A as an input in addition to state S, and outputs a real
number.
Deep reinforcement learning is basically application of deep learning techniques in

reinforcement learning. Like using deep neural networks to approximate the value
functions or using a deep neural network to represent the policy.
Current applications of this method include:
- Robot Locomotion & Manipulation

- Helping AI create wagering strategies and even beat humans in Jeopardy!
- Learning to play video games better than humans.
PAGE 14
OpenAI Gym
OpenAI Gym is an open source toolkit that provides a diverse collection of tasks, called
environments, with a common interface for developing and testing your intelligent agent
algorithms. The toolkit introduces a standard Application Programming Interface
(API) for interfacing with environments designed for reinforcement learning. Each
environment has a version attached to it, which ensures meaningful comparisons and
reproducible results with the evolving algorithms and the environments themselves.
OpenAI Gym provides a simple and common Python interface to environments.

Specifically, it takes an action as input and provides observation, reward, done and an
optional info object, based on the action as the output at each step.
The following work has been done in windows OS using Anaconda command Prompt &
the code is written using ‘Jupyter Notebook’.
PAGE 15
Install the Gym Environment-
// ~$pip install gym
Tools & Libraries Needed

PyTorch
It is a highly optimized deep learning library. There are other libraries too like PyTorch,
TensorFlow, Caffe, Chainer, MxNet, and CNTK, to name a few. But Pytorch is used due to
its simplicity of use and dynamic graph definition.
Compute Unified Device Architecture (CUDA)

This is for running of Nvidia GPU of my system. CUDA is a parallel computing platform
and programming model that makes using a GPU for general purpose computing simple
and elegant.
Understanding the Gym Interface
PAGE 16
After Importing Gym, we make an environment
env = gym.make("ENVIRONMENT_NAME")
We get the first observation from the environment by calling env.reset(). Let's store the
observation in a variable named obs using the following line of code:
obs = env.reset()
Our task is to design the Algorithm that is responsible for taking the action.
Once the action to be taken is decided, we send it to the environment (second arrow in
the diagram) using the env.step() method, which will return four values in this order:
next_state, reward, done, and info:
1) Once the action to be taken is decided, we send it to the environment (second arrow
in the diagram) using the env.step() method, which will return four values in this
order: next_state, reward, done, and info:
2) The reward (third arrow in the diagram) is returned by the environment.
3) The done variable is a Boolean (true or false), which gets a value of true if the episode
has terminated/finished (therefore, it is time to reset the environment) and false
otherwise.
4) The info variable returned is an optional variable, which some environments may
return with some additional information. (We won’t be using this)
This is How the Algorithm looks qualitatively speaking-
import gym
env = gym.make("ENVIRONMENT_NAME")
obs = env.reset() # The first arrow in the picture
# Inner loop (roll out)
action = agent.choose_action(obs) # The second arrow in the
picture
next_state, reward, done, info = env.step(action) # The
third arrow (and more)
obs = next_state
# Repeat Inner loop (roll out)
PAGE 17
Spaces in The Gym Environment
Each environment in the Gym is different. Every game environment under the Atari
category is also different from the others.
To have defined the mathematics for the allowed observation & actions for a given
Environment is Known as spaces .
Different Categories of Spaces Include:
- Box
- Discrete
- Multi-binary
- Dict
- Multi Discrete
- Tuple
The Mountain- Car Problem:
In the Mountain Car Gym environment, a car is on a one-dimensional track, positioned

between two mountains. The goal is to drive the car up the mountain on the right;
however, the car's engine is not strong enough to drive up the mountain even at the
maximum speed. Therefore, the only way to succeed is to drive back and forth to build up
momentum. In short, the Mountain Car problem is to get an under-powered car to the top
of a hill.
PAGE 18
SOLVING THE MOUNTAIN CAR PROBLEM
After looking at the Environment we get to know that, the state and observation space is a
two-dimensional box and the action space is three-dimensional and discrete.
The state and action space type, description, and range of allowed values are summarized
in the following table for your reference:
PAGE 19
METHOD 1: TAKING RANDOM STEPS
We implement the famous Q-learning algorithm using the NumPy library and the
MountainCar-V0 environment from the OpenAI Gym library.
Next we set MAX_STEPS_PER_EPISODE. This is the number of steps or actions that the
agent can take before the episode ends. This may be useful in continuing, perpetual, or
looping environments, where the environment itself does not end the episode.
The code, the becomes the following:
#!/usr/bin/env python
import gym
env = gym.make("MountainCar-v0")
MAX_NUM_EPISODES = 5000
for episode in range(MAX_NUM_EPISODES):

done = False
obs = env.reset()
total_reward = 0.0 # To keep track of the total reward
obtained in each episode
step = 0
while not done:
env.render()
action = env.action_space.sample()# Sample random action.
This will be replaced by our agent's action when we start
developing the agent algorithms
next_state, reward, done, info = env.step(action) # Send
the action to the environment and receive the next_state,
reward and whether done or not
total_reward += reward
step += 1
obs = next_state
print("\n Episode #{} ended in {} steps.

total_reward={}".format(episode, step+1, total_reward))
env.close()
If you run the preceding script, we see the Mountain Car environment come up in a new
window and the car moving left and right randomly for 1,000 episodes. We also see the
episode number, steps taken, and the total reward obtained printed at the end of every
episode.
PAGE 20
METHOD 2: IMPLEMENTING Q-LEARNING ALGORITHM
Before our Q_Learner class declaration, we initialize a few useful hyperparameters.
- EPSILON_MIN: This is the minimum value of the epsilon value that we want the
agent to use while following an epsilon-greedy policy.
- MAX_NUM_EPISODES: The maximum number of episodes that we want the
agent to interact with the environment for.
- STEPS_PER_EPISODE: This is the number of steps in each episode. This could
be the maximum number of steps that an environment will allow per episode or a
custom value that we want to limit based on some time budget.
- ALPHA: This is the learning rate that we want the agent to use. This is the alpha in
the Q-learning update equation listed in the previous section.
- GAMMA: This is the discount factor that the agent will use to factor in future
rewards.
- NUM_DISCRETE_BINS: This is the number of bins of values that the state
space will be discretized into. For the Mountain Car environment, we will be
discretizing the state space into 30 bins
This is done in the following code Snippet:
EPSILON_MIN = 0.005
max_num_steps = MAX_NUM_EPISODES * STEPS_PER_EPISODE
EPSILON_DECAY = 500 * EPSILON_MIN / max_num_steps
ALPHA = 0.05 # Learning rate
GAMMA = 0.98 # Discount factor
NUM_DISCRETE_BINS = 30 # Number of bins to Discretize each
observation dim
The __init__(self, env) function takes the environment instance, env, as an input
argument and initializes the dimensions/shape of the observation space and the action
space, and also determines the parameters to discretize the observation space based on
the NUM_DISCRETE_BINS set. The __init__(self, env) function also initializes the Q
function as a NumPy array, based on the shape of the discretized observation space and
the action space dimensions.
PAGE 21
THIS IS HOW THE COMPLETE CODE LOOKS-
Import gym
import numpy as np
MAX_NUM_EPISODES = 50000
STEPS_PER_EPISODE = 200 # This is specific to MountainCar. May change with env
EPSILON_MIN = 0.005
max_num_steps = MAX_NUM_EPISODES * STEPS_PER_EPISODE
EPSILON_DECAY = 500 * EPSILON_MIN / max_num_steps
ALPHA = 0.05 # Learning rate
GAMMA = 0.98 # Discount factor
NUM_DISCRETE_BINS = 30 # Number of bins to Discretize each observation dim
class Q_Learner(object):
def __init__(self, env):
self.obs_shape = env.observation_space.shape
self.obs_high = env.observation_space.high
self.obs_low = env.observation_space.low
self.obs_bins = NUM_DISCRETE_BINS # Number of bins to Discretize each
observation dim
self.bin_width = (self.obs_high - self.obs_low) / self.obs_bins
self.action_shape = env.action_space.n
# Create a multi-dimensional array (aka. Table) to represent the
# Q-values
self.Q = np.zeros((self.obs_bins + 1, self.obs_bins + 1,
self.action_shape)) # (51 x 51 x 3)
self.alpha = ALPHA # Learning rate
self.gamma = GAMMA # Discount factor
self.epsilon = 1.0
def discretize(self, obs):

return tuple(((obs - self.obs_low) / self.bin_width).astype(int))
def get_action(self, obs):

discretized_obs = self.discretize(obs)
# Epsilon-Greedy action selection
PAGE 22
if self.epsilon > EPSILON_MIN:
self.epsilon -= EPSILON_DECAY
if np.random.random() > self.epsilon:
return np.argmax(self.Q[discretized_obs])
else: # Choose a random action
return np.random.choice([a for a in range(self.action_shape)])
def learn(self, obs, action, reward, next_obs):

discretized_obs = self.discretize(obs)
discretized_next_obs = self.discretize(next_obs)
td_target = reward + self.gamma * np.max(self.Q[discretized_next_obs])
td_error = td_target - self.Q[discretized_obs][action]
self.Q[discretized_obs][action] += self.alpha * td_error
def train(agent, env):

best_reward = -float('inf')
for episode in range(MAX_NUM_EPISODES):
done = False
obs = env.reset()
total_reward = 0.0
while not done:
action = agent.get_action(obs)
next_obs, reward, done, info = env.step(action)
agent.learn(obs, action, reward, next_obs)
obs = next_obs
if total_reward > best_reward:
best_reward = total_reward
print("Episode#:{} reward:{} best_reward:{} eps:{}".format(episode,
total_reward, best_reward, agent.epsilon))
# Return the trained policy
return np.argmax(agent.Q, axis=2)
def test(agent, env, policy):

done = False
obs = env.reset()
total_reward = 0.0
while not done:
action = policy[agent.discretize(obs)]
PAGE 23
next_obs, reward, done, info = env.step(action)
obs = next_obs
return total_reward
if __name__ == "__main__":
env = gym.make('MountainCar-v0')
agent = Q_Learner(env)
learned_policy = train(agent, env)
# Use the Gym Monitor wrapper to evalaute the agent and record video
gym_monitor_path = "./gym_monitor_output"
env = gym.wrappers.Monitor(env, gym_monitor_path, force=True)
for _ in range(1000):
test(agent, env, learned_policy)
env.close()
PAGE 24
THE SIMULATION UNDERWAY
PAGE 25
Bibliography:
• Books
- Artificial Intelligence A Modern Approach 3rd ed by Stuart Russel & Peter Norvig
- Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G.
Barton
- Hands-On Intelligent Agents with OpenAI Gym by Praveen Palanisamy
• Online Courses-
- https://www.cse.iitb.ac.in/~shivaram/teaching/cs337+335-s2019/
- https://www.coursera.org/learn/neural-networks-deep-learning/home/welcome
- https://www.coursera.org/learn/practical-rl/home/week/3
- http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html
PAGE 26

RL-Endterm Report - Mridul Agarwal

Uploaded by

Copyright:

Available Formats

You might also like

RL-Endterm Report - Mridul Agarwal

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

RL-Endterm Report - Mridul Agarwal

Uploaded by

Copyright:

Available Formats

Deep Reinforcement Learning

SUMMER OF SCIENCES, 2019

Mridul Agarwal | 180010034 | 27th July 2019

A reinforcement learning algorithm, or agent, learns by interacting with its environment.

HOW IS IT DIFFERENT FROM SUPERVISED AND UNSUPERVISED LEARNING?

The K-armed bandit problem

Optimistic Initial values-

Gradient Bandit Algorithm-

And the agent kind of generates a trajectory working as-

P defines the dynamics of MDP

So we seek to maximize the Expected Return-

Episodes- when the agent-environment interaction breaks naturally into subsequence.

Policies & Value Functions-

A fundamental property of value functions used throughout reinforcement learning &

Optimal Policies & Value Functions

Policy Improvement Theorem:

The new greedy policy π’ is given by:

SARSA is so named because of the sequence State->Action->Reward->State'->Action' that

Deep Reinforcement Learning

Deep reinforcement learning is basically application of deep learning techniques in

Current applications of this method include:

- Robot Locomotion & Manipulation

OpenAI Gym provides a simple and common Python interface to environments.

// ~$pip install gym

Tools & Libraries Needed

Compute Unified Device Architecture (CUDA)

Understanding the Gym Interface

This is How the Algorithm looks qualitatively speaking-

The Mountain- Car Problem:

In the Mountain Car Gym environment, a car is on a one-dimensional track, positioned

The code, the becomes the following:

for episode in range(MAX_NUM_EPISODES):

print("\n Episode #{} ended in {} steps.

This is done in the following code Snippet:

def discretize(self, obs):

def get_action(self, obs):

def learn(self, obs, action, reward, next_obs):

def train(agent, env):

def test(agent, env, policy):

You might also like