BTP Final Term Report v3

B.Tech.
- Project
on
Multi-agent Reinforcement Learning Algorithm
Project code: I-03
Submitted by
Diptanshu Rajan (Entry No: 2017ME10574)
Jayaditya Gupta (Entry No: 2017ME10579)
Supervised by Prof. Arnob Ghosh, Prof. Varun Ramamohan
Mechanical Engineering Department

Indian Institute of Technology Delhi
January 2021
1
Acknowledgements
First and foremost, we would like to express our deep and sincere gratitude to our
project supervisors, Prof. Arnob Ghosh and Prof. Varun Ramamohan, for providing us
this great opportunity and invaluable guidance throughout the project. We cannot
express enough thanks to our supervisors for their continued help, support, motivation
and encouragement and we offer our sincere appreciation for the great learning
experiences provided by them.
We would also like to thank the committee members of the B.Tech. Project course for
always encouraging and motivating us to work on this project and providing valuable
suggestions from time to time.
We are also grateful to our parents for their love, care and prayers. We would like to
thank each person who helped us to complete this project.
2
Abstract
Multi-agent reinforcement learning has made tremendous progress in recent years, but
it still remains a difficult problem. By addressing many of the sequential decision-making
challenges in machine learning, it has achieved some success.
Several of RL's common and successful implementations, such as poker and Go
games, drones and autonomous driving, need more than one agent to be involved,
which eventually falls into the multi-agent RL (MARL) domain, an area that has gained
prominence with latest developments in the RL single-agent methods.
In this report, we address the fundamental work of the Markov Decision Process (MDP)
RL algorithm and primarily concentrate on policy gradient methods.
The report addresses an algorithm for a sequential two-agent RL on an environment
(cartpoole) and its outcomes. Further, an improvement of this algorithm is also
discussed in which each agent uses another agent’s action also in account while taking
its own action. The setting of the environment is cooperative and tested for different
values of different parameters like discount factor and learning rate of the neural
network for optimization.
3
Table of Contents
Page
Acknowledgements………………………………………………………. 2
Abstract……………………………………………………………………. 3
Table of Contents…………………………………………………………..4
List of Tables………………………………………………………………..5
List of Figures……………………………………………………………….6
Chapters
1. Introduction..………………………………………………….7
2. Literature Survey……………………………………………..9
3. Project Objectives…………………………………………...13
4. Work Progress………………………………………………..15
5. Conclusion…………………………………………………...23
References………………………………………………………………...24
Originality Report………………………………………………………….25
4
List of Tables
Table Page
Table 4.1 Comparison of different learning rates and discount factors value.... 22
5
List of Figures
Figure Page
Fig 2.1 Canonical Feedback Loop………………………………….. 9

Fig 4.1 Cartpole Environment………………………………………. 15
Fig 4.2 Training loop for sequential process code snippet………. 16
Fig 4.3 Implementation loop for sequential process……………… 17
Fig 4.4 Policy gradient for 2 policies in a single call………………. 19
Fig 4.5 Final Implementation loop…………………………………... 21
6
1. Introduction
In the paradigm of reinforcement learning, an individual agent interacts in an

environment moving through a state space and tries to explore a policy which is a
mapping of states and probability distribution over actions to improve the accumulated
rewards. By choosing a given action at a specific state, a reward is released and a
random transition to a new state happens depending on the probability distribution
which only depends on the action and current state, i.e., according to Markovian state
transitions.
Multi-agent reinforcement learning (MARL) comprises a system in which in a shared

and unknown setting many agents act. MARL is distinct from the traditional single agent
RL due to the presence of other agents. All other agents are part of the environment
when we view the MARL setup from a specific agent's point of view. Because all of
these agents are often learning their policies and changing them, the environment faced
by a specific agent changes over time. The conventional single agent RL algorithms
cannot be used in MARL due to this concept of the non-stationary world.
Based on the kinds of environments they handle, multi-agent RL algorithms are being
divided into 3 main groups, completely cooperative, completely competitive, as well as a
combination of the two. In the cooperative environment, agents coordinate to maximize
a mutual long-term reward, while in the competitive environment it generally amounts to
zero (zero-sum game). In the combined setting, with general-sum returns, it requires
both cooperative and competitive agents.
As previously mentioned, it is hard to implement value-based reinforcement learning

algorithms like Q-learning for multiple agents, as stationary environments are
considered in approaches related to value-based reinforcement learning. To solve this
problem, we try to use policy gradient methods as these mechanisms are quite useful
approaches to reinforcement learning in multi-agent systems. Using these techniques,
like by implementing autonomous decentralized control, a decision problem in a multi-
agent system can be split into a collection of independent decision issues for each
agent. These approaches, however, use stochastic policies that involve parameters.
These methods scan the parameterized policy space of interest directly, changing the
parameters in the direction of the policy gradient. To maximize the expectation of the
reward, the parameters are modified stochastically.
We have implemented an algorithm for a setting consisting of two agents in a

cooperative environment (Cartpole), in which both the agents take the actions
sequentially in the section 4.2. In the same section, an improved algorithm is also
discussed which uses a total of 4 policies for two agents to account the actions of
7
another agent also. Further, results are presented for different values of parameters and
in section 5, conclusion and discussion is addressed.
2. Literature Survey
8
In reinforcement learning, the primary goal is to optimize a long-term reward or return.
This concept of the system includes an agent which can interact with the environment
and receives a reward at distinct time stages for his actions which ultimately transforms
the agent into a new state. A conventional feedback loop consisting of agent and
environment is described below:
Fig 2.1: Canonical Feedback Loop
Definition: “A Markov Decision Process (MDP) is defined by a tuple (S,A,P,R,γ), where S

and A denote the state and action spaces, respectively; P : S × A → ∆(S) denotes the transition
probability from any state s ∈ S to any state s 0 ∈ S for any given action a ∈ A; R : S × A
× S → R is the reward function that determines the immediate reward received by the agent for a
transition from (s,a) to s’ 0 ; γ ∈ [0,1] is the discount factor that trades off the
instantaneous and future rewards.” [1]
The agent attempts to choose actions by using the discounting factor γ such that the
amount of the discounted rewards it receives over the future are maximum. It chooses
At in particular to optimize the anticipated discounted return:
9
The agent is only concerned with immediate rewards for γ=0, while for γ=1, the agent is
more technically farsighted and takes possible future rewards more closely into account.
Definition: “A policy is a mapping from states to probabilities of selecting each possible

action. If the agent is following policy π at time t, then π(a|s) is the probability that A t = a
if St = s.”[1]
The Reward Hypothesis, which states in summary that a single scalar called the reward
can explain all an agent's goals and purposes, offers a significant amount of theory
behind RL.
The Reward Hypothesis:

All of what we mean by objectives and intentions can be well interpreted as maximizing
the expected value of the cumulative sum of a scalar signal obtained (called reward).
This leads to a series of states, acts and rewards known as a trajectory, and the goal is
to achieve a maximum reward set.
Policy Gradients
When following a policy π, the goal of a RL-agent is to maximize the "expected" reward.
We arrive at the following description if we define the total reward for a given trajectory τ
as r(τ)
10
In gradient ascent, we keep stepping through the parameters using the following update
rule
Now, we rearrange the gradient by expanding the expectation term:
The Policy Gradient Theorem: “The derivative of the expected reward is the
expectation of the product of the reward and gradient of the log of the policy π_θ.”
By the use of policy π_θ and the other parameters of the environment, we take some
action to determine which new state to turn into. Those representing the length of the
11
trajectory are multiplied over T time measures.
Hence, the “reinforce” algorithm defines the policy gradient as below:
In crux we are doing the following:
1. Start out with an arbitrary random policy

2. Sample some actions in the environment
3. If rewards are better than expected, increase probability of taking those actions
4. If rewards are worse, decrease the probability
3. Project Objectives and Work Plan
12
3.1 Problem Definition/Motivation
Significant research has been conducted in the area of reinforcement learning (RL). But
a major part of the research done in this area, surprisingly, is focused on single-agent
RL.
Recent successful innovations in this area include, for example, the rollout of UAVs
(Unmanned Aerial Vehicles) to execute a cooperation mission, learning how to play
board games such as the Go game and other video strategy games such as Dota2.
Multi-agent RL manages to find its use in various technologies. Because of many
problems in MARL, such as environmental non-stationarity and the issue of scalability,
its deployment becomes complex and this primarily motivates us to effectively solve this
problem. For this, we plan to use the non-value-based approaches that are further
addressed in this report, such as PG methods.
3.2 Objectives of the work
● Implement a sequential two-agent reinforcement learning algorithm using Policy

Gradient methods.
● Implement the algorithm for fully cooperative setting such that each agent also
considers other agent’s action to modify its policy.
● Further, testing the algorithm in different settings for different values of variable
parameters and analysing the convergence of the model.
3.3 Methodology
We began the project by reviewing the RL algorithm literature, from basic single-agent
RL algorithms to complex multi-agent RL algorithms, various environment settings, Q-
learning methods, PG methods, etc.
Using ReLU as the activation function, the neural network used for the implementation
of the sequential two-agent RL algorithm uses 2 hidden layers with 30 nodes in each
layer. Finally, the softmax layer is the output layer. We used the Adam optimizer with a
learning rate of 0.001 for neural network training. The value of the discount factor γ
used is 1 to update the reward.
In the second algorithm which uses 4 policies for cross assessment, all the other
settings remain the same and the model is run for different values of discount factor and
13
learning rates for both of the agents. Results are obtained to observe the effect on
convergence.
4. Work Progress
14
4.1 Description of the Problem / Case Study
We first try to implement this Multi Agent Reinforcement Model on a Agent Cartpole
Model.
This is essentially a typical environment for 2 agents who use the state of the current
environment to take actions one after the other. We have a standard rewards list that
accounts for the agents' collective effort.
CartPole:
A pole is connected to a cart by an un-actuated joint, which travels along a track that is
frictionless. By applying +1 or -1 force to the cart, the mechanism is regulated. The pole
begins upright, and the purpose is to stop it from falling over. For every moment that the
pole doesn’t fall and remains straight, + 1 reward is generated for the agent. If the pole
is greater than 15 degrees from vertical or it travels more than 2.4 units from the middle,
the episode stops.
Fig 4.1: Cartpole Environment
4.2 Work Done
1. Sequential Multi Agent: 2 Policies - Self Assessment
This was the initial idea in which both the agents had their seperate policies and actions
were taken one after the other, both the agents had a common reward list. This was an
implementation of 2 major sectors in the environment.
a. Training Loop
15
Fig 4.2: Training loop for sequential process code snippet
The above snippet is a implication of the literature that we reviewed and basically
implements:
We basically have States, Rewards and Actions array for the agent. It first finds the
Discounted Rewards (Gt) and stores them. Next, it finds loss as the (log π(a|s)) using
‘tfp’, multiplies it with Gt. Then it finds out the gradient from the calculated loss and
finally applies this gradient in the model which is a neural network.
b. Implementation Loop
16
Fig 4.3: Implementation loop for sequential process
This is the chain in which the process is completed and the Training loop is called. It
implements the sequential calling of actions. We model 2 agents at the start. For a fixed
number of steps we have a cumulative reward list and separate action and state lists.
We start by getting an action for agent 1 for the given state. Give that state to the env.
and get the next state, reward we use this next state to find the action of agent 2. Find
17
the next state and reward for the 2nd agent Cumulate the rewards and train the agents
based on the reward list and repeat.
Result - We found out that both the model seems to converge after ~200 steps for a
learning rate of 0.001 and Gamma = 1, but still sips down to lower value at some points.
2. Sequential Multi Agent: 4 Policies - Cross Assessment
This was an improvement from the initial idea as earlier we were trying to formulate
policies based on the states and actions of the agent itself alone but now we are using
the state action pair of both the agents to formulate the policies for each agent.
a. Training Loop
Let, πij be the policy generated for agent i using the action pair for agent j.
For instance, to get an action for a state for agent 1 first we create π 12 for agent 1 using
the action state pair and the reward list for agent 2. Next, using this π 12 we sample an
action given the state of agent 1, let it be temp_action. Next we formulate π 11 using this
temp_action and state 1 and apply policy gradient. This completes the training loop.
The below snippet (in Fig.4.4) is an improvement over the initial one, but in crux
implements the policy gradient for 2 policies in a single training call. As for agent 1 it
complies π12,π11 and also generates a sample action for π12.
∇ E 𝝿 [r (t )]=E 𝝿 ¿
12 12
In the above equation t is the variable parameter G12,t is the discounted rewards of
Agent 2 using Gamma of Agent 1, a2,t is the action of Agent 2 and s2,t is the state of
Agent 2. Similarly for π21 swap agent 1 and 2.
∇ E 𝝿 [r (t)]=E𝝿 ¿
11 11
The above equation is similar to previous one in terms of variable with changes
in which agent to call, ax,t is the action generated when the neural network of𝝿 12is
given the State of Agent 1.
18
Fig 4.4: Policy gradient for 2 policies in a single call code snippet
There are 5 major steps in the Training Loop for this model (Assume training for Agent
1) for the explanation:
1. Finding the discounted rewards list: The loop indicated as ‘1’ calculates the
discounted rewards for Agent 2 and stores them, for this it uses the Agent 1
Gamma variable.
2. Forming the Policy 𝝿12: The 2nd loop formulates the policy 𝝿12 using the Action,
State and Reward list of Agent 2 using the Standard Policy Gradient Mechanism
and stores it in a separate neural network which has Gamma and Learning
Factor of Agent 1.
19
3. Sampling a Action given Policy 𝝿12: In this step the Agent 1 basically what the
Agent 2 would have done in this given situation and for that it uses the Policy 𝝿12,
created in step 2 and samples a Action given the current state of Agent 1 and
stores them.
4. Finding the discounted rewards list: The loop indicated as ‘4’ calculates the
discounted rewards for Agent 1 and stores them, for this it uses the Agent 1
Gamma variable.
5. Forming the Policy 𝝿11: This loop works similar to the 2nd loop with the only
difference being that this time it accounts for the actions that we generated from
the 𝝿12 and stored in step 3 and uses those actions combined with Agent 1’s
states and rewards and formulates the policy using standard Policy Gradient
Mechanism and finally applies these gradient to self, so that whenever a action is
called from Agent 1 it samples the action from this final Policy 𝝿11.
Note: For Agent 2 i.e. for formation of 𝝿21 and 𝝿22 the above implementation is the
same but when calling the function Agent 1 and Agent 2 are swapped.
b. Implementation Loop
This implementation again expresses the chain of sequence and the way in which the
process is implemented. In essence this is the previous implementation with slight
change in the way functions are called. Whenever we call the training loop for the agent
we first separate the reward list , and call the state action and reward list for both the
agents for each loop.
20
Fig 4.5: Final Implementation loop
21
We tested this algorithm on various settings which are indicated as follows:
Learning Learning Discount Discount Steps to Steps to Total

Rate A1 Rate A2 Factor Factor reach 1st reach max steps
(𝛼1) (𝛼2) A1 (ɣ1) A2 (ɣ2) max consistently
0.001 0.002 1 1 620 ~1500 2500
0.001 0.002 0.5 0.5 - - 2500
0.001 0.001 1 1 750 ~1200 3000
0.001 0.001 1 0.5 1000 ~1300 3000
0.002 0.002 1 1 800 ~1700 3000
0.01 0.01 1 1 - - 3000
0.001 0.001 0.5 0.5 - - 4000
Table 4.1: Comparison of different learning rates and discount factors value
1. Learning Rates(𝛼): When comparing the Learning rates for different settings we
see that for the case when both the agents had a same learning rate of 0.001
compared when both the agents had 2 times the learning rate of 0.002 the
convergence was better in the first case overall (for both Steps to reach first max
and to get maximum reward consistently).
Next when comparing the previous setting of both the agents having 𝛼 of 0.001
with a new setting changing the 𝛼 of one of the agents of 0.002 we saw that this
was better than the case when both the agents had a 𝛼 of 0.002 and even for the
case with both agents having 0.001 𝛼 the first max steps was lower but overall it
took more time for the algorithm to converge.
2. Discount Factor(ɣ): The connection with ɣ was also in line with the trend in the 𝛼
setting. When both the agents have ɣ = 0.5 the convergence is very slow when
compared to both the agents having ɣ = 1 and ɣ1 = 1 and ɣ2 = 0.5 setting lies in
between these two.
5. Conclusion
22
We started from implementing a single agent RL model with policy gradient to a 2 Agent
RL model in a sequential manner which seems to work for the given environment. The
first model was only accounting for the actions and states for its own setting without
accounting for other behaviour. The next model was an improvement on the first setting;
it accounts for the behaviour for the other agent as well and works in a 2-Agent RL
setting with 4 policies. On changing out various parameters within the model we saw
that the variable factors for the policies are Learning Rate and Gamma for the network.
The convergence is better when the learning rate is smaller for both the agents and the
model seems to be optimized better.
The convergence is better when the Gamma for the agent is closer to 1 for both the
agents and becomes less optimized as we start going away from 1 and towards zero.
References
23
1. Richard S. Sutton, Andrew G. Barto, Reinforcement Learning- An Introduction
(2nd edition)
2. R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, I. Mordatch, Multi-Agent Actor-
Critic for Mixed Cooperative-Competitive Environments (2018).
3. K. Zhang, A. Koppel, H. Zhu, T. Basar, Global Convergence of Policy Gradient
Methods to (Almost) Locally Optimal Policies.
4. J. Subramanian, A. Mahajan, Reinforcement Learning in Stationary Mean-field
Games.
5. M. Prajapat, K. Azizzadenesheli, A. Liniger, Y. Yue, A. Anandkumar, Competitive
Policy Optimization
Originality Report
24
25
26

BTP Final Term Report v3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BTP Final Term Report v3

Uploaded by

Copyright:

Available Formats

B.Tech.

Project code: I-03

Supervised by Prof. Arnob Ghosh, Prof. Varun Ramamohan

Mechanical Engineering Department

experiences provided by them.

suggestions from time to time.

thank each person who helped us to complete this project.

Fig 2.1 Canonical Feedback Loop………………………………….. 9

In the paradigm of reinforcement learning, an individual agent interacts in an

Multi-agent reinforcement learning (MARL) comprises a system in which in a shared

As previously mentioned, it is hard to implement value-based reinforcement learning

We have implemented an algorithm for a setting consisting of two agents in a

Fig 2.1: Canonical Feedback Loop

Definition: “A Markov Decision Process (MDP) is defined by a tuple (S,A,P,R,γ), where S

Definition: “A policy is a mapping from states to probabilities of selecting each possible

The Reward Hypothesis:

Now, we rearrange the gradient by expanding the expectation term:

Hence, the “reinforce” algorithm defines the policy gradient as below:

In crux we are doing the following:

1. Start out with an arbitrary random policy

3. Project Objectives and Work Plan

3.2 Objectives of the work

● Implement a sequential two-agent reinforcement learning algorithm using Policy

Fig 4.1: Cartpole Environment

4.2 Work Done

1. Sequential Multi Agent: 2 Policies - Self Assessment

2. Sequential Multi Agent: 4 Policies - Cross Assessment

Learning Learning Discount Discount Steps to Steps to Total

0.001 0.002 1 1 620 ~1500 2500

0.001 0.002 0.5 0.5 - - 2500

0.001 0.001 1 1 750 ~1200 3000

0.001 0.001 1 0.5 1000 ~1300 3000

0.002 0.002 1 1 800 ~1700 3000

0.01 0.01 1 1 - - 3000

0.001 0.001 0.5 0.5 - - 4000

You might also like