AS02

QUANTUM UNIVERSITY
Reinforcement Learning / CS3821

ASSIGNMENT: 02
Subject: Reinforcement Learning

Program/Branch/Year: B.Tech. CSE 4th Year
General Instructions:
Max Marks: 30
All questions are compulsory.
1. Each Topics carries 30 marks.
Discuss about the Markov Property in Reinforcement Learning in

detail.
Ans. The Markov Property is a fundamental concept in
reinforcement learning (RL) and is crucial for understanding how
agents interact with environments modeled as Markov Decision
Processes (MDPs).
Definition of the Markov Property
The Markov Property states that the future state of a process
depends only on the present state and action, not on the sequence of
events that preceded it. Formally, a process satisfies the Markov
Property if:
𝑃(𝑠𝑡+1∣𝑠𝑡,𝑎𝑡)=𝑃(𝑠𝑡+1∣𝑠𝑡,𝑎𝑡,𝑠𝑡−1,𝑎𝑡−1,…,𝑠0,𝑎0)P(st+1∣st,at)=P(st+1∣st
,at,st−1,at−1,…,s0,a0)
This means that given the current state 𝑠𝑡st and action 𝑎𝑡at, the next
state 𝑠𝑡+1st+1 is conditionally independent of all previous states and
actions.
Implications of the Markov Property
State Representation:
The state 𝑠𝑡st must capture all relevant information from the history
of the process up to time 𝑡t. If 𝑠𝑡st is truly Markovian, then no
additional information from the past is needed to predict the future.
Simplified Modeling:
The Markov Property allows for a simplified representation of the
environment, reducing the complexity of modeling the dynamics of
the system.
Transition dynamics can be described by a transition probability
matrix 𝑃(𝑠′∣𝑠,𝑎)P(s′∣s,a), which specifies the probability of
transitioning to state 𝑠′s′ from state 𝑠s when action 𝑎a is taken.
Policy and Value Functions:
Policies 𝜋(𝑎∣𝑠)π(a∣s), which define the agent’s behavior, can be based

solely on the current state.
Value functions 𝑉(𝑠)V(s) and 𝑄(𝑠,𝑎)Q(s,a) depend only on the current
state (and action for 𝑄Q), enabling efficient computation and
learning.
Markov Decision Processes (MDPs)
An MDP is a mathematical framework for modeling decision-making
in environments that satisfy the Markov Property. An MDP is defined
by:
States (S): A finite or infinite set of states the environment can be in.
Actions (A): A finite or infinite set of actions the agent can take.
Transition Probability (P): The probability 𝑃(𝑠′∣𝑠,𝑎)P(s′∣s,a) of

transitioning to state 𝑠′s′ from state 𝑠s when action 𝑎a is taken.
Reward Function (R): The immediate reward 𝑅(𝑠,𝑎,𝑠′)R(s,a,s′)

received after transitioning from state 𝑠s to state 𝑠′s′ due to action
𝑎a.
Discount Factor (γ): A factor 𝛾∈[0,1]γ∈[0,1] that discounts future

rewards.
Significance in Reinforcement Learning
Efficient Learning:
The Markov Property allows RL algorithms to efficiently learn optimal
policies and value functions because they only need to consider the
current state and action, not the entire history.
Algorithms such as Q-learning and SARSA update value estimates
based on the current state-action pair, leveraging the Markovian
nature of the environment.
Dynamic Programming:
Methods like Value Iteration and Policy Iteration rely on the Markov
Property to iteratively improve value functions and policies.
Bellman equations, which form the backbone of these methods, are
based on the principle that the value of a state depends only on the
current state and the value of subsequent states.
Temporal Difference Learning:
TD learning methods, which blend ideas from Monte Carlo and
dynamic programming, also exploit the Markov Property.
They update value estimates incrementally, based on current state-
action pairs and subsequent rewards, assuming the future is
conditionally independent of the past given the present state.
Challenges and Considerations
Partial Observability:
In many real-world problems, the true state of the environment may
not be fully observable. Such problems are modeled as Partially
Observable Markov Decision Processes (POMDPs), which are more
complex and challenging to solve.
Agents must rely on observations and belief states (probabilistic
estimates of the true state) to make decisions.
State Representation:
Ensuring that the state representation captures all relevant
information to satisfy the Markov Property can be difficult.
Feature engineering or using function approximators like neural
networks can help create effective state representations.
Non-Markovian Dynamics:
Some environments inherently exhibit non-Markovian dynamics,
where the future state depends on more than just the current state
and action.
Techniques such as memory-augmented neural networks or recurrent
neural networks (RNNs) can be employed to handle such cases by
implicitly remembering past information.
Example
Gridworld Example:
Consider a robot navigating a grid where each cell represents a state.
The robot can move in four directions: up, down, left, or right.
The environment satisfies the Markov Property if the robot's next
position depends only on its current position and chosen direction,
not on how it arrived at the current position.
Distinguish between different Immediate RL and Full RL in detail.

Ans.
Feature Immediate RL Full RL (Episodic RL)
Sequential decisions over

Decision Type Single, independent actions multiple time steps
No state transitions; single State transitions based on

State Transitions state environment actions
Maximize cumulative
Objective Maximize immediate reward (discounted) reward
More complex, requires

Simpler, no need for long-term planning and long-term
Complexity planning strategies
Reward Cumulative rewards over time

Structure Immediate rewards only or episodes
Navigation, games, robotics,

Suitable Multi-Armed Bandit problems, any task with long-term
Environments immediate reward tasks planning
Common Epsilon-Greedy, UCB, Value Iteration, Policy Iteration,

Algorithms Thompson Sampling Q-Learning, Deep RL
Feature Immediate RL Full RL (Episodic RL)
Choosing an advertisement, Robot navigation, playing chess,

Examples slot machines game strategies
3.Discuss about the Efficiency of Dynamic Programing in detail.

Ans. Dynamic programming (DP) is a powerful method used in various
fields, including reinforcement learning (RL), for solving problems that
can be decomposed into overlapping subproblems. DP is particularly
effective for solving Markov Decision Processes (MDPs), where the
objective is to find an optimal policy that maximizes cumulative
rewards over time. Here, we will discuss the efficiency of dynamic
programming in detail, covering its principles, key algorithms,
advantages, limitations, and computational aspects.
Principles of Dynamic Programming
Dynamic programming leverages the principle of optimality, which
states that an optimal policy has the property that, regardless of the
initial state and decision, the remaining decisions constitute an
optimal policy with regard to the state resulting from the first
decision. This principle allows DP to break down problems into simpler
subproblems, solve each subproblem once, and store their solutions.
Key Dynamic Programming Algorithms
In the context of reinforcement learning and MDPs, the two primary
DP algorithms are Value Iteration and Policy Iteration.
1. Value Iteration
Value Iteration is an algorithm that iteratively updates the value
function until it converges to the optimal value function.
2. Policy Iteration
Policy Iteration involves two main steps: policy evaluation and policy
improvement.
Limitations and Challenges
Scalability:
The computational complexity makes DP impractical for very large

state and action spaces (the "curse of dimensionality").
For large MDPs, the state space ∣𝑆∣∣S∣ and action space ∣𝐴∣∣A∣ can
become prohibitively large, leading to high memory and computation
requirements.
Model Dependency:
DP methods require a complete and accurate model of the

environment's dynamics (transition probabilities and rewards), which
may not be available or feasible to obtain in many real-world
scenarios.
Convergence Speed:
Convergence can be slow, especially for value iteration with a small

discount factor 𝛾γ close to 1, where changes in value propagate slowly.
Enhancements and Alternatives
To address the limitations of traditional DP methods, several
enhancements and alternatives are used:
Approximate Dynamic Programming:
Approximate solutions using function approximation techniques (e.g.,

linear function approximation, neural networks) to handle large state
spaces.
Monte Carlo Methods:
Sample-based methods that estimate value functions and policies

from empirical data rather than relying on a full model.
Temporal Difference (TD) Learning:
Combines ideas from DP and Monte Carlo methods, updating value

estimates based on temporal differences and enabling online learning
without a full model.
Heuristics and Prioritization:
Techniques like prioritized sweeping and heuristic search to focus

computation on the most relevant states and actions.
4.Describe the Policy Evaluation and Improvement in detail.

Ans. Policy evaluation and policy improvement are key components of
dynamic programming methods used to solve Markov Decision
Processes (MDPs) in reinforcement learning. These steps form the
basis of the policy iteration algorithm, which iteratively evaluates and
improves policies until an optimal policy is found. Here's a detailed
look at each process:
Policy Evaluation
Policy evaluation aims to compute the value function 𝑉𝜋(𝑠)Vπ(s) for a
given policy 𝜋π. The value function represents the expected return
(cumulative future rewards) starting from state 𝑠s and following policy
𝜋π.
Steps in Policy Evaluation

Initialization:
Start with an arbitrary initial value function 𝑉(𝑠)V(s) for all states 𝑠s. This
can be zero or any other guess.
Iterative Update:
Update the value function iteratively using the Bellman expectation

equation for the given policy 𝜋π:
𝑉𝜋(𝑠)=∑𝑎𝜋(𝑎∣𝑠)∑𝑠′𝑃(𝑠′∣𝑠,𝑎)[𝑅(𝑠,𝑎,𝑠′)+𝛾𝑉𝜋(𝑠′)]Vπ(s)=a∑π(a∣s)s′∑P(s′∣s,a)[R(s,a,s′)+γVπ(s′)]
Here, 𝜋(𝑎∣𝑠)π(a∣s) is the probability of taking action 𝑎a in state 𝑠s under
policy 𝜋π, 𝑃(𝑠′∣𝑠,𝑎)P(s′∣s,a) is the transition probability, 𝑅(𝑠,𝑎,𝑠′)R(s,a,s′) is the
reward, and 𝛾γ is the discount factor.
Convergence Check:
Continue updating 𝑉𝜋(𝑠)Vπ(s) until the value function converges,

meaning the changes in 𝑉𝜋(𝑠)Vπ(s) between iterations are smaller than
a predetermined threshold 𝜖ϵ.
Efficient Computation
For small state spaces, the Bellman equation can be
Matrix Formulation:
written in matrix form and solved using linear algebra techniques:
𝑉𝜋=𝑅𝜋+𝛾𝑃𝜋𝑉𝜋Vπ=Rπ+γPπVπ
where 𝑅𝜋Rπ is the reward vector and 𝑃𝜋Pπ is the transition matrix
under policy 𝜋π. This can be solved as:
𝑉𝜋=(𝐼−𝛾𝑃𝜋)−1𝑅𝜋Vπ=(I−γPπ)−1Rπ
For larger state spaces, iterative methods such as

Iterative Methods:
synchronous or asynchronous updates are used, where values are
updated based on the current estimates.
Policy Improvement
Policy improvement aims to enhance the current policy by making it
greedy with respect to the current value function 𝑉𝜋Vπ. This process
involves finding a new policy 𝜋′π′ that provides higher or equal value
compared to the current policy.
Steps in Policy Improvement
Policy Improvement Step:
For each state 𝑠s, find the action 𝑎a that maximizes the expected
return, given the current value function 𝑉𝜋Vπ:
𝜋′(𝑠)=arg⁡max⁡𝑎∑𝑠′𝑃(𝑠′∣𝑠,𝑎)[𝑅(𝑠,𝑎,𝑠′)+𝛾𝑉𝜋(𝑠′)]π′(s)=argamaxs′∑P(s′∣s,a)[R(s,a,s′)+γVπ(s′)]
This step generates a new policy 𝜋′π′ that is greedy with respect to
𝑉𝜋Vπ.
Greedy Policy:
The new policy 𝜋′π′ is constructed such that for every state 𝑠s, the
chosen action maximizes the expected return.
Policy Improvement Theorem:
The policy improvement theorem guarantees that the new policy 𝜋′π′
will be at least as good as the current policy 𝜋π. Formally:
𝑉𝜋′(𝑠)≥𝑉𝜋(𝑠)∀𝑠∈𝑆Vπ′(s)≥Vπ(s)∀s∈S
If 𝑉𝜋′(𝑠)=𝑉𝜋(𝑠)Vπ′(s)=Vπ(s) for all states, the current policy 𝜋π is already

optimal.
Policy Iteration
Policy iteration combines policy evaluation and policy improvement
into an iterative algorithm that converges to the optimal policy.
Steps in Policy Iteration
Initialization:
Start with an arbitrary initial policy 𝜋π.

Policy Evaluation:
Evaluate the current policy 𝜋π to find the value function 𝑉𝜋Vπ.

Policy Improvement:
Improve the current policy 𝜋π using the value function 𝑉𝜋Vπ to obtain a
new policy 𝜋′π′.
Convergence Check:
Repeat the policy evaluation and improvement steps until the policy
converges (i.e., the policy does not change between iterations).
Convergence and Efficiency
Convergence: Policy iteration is guaranteed to converge to the optimal
policy in a finite number of iterations because each iteration strictly
improves the policy or leaves it unchanged if it is already optimal.
Efficiency:
While policy iteration can be computationally intensive

Policy Iteration:
per iteration (due to the policy evaluation step), it often requires
fewer iterations to converge compared to value iteration.
Value iteration typically requires more iterations to
Value Iteration:
converge but has simpler per-iteration computations.
5.Discuss the MC Estimation of Action Value in detail.

Ans. Monte Carlo (MC) estimation of action values is a method used in
reinforcement learning to estimate the value of actions by averaging
the returns received from multiple episodes. This approach is
particularly useful in environments where the model of the
environment (transition probabilities and rewards) is unknown or
difficult to estimate. Here, we will discuss the MC estimation of action
values in detail, covering the key concepts, types of MC methods, the
algorithmic steps, and their advantages and limitations.
Key Concepts
Action Value Function (Q-function):
The action value function 𝑄(𝑠,𝑎)Q(s,a) represents the expected return

(cumulative future rewards) of taking action 𝑎a in state 𝑠s and then
following a particular policy 𝜋π.
Returns:
The return 𝐺𝑡Gt is the total discounted reward received from time step
𝑡t onwards:
𝐺𝑡=𝑅𝑡+1+𝛾𝑅𝑡+2+𝛾2𝑅𝑡+3+…Gt=Rt+1+γRt+2+γ2Rt+3+…
Here, 𝛾γ is the discount factor.
Episodes:
Monte Carlo methods require complete episodes to estimate returns

because they calculate the value of actions based on the outcomes of
these episodes.
Types of Monte Carlo Methods
First-Visit MC:
Estimates the value of an action based on the first time the action is
taken in a state within an episode.
For a given state-action pair (𝑠,𝑎)(s,a), the return is averaged over all
first occurrences of (𝑠,𝑎)(s,a) across episodes.
Every-Visit MC:
Estimates the value of an action based on every time the action is

taken in a state within an episode.
For a given state-action pair (𝑠,𝑎)(s,a), the return is averaged over all
occurrences of (𝑠,𝑎)(s,a) across episodes.
Algorithmic Steps
Initialization:
Initialize the action value function 𝑄(𝑠,𝑎)Q(s,a) arbitrarily for all state-
action pairs.
Initialize a counter 𝑁(𝑠,𝑎)N(s,a) for the number of times each state-
action pair is visited.
Generating Episodes:
Execute the policy 𝜋π to generate episodes. Each episode consists of a

sequence of state-action-reward tuples: (𝑠0,𝑎0,𝑟1,𝑠1,𝑎1,𝑟2,…,𝑠𝑇)(s0,a0,r1,s1
,a1,r2,…,sT).
Calculating Returns:
For each state-action pair (𝑠,𝑎)(s,a) encountered in the episode,
calculate the return 𝐺𝑡Gt from the first or every occurrence (depending
on the method used) of (𝑠,𝑎)(s,a) in the episode.
Updating Q-values:
First-Visit MC:
For each state-action pair (𝑠,𝑎)(s,a), if this is the first occurrence in the
episode:
𝑄(𝑠,𝑎)←𝑄(𝑠,𝑎)+1𝑁(𝑠,𝑎)(𝐺𝑡−𝑄(𝑠,𝑎))Q(s,a)←Q(s,a)+N(s,a)1(Gt−Q(s,a))
Every-Visit MC:
For each state-action pair (𝑠,𝑎)(s,a):
𝑄(𝑠,𝑎)←𝑄(𝑠,𝑎)+1𝑁(𝑠,𝑎)(𝐺𝑡−𝑄(𝑠,𝑎))Q(s,a)←Q(s,a)+N(s,a)1(Gt−Q(s,a))
Detailed Example
Consider a simple grid world where an agent starts in a particular
state, takes actions (up, down, left, right), and receives rewards based
on the resulting state. Let's apply MC estimation of action values:
Initialization:
𝑄(𝑠,𝑎)Q(s,a) is initialized to zero for all state-action pairs.

𝑁(𝑠,𝑎)N(s,a) is initialized to zero for all state-action pairs.
Generating an Episode:
Suppose the agent follows a policy and generates the following

episode:
(𝑠0,𝑎0,𝑟1,𝑠1,𝑎1,𝑟2,𝑠2,𝑎2,𝑟3,𝑠3)(s0,a0,r1,s1,a1,r2,s2,a2,r3,s3)
Assume discount factor 𝛾=1γ=1.

Calculating Returns:
For each state-action pair in the episode, calculate the return:

For (𝑠0,𝑎0)(s0,a0):
𝐺0=𝑟1+𝑟2+𝑟3G0=r1+r2+r3
For (𝑠1,𝑎1)(s1,a1):
𝐺1=𝑟2+𝑟3G1=r2+r3
For (𝑠2,𝑎2)(s2,a2):
𝐺2=𝑟3G2=r3
Updating Q-values:
First-Visit MC:
Update 𝑄(𝑠0,𝑎0)Q(s0,a0) using the return from the first occurrence:
𝑄(𝑠0,𝑎0)←𝑄(𝑠0,𝑎0)+1𝑁(𝑠0,𝑎0)(𝐺0−𝑄(𝑠0,𝑎0))Q(s0,a0)←Q(s0,a0)+N(s0,a0)1(G0−Q(s0,a0))
Every-Visit MC:
Update 𝑄(𝑠1,𝑎1)Q(s1,a1) and 𝑄(𝑠2,𝑎2)Q(s2,a2) similarly.
Advantages and Limitations
Advantages
Model-Free:
MC methods do not require a model of the environment. They

estimate values based on sampled episodes, making them suitable for
environments where the transition dynamics are unknown.
Ease of Implementation:
Simple to implement and understand, especially for small state and

action spaces.
Exploration:
Naturally incorporate exploration by evaluating the returns from

actual episodes, encouraging the agent to discover new strategies.
Limitations
Episode Completion:
MC methods require complete episodes to compute returns, which

can be inefficient or infeasible in environments with long episodes or
no natural episodes.
High Variance:
The estimates can have high variance because they rely on the returns
from sampled episodes, which can vary significantly.
Slow Convergence:
The convergence of MC methods can be slow, especially when the

episodes are long or the returns are highly variable.
Lack of Temporal Resolution:
Unlike temporal difference (TD) methods, MC methods do not update

values incrementally within an episode, potentially missing out on
valuable intermediate information.
6.Discuss in detail the Q-learning algorithm in Reinforcement Learning

with example.
Ans. Q-learning is a model-free, value-based, off-policy algorithm that
will find the best series of actions based on the agent's current state.
The “Q” stands for quality. Quality represents how valuable the action
is in maximizing future rewards.
The model-based algorithms use transition and reward functions to
estimate the optimal policy and create the model. In contrast, model-
free algorithms learn the consequences of their actions through the
experience without transition and reward function.
The value-based method trains the value function to learn which state
is more valuable and take action. On the other hand, policy-
based methods train the policy directly to learn which action to take
in a given state.
In the off-policy, the algorithm evaluates and updates a policy that
differs from the policy used to take an action. Conversely, the on-
policy algorithm evaluates and improves the same policy used to take
an action.
Key Terminologies in Q-learning
Before we jump into how Q-learning works, we need to learn a few
useful terminologies to understand Q-learning's fundamentals.
States(s): the current position of the agent in the environment.
Action(a): a step taken by the agent in a particular state.
Rewards: for every action, the agent receives a reward and penalty.
Episodes: the end of the stage, where agents can’t take new action. It
happens when the agent has achieved the goal or failed.
Q(St+1, a): expected optimal Q-value of doing the action in a particular
state.
Q(St, At): it is the current estimation of Q(St+1, a).
Q-Table: the agent maintains the Q-table of sets of states and actions.
Temporal Differences(TD): used to estimate the expected value of
Q(St+1, a) by using the current state and action and previous state and
action.

AS02

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AS02

Uploaded by

Copyright:

Available Formats

QUANTUM UNIVERSITY

Reinforcement Learning / CS3821

Subject: Reinforcement Learning

Discuss about the Markov Property in Reinforcement Learning in

Policy and Value Functions:

Policies 𝜋(𝑎∣𝑠)π(a∣s), which define the agent’s behavior, can be based

Transition Probability (P): The probability 𝑃(𝑠′∣𝑠,𝑎)P(s′∣s,a) of

Reward Function (R): The immediate reward 𝑅(𝑠,𝑎,𝑠′)R(s,a,s′)

Discount Factor (γ): A factor 𝛾∈[0,1]γ∈[0,1] that discounts future

Distinguish between different Immediate RL and Full RL in detail.

Feature Immediate RL Full RL (Episodic RL)

Sequential decisions over

No state transitions; single State transitions based on

More complex, requires

Reward Cumulative rewards over time

Navigation, games, robotics,

Common Epsilon-Greedy, UCB, Value Iteration, Policy Iteration,

Choosing an advertisement, Robot navigation, playing chess,

3.Discuss about the Efficiency of Dynamic Programing in detail.

The computational complexity makes DP impractical for very large

DP methods require a complete and accurate model of the

Convergence can be slow, especially for value iteration with a small

Approximate solutions using function approximation techniques (e.g.,

Sample-based methods that estimate value functions and policies

Combines ideas from DP and Monte Carlo methods, updating value

Techniques like prioritized sweeping and heuristic search to focus

4.Describe the Policy Evaluation and Improvement in detail.

Steps in Policy Evaluation

Update the value function iteratively using the Bellman expectation

Continue updating 𝑉𝜋(𝑠)Vπ(s) until the value function converges,

For larger state spaces, iterative methods such as

If 𝑉𝜋′(𝑠)=𝑉𝜋(𝑠)Vπ′(s)=Vπ(s) for all states, the current policy 𝜋π is already

Start with an arbitrary initial policy 𝜋π.

Evaluate the current policy 𝜋π to find the value function 𝑉𝜋Vπ.

While policy iteration can be computationally intensive

5.Discuss the MC Estimation of Action Value in detail.

The action value function 𝑄(𝑠,𝑎)Q(s,a) represents the expected return

Monte Carlo methods require complete episodes to estimate returns

Estimates the value of an action based on every time the action is

Execute the policy 𝜋π to generate episodes. Each episode consists of a

𝑄(𝑠,𝑎)Q(s,a) is initialized to zero for all state-action pairs.

Suppose the agent follows a policy and generates the following

Assume discount factor 𝛾=1γ=1.

For each state-action pair in the episode, calculate the return:

MC methods do not require a model of the environment. They

Simple to implement and understand, especially for small state and

Naturally incorporate exploration by evaluating the returns from

MC methods require complete episodes to compute returns, which

The convergence of MC methods can be slow, especially when the

Unlike temporal difference (TD) methods, MC methods do not update

6.Discuss in detail the Q-learning algorithm in Reinforcement Learning

You might also like