Professional Documents
Culture Documents
AS02
AS02
General Instructions:
Max Marks: 30
All questions are compulsory.
1. Each Topics carries 30 marks.
This means that given the current state 𝑠𝑡st and action 𝑎𝑡at, the next
state 𝑠𝑡+1st+1 is conditionally independent of all previous states and
actions.
Implications of the Markov Property
State Representation:
The state 𝑠𝑡st must capture all relevant information from the history
of the process up to time 𝑡t. If 𝑠𝑡st is truly Markovian, then no
additional information from the past is needed to predict the future.
Simplified Modeling:
The Markov Property allows for a simplified representation of the
environment, reducing the complexity of modeling the dynamics of
the system.
Transition dynamics can be described by a transition probability
matrix 𝑃(𝑠′∣𝑠,𝑎)P(s′∣s,a), which specifies the probability of
transitioning to state 𝑠′s′ from state 𝑠s when action 𝑎a is taken.
Actions (A): A finite or infinite set of actions the agent can take.
Maximize cumulative
Objective Maximize immediate reward (discounted) reward
Start with an arbitrary initial value function 𝑉(𝑠)V(s) for all states 𝑠s. This
can be zero or any other guess.
Iterative Update:
where 𝑅𝜋Rπ is the reward vector and 𝑃𝜋Pπ is the transition matrix
under policy 𝜋π. This can be solved as:
𝑉𝜋=(𝐼−𝛾𝑃𝜋)−1𝑅𝜋Vπ=(I−γPπ)−1Rπ
For each state 𝑠s, find the action 𝑎a that maximizes the expected
return, given the current value function 𝑉𝜋Vπ:
𝜋′(𝑠)=argmax𝑎∑𝑠′𝑃(𝑠′∣𝑠,𝑎)[𝑅(𝑠,𝑎,𝑠′)+𝛾𝑉𝜋(𝑠′)]π′(s)=argamaxs′∑P(s′∣s,a)[R(s,a,s′)+γVπ(s′)]
This step generates a new policy 𝜋′π′ that is greedy with respect to
𝑉𝜋Vπ.
Greedy Policy:
The new policy 𝜋′π′ is constructed such that for every state 𝑠s, the
chosen action maximizes the expected return.
Policy Improvement Theorem:
The policy improvement theorem guarantees that the new policy 𝜋′π′
will be at least as good as the current policy 𝜋π. Formally:
𝑉𝜋′(𝑠)≥𝑉𝜋(𝑠)∀𝑠∈𝑆Vπ′(s)≥Vπ(s)∀s∈S
Improve the current policy 𝜋π using the value function 𝑉𝜋Vπ to obtain a
new policy 𝜋′π′.
Convergence Check:
Repeat the policy evaluation and improvement steps until the policy
converges (i.e., the policy does not change between iterations).
Convergence and Efficiency
Convergence: Policy iteration is guaranteed to converge to the optimal
policy in a finite number of iterations because each iteration strictly
improves the policy or leaves it unchanged if it is already optimal.
Efficiency:
The return 𝐺𝑡Gt is the total discounted reward received from time step
𝑡t onwards:
𝐺𝑡=𝑅𝑡+1+𝛾𝑅𝑡+2+𝛾2𝑅𝑡+3+…Gt=Rt+1+γRt+2+γ2Rt+3+…
Here, 𝛾γ is the discount factor.
Episodes:
Estimates the value of an action based on the first time the action is
taken in a state within an episode.
For a given state-action pair (𝑠,𝑎)(s,a), the return is averaged over all
first occurrences of (𝑠,𝑎)(s,a) across episodes.
Every-Visit MC:
Initialize the action value function 𝑄(𝑠,𝑎)Q(s,a) arbitrarily for all state-
action pairs.
Initialize a counter 𝑁(𝑠,𝑎)N(s,a) for the number of times each state-
action pair is visited.
Generating Episodes:
Calculating Returns:
For each state-action pair (𝑠,𝑎)(s,a) encountered in the episode,
calculate the return 𝐺𝑡Gt from the first or every occurrence (depending
on the method used) of (𝑠,𝑎)(s,a) in the episode.
Updating Q-values:
First-Visit MC:
For each state-action pair (𝑠,𝑎)(s,a), if this is the first occurrence in the
episode:
𝑄(𝑠,𝑎)←𝑄(𝑠,𝑎)+1𝑁(𝑠,𝑎)(𝐺𝑡−𝑄(𝑠,𝑎))Q(s,a)←Q(s,a)+N(s,a)1(Gt−Q(s,a))
Every-Visit MC:
For each state-action pair (𝑠,𝑎)(s,a):
𝑄(𝑠,𝑎)←𝑄(𝑠,𝑎)+1𝑁(𝑠,𝑎)(𝐺𝑡−𝑄(𝑠,𝑎))Q(s,a)←Q(s,a)+N(s,a)1(Gt−Q(s,a))
Detailed Example
Consider a simple grid world where an agent starts in a particular
state, takes actions (up, down, left, right), and receives rewards based
on the resulting state. Let's apply MC estimation of action values:
Initialization:
For (𝑠1,𝑎1)(s1,a1):
𝐺1=𝑟2+𝑟3G1=r2+r3
For (𝑠2,𝑎2)(s2,a2):
𝐺2=𝑟3G2=r3
Updating Q-values:
First-Visit MC:
Update 𝑄(𝑠0,𝑎0)Q(s0,a0) using the return from the first occurrence:
𝑄(𝑠0,𝑎0)←𝑄(𝑠0,𝑎0)+1𝑁(𝑠0,𝑎0)(𝐺0−𝑄(𝑠0,𝑎0))Q(s0,a0)←Q(s0,a0)+N(s0,a0)1(G0−Q(s0,a0))
Every-Visit MC:
Update 𝑄(𝑠1,𝑎1)Q(s1,a1) and 𝑄(𝑠2,𝑎2)Q(s2,a2) similarly.
Advantages and Limitations
Advantages
Model-Free:
The estimates can have high variance because they rely on the returns
from sampled episodes, which can vary significantly.
Slow Convergence: