Markov Decision Process (MDP)

Markov Decision
Process
By Theepana Govintharajah
Markov Decision
Process
The Markov decision process (MDP) is a

mathematical framework used for
modeling decision-making problems
where the outcomes are partly random
and partly controllable.
It’s a framework that can address most

reinforcement learning (RL) problems.
Typical Reinforcement
Learning cycle
Important terms
Agent :
Software programs that make
intelligent decisions and they are the
learners in RL. These agents interact
with the environment by actions and
receive rewards based on there
actions.
For example, a robot that is being
trained to move around a house
without crashing.
Environment :
The surroundings with which the agent
interacts.
The agent cannot manipulate the
environment.
For example, the house where the
robot moves.
Important terms
State :
This is the position of the agents at a
specific time-step in the environment.
For example, This can be the exact
position of the robot in the house, the
alignment of its two legs or its current
posture.
It all depends on how you address the
problem.
Action :
The choice that the agent makes at the
current time step.
We know the set of actions (decisions)
that the agent can perform in advance.
For example, the robot can move its
right or left leg, raise its arm, lift an
object or turn right/left, etc.
Important terms
Rewards :
Rewards are the numerical values that
the agent receives on performing some
action at some state(s) in the
environment.
The numerical value can be positive or
negative based on the actions of the
agent.
In Reinforcement learning, we care
about maximizing the cumulative
reward (all the rewards agent receives
from the environment) instead of, the
reward agent receives from the current
state(also called immediate reward).
This total sum of reward the agent
receives from the environment is
called returns (G).
Agent-Environment
Relationship
So,whenever an agent performs a
action, the environment gives the
agent reward and a new state where
the agent reached by performing the
action.
We do not assume that everything in

the environment is unknown to the
agent, for example, reward calculation
is considered to be the part of the
environment even though the agent
knows a bit on how it’s reward is
calculated as a function of its actions
and states in which they are taken.
Example
Imagine a rabbit is wandering around in

a field looking for food and finds itself
in a situation where there is a carrot to
its right and broccoli on its left, a rabbit
prefers carrots.
So eating the carrot generates a reward
of plus 10.
Eating the broccoli on the other hand
generates reward of only plus three.
In different situation, right of the carrot
there is a tiger. If the rabbit moves
right, it gets to eat the carrot. But
afterwards it will killed by tiger
(Reward: -100).
If we account for the long-term impact
of our actions, the rabbit should go left
and settle for broccoli to give itself a
better chance to escape.
A bandit rabbit would only be
concerned about immediate reward
and so it would go for the carrot
(Reward: 10, it will not consider -100 in
next timestamp).
But a better decision can be made by
considering the long-term impact of
our decisions.
MDP Application Examples
Robot navigation
Resource allocation
Bandits Application Examples

Recommender systems
Online advertising
Markov Property
S[t] denotes the current state of the

agent and s[t+1] denotes the next state.
Now, according to the Markov

property, the current state of the robot
depends only on its immediate
previous state (or the previous
timestep), not on remaining past
timestamps.
Transition Probability: The probability

that the agent will move from one state
to another is called transition
probability.
State Transition
probability
We can define state Transition Probability

as follows :
Markov chain
Example
Episodic and
Continuous Tasks
Episodic Tasks:
These are the tasks that have a terminal
state (end state).
We can say they have finite states.
For example, in racing games, we start
the game (start the race) and play it
until the game is over (race ends!).
This is called an episode. Once we
restart the game it will start from an
initial state and hence, every episode is
independent.
Continuous Tasks :
These are the tasks that have no ends
i.e. they don’t have any terminal state.
These types of tasks will never end.
Episodic and
Continuous Tasks
In Episodic task, since we have finite

number of steps
Now, it’s easy to calculate the returns

from the episodic tasks as they will
eventually end, but what about
continuous tasks, as it will go on and on
forever.
So how can we modify return of
continuous task so that is always finite?
This is where we need Discount
factor(ɤ).
Discount Factor (ɤ):

It determines how much importance is to
be given to the immediate reward and
future rewards.
This basically helps us to avoid infinity as
a reward in continuous tasks.
It has a value between 0 and 1.
A value of 0 means that more importance
is given to the immediate reward and a
value of 1 means that more importance is
given to future rewards.
In practice, a discount factor of 0 will
never learn as it only considers
immediate reward and a discount factor
of 1 will go on for future rewards which
may lead to infinity. Therefore, the
optimal value for the discount factor lies
between 0.2 to 0.8.
Returns using
discount factor
Let's assume you have 100 liters of water

One with discount factor (ɤ) 0.8 :
Gt = 100 + 80 + 64 + ...
Second, with discount factor (ɤ) 0.2 :

Gt = 100 + 20 + 4 + ...
This means that we are more interested in
early rewards as the rewards are getting
significantly low at future timestamps.
Policies
How an agent selects these actions
(what actions to perform in a particular
state s).
We will use the symbol π (pi) to denote

a policy.
π(s) represents the actions selected in

state S by the policy π.
In the simplest case, a policy maps

each state to a single action. This kind
of policy is called the deterministic
policy. so that means π(s) = a
Deterministic Policy
In this example, π selects the action

A1 in state S0 and action A0 in states
S1 and S2.
Notice that the agent can select the

same action in multiple states, and
some actions might not be selected
in any state.
Stochastic Policy
A policy assigns probabilities to each
action in each state.
π(a|s) is the probability that the agent

with taking action ( a ) in a state ( s ).
A stochastic policy is one where

multiple actions may be selected with
non-zero probability.
So it is a probability distribution over

∈
Actions (a A) for each state (s ∈ S). π
specifies a separate distribution over
actions for each state.
The sum over all action probabilities

must be one for each state, and each
action probability must be non-
negative.
Stochastic Policy
A stochastic policy might choose up or

right with equal probability in the
bottom row to reach the house.
Valid and Invalid Policy
It's important that policies depend only
on the current state.
We might also want to define a policy

that chooses the opposite of what it did
last, alternating between left and right
actions.
However, that would not be a valid policy
because this is conditional on the last
action. That means the action depends
on something other than the state.
Value Functions
Value functions are functions of states,
or of state-action pairs, that estimate
how good it is for an agent to be in a
given state, or how good it is for the
agent to perform a given action in a
given state.
This notion of how good a state or

state-action pair is is given in terms of
expected return. Rewards an agent
expects to receive are dependent on
what actions the agent takes in given
states. Since the way an agent acts is
influenced by the policy, we can see
that value functions are defined with
respect to policies.
State-Value Function
The state-value function for policy π,
tells us how good any given state is for an
agent following policy π.
Formally, the value of state s under

policy π is the expected return from
starting from state s at time t and
following policy π thereafter.
The subscript Pi indicates the value

function is contingent on the agent
selecting actions according to Pi.
Likewise, a subscript Pi on the expectation

indicates that the expectation is computed
with respect to the policy Pi.
Action-Value Function
Similarly, the action-value function for
policy π, tells us how good it is for the
agent to take any given action from a
given state while following policy π.
Formally, the value of action a in state s

under policy π is the expected return
from starting from state s at time t, taking
action a, and following policy π
thereafter.
Summary
Value functions capture the future total
reward under a particular policy.
Two kinds of value functions:

state value functions,
action value functions.
The state value function gives the

expected return from the current state
under a policy.
The action value function gives the
expected return from state S if the agent
first selects actions A and follows π after
that.
Value functions simplify things by

aggregating many possible future returns
into a single number.
Bellman equations define a relationship
between the value of a state or state-
action pair and its successor states.
Bellman equation for
the state value function
Bellman Equation states that value function
can be decomposed into two parts:
Immediate Reward, R[t+1]
Discounted value of successor states
Mathematically, we can define Bellman

Equation as :
Bellman Equation for Value Function
Suppose, there is a robot in some state (s)

and then he moves from this state to some
other state (s’). Now, the question is how
good it was for the robot to be in the
state(s). Using the Bellman equation, we can
that it is the expectation of reward it got on
leaving the state(s) plus the value of the
state (s’) he moved to.
Bellman equation for
the action value function
The Bellman equation for the action value
function gives the value of a particular state-
action pair as the sum over the values of all
possible next state-action pairs and rewards.
Mathematically, we can define Bellman

Equation as :
Bellman Equation for Action - Value Function

Optimal Policy
An optimal policy, is a policy which is as
good as or better than all the other policies.
Optimal policy will have the highest

possible value in every state. (It is a set of
rules that tells you the best action to take in
every state to maximize the expected total
reward you can receive over time.)
There's always at least one optimal policy,

but there may be more than one.
We'll use the notation π∗ to denote any

optimal policy.
Even if we limit ourselves to deterministic

policies, the number of possible policies is
equal to the number of possible actions to
the power of the number of states. We
could use a brute force search where we
compute the value function for every policy
to find the optimal policy which is
practically not possible.
BELLMAN OPTIMALITY
EQUATION
It forms the basis for various algorithms like

Dynamic Programming and Policy Iteration
to find the optimal policy in an MDP.
Follow me for more
similar posts

Markov Decision Process (MDP)

Uploaded by

Copyright:

Available Formats

You might also like

Markov Decision Process (MDP)

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Markov Decision Process (MDP)

Uploaded by

Copyright:

Available Formats

Markov Decision

The Markov decision process (MDP) is a

It’s a framework that can address most

We do not assume that everything in

Imagine a rabbit is wandering around in

Bandits Application Examples

S[t] denotes the current state of the

Now, according to the Markov

Transition Probability: The probability

We can define state Transition Probability

In Episodic task, since we have finite

Now, it’s easy to calculate the returns

Discount Factor (ɤ):

Let's assume you have 100 liters of water

Second, with discount factor (ɤ) 0.2 :

We will use the symbol π (pi) to denote

π(s) represents the actions selected in

In the simplest case, a policy maps

In this example, π selects the action

Notice that the agent can select the

π(a|s) is the probability that the agent

A stochastic policy is one where

So it is a probability distribution over

The sum over all action probabilities

A stochastic policy might choose up or

We might also want to define a policy

This notion of how good a state or

Formally, the value of state s under

The subscript Pi indicates the value

Likewise, a subscript Pi on the expectation

Formally, the value of action a in state s

Two kinds of value functions:

The state value function gives the

Value functions simplify things by

Mathematically, we can define Bellman

Bellman Equation for Value Function

Suppose, there is a robot in some state (s)

Mathematically, we can define Bellman

Bellman Equation for Action - Value Function

Optimal policy will have the highest

There's always at least one optimal policy,

We'll use the notation π∗ to denote any

Even if we limit ourselves to deterministic

It forms the basis for various algorithms like

You might also like