Markov Decision Process (MDP)

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

Markov Decision

Process

By Theepana Govintharajah
Markov Decision
Process

The Markov decision process (MDP) is a


mathematical framework used for
modeling decision-making problems
where the outcomes are partly random
and partly controllable.

It’s a framework that can address most


reinforcement learning (RL) problems.

By Theepana Govintharajah
Typical Reinforcement
Learning cycle

By Theepana Govintharajah
Important terms
Agent :
Software programs that make
intelligent decisions and they are the
learners in RL. These agents interact
with the environment by actions and
receive rewards based on there
actions.
For example, a robot that is being
trained to move around a house
without crashing.

Environment :
The surroundings with which the agent
interacts.
The agent cannot manipulate the
environment.
For example, the house where the
robot moves.

By Theepana Govintharajah
Important terms
State :
This is the position of the agents at a
specific time-step in the environment.
For example, This can be the exact
position of the robot in the house, the
alignment of its two legs or its current
posture.
It all depends on how you address the
problem.

Action :
The choice that the agent makes at the
current time step.
We know the set of actions (decisions)
that the agent can perform in advance.
For example, the robot can move its
right or left leg, raise its arm, lift an
object or turn right/left, etc.

By Theepana Govintharajah
Important terms
Rewards :
Rewards are the numerical values that
the agent receives on performing some
action at some state(s) in the
environment.
The numerical value can be positive or
negative based on the actions of the
agent.
In Reinforcement learning, we care
about maximizing the cumulative
reward (all the rewards agent receives
from the environment) instead of, the
reward agent receives from the current
state(also called immediate reward).
This total sum of reward the agent
receives from the environment is
called returns (G).

By Theepana Govintharajah
Agent-Environment
Relationship
So,whenever an agent performs a
action, the environment gives the
agent reward and a new state where
the agent reached by performing the
action.

We do not assume that everything in


the environment is unknown to the
agent, for example, reward calculation
is considered to be the part of the
environment even though the agent
knows a bit on how it’s reward is
calculated as a function of its actions
and states in which they are taken.

By Theepana Govintharajah
Example

Imagine a rabbit is wandering around in


a field looking for food and finds itself
in a situation where there is a carrot to
its right and broccoli on its left, a rabbit
prefers carrots.
So eating the carrot generates a reward
of plus 10.
Eating the broccoli on the other hand
generates reward of only plus three.

By Theepana Govintharajah
In different situation, right of the carrot
there is a tiger. If the rabbit moves
right, it gets to eat the carrot. But
afterwards it will killed by tiger
(Reward: -100).
If we account for the long-term impact
of our actions, the rabbit should go left
and settle for broccoli to give itself a
better chance to escape.
A bandit rabbit would only be
concerned about immediate reward
and so it would go for the carrot
(Reward: 10, it will not consider -100 in
next timestamp).
But a better decision can be made by
considering the long-term impact of
our decisions.

By Theepana Govintharajah
MDP Application Examples
Robot navigation
Resource allocation

Bandits Application Examples


Recommender systems
Online advertising

By Theepana Govintharajah
Markov Property

S[t] denotes the current state of the


agent and s[t+1] denotes the next state.

Now, according to the Markov


property, the current state of the robot
depends only on its immediate
previous state (or the previous
timestep), not on remaining past
timestamps.

Transition Probability: The probability


that the agent will move from one state
to another is called transition
probability.

By Theepana Govintharajah
State Transition
probability

We can define state Transition Probability


as follows :

By Theepana Govintharajah
Markov chain
Example

By Theepana Govintharajah
Episodic and
Continuous Tasks
Episodic Tasks:
These are the tasks that have a terminal
state (end state).
We can say they have finite states.
For example, in racing games, we start
the game (start the race) and play it
until the game is over (race ends!).
This is called an episode. Once we
restart the game it will start from an
initial state and hence, every episode is
independent.

Continuous Tasks :
These are the tasks that have no ends
i.e. they don’t have any terminal state.
These types of tasks will never end.

By Theepana Govintharajah
Episodic and
Continuous Tasks

In Episodic task, since we have finite


number of steps

Now, it’s easy to calculate the returns


from the episodic tasks as they will
eventually end, but what about
continuous tasks, as it will go on and on
forever.

By Theepana Govintharajah
So how can we modify return of
continuous task so that is always finite?
This is where we need Discount
factor(ɤ).

Discount Factor (ɤ):


It determines how much importance is to
be given to the immediate reward and
future rewards.
This basically helps us to avoid infinity as
a reward in continuous tasks.
It has a value between 0 and 1.
A value of 0 means that more importance
is given to the immediate reward and a
value of 1 means that more importance is
given to future rewards.
In practice, a discount factor of 0 will
never learn as it only considers
immediate reward and a discount factor
of 1 will go on for future rewards which
may lead to infinity. Therefore, the
optimal value for the discount factor lies
between 0.2 to 0.8.

By Theepana Govintharajah
Returns using
discount factor

Let's assume you have 100 liters of water


One with discount factor (ɤ) 0.8 :
Gt = 100 + 80 + 64 + ...

Second, with discount factor (ɤ) 0.2 :


Gt = 100 + 20 + 4 + ...
This means that we are more interested in
early rewards as the rewards are getting
significantly low at future timestamps.

By Theepana Govintharajah
Policies
How an agent selects these actions
(what actions to perform in a particular
state s).

We will use the symbol π (pi) to denote


a policy.

π(s) represents the actions selected in


state S by the policy π.

In the simplest case, a policy maps


each state to a single action. This kind
of policy is called the deterministic
policy. so that means π(s) = a
Deterministic Policy

In this example, π selects the action


A1 in state S0 and action A0 in states
S1 and S2.

Notice that the agent can select the


same action in multiple states, and
some actions might not be selected
in any state.
Stochastic Policy
A policy assigns probabilities to each
action in each state.

π(a|s) is the probability that the agent


with taking action ( a ) in a state ( s ).

A stochastic policy is one where


multiple actions may be selected with
non-zero probability.

So it is a probability distribution over



Actions (a A) for each state (s ∈ S). π
specifies a separate distribution over
actions for each state.

The sum over all action probabilities


must be one for each state, and each
action probability must be non-
negative.
Stochastic Policy

A stochastic policy might choose up or


right with equal probability in the
bottom row to reach the house.
Valid and Invalid Policy
It's important that policies depend only
on the current state.

We might also want to define a policy


that chooses the opposite of what it did
last, alternating between left and right
actions.
However, that would not be a valid policy
because this is conditional on the last
action. That means the action depends
on something other than the state.
Value Functions
Value functions are functions of states,
or of state-action pairs, that estimate
how good it is for an agent to be in a
given state, or how good it is for the
agent to perform a given action in a
given state.

This notion of how good a state or


state-action pair is is given in terms of
expected return. Rewards an agent
expects to receive are dependent on
what actions the agent takes in given
states. Since the way an agent acts is
influenced by the policy, we can see
that value functions are defined with
respect to policies.
State-Value Function
The state-value function for policy π,
tells us how good any given state is for an
agent following policy π.

Formally, the value of state s under


policy π is the expected return from
starting from state s at time t and
following policy π thereafter.

The subscript Pi indicates the value


function is contingent on the agent
selecting actions according to Pi.

Likewise, a subscript Pi on the expectation


indicates that the expectation is computed
with respect to the policy Pi.
Action-Value Function
Similarly, the action-value function for
policy π, tells us how good it is for the
agent to take any given action from a
given state while following policy π.

Formally, the value of action a in state s


under policy π is the expected return
from starting from state s at time t, taking
action a, and following policy π
thereafter.
Summary
Value functions capture the future total
reward under a particular policy.

Two kinds of value functions:


state value functions,
action value functions.

The state value function gives the


expected return from the current state
under a policy.
The action value function gives the
expected return from state S if the agent
first selects actions A and follows π after
that.

Value functions simplify things by


aggregating many possible future returns
into a single number.
Bellman equations define a relationship
between the value of a state or state-
action pair and its successor states.
Bellman equation for
the state value function
Bellman Equation states that value function
can be decomposed into two parts:
Immediate Reward, R[t+1]
Discounted value of successor states

Mathematically, we can define Bellman


Equation as :

Bellman Equation for Value Function

Suppose, there is a robot in some state (s)


and then he moves from this state to some
other state (s’). Now, the question is how
good it was for the robot to be in the
state(s). Using the Bellman equation, we can
that it is the expectation of reward it got on
leaving the state(s) plus the value of the
state (s’) he moved to.
Bellman equation for
the action value function
The Bellman equation for the action value
function gives the value of a particular state-
action pair as the sum over the values of all
possible next state-action pairs and rewards.

Mathematically, we can define Bellman


Equation as :

Bellman Equation for Action - Value Function


Optimal Policy
An optimal policy, is a policy which is as
good as or better than all the other policies.

Optimal policy will have the highest


possible value in every state. (It is a set of
rules that tells you the best action to take in
every state to maximize the expected total
reward you can receive over time.)

There's always at least one optimal policy,


but there may be more than one.

We'll use the notation π∗ to denote any


optimal policy.

Even if we limit ourselves to deterministic


policies, the number of possible policies is
equal to the number of possible actions to
the power of the number of states. We
could use a brute force search where we
compute the value function for every policy
to find the optimal policy which is
practically not possible.
BELLMAN OPTIMALITY
EQUATION

It forms the basis for various algorithms like


Dynamic Programming and Policy Iteration
to find the optimal policy in an MDP.
Follow me for more
similar posts

By Theepana Govintharajah

You might also like