Professional Documents
Culture Documents
Markov Decision Process (MDP)
Markov Decision Process (MDP)
Markov Decision Process (MDP)
Process
By Theepana Govintharajah
Markov Decision
Process
By Theepana Govintharajah
Typical Reinforcement
Learning cycle
By Theepana Govintharajah
Important terms
Agent :
Software programs that make
intelligent decisions and they are the
learners in RL. These agents interact
with the environment by actions and
receive rewards based on there
actions.
For example, a robot that is being
trained to move around a house
without crashing.
Environment :
The surroundings with which the agent
interacts.
The agent cannot manipulate the
environment.
For example, the house where the
robot moves.
By Theepana Govintharajah
Important terms
State :
This is the position of the agents at a
specific time-step in the environment.
For example, This can be the exact
position of the robot in the house, the
alignment of its two legs or its current
posture.
It all depends on how you address the
problem.
Action :
The choice that the agent makes at the
current time step.
We know the set of actions (decisions)
that the agent can perform in advance.
For example, the robot can move its
right or left leg, raise its arm, lift an
object or turn right/left, etc.
By Theepana Govintharajah
Important terms
Rewards :
Rewards are the numerical values that
the agent receives on performing some
action at some state(s) in the
environment.
The numerical value can be positive or
negative based on the actions of the
agent.
In Reinforcement learning, we care
about maximizing the cumulative
reward (all the rewards agent receives
from the environment) instead of, the
reward agent receives from the current
state(also called immediate reward).
This total sum of reward the agent
receives from the environment is
called returns (G).
By Theepana Govintharajah
Agent-Environment
Relationship
So,whenever an agent performs a
action, the environment gives the
agent reward and a new state where
the agent reached by performing the
action.
By Theepana Govintharajah
Example
By Theepana Govintharajah
In different situation, right of the carrot
there is a tiger. If the rabbit moves
right, it gets to eat the carrot. But
afterwards it will killed by tiger
(Reward: -100).
If we account for the long-term impact
of our actions, the rabbit should go left
and settle for broccoli to give itself a
better chance to escape.
A bandit rabbit would only be
concerned about immediate reward
and so it would go for the carrot
(Reward: 10, it will not consider -100 in
next timestamp).
But a better decision can be made by
considering the long-term impact of
our decisions.
By Theepana Govintharajah
MDP Application Examples
Robot navigation
Resource allocation
By Theepana Govintharajah
Markov Property
By Theepana Govintharajah
State Transition
probability
By Theepana Govintharajah
Markov chain
Example
By Theepana Govintharajah
Episodic and
Continuous Tasks
Episodic Tasks:
These are the tasks that have a terminal
state (end state).
We can say they have finite states.
For example, in racing games, we start
the game (start the race) and play it
until the game is over (race ends!).
This is called an episode. Once we
restart the game it will start from an
initial state and hence, every episode is
independent.
Continuous Tasks :
These are the tasks that have no ends
i.e. they don’t have any terminal state.
These types of tasks will never end.
By Theepana Govintharajah
Episodic and
Continuous Tasks
By Theepana Govintharajah
So how can we modify return of
continuous task so that is always finite?
This is where we need Discount
factor(ɤ).
By Theepana Govintharajah
Returns using
discount factor
By Theepana Govintharajah
Policies
How an agent selects these actions
(what actions to perform in a particular
state s).
By Theepana Govintharajah