Professional Documents
Culture Documents
POMDP Tutoria POMDP - Tutoriall
POMDP Tutoria POMDP - Tutoriall
POMDP Tutoria POMDP - Tutoriall
Transition
Dynamics
WORLD
Observation
Belief b
ACTOR
reward component
Action
Policy
-T=P(x|x,a): transition
and reward probabilities
-O: Observation function
-b: Belief and info. state
-: Policy
st
at
= 0, 1, 2, . . .
rt +1
st +1
at +1
rt +2
st +2
at +2
rt +3 s
t +3
...
at +3
MDP Model
Agent
State
Action
Reward
Environment
s0
a0
r0
s1
a1
r1
s2
a2
r2
s3
Process:
Observe state st in S
Choose action at in At
Receive immediate
reward rt
State changes to st+1
MDP Model <S, A, T, R>
s S a A(s)
(could be stochastic)
V (s) = E {rt +1 + rt +2 + 2 rt +3 +
and each s,a pair:
...
st =s, }
rewards
...
s t =s, a t=a, }
Optimization (MDPs)
4Grid Example
Policy Iteration
Optimal
Trajec.
Ball-Balancing
RL applied to a real physical system-illustrates online learning
An easy problem...but learns as you watch
Smarts Demo
Rewards ?
Environment state space?
Action state space ?
Role of perception?
Policy?
st
rt +1
at
st +1
Update:
a table
entry
TD Error
lim # t ! # *
(Watkins, 1989)
t! "
Value
Function
Policy
!
policy
improvement
V, Q
greedification
V,Q
Rooms Example
4 rooms
4 hallways
ROOM
HALLWAYS
4 unreliable
primitive actions
up
left
O1
right
Fail 33%
of the time
down
G?
O2
G?
8 multi-step options
(to each room's 2 hallways)
V (goal )=1
Iteration #1
Iteration #2
Iteration #3
V (goal )=1
Iteration #1
Iteration #2
Iteration #3
MDP Model
Agent
State
Action
Reward
Environment
s0
a0
r0
s1
a1
r1
s2
a2
r2
s3
Process:
Observe state st in S
Choose action at in At
Receive immediate
reward rt
State changes to st+1
MDP Model <S, A, T, R>
POMDP: UNCERTAINTY
A broad perspective
Belief
state
OBSERVATIONS
STATE
AGENT
WORLD + AGENT
ACTIONS
0.5
0.5
S1
Pr(o1)=0.5
Pr(o2)=0.5
Pr(o1)=0.9
Pr(o2)=0.1
a1
a2
Components:
Set of states: sS
Set of actions: aA
Set of observations: o
S3
Pr(o1)=0.2
Pr(o2)=0.8
POMDP parameters:
Initial belief: b0(s)=Pr(S=s)
Belief state updating: b(s)=Pr(s|o, a, b)
Observation probabilities: O(s,a,o)=Pr(o|s,a)
Transition probabilities: T(s,a,s)=Pr(s|s,a)
Rewards: R(s,a)
MDP
Belief state
World
Action
Observation
SE
Agent
Zi = Observations
b ' (s j ) = P (s j | o, a, b ) =
P (o | s j , a )" P (s j | si , a )b (si )
si !S
" P (o | s , a )" P (s
j
s j!S
si !S
| si , a )b (si )
S1
tiger-right
Actions={ 0: listen,
1: open-left,
2: open-right}
Reward Function
Observations
S0
A1
A2
A3
S1
S2
S3
O1
O2
O3
We may solve this belief MDP like before using value iteration
algorithm to find the optimal policy in a continuous space. However,
some adaptations for this algorithm are needed.
Belief MDP
The policy of a POMDP maps the current belief state into an action. As the
belief state holds all relevant information about the past, the optimal policy of
the POMDP is the the solution of (continuous-space) belief MDP.
A belief MDP is a tuple <B, A, , P>:
B = infinite set of belief states
A = finite set of actions
(b, a) = " b (s ) R (s, a ) (reward function)
s!S
P(b|b, a) =
(transition function)
S1
tiger-right
Actions={ 0: listen,
1: open-left,
2: open-right}
Reward Function
Observations
Prob. (LISTEN)
Tiger: left
Tiger: right
Doesnt change
Tiger: left
1.0
0.0
Tiger location
Tiger: right
0.0
1.0
Prob. (LEFT)
Tiger: left
Tiger: right
Tiger: left
0.5
0.5
Tiger: right
0.5
0.5
Prob. (RIGHT)
Tiger: left
Tiger: right
Tiger: left
0.5
0.5
Tiger: right
0.5
0.5
Problem reset
Prob. (LISTEN)
O: TL
O: TR
Tiger: left
0.85
0.15
Tiger: right
0.15
0.85
Prob. (LEFT)
O: TL
O: TR
Any observation
Tiger: left
0.5
0.5
Tiger: right
0.5
0.5
Is uninformative
Prob. (LEFT)
O: TL
O: TR
Tiger: left
0.5
0.5
Tiger: right
0.5
0.5
Reward (LISTEN)
Tiger: left
-1
Tiger: right
-1
Reward (LEFT)
Tiger: left
-100
Tiger: right
+10
Reward (RIGHT)
Tiger: left
+10
Tiger: right
-100
Belief vector
S1
tiger-left
S2
tiger-right
Belief
Belief vector
S1
tiger-left
S2
tiger-right
Belief
obs=hear-tiger-left
action=listen
b0
b1 (si ) =
P(o | a, b )
Belief vector
S1
tiger-left
S2
tiger-right
Belief
obs=growl-left
action=listen
left
[0.00, 0.10]
open-left
1(1)=(-1.0, -1.0)
0(1)=(10.0, -100.0)
listen
right
[0.10, 0.90]
[0.90, 1.00]
listen
open-right
Optimal policy:
Belief Space:
S1
tiger-left
S2
tiger-right
For t=2
[0.00, 0.02]
[0.02, 0.39]
[0.39, 0.61]
listen
listen
listen
TR
TL/TR
left
TL
TL/TR
listen
[0.61, 0.98]
[0.98, 1.00]
listen
TR
listen
TL
right
TL/TR
#
$
V * (b ) = max % ) b (s ) R (s, a ) +! ) Pr (o | b, a )V * (boa )&
a" A
o"O
' s"S
(
Policy Tree
With one step remaining, the agent must take a single action. With 2 steps to go, it takes
an action, make an observation, and makes the final action. In general, an agent t-step
policy can be represented as a policy tree.
O1
O1
T steps to go
A
Ok
O2
2 steps to go
A
O2
T-1 steps to go
(T=2)
Ok
1 step to go
(T=1)
oi !"
Thus, Vp(s) can be thought as a vector associated with the policy trees
p since its dimension is the same as the number of states. We often
use notation p to refer to this vectors.
As the exact world cannot be observed, the agent must compute an expectation over
world states of executing policy tree p from belief state b:
Vp(b) = " b (s )V p (s )
s!S
If we let ! = V (s ), V (s ),..., V (s )
p
p
1
p
2
p
n
, then
V p (b ) = b ! p
To construct an optimal t-step policy, we must maximize over all t-step policy trees P:
Vt (b ) = max b ! p
p"P
As Vp(b) is linear in b for each pP, Vt(b) is the upper surface of those functions. That
is, Vt(b) is piecewise linear and convex.
Let Vp1Vp2 and Vp3 be the value functions induced by policy trees
p1, p2, and p3. Each of these value functions are the form
V pi (b ) = b ! pi
Vt (b ) = max b ! pi
pi "P
Vp3
Vp2
S=S0
Belief
B(s0)
S=S1
Vp1
A(p1)
0
Vp3
Vp2
A(p2)
A(p3)
Belief
B(s0)
The projection of the optimal t-step value function yields a partition into
regions, within each of which there is a single policy tree, p, such that
is maximal over the entire region. The optimal action in that region a(p), the
action in the root of the policy tree p.
a0
a1
One-step policy trees are just actions:
To do a DP backup, we evaluate every possible 2-step policy tree
a0
O0
a0
O1
a0
a0
O0
O1
a0
O0
a0
O0
a1
a1
a0
O0
a0
O0
O1
a1
a0
O1
a1
O0
a1
O1
a1
a1
a1
O1
a0
a0
a1
a1
O1
a0
O0
a1
O1
a1
Pruning
Vp1
Vp2
0
Belief
B(s0)
Keep doing backups until the value function doesnt change much
anymore
In the worst case, you end up considering every possible policy tree
meta-level state
sb
a
Choose action
to maximize
long term reward
actions
s b +a
rewards
s b
actions
s b +a
rewards
s b
Meta-level MDP
decision
a
outcome
r, s, b
Meta-level Model
P(r,sb|s b,a)
decision
a
outcome
r, s, b
Meta-level Model
P(r,sb|s b,a)
Arm2
Problem difficulty
Policies must specify actions
For all possible
information states
Value Function
Optimal value function for learning while acting for k step
look-ahead
More reading