Professional Documents
Culture Documents
Greedy Algorithms For Sequential Sensing Decision
Greedy Algorithms For Sequential Sensing Decision
1908
is monitored at all times. The state of the environment evolves by a stationary
Finally, the multi-armed bandit problem [Robbins, 1952; Markovian transition model P (Xt |Xt−1 , at−1 ). The obser-
Auer et al., 2000] is used to model exploration decisions in vation model is deterministic, i.e., for action idle the agent
stateless domains. There, an agent tries to maximize its over- always observes Xo , and for sensing action ai it observes
all reward by playing one of n different slot machines at ev- Xi ∪ Xo .
ery time step, with the initial payoffs unknown. The prob- The reward function, R, enforces our goal to detect change
lem is learning the payoffs vs. trying to maximize rewards as quickly as possible with a minimal number of costly sens-
for machines where the payoffs have been estimated already. ing actions. If the agent executes a sensing action, it receives
In our setting, we assume that dynamics (reward and transi- a positive reward when change has occurred and a negative
tion model) are known and focus on devising optimal sensing reward otherwise. The agent also receives a penalty (negative
schedules in a partially observed dynamic domain. reward) for staying idle while the change has occurred.
Usually, a policy is represented indirectly in terms of a
value function V (b) which represents how desirable it is for
2 Sequential Sensing Decisions the agent to be in a particular belief state. The optimal action
In this section, we model a general sensing decision problem in belief state b(s) is then chosen as the one that maximizes
using a POMDP. Details of the formulation for specific sce- V (b) which is the sum of immediate rewards plus expected
narios are given in later sections. In general, we assume a future rewards.
discrete-time environment which evolves stochastically with Given this formulation, we can solve the sensing deci-
time. At each time step t the agent decides to sense (or not) sion problem using general POMDP algorithms, e.g., [Cas-
a sensing object and receives a reward that depends on the sandra et al., 1994; Spaan and Vlassis, 2005]. Unfortu-
agent’s action and the current state of the world. We model nately, POMDP algorithms are intractable for environments
this decision problem with a POMDP because it allows mod- larger than a few dozen states and a large time horizon
eling the effect of sensing on knowledge and the effect of (see [Littman, 1996] for a summary of complexity results,
knowledge on decisions. We use this specialized POMDP and [Murphy, 2000] for an overview of approximate solu-
because it emphasizes our problem’s special structure and tions). Recently, point-based methods (e.g., [Spaan and Vlas-
makes change explicit. Later, we take advantage of the fea- sis, 2005], [Poupart, 2005]) provide good approximation for
tures of this POMDP and introduce efficient algorithms to POMDPs optimal policy using sampling. In this paper, we
find optimal solutions. Thus, part of our contribution here is look for efficient algorithms that take advantage of the special
the POMDP structure introduced for this type of problems. structure of our problem. In what follows, we present greedy
In general, every POMDP is a tuple M = X, A, T, O, R, algorithms to find either the optimal policy or an approxima-
which includes a state space, X, a set of actions, A, a prob- tion to the optimal policy and list some cases for which these
abilistic transition model, T , an observation model, O, and greedy algorithms work.
a reward function, R. At time step t − 1, the agent exe-
cutes a chosen action, at−1 , and goes from state st−1 to st 3 Sequential Decisions for a Single Object
according to the transition model. Then, it receives an ob- In this section we provide an efficient algorithm for detect-
servation ot and a reward R(st−1 , at−1 ). The agent cannot ing change of a single sensing object. Our algorithm uses a
observe st directly, but can maintain a belief state bt (st ) = greedy method to determine whether to sense the object or not
P (st |o1:t , a1:t−1 , b0 ), where b0 is the belief state at time 0. at each time step. Every sensing action on the sensing object
The agent’s goal is to devise a policy that maximizes the dis- determines whether a change occurs in the value of the ob-
counted sum of expected rewards. ject. We prove that the greedy algorithm provides an optimal
We define POMDP components that model our problem. policy in our POMDP.
The environment of interest is represented by a state space X The POMDP model constructed for a single sensing object
which is the union of a set of random variables. The vari- has two variables in its state space X. (1) A binary variable
ables defining the state space are factored into four disjoint G ∈ Xh (completely hidden variable) whose value indicates
sets Xh , Xo , Xs , where whether a change has occurred at that time or not. (2) A bi-
• Xh is a set of completely hidden variables whose true nary sensing variable C ∈ Xs (sensing with a cost) whose
values can never be observed by the agent. They can value indicates whether a change has occurred since the last
represent some properties of the system. sensing action. The graphical structure of our POMDP model
• Xo is a set of completely observable variables whose for a single variable is represented with a Dynamic Decision
true values are always observed by the agent. Network (DDN) [Russell and Norvig, 2003] in Figure 1, left.
• Xs is a set of sensing variables whose true value can be The probability of change of the sensing object is modeled
observed depending on the action. Every sensing vari- with P (Gt ). Gt is independent and identically distributed for
able C is boolean, representing whether or not a certain t ≥ 0. The value of C is a deterministic function of the value
property changed since the last time we observed it. of G and the set of actions performed up to time t (the value
The full set of possible deterministic actions is A = As ∪ of C becomes 1 the first time the change occurs and it be-
{idle} in which As includes all sensing actions. Action idle comes 0 when we sense the object). P (C t ) is calculated as
does nothing. Each ai ∈ As is associated with a set of vari- P (C t = v|at−1 = sense, C t−1 = v , Gt = v) = 1
ables Xi ⊆ Xs , that are sensed by ai (observed in the result follows: P (C t = 1|at−1 = idle, C t−1 = 1, Gt = v) = 1
of performing ai ). P (C t = v|at−1 = idle, C t−1 = 0, Gt = v) = 1
1909
where, v and v are either 0 or 1. a t− 1
The reward function R(s, a) of executing action a in state PROCEDURE MSMC(Pb (C t ))
s = C = c, G = g is defined as follows. Note that r > 0 G t− 1 Gt Pb (C t ) part of the current belief
represents the reward of sensing the change, e < 0 repre- state,Pb (C t+1 ) computed by prediction.
sents the cost of sensing while there is no change and p < 0 1. d ← f (1) − f (0) (Formula (4))
represents the penalty of sensing late while there is a change. C t− 1 Ct
2. if d < 0 then return sense
8 3. else return idle
> r c = 1, a = sense;
< R t− 1 Rt
e c = 0, a = sense;
R(C = c, G = g, a) = (1)
>
: p c = 1, a = idle;
0 c = 0, a = idle. Figure 1: (left) DDN representing the transition model for one sens-
ing variable C. G indicates whether a change occurs at this time. C
We calculate the value function of a belief state b by using
represents whether a change has occurred since the last sensing ac-
the Bellman equation in an infinite horizon with discounted
tions. a is an action node, and R is the reward. (right) Algorithm for
rewards. Note that a belief state b(C = c, G = g) is defined
deciding when to sense in case of a single sensing object.
as the probability of being in state s = C = c, G = g.
! PROCEDURE FPBSTAR()
X X
V (b) = max b(s)R(s, a) + γ
T (b, a, b )V (b ) (2) 1. v0 ← 0, v1 ← VALUE(v0 )
a 2. while (|v1 − v0 | > )
s∈S b
(a) v0 ← v1, v1 ← VALUE(v0 )
where 0 ≤ γ < 1 is the discount factor and T (b, a, b ) is the 3. return v1
transition between belief states after executing action a. In PROCEDURE VALUE(v0 )
this paper, V always refers to the optimal value function. 1. f ← −∞, x ← 0
We use an equivalent formulation for the value function 2. do
suitable to our problem. We unwind the recursive term of (a) prevF ← f
x x−1
Formula (2) until the first action sense is executed, i.e., we (b) f ←PxP (G = 1) (γ jr + (1 + γ . .x. + γ )p)
compute the rewards iteratively up to the first action sense + j=1 P (G = 0) P (G = 1)(γ r+
(time x). For simplicity of representation, we define an (γ j . . . P
+ γ x−1 )p)
auxiliary variable ch, where P (ch = t + x) = P (C t = +(1 − xj=0 P (G = 0)j P (G = 1))γ x e + γ x+1 v0
0, . . . , C t+x−1 = 0, C t+x = 1) (i.e. ch denotes the time at (c) x ← x + 1
which C switches from 0 to 1). Also, we define Pb of any while (f > prevF )
belief state b and any event E as: Pb (E) = s P (E|s)b(s). 3. return prevF
Proposition 3.1 The value function is as follows:
` ´ Figure 2: Algorithm for computing V (b∗ ), the fixed point of
V (b) = max[Pb (C = 1) γ x r + (1 + γ . . . + γ x−1 )p
t
the value function for belief b∗ .
x
`
+Pb (t < ch ≤ t + x) γ x r +
´ Computing V (b) requires an x-step progression while com-
E(γ ch−t + . . . + γ x−1 |t < ch ≤ t + x) p puting d just requires a 1-step progression as shown in the
+Pb (t + x < ch) (γ x e) following:
+γ x+1 V (b∗ )] = max f (x) (3) f (1) − f (0) = Pb (C t = 1) ((γ − 1)r + p)
x
+ Pb (C t = 0, C t+1 = 1)(γr − e)
where b∗ is the belief state of the system after executing action
sense i.e., b∗ (C = v, G = v ) is equal to P (G = v ) if v = v + Pb (C t = 0, C t+1 = 0) (γ − 1)e
and 0 otherwise. + γ(γ − 1)V (b∗ ) (4)
∗
We interpret the formula in Proposition 3.1 as follows: The We compute V (b ) of Formula (4) as a preprocessing step us-
first term of this formula states that a change has occurred ing a simple value iteration algorithm (Figure 2). This way,
before current time and we get a penalty for sensing it late in we compute V (b∗ ) once and use it in making decision for ev-
addition to positive reward r. The second term is the expected ery time step. Note that b∗ is a very special belief state (sens-
reward when the change occurs after the current time but be- ing action has just occurred), the sensing object changes with
fore sensing. The third term corresponds to when change oc- prior probability P (G = 1). Essentially, computing V (b∗ )
curs after sensing so we have to pay the cost for making a is like performing value iteration in MDPs as we do not con-
sensing action in error. Finally the last term is the recursive sider the effect of observations here. Moreover, the inner loop
value for actions that we perform after time t + x. of computing V (b∗ ) in Figure 2 is polynomial.
The algorithm (Multiple Sense Multiple Change (MSMC)) The rest of this section verifies the correctness of proce-
for finding the optimal policy is sketched in Figure 1. The dure MSMC (Figure 1, right). First, we show that the proba-
algorithm checks at each time step if there is a decrease in f bility of ch = t + x is a decreasing function of x. Using this
function by evaluating d = f (1) − f (0). It then senses the we prove that f (x) has just one local maximum. Finally we
object if d < 0. Notice that the algorithm replaces expen- prove that, to obtain the optimal policy, we decide whether to
sive evaluation of V (b) with a much cheaper evaluation of d. sense or not at each time step based on just one comparison.
1910
t t+1 t+2
1911
PROCEDURE MOCHA(Pb (ct |o0:t ) ) tion over Ci , Gi ; and let Vi (bi ) be the value function for sin-
Pb (ct |o0:t ) part of current belief state, n number of objects, gle optimal policy given belief bi ; also let V (b) be the value
1. for all 0 ≤ i ≤ n function for the optimal composite policy (Proposition 5.1).
(a) di ← fi (1) − fi (0) Then, V (b) = V0 (b0 ) + V1 (b1 ).
(b) if di < 0 then return sense variable i
P ROOF See appendix for the proof.
Figure 4: Algorithm for detecting change of multiple objects Like in previous sections, if we prove that Pb (chi ) is de-
cost) and two pages “Barak Obama” and “Hillary Clinton” as creasing, then the value function for single policies has just
sensing objects (sensing with cost). one maximum which can be found by the greedy algorithm.
At each time step, the agent can choose a subset of ob- Lemma 5.3 Let Pb (chi = t + x|o0:t ) be the probability that
jects to sense. We refer to this kind of actions as composite object i has changed for the first time after previous sensing
actions. Each composite action a0:n is a vector which repre- at time t+x, while all the observations up to time t have been
sents that action ai ∈ {sense, idle} has been performed on perceived. Then, Pb (chi = t+x|o0:t ) ≤ Pb (chi = t+y|o0:t )
the ith object. We assume that the reward function of a com- for x ≥ y.
posite action a0:n , R(c0:n , g0:n , a0:n ), is the sum of the re- P ROOF See appendix for the proof.
wards of executing single actions on the corresponding
sens-
ing variables: R(c0:n , g0:n , a0:n ) = i Ri (ci , gi , ai ),
where Ri (ci , gi , ai ) is the reward function for single sens- Theorem 5.4 Let Vi (bi ) = maxx fi (x)be the value function
ing variable with parameters ri , ei and pi , as in Formula (1). for single policy in the case of multiple objects. The best
policy is to sense variable i at time t iff fi (0) ≥ fi (1).
Again, the optimal restricted policy tree is the one that
has the highest value among different restricted policy trees. P ROOF By Theorem 5.2: V (b) = V0 (b0 ) + V1 (b1 ). This
Still, our algorithm finds the optimal restricted policy tree for value function achieves its maximum when both of its terms
each of the sensing variables, and then merges the results to are at their maximum. Therefore, the best policy is to sense
find the optimal restricted composite policy. The algorithm them at their unique maximum. Previous results show that if
is sketched in Figure 4. The rest of the section verifies the the value function for single sensing variable decreases, the
correctness of this algorithm. function is at its maximum. Consequently, the best policy is
The following development is presented with two sensing to sense that variable at that time.
variables C0 and C1 for simplicity and clarity, but we can
extend the results to more sensing variables easily. We use Like before, to avoid the direct computation of the value
auxiliary variables ch0 and ch1 to show the time of the change function, we calculate fi (1) − fi (0) in Formula 11.
for sensing variables C0 and C1 respectively. fi (1) − fi (0) = Pb (Cit = 1|o0:t ) ((γ − 1)r + p) (11)
Proposition 5.1 The value function for belief state b is:
+ Pb (Cit = 0, Cit+1 = 1|o0:t )(γr − e)
V (b) = max[ + Pb (Cit = 0, Cit+1 = 0|o0:t )(γ − 1)e
x,i
1912
is pruned very aggressively on one side. Many subtrees are
merged because they are all rooted in either b∗ = (G =
1, C = 1) or b∗ = (C = 0, G = 0). This way we save
computation time whereas witness explores the entire tree.
This suggests that our approximation is better and faster
than the state of the art POMDP algorithms.
Figure 5: (left) Value (discounted sum of rewards) of the optimal We use our algorithm to detect changes of Wikipedia pages
policy vs. number of states. (right) Running time vs. number of
while minimizing sensing effort and the penalty of delayed
states. conclusion: Our algorithm, MOCHA, returns higher value
updates. Our approach is general and can be applied to any
and is faster than Perseus (10000, 30000 samples), and Witness.
set of WWW pages. We compare our algorithm with both
Witness and Perseus. Moreover, we compare the accuracy of
105 time steps for each experiment and calculate the value of these algorithms with the ground truth of changes of WWW
the policy returned by MOCHA. We have 1 to 8 boolean vari- pages.
ables (3 to 6561 states), 1 to 8 sensing actions, 1 idle action In general, we build a set of factored-HMMs from a graph
per time step. of nodes (e.g., WWW pages), if the graph satisfies the fol-
We compare MOCHA with Witness and Perseus on this lowing three conditions: (1) a set of nodes (V C) are always
simulated data (transforming input DDNs into POMDPs). sensible with no additional cost; (2) V C is a vertex cover of
Witness is a classic POMDP algorithm and does not scale the graph (vertex cover of a graph: a subset S of vertices such
up to large state spaces. Perseus is a point-based algorithm that each edge has at least one endpoint in S); and (3) the con-
which provides a good approximation for the optimal policy ditional probability of each node in V C can be represented by
using sampling. its adjacent nodes which are not in V C. Once the conditions
We report the accuracy and running time of our experi- are satisfied, each node in V C becomes a child node, and its
ments over random examples in Figure 5. This figure shows adjacent nodes become parents of the node in the factored-
that in all the experiments with more than 9 states the returned HMM as shown in Figure 6, left. Thus, we can still solve the
value for our method is always higher than values returned detect-change problem of the set of factored-HMMs obtained
by Witness and Perseus. For each experiment we compute from a generic graph of WWW pages.
the value (discounted sum of rewards) of the policy returned We gathered log data of 700 days for three Wikipedia
(Perseus has experiments with 10000 and 30000 samples). pages (‘Barack Obama’, 2 , ‘Hilary Clinton’ 3 and ‘Demo-
Each bar in Figure 5, left displays the average value over all cratic party’ 4 ). Since each page may change at any point of
examples with the same number of states. time, we need to discretize time first. We choose a one hour
Notice that Witness is optimal for our problem when al- time step. We use the Democratic party page as an observa-
lowed an infinite tree depth. However, in practice we must tion to estimate changes of the Obama page and the Clinton
limit Witness to a tractable number of steps so it is natural page. In this case, the Democratic party page is a vertex cover
that the outcome is sub-optimal. In our case, we developed of the graph of three pages. We build a conditional probabil-
better formulas for our specific scenario so computation of ity table for the Democratic party page given the other two
those formulas is more precise and easier to estimate given pages. The priors and conditionals for the Wikipedia pages
the same time horizon. Also, in our experiments we run Wit- are trained by counting of events.
ness only once per problem, not re-computing the policy tree Figure 6, right compares the value of MOCHA, Witness
after observations are presented. For that reason, in effect and Perseus for this domain. We evaluate the discounted
Witness’s policy tree is shorter than ours after making an ob- sum of exact rewards achieved by these algorithms (r, p, e are
servation. This is a caveat in our evaluation method, and we 62, −4, −10). We used these parameters for training to test
can re-run the evaluation with a deeper tree for Witness, re- in both cases. However, we believe that changing the param-
peating the run of Witness after every observation. We limited eters does not affect the result much.
the number of executions of witness for obvious reasons be- Our algorithm outperforms both Witness and Perseus. In
cause it took a very long time to execute witness even for few addition, we display the discounted sum of rewards of the
states (1000 sec for 9 states). ground truth of changes of WWW pages which knows the
Perseus returns the same value as ours for small problems change of each page (call it oracle). The reward achieved by
with 3 and 9 states but it runs out of memory (uses more than oracle is the maximum possible reward for the data set. Note
1Gb of memory) for state spaces larger than 243 states and that the results are not sensitive to input parameters (r, p, e). If
crashes. MOCHA, on the other hand, requires less than 1Mb we give different rewards, the overall rewards easily become
even for the largest state space (19863 states). negative numbers where our algorithm gave higher number
Figure 5, right shows the running time for computing the than the others, though.
optimal policy for 105 time steps. It shows that MOCHA is
2
faster than Witness and Perseus. Our preprocessing (com- http://en.wikipedia.org/wiki/Barack Obama
puting V (b∗ )) takes 0.01 seconds. Our algorithm takes less 3
http://en.wikipedia.org/wiki/HillaryRodhamClinton
4
time than Witness for the same tree depth because our tree http://en.wikipedia.org/wiki/Democratic Party United States
1913
[Cassandra et al., 1994] A. R. Cassandra, L. P. Kaelbling,
and M. L. Littman. Acting optimally in partially observ-
able stochastic domains. In AAAI, pages 1023–1028, 1994.
[Cho and Garcia-Molina, 2003] J. Cho and H. Garcia-
Molina. Effective page refresh policies for web crawlers.
ACM Trans. Database Syst., 28(4), 2003.
[Cho, 2001] J. Cho. Crawling the Web: Discovery and Main-
Figure 6: (left) Transforming the graph between WWW pages to tenance of Large-Scale Web Data. PhD thesis, 2001.
factored HMMs for our algorithm. (right) Value vs. time of Obama [Even-Dar et al., 2005] E. Even-Dar, S. M. Kakade, and
and Clinton pages. MOCHA outperforms POMDP algorithms. Y. Mansour. Reinforcement learning in POMDPs. In IJ-
CAI ’05, 2005.
7 Conclusion & Future Work [Finkelstein and Markovitch, 2001] L. Finkelstein and
We formalized the problem of detecting change in dynamic S. Markovitch. Optimal schedules for monitoring anytime
systems and showed how it can be cast as a POMDP. We sug- algorithms. Artif. Intell., 126(1-2):63–108, 2001.
gested greedy algorithms for sensing decision in one or more [Hansen and Zilberstein, 2001] E. A. Hansen and S. Zilber-
variables. These algorithms use the structure of the problem stein. Monitoring and control of anytime algorithms: A
to their advantage to compute a solution in time polynomial dynamic programming approach. Artificial Intelligence,
in the time horizon. 126:139–157, 2001.
Our model that includes 0-cost sensing in addition to costly
sensing is suitable for WWW sensing where there are WWW [Jaakkola et al., 1994] T. Jaakkola, S. P. Singh, and M. I. Jor-
pages that are sensed automatically by designer decision. dan. Reinforcement learning algorithm for partially ob-
Such pages are, for example, news pages (NY Times, e.g.) servable Markov decision problems. In NIPS, volume 7,
and streaming media from websites and newsgroups. Other 1994.
examples include pages with RSS feeds whose changes are [Jensen et al., 1990] F. V. Jensen, S. L. Lauritzen, and K. G.
detectable with very low cost. Our assumption about perfect Olesen. Bayesian updating in recursive graphical models
sensing, while restrictive, is not far from many real-world sys- by local computation. Computational Statistics Quarterly,
tems. WWW pages and images captured by stationary cam- 4:269–282, 1990.
eras are good examples of those. In other cases, the perfect [Kaelbling et al., 1998] L. P. Kaelbling, M. L. Littman, and
change-observation assumption is a good approximation, e.g. A. R. Cassandra. Planning and acting in partially observ-
CT imaging (for detecting growth of tumors or pneumonia able stochastic domains. AIJ, 101:99–134, 1998.
condition) outputs a 3D image which can be easily compared
[Kearns et al., 2000] M. Kearns, Y. Mansour, and A. Y. Ng.
to previous images with only little noise.
There are two important limitation in our current approach. Approximate planning in large pomdps via reusable tra-
First, we do not know how far an approximation our method jectories. In Proc. NIPS’99, 2000.
is from the oracle (sense at the exact same time that change [Littman, 1996] M. L. Littman. Algorithms for sequential
occurs). Second, we cannot include actions that change the decision making. PhD thesis, Brown U., 1996.
world (e.g., moving actions) in the model yet. The imme- [Murphy, 2000] K. Murphy. A Survey of POMDP Solution
diate direction for the future work is to fix these two limita- Techniques. Unpublished Work, 2000.
tions. Solving the problem of multiple objects when there is [Pandey et al., 2004] S. Pandey, K. Dhamdhere, and C. Ol-
not any constraint on the transition relation between sensing
ston. Wic: A general-purpose algorithm for monitoring
variables can also be a future work. Another direction is to
web information sources. In VLDB, pages 360–371, 2004.
learn the model. The update frequencies (i.e. P (G)) and the
observation model need to be learned. Choosing a relevant [Poupart, 2005] Pascal Poupart. Exploiting Structure to Effi-
observation node is also an important part of the learning. ciently Solve Large Scale POMDPs. PhD thesis, Univer-
sity of Toronto, 2005.
Acknowledgements [Rabiner, 1989] L. R. Rabiner. A tutorial on hidden Markov
We would like to thank Anhai Doan and Benjamin Liebald models and selected applications in speech recognition.
for the insightful discussions and their help throughout the Proceedings of the IEEE, 77(2):257–285, 1989.
progress of this project. We also thank David Forsyth and the
anonymous reviewers for their helpful comments. This work [Robbins, 1952] H. Robbins. Some aspects of the sequential
was supported by DARPA SRI 27-001253 (PLATO project), design of experiments. Bulletin of the American Mathe-
NSF CAREER 05-46663, and UIUC/NCSA AESIS 251024 matical Society, 55, 1952.
grants. [Russell and Norvig, 2003] S. Russell and P. Norvig. Artifi-
cial Intelligence: Modern Approach Second Edition. Pren-
References tice Hall, 2003.
[Auer et al., 2000] P. Auer, N. Cesa-Bianchi, Y. Freund, and [Spaan and Vlassis, 2005] Matthijs T. J. Spaan and Nikos
R. E. Schapire. Gambling in a rigged casino: The adver- Vlassis. Perseus: Randomized point based value iteration
sarial multi-armed bandit problem. ECCC, 7(68), 2000. for pomdps. JAIR, 24:195–220, 2005.
1914
8 Appendix V0k (b0 ) and V1k (b1 ) can be defined in the same way. Now we
Proof of Theorem 3.3 prove by induction that
We prove by induction over x: Inductive assumption: f (i) ≤ V k (b) = V0k (b0 ) + V1k (b1 )
f (0) for i ≤ x − 1. So once we prove f (x) ≤ f (x − 1), we
are done. Induction basis: for k = 0 the formula holds because the
expected sum of future rewards of the optimal 0-step policy
f (x) − f (x − 1) = Pb (ch ≤ t + x − 1) γ x−1 ((γ − 1)κ + β) is 0.
+ Pb (ch = t + x)γ x−1 (γκ − α) Now, assume by induction that V k (b) = V0k (b0 ) + V1k (b1 )
for k ≤ n. Without loss of generality assume that V n (b) is
+ Pb (t + x < ch) γ x−1 (γ − 1)α
maximized when we have to sense object 0 first. Using the
+ γ x (γ − 1) V (b∗ ) induction hypothesis (k − x − 1 < n) we rewrite the recursive
By the assumption we know: f (1) − f (0) ≤ 0. term as the sum of value functions of single policies:
Because both the first term and f (1)−f (0) are negative num- Proof of Lemma 5.3
bers, and the proof is done. We use ct and ¬ct in place of C t = 1 and C t = 0 respec-
tively. According to Figure 3,
Proof of Theorem 4.1
Pb (ch0 = t + x, o0:t ) = Pb (¬ct0 , . . . , ¬ct+x−1 , ct+x 0:t
0 ,o )
We use ct and ¬ct as C t = 1 and C t = 0 respectively. In the
0
following formulas k t is the vector of all variables in Xh and = Pb (C1t , ¬ct0 , o0:t )
Xe except Gt . According to Figure 3: C1t:t+x ,G1t+1:t+x ,ot+1:t+x
Pb (ch = t + x, o0:t ) = Pb (¬ct , . . . , ¬ct+x−1 , ct+x , o0:t ) P (ot+1 |Gt+1 t+1 t+1
1 , ¬g0 )P (G1 )
= Pb (k t \ ot , ¬ct , o0:t ) P (¬g0t+1 )P (C1t+1 |C1t , Gt+1
1 )...
kt \ot ,kt+1:t+x P (ot+x |Gt+x t+x t+x
1 , g0 )P (G1 )
Pb (k t+1 |¬ct+1 , ¬ct , k t \ ot , ot )P (¬g t+1 ) . . . P (g0t+x )P (C1t+x |C1t+x−1 , Gt+x
1 )
Pb (k t+x |ct+x , ¬ct+x−1 , k t+x−1 )P (g t+x )
After performing a sequence of variable eliminations:
After a sequence of variable eliminations on k t+1:t+x : Pb (ch0 = t + x, o0:t ) =
Pb (ch = t + x, o0:t ) = P (g t+x )P (¬g t+x−1 ) . . . P (¬g t+1 ) P (g0t+x )P (¬g0t+x−1 ) . . . P (¬g0t+1 ) Pb (C1t , ¬ct0 , o0:t )
Pb (k t \ ot , ¬ct , o0:t ) C1t
kt \ot
So Pb (ch0 = t + x|o0:t ) ≤ Pb (ch0 = t + y|o0:t )for x ≥ y.
0:t 0:t
So Pb (ch = t + x|o ) ≤ Pb (ch = t + y|o )for x ≥ y.
Proof of Theorem 5.2
Define V k (b) as the expected sum of future rewards of the op-
timal k-step policy (i.e., one in which we stop after k steps).
1915