1. The document describes discrete time optimal control and the dynamic programming algorithm (DPA) for finding the optimal policy.
2. The DPA works backwards in time from the final time period to solve for the optimal cost-to-go and control inputs for each time period and state.
3. The principle of optimality states that the optimal policy for the remaining problem must be optimal for the original problem, allowing the DPA to solve recursively for the fully optimal policy from the final time period backwards.
1. The document describes discrete time optimal control and the dynamic programming algorithm (DPA) for finding the optimal policy.
2. The DPA works backwards in time from the final time period to solve for the optimal cost-to-go and control inputs for each time period and state.
3. The principle of optimality states that the optimal policy for the remaining problem must be optimal for the original problem, allowing the DPA to solve recursively for the fully optimal policy from the final time period backwards.
1. The document describes discrete time optimal control and the dynamic programming algorithm (DPA) for finding the optimal policy.
2. The DPA works backwards in time from the final time period to solve for the optimal cost-to-go and control inputs for each time period and state.
3. The principle of optimality states that the optimal policy for the remaining problem must be optimal for the original problem, allowing the DPA to solve recursively for the fully optimal policy from the final time period backwards.
1. The document describes discrete time optimal control and the dynamic programming algorithm (DPA) for finding the optimal policy.
2. The DPA works backwards in time from the final time period to solve for the optimal cost-to-go and control inputs for each time period and state.
3. The principle of optimality states that the optimal policy for the remaining problem must be optimal for the original problem, allowing the DPA to solve recursively for the fully optimal policy from the final time period backwards.
The DPA (Dynamic Programming Algorithm) j to T in M − k − 1 moves:
System dynamics. Jk (i) = min aij + Jk+1 (j) k = M − 2, . . . , 1, 0 xk+1 = fk (xk , uk , wk ) k = 0, 1, . . . , N − 1 j Where Termination criterion: if k = 0 or Jk (i) = Jk+1 (i) ∀i. k ∈ N discrete time index How about alternative to DP for shortest path? xk ∈ Sk state YES di+aij<dT? Label-correcting algorithm. uk ∈ Uk (xk ) control input Set dj=di+aij, 0 Place node S into OPEN, set dS = 0, dj = ∞ ∀j. wk ∈ Dk disturbance or noise, potentially wk ∼ P(·|xk , uk ) put j into OPEN Children YES N ∈ N time horizon 1 Remove i from OPEN, execute Step 2 ∀ children j of i. 2 If di + aij < min(dj , dT ), set dj = di + aij , set i to be fk system dynamics Parent When xk can only take finite # of values, express dynamics by transition probabilities: i j di+aij<dj? optimal parent of j. If j 6= T , place j into OPEN if not xk+1 = wk Dynamics already there. pij (u, k) = P(wk = j|xk = i, uk = u) Transition probabilities OPEN bin 3 If OPEN empty, done. Else, go to Step 1. To visualize: transition probability graph (Markov chain). Proof of termination. Every time j enters OPEN, its cost is reduced to the current shortest Additive cost function. path from S → j. The # of distinct paths from S → j with cost smaller than any given number NX −1 z Stage}|cost k { is finite since # of nodes is finite and aij ≥ 0 ∀i, j. Hence, there can be only a finite # of cost Jπ (x0 ) = E gN (xN ) + gk (xk , uk , wk ) reductions ⇒ algorithm terminates in a finite # steps. wk | Proof that resulting cost is optimal (or ∞ if no path exists). Suppose no path S → T Terminal |k=0 {z } {z } exists. Then no node i such that i → T exists can enter OPEN. Thus, dT will never be reduced cost Accumulated from ∞. Now suppose path S → T exists. Then, since the number of paths with cost smaller cost than any given number is finite, a shortest path exists. Let (S, j1 , . . . , jk , T ) be this path and its The above definitions (system dynamics and additive cost function) form the Basic Prob- cost d∗ . Let dm be cost of path (S, j1 , . . . , jm ), m = 1, . . . , k. FTSOC, suppose at termination lem. Open loop: decide on control inputs {u0 , u1 , . . . , uN −1 } before k = 0. Stick to it no matter dT > d∗ . If so, this must have been true throughout the algorithm. Notably, dT > dm ∀m what happens! (since aij ≥ 0). It follows that jk will never enter OPEN with cost dk , otherwise next iteration Closed loop: wait until time k to make decision. Does not imply online computation! Basically, will set dT = d∗ . Similarly, jk−1 will never enter OPEN with dk−1 , otherwise next iteration requires a mechanism to decide what uk to apply as a function of xk ⇒ implies xk measureable. we’ll have jk enter and distance set to dk , then dT = d∗ in the iteration after. Going back, it Control rule uk = µk (xk ). This is the mechanism that makes a closed-loop. When at k, get the means j1 never enters OPEN with d1 . But this happens at first iteration⇒contradiction! xk ⇒ uk = µk (xk ). π = {µ0 , µ1 , . . . , µN −1 } is a policy or control law. Infinite # nodes. Suppose 1) αij ≥ 1 2) # children finite for each node 3) ∃ path S → T Π = {π1 , π2 , . . .} is the set of admissible policies. The optimal policy π ∗ = SdT ,max {µ∗ ∗ ∗ ∗ 4) Shortest distance d∗ T (∈ N) ≤ dT ,max < ∞. Define set R = Si where Si set of 0 , µ1 , . . . , µN −1 } is s.t. Jπ ∗ (x0 ) ≤ Jπ (x0 ) ∀π ∈ Π. Then J (x0 ) ≡ Jπ ∗ (x0 ) the op- i=1 timal cost. nodes for which min # arcs to S is i; let’s show that R is finite. Induction: • S1 is finite by 2) DPA objective: find π ∗ . • Assume Sk finite (*) • Sk+1 ≤ i|i child of node in Sk+1 . Since Sk finite (*) & each j ∈ Sk Principle of optimality. Suppose π ∗ = {µ∗ ∗ ∗ 0 , µ1 , . . . , µN −1 } is the optimal policy going from has finite # children (2)), Sk+1 finite. So R finite. Also, any i 6= R never enters open since by time 0 to time N − 1. Then the truncated policy {µ∗ ∗ ∗ i , µi+1 , . . . , µN −1 } is the optimal policy 1), shortest path S → i is > dT ,max . So we are back to a finite graph problem, can apply above going from time i to time N − 1. proof to show that terminates with label dT = d∗ T! The DPA. The optimal cost J ∗ (x0 ) and the associated optimal policy π ∗ is given by the last NB1: di + aij < dT assumes that the cost always increases along a path. Won’t work with step of the following recursive algorithm, proceeding backwards in time: negative arc lengths! Initialization: JN (xN ) = gN (xN ) ∀xN ∈ SN NB2: a child can have multiple parents, only one optimal parents, and multiple of its own chil- Recursion: for each xk ∈ Sk , compute h cost-to-go at state xk : i dren. . . Different algorithms around how to select item in OPEN: Jk (xk ) = min E gk (xk , uk , wk ) + Jk+1 (fk (xk , uk , wk )) 1. LIFO (depth-first): pick youngest node in the bin (idea: explore as far as possible first). uk ∈Uk (xk ) wk 2. FIFO (breadth-first): pick oldest node in the bin (idea: stockpile many nodes into OPEN). In the above: memorize each Jk (xk ) and the argmin, u∗ ∗ k = µk (xk ), for each xk , k = N − 3. Dijkstra (best-first): pick node with lowest current cost (idea: explore most promising nodes). 1, . . . , 1, 0. finally: J0 (x0 ) = J ∗ (x0 ), and the memorized µ∗ ∗ A∗ Algorithm. k form the optimal policy π in the form of a lookup table. Replace test di + aij < dT in label-correcting algo. by di + aij + hj < dT where hj is lower Recasting problems into DPA formulation bound on cost from j → T , i.e. djT ≥ hj . 1. Time lags: Idea: prevent adding j into OPEN if clearly even if in the best-case scenario (lower bound hj ) going from j to T will cost more than already-available dT . xk+1 = fk (xk , xk−1 , uk , uk−1 , wk ) Multi-objective problems Recast: define yk = xk−1 , sk = uk−1 . Let x̃k = (xk , yk , sk ), x̃k+1 = f˜k (x̃k , uk , wk ): Idea: multiple costs you care about, but don’t know yet how to “combine” them into one cost. xk+1 fk (xk , yk , uk , sk , wk ) How to narrow down # of possible optimal control policies? y k+1 = xk (cost #1) uk time Inferior b/c Vector of costs x = (x1 , x2 , . . . , xM ) is inferior if ∃y s.t. yl ≤ xl l = sk+1 | {z } of 1, 2, . . . , M with strict inequality for at least one l. If no such y exists, x (cost #2) | {z } =x̃k+1 f˜k fuel non-inferior. Consequently: uk = µk (xk , yk , sk ). This is exactly how discrete-time state space works! Sure, we write xk+1 = Axk + Buk , but in fact xk+1 → Suppose you have M cost functions f1 (x), f2 (x), . . . , fM (x) with x ∈ X a vector. x̃k+1 , xk → x̃k so the underlying mechanics is that all the “−1s” are replaced with augmented states. x ∈ S called non-inferior solution if (f1 (x), f2 (x), . . . , fM (x)) is non-inferior vector of set {(f1 (y), f2 (y), . . . , fM (y))|y ∈ S}. Each cost fl (x) computed using deterministic DPA: 2. Correlated disturbances: ( xk+1 = fk (xk , uk ) ( wk = Ck yk+1 Disturbances given by linear system: . Recast: l (x ) + PN −1 g l (x , u ) fl (u) = gN yk+1 = Ak yk + ξk N k=0 k k k " # xk+1
fk (xk , uk , Ck (Ak yk + ξk )) Idea: find all non-inferior solutions, store them and later decide which one to use (based on some = yk+1 Ak yk + ξk criteria). | {z } | {z } Extended principle of optimality: If {uk , uk+1 , . . . , uN −1 } is a non-inferior control sequence =x̃k+1 f˜k starting at xk at time k, then {uk+1 , . . . , uN −1 } is a non -inferior control sequence starting at Consequently: uk = µk (xk , yk ). fk (xk , uk ) at time k + 1. 3. Forecasts: Multi-objective problem algorithm (***). Call Fk (xk ) set of non-inferior M -tuples of costs Idea: at time k, receive info that wk has probability distribution Qk . At time k+1, wk+1 can take to go at x(k). The algorithm is then: on prob. distrib. Qi with probab. pi ; we encode this in yk+1 = ξk where P(ξk = i) = pi , when Initialization: ξk = i then the next disturbance wk+1 will be evaluated using prob. distrib. Qi . Augmented FN (xN ) = {(gN 1 (x ), g 2 (x ), . . . , g M (x ))|x N N N N N N ∈ SN }. state: Recursion: for each xk ∈ " # n Sk , compute non-inferior set at statexk : 1 M o xk+1 fk (xk , uk , wk ∼ Qy )
Fk (xk ) = noninf gk (xk , uk ) + c1 , ·, gk (xk , uk ) + cM (c1 , ·, cM ) ∈ Fk+1 (xk+1 ))
= k yk+1 ξk uk ∈Uk (xk ) | {z } | {z } This search is hard – sometimes brute force, explore all uk combinations to find the non-inferior M -tuples! =x̃k+1 f˜k At the final iteration we’ll get F0 (x0 ), this is the set of non-inferior solutions (associated Consequently: uk = µk (xk , yk ), i.e. input we apply depends now on the forecast we receive right non-inferior control policies will have had to be memorized when doing the algorithm!). Later now. use some criterion to pick one out that we’ll use (e.g.: linear combination of each M -tuple, pick Then the cost-to-go becomes: the smallest one)!
Jk (xk , yk ) = min E gk (xk , uk , wk ) + E Jk+1 (fk (xk , uk , wk ), ξk ) yk Above algorithm degenerates to DPA when M = 1, since in this case non-inferior≡minimal: uk wk ξk , black non-inferior (≡ minimal, DPA!) and red all inferior! m X Handling constraints = min E gk (xk , uk , wk ) + pi Jk+1 (fk (xk , uk , wk ), i)yk Consider problem: uk wk i=1 xk+1 = fk (xk , uk ) Viterbi algorithm. Situation: Markov chain with transition probabilities pij = P(xk+1 = J = g 1 (x ) + PN −1 g 1 (x , u ) cost j|xk = i) and initial state probability p(x0 ), but can only indirectly observe x via measurement π N N k=0 k k k r(z; i, j) = P(meas = z|xk = i, xk+1 = j) ∀k. Objective: given measures ZN = {z1 , . . . , zN } subject to l PN −1 l gk (xk , uk ) ≤ bl find Xc N = {x b N } = argmaxX (P(XN |ZN )). Idea: b0 , . . . , x gN (xN ) + l = 2, . . . , M constraints N k=0 P(XN , ZN ) = P(x0 , . . . , xN , z1 , . . . , zN ) Solution: calculate F0 (x0 ) with algo. (***) above. Then throw away any elements (i.e. control = P(x2 , . . . , xN , z2 , . . . , zN |x0 , x1 , z1 )P(z1 |x0 , x1 )P(x1 |x0 )P(x0 ) policies) that don’t satisfy constraints and, from what’s left, pick one with smallest cost (i.e. = P(x2 , . . . , xN , z2 , . . . , zN |x0 , x1 , z1 )r(z1 ; x0 , x1 )px ,x p(x0 ) smallest Jπ ). 0 1 Continuing by developing P(x2 , . . . , xN , z2 , . . . , zN |x0 , x1 , z1 ) the same way, we get: Can we do better? Yes if pbm. deterministic (no noise). Do the following: N 1. Compute J˜k l (x ) optimal cost to arrive to x for l = 2, . . . , M using forward DP on cost P(XN , ZN ) = p(x0 ) Y px r(zk ; xk−1 , xk ) k k k−1 ,xk l, individually. Forms a lower bound for cost l, “lower ” because may be violating other costs. k=1 PN 2. Apply algo. (***) but in Recursion on top of non-inferior check also use “A∗ algo. in reverse Logarithm: log(P(XN , ZN )) = minx (− log(p(x0 )) + k=1 − log(px r(zk ; xk−1 , xk ))). type check”: N k−1 ,xk l l l l Can apply DPA with 0 terminal cost and stage cost − log(p(x0 )) for k = 0. J˜k (xk ) + gk (xk , uk ) + c ≤ b l = 2, . . . , M Deterministic, finite state systems (shortest path problem) | {z } | {z } Going 0→k Going k→N Assume: At time k = 0, you’ll have only the feasible, non-inferior control sequences . 1. xk ∈ Sk finite set Infinite-horizon problems 2. No disturbance wk , i.e. P(wk = 0) = 1. Time-invariant system: 3. Only one way to go from state i ∈ Sk → j ∈ Sk+1 (if multiple, pick the one with lowest xk+1 = f (xk , uk , wk ) xk ∈ S, uk ∈ U, wk ∼ P(·|xk , uk ) cost when at stage k. . . ). Notice: no k subscript for f , S and U ! Stage 1 Stage 2 Stage N Cost: Shortest Path Problem. −1 ij NX ak ij = gk (i, uk ) the cost to go from state i ∈ Sk to j ∈ Sk+1 , Jπ (x0 ) = E g(xk , µk (xk ), wk ) No terminal cost! g time-inv.! ij k=0 j = fk (i, u ). aij = ∞ if going from i to j impossible. k Idea: loose notion of time, obtain static policy u = µ(x) using: Goal: find path of least cost from S to T . Bellman equation: ∗ ∗
Shortest Path DPA. J (x) = min E g(x, u, w) + J (f (x, u, w)) ∀x ∈ S Initialization: JN (i) = aN iT ∀i ∈ SN u w Recursion: for each i ∈ Sk , compute cost-to-go at state i: Stochastic shortest-path problems (SSPP) k
Jk (i) = min aij + Jk+1 (j) k = N − 1, . . . , 1, 0 Idea: a problem with infinite horizon, no noise but state transitions uncertain (only probability j∈Sk+1 known). Want to control to optimally reach “terminal state”. Dynamics: Forward DPA (only possible when deterministic, no noise). xk+1 = wk xk ∈ S = {1, 2, . . . , n, t} finite set! Idea: going backwards same as going forwards, “optimal” path cannot change! Possible only when P(wk = k|xk = i, uk = u) = pij (u) uk ∈ U (i) finite set! deterministic due to noise info. Initialization: J˜N (j) = a0 Special cost-free termination state (think: “destination state”): Sj ∀j ∈ S1 1. ptt (u) = 1 ∀u ∈ U (t) (“once at t, always at t”). Recursion: for each j ∈ SN −k+1 , compute cost-to-arrive at state j: 2. g(t, u) = 0 ∀u ∈ U (t) (“no cost being at t”). N −k Policy stationnary: π = {µ, µ, . . .} ≡ µ. µ optimal if Jµ (i) = J ∗ (i) = minJπ (i) (optimal
J˜k (j) = min a + J˜k+1 (i) k = N − 1, . . . , 1, 0 π i∈SN −k ij cost) where cost is computed as: When finished, optimal path the same so: J˜0 (T ) ≡ J0 (s). −1 NX
Recast shortest path as DPA. Assume all cycles have non-negative cost (otherwise infinite loop). Jπ (i) = lim E g(xk , µ(xk ))x0 = i (1) Then, going from S to T will only ever require at most M := N − 1 moves (visit all nodes, not counting starting N →∞ k=0 node). So we can limit horizon to M . Allow degenerate moves: reach T but can stay at some nodes Main results: (effectively reach T in M − (# degen.’s) moves). (A). Given any initial conditions J0 (1), . . . , J0 (n) (pick them!) the sequence Call Jk (i) optimal cost to go i → T in M − k moves. Then can write: Xn Recast shortest path as DPA algorithm. Jk+1 (i) = min g(i, u) + pij (u)Jk (j) ∀i ∈ S \ t = {1, 2, . . . , n} u∈U (i) j=1 Initialization (k = M − 1): JM −1 (i) = aiT ∀i. Can be infinite if no dir. path i → T ! ∗ Recursion: for each i, calculate cost to go to T in M − k moves by going to j in one move and from converges to optimal cost J (i) for each i. Note: subscript k in Jk not “stage k”, rather “current Lemma. F (t, x, u) continuously differentiable, U convex and µ∗ (t, x) := arg minF (t, x, u) con- iteration number” in the sequence! u∈U (B). Optimal cost satisfies Bellman’s Equation (version for stochastic shortest path problems): tinuously differentiable. Then: n ∂ minu∈U F (t,x,u) ∂F (t,x,µ∗ (t,x))
∗ X ∗ 1. = ∀t, x J (i) = min g(i, u) + pij (u)J (j) ∀i = 1, . . . , n (2) ∂t ∂t u∈U (i) j=1 ∂ minu∈U F (t,x,u) ∂F (t,x,µ∗ (t,x)) which has a unique solution. 2. = ∀t, x ∂x ∂x (C). For any static policy µ, costs Jµ (i) are the unique solutions of: Pontryagin’s Minimum Principle. n Define: H(x, u, p) := g(x, u) + pT f (x, u) the Hamiltonian. X J(i) = g(i, µ(i)) + pij (µ(i))J(j) i ∈ S \ t = {1, 2, . . . , n} j=1 ∂J ∗ (t,x∗ (t)) Furthermore, given any initial conditions J0 (1), J0 (2),. . . , J0 (n) the sequence: Co-state equation: p(t) = ∂x Xn u∗ (t) := optimal control and x∗ (t) := resulting state trajectory. Jk+1 (i) = g(i, µ(i)) + pij (µ(i))Jk (j) Statement: u∗ (t), x∗ (t) & co-state p(t) must minimize H(x, u, p) (maximize if original problem j=1 seeks to maximize cost!). converges to Jµ (i) (cost starting from state i given policy µ) for each i. Proof : take (A) but restrict U (i) = µ(i), so no min ⇒ just take u = µ(i). Convergence to Jµ (i) assured by (B) with Resulting conditions (≡ free term. state with cost h(x(T )) pbm.): Jµ (i) not J ∗ (i) since we only consider the single policy µ. . . ∂H 1) ẋ∗ (t) = (x∗ (t), u∗ (t), p(t)) x∗ (0) = x0 How to solve Bellman’s equation (2)? ∂p Method 1: Value Iteration (VI). Basically, use (A). ∂H ∂h(x∗ (T )) Step 1: Choose initial costs J0 (i) i = 1, . . . , n (could guess, could be smart). 2) Adjoint eq.’s ṗ(t) = − (x∗ (t), u∗ (t), p(t)) p(T ) = (5) Step 2: Iterate by solving ∗ ∂x ∗ ∂x n X 3) u (t) = arg minH(x (t), u, p(t)) Jk+1 (i) = min g(i, u) + pij (u)Jk (j) ∀i = 1, 2, . . . , n (3) u∈U u∈U (i) ∗ ∗ 4) H(x (t), u (t), p(t)) = const. (if time-invariant f, g) ∀t ∈ [0, T ] j=1 until converges: |Jk+1 (i) − Jk (i)| < ∀i. Problem: NOT guaranteed that policy has converged! The full set of equations/ICs (5) := Pontryagin’s necessary conditions for optimality. When writing in exam: in condition 3), already show the minimization’s result (i.e. u∗ = {· · · )! Complexity of VI: Step 2 is minimization over p possibilities of u (that each take n multiplications to compute because of the sum), done n times (for each i). ⇒ O(pn2 ). NB: when multiple states, write 2) for each state individually, i.e. ṗi = − ∂H . ∂xi Method 2: Policy Iteration (PI). Basically, use (C). Let i ∈ S \ t = {1, 2, . . . , n}. NB: when more than 1 input, then write 3) for each input individually (u∗ i (t) = arg min(H)) Initialize: Choose initial policy µ0 (i) ∀i (can be anything. . . ). Set k = 0. ui ∈Ui Stage 1: Given µk (static policy at iteration not “time” – no such thing exists here – k) obtain NB: when solving cond. 1) using Laplace, substitute x1 (0), ẋ1 (0), etc. by A, B, . . . and use available J k (i) by solving: ICs to solve for A, B, . . . Always possible! Don’t “assume” e.g. ẋi (0) = 0 if not given! Same for µ cond. 2) (e.g. don’t assume ṗj (0) = 0. . . ). n k X k Time-varying ẋ = f (x, u,t). Everything same, but H 6= const. Idea: augment state, z = [x; y] = J(i) = g(i, µ (i)) + pij (µ (i))J(j) for each i (4) j=1 [x; t] ⇒ ż = [f (x, u, y); 1]. Then, H f = g(x, u, y) + p f (x, u, y) + p = H(x, u, p, y) + p . Since 1 2 2 This is a linear problem with n eq.’s, n unknowns. Solve as Ax = b ⇒ x = A−1 b where ṗ2 = −∂ H/∂y f 6= 0 generally and since H f = const. ⇒ H(x, u, p, y) = H(x, u, p, t) 6= const.! x = [J k (1); J k (2); . . . ; J k (n)]. µ µ Stage 2: Improve policy by iterating µ ! 4 Minimum principle is necessary but not sufficient for optimality; necessary and sufficient ⇔ f (x, u) n linear and U, h, g convex! Also, since necessary, if ∃! solution only, then it is optimal! k+1 X Practical implementation: µ (i) = argmin g(i, u) + pij (u)J k (j) ∀i u∈U (i) µ • HJB: if can solve, then will obtain u = µ(x), your feedback controller! j=1 • Pontryagin: can obtain one “optimal trajectory” (for given initial/final conditions and quit when J k+1 (i) = J k (i) ∀i (precise no-bullshit stopping criterion). µ µ cost). Use another closed-loop feedback (e.g. PID) to track the trajectory or solve for NB: STOP ⇔ J k+1 (i) = J k (i) or µk+1 (i) = µk (i) ∀i. If get multiple minimizers for µk+1 (i) trajectory/optimal input online (e.g. MPC). µ µ Minimum Principle applications (drop ∗ notation for brevity ) including some identical to µk (i), must continue to iterate! Only guarantee: will converge even- Fixed Terminal State x(T ) = xT . Z T tually for whatever choice! Extra “Method”: Exhaustive Search. Enumerate all policy combinations and for each policy Cost = g(x(t), u(t))dt 0 find J(i) i = 1, . . . , n using (4). Optimal policy is the one giving smallest (if taking min) vector Conditions 1 and 2 (3 and 4 same as in (5)): [J(1); . . . ; J(n)]. 1) ẋ(t) = f (x(t), u(t)) x(0) = x0 , x(T ) = xT Complexity of PI: Stage 2 is like VI, so O(pn2 ). Stage 1: solve linear system, so O(n3 ). ⇒ ∂H 2) ṗ(t) = − (x(t), u(t), p(t)) O(n2 (n + p)). Worst case: search over all policies combinations, pn combinations. ∂x Rewrite VI (below left) as PI (below right): Free initial state, with cost l(x(0)). Z T
n 1 µk (i) = argminu g(i, u) + P j pij (u)Jk (j)
X Cost = l(x(0)) + g(x(t), u(t))dt Jk+1 (i) = min g(i, u)+ pij (u)Jk (j) → 0 u Jk+1 (i) = g(i, µk (i)) + k 2 P j=1 j pij (µ (i))Jk (j) Conditions 1 and 2 (3 and 4 same as in (5)): PI solves a linear system → same as running value update ∞ # of times. VI: only does it once. 1) ẋ(t) = f (x(t), u(t)) x(T ) = xT ∂H ∂l(x(0)) For the next solution method, notice if we pick as initial condition: 2) ṗ(t) = − (x(t), u(t), p(t)) p(0) = − n ∂x ∂x X J0 (i) ≤ min g(i, u) + pij (u)J0 (j) ∀i = 1, . . . , n Free initial and terminal states. Conditions 1 and 2: u∈U (i) 1) ẋ(t) = f (x(t), u(t)) i=1 ∂H Then by (3), J0 (i) ≤ J1 (i) ∀i. Can show this means: Jk (i) ≤ Jk+1 (i) ∀i, k. From VI know that ∂l(x(0)) ∂h(x(T )) 2) ṗ(t) = − (x(t), u(t), p(t)) p(0) = − p(T ) = ∂x ∂x Jk → J ∗ , so J0 (i) ≤ Jk (i) ≤ J ∗ (i)∀i, k. Can do: ∂x Free terminal time. Method 3: Linear Programming (LP). Find J by solving following linear program: Z T n X Cost = h(x(T )) + g(x(t), u(t))dt max J(i) subject to 0 n Conditions 1, 2 and 3 same as in (5); 4) H(x(t), u(t), p(t))= 0 ∀t ∈ [0, T ]. i=1 X J(i) ≤ g(i, u) + pij (u)J(j) ∀i, ∀u ∈ U (i) Singular problems Problem is singular if Hamiltonian is not a function of u for a non-trivial time interval ⇒ j=1 ∗ ∗ ∗ cannot use u = arg min(H) to find optimal input during that time interval! The solution must be J since J(i) ≤ J (i) ∀i and J satisfies constraints. Example situation. Discounted problems. Discounted problem cost function: −1 1 if p > 0 NX k u(t) = −1 if p < 0 Jπ (i) = lim E α g(xk , µ(xk ))x0 = i 0 < α < 1, i ∈ 1, . . . , n N →∞ ??? if p = 0 “singular arc candidate” k=0 Advantage: no explicit termination state required! Use when your problem “doesn’t have an Solution: non-trivial time requires ṗ = 0 for non-trivial time. Using condition 2) write what this end”. . . requires (ṗ = (condition 2) = 0, what does this imply for input u?). If this condition can hold for Bellman’s Equation for discounted problem: non-trivial time ⇒ then this is the ??? you are looking for! If cannot hold for non-trivial time, then n ∗ X ∗ no need to consider the p = 0 case! J (i) = min g(i, u) + (α · pij (u))J (j) ∀i = 1, . . . , n u∈U (i) NB: if multiple inputs ui , in a singular arc can only control “current” input u being considered, j=1 Can then use VI, PI or LP as before using α · pij (u) as trans. prob.’s! not the others! Dunno what’s going on with them ⇒ cannot set non-trivial time conditions on them. Markov Decision Processes (MDP). Idea: stage cost in (1) is Ex (g(xk , µ(xk ))), in MDP Constraints k Equality constraints. min(Cost) subject to ẋ(t) = f (x(t), u(t), t), g(t, x(t), u(t)) = c. Form La- replace this with Ex ,x (g(xk , xk+1 )). However: k k+1 X grangian:
E (g(xk , xk+1 )) = E E (g(xk , xk+1 )|xk ) = E g(xk , j)pij µ(xk ) L = H(x(t), u(t), t) + λ(t)(g(t, x(t), u(t)) − c) xk ,xk+1 xk xk+1 xk where Lagrange multiplier λ(t) generally time-variant. Continue as before using L instead of j Then, can use yellow above as g(xk , µ(xk )) and apply VI/PI/LP as before! H and require additionally ∂L/∂λ = 0. NB: preferable to just use substitution to eliminate 2: Continuous Time Optimal Control constrained variables! Inequality constraints. min(Cost) subject to ẋ(t) = f (x(t), u(t), t), g(t, x(t), u(t)) ≤ c. Form L Notation ∂F (t,x) ∂F (t,x) as above and consider two cases: constraint inactive (λ = 0) and constraint active (λ 6= 0). • = Partial derivative For inactive case, solve unconstrained problem → if constraint violated, then constraint must be ∂t ∂t ∂F (t,x(t)) ∂F (t,x) active so you solve a g(·) = c equality constraint problem as above. • = Partial derivative (shorthand) 3: Extra ∂t ∂t x=x(t) Laplace transforms: • ε(t − c) → e−cs • δ(t − c) → e−cs • eat → 1 dF (t,x(t)) ∂F (t,x(t)) ∂F (t,x) ∂x(t) s s−a • = + Total derivative dt ∂t ∂x ∂t n n! a s 2as x=x(t) • t → • sin(at) → • cos(at) → • t sin(at) → s n+1 2 s +a 2 2 s +a 2 (s +a )2 2 2 The Hamilton Jacobi Bellman (HJB) Equation System dynamics: • t cos(at) → s2 −a2 • sinh(at) → a • cosh(at) → s • e at sin(bt) → b ẋ(t) = f (x(t), u(t)) 0 ≤ t ≤ T (s2 +a2 )2 s2 −a2 s2 −a2 (s−a)2 +b2 x(0) = x0 no noise! • eat cos(bt) → s−a • e at sinh(bt) → b • eat cosh(bt) → s−a where (s−a)2 +b2 (s−a)2 −b2 (s−a)2 −b2 x(t) ∈ Rn State • tn eat → n! , n = 1, 2, . . . • f (n) (t) → sn F (s) − sn−1 f (0) − · · · − f (n−1) (0) u(t) ∈ U ⊂ Rm Control constraint set (s−a)n+1 t ∈ R Time, with T the terminal time NB: attention for singularities, e.g. when decomposing fractions coeffs with 1/(something that could Require: f (·, u(t)) continuously differentiable, f (x(t), ·) continuous, u(t) pwc and assume solution be 0) appear! Then have to treat these singularities separately! exists and is unique on 0 ≤ t ≤ T (very heavy assumption!). Calculus: • d (asin(x)) = q 1 • d (acos(x)) = − q 1 • d (atan(x)) = 1 Objective: minimize the following cost function: Z T dx 1−x2 dx 1−x2 dx 1+x2 Cost = h(x(T )) + g(x(t), u(t))dt Discrete-time LQR. Infinite horizon, LTI system with quadratic cost: 0 xk+1 = Axk + Buk k = 0, 1, . . . where g(·, u(t)) and h(·) continuously differentiable, g(x(t), ·) continuous. ∞ Hamilton Jacobi Bellman (HJB) equation: X T T T T Cost = xk Qxk + uk Ruk R = R > 0, Q = Q ≥ 0 ∂V (t, x) ∂V (t, x) T
k=0 0 = min g(x, u) + + f (x, u) ∀t, x HJB Conjecture: same as for continuous time LQR: u∈U ∂t ∂x V (T , x) = h(x) Boundary condition • Cost-to-go is quadratic: V (x) = xT Kx (and time-invariant, like infinite-horizon pbm’s!) • K = KT ≥ 0 The u = µ(t, x) minimizing RHS of HJB optimal policy and the resulting V (t, x) := J ∗ (t, x), Results (using HJB, same as for continuous-time LQR): i.e. is the optimal cost (NB: must verify the HJB and the boundary condition!). NB: HJB is a sufficient but not necessary optimality condition! 1. Optimal Cost to go: J(x) = xT Kx LQR controller 2. Optimal feedback strategy: u = F x (Optimal feedback strategy) Cost function: Z T F := −(R + B T KB)−1 B T KA T T T Cost = x (T )QT x(T ) + x (t)Qx(t) + u (t)Ru(t)dt K := AT (K − KB(R + B T KB)−1 B T K)A + Q, K ≥ 0 0 with QT = QT ≥ 0, Q = Q T ≥ 0, R = RT > 0. Assume: Trig identities: • sin(u ± v) = sin(u) cos(v) ± cos(u) sin(v) T • cos(u ± v) = cos(u) cos(v) ∓ sin(u) sin(v) • tan(u ± v) = (tan(u) ± tan(v))/(1 ∓ tan(u) tan(v)) • Cost-to-go is quadratic: V (t, x) = xT K(t)x • sin(2u) = 2 sin(u) cos(v) • cos(2u) = cos2 (u) − sin2 (u) = 2 cos2 (u) − 1 = 1 − 2 sin2 (u) • K(t) = K T (t) • sinh(x) = (ex − e−x )/2 • cosh(x) = (ex + e−x )/2 • tanh(x) = sinh(x)/ cosh(x) Optimal policy: µ(t, x) = −R−1 B T K(t)x(t), where K(t) found from Continuous Time Ricatti • cosh2 (x) − sinh2 (x) = 1 • csch(x) = 1/ sinh(x) • sech(x) = 1/ cosh(x) • coth(x) = 1/ tanh(x) Differential Equation: Complexity. DPA has a complexity O(|U | · |S| · N ) where |U | the cardinality (# of elements) in K̇(t) = −K(t)A − AT K(t) + K(t)BR−1 B T K(t) − Q ( control space U , |S| the cardinality of the state space and N the time horizon. A “brute force” K(T ) = QT PN N approach has a complexity ≡ 2N (cardinality of the power set) where N is the # of For an infinite horizon problem (QT = 0, T = ∞): µ(x) = −R−1 B T Kx := −F x where K k=0 k found from Algebraic Ricatti Equation: choices. T −1 T n n! 0 = −KA − A K + KBR B K−Q Combin. (order doesn’t matter): = . Permut. (order matters): P (n, k) = n! k k!(n−k)! k! Pontryagin’s Minimum Principle NB: e ≈ 2.7183, ln(2) ≈ 0.6931.