Reinforcement Learning

Value Function 𝑽(𝒔)
RL Cheat Sheet • Long term value of state S

• State value function 𝑉(𝑠) of a MRP is expected
Definitions reward from state s
• State(S): Current Condition • 𝑉(𝑠) = 𝐸(𝐺𝑡 |𝑆𝑡 = 𝑠)
• Reward(R): Instant Return from environment to appraise the last action Action Value Function 𝒒(𝒔, 𝒂)
• Value(V): Expected Long-term reward with discount, as opposed to short-term reward R • 𝑞𝜋 (𝑠, 𝑎) = 𝐸𝜋 (𝐺𝑡 |𝑆𝑡 = 𝑠, 𝐴𝑡 = 𝑎)
• Action-Value(Q): This is similar to Value, except, it takes extra parameter, action A Bellmen Equation
• Policy(𝝅): Approach of agent to determine next action based on current state • 𝑉(𝑠) = 𝑅(𝑠) + 𝛾𝐸𝑠′ ∈𝑠 [𝑉(𝑠 ′ )]
• Exploitation is about using already known info to maximize rewards • 𝑉(𝑠) = 𝑅(𝑠) + 𝛾 ∑𝑠′ ∈𝑆 𝑃𝑠𝑠′ (𝑉(𝑠 ′ ))
• Exploration is about exploring and capturing more information • 𝑉 = 𝑅 + 𝛾𝑃𝑉 → 𝑉 = (𝐼 − 𝛾𝑃)−1 𝑅
Discount Factor (𝜸) Markov Process
• Varies between 0 to 1 • Consists of < 𝑠, 𝑝 > tuple where s are states and p is state transition matrix
• Closer to 0 → Agent tend to consider immediate reward • 𝑃𝑠𝑠′ = 𝑃(𝑠𝑡+1 = 𝑠 ′ |𝑠𝑡 = 𝑠)
• Closer to 1→ Agent tend to consider future reward with greater • 𝜇𝑡+1 = 𝑝𝑇 𝜇𝑡 where 𝜇𝑡 = [𝜇𝑡,1 … 𝜇𝑡,𝑛 ]
𝑇
Q-Learning
weight
4.Create reward matrix 𝑅 where 𝑅𝑠𝑎 = reward for taking action 𝑎 Markov Reward Process
in state 𝑠 and set 𝛾 parameter. • Consists of < 𝑠, 𝑝, 𝑅, 𝛾 > tuple where R is reward 𝛾 is discount
5.Initialize Q matrix to 0 • 𝑅 = 𝐸[𝑅𝑡+1 |𝑆𝑡 = 𝑆] = 𝑅(𝑠)
6.Set initial random state and assign this to current state • 𝐺𝑡 = 𝑅𝑡+1 + 𝛾𝑅𝑡+2 + ⋯ = ∑𝑘=0 𝛾 𝑘 𝑅𝑡+𝑘+1 is Total discounted reward
7.Select One among all possible actions of current state
a. Use this action to get new state Discounted Reward cons: Uncertainty may not be fully represented. Immediate
b. Get maximum Q value for this state based on all previous rewards values> delayed, Avoid ∞ rewards in cycle
actions
Markov Decision Process
c. Compute Q matrix using
• Consists of < 𝑠, 𝐴, 𝑝, 𝑅, 𝛾 > tuple where A is action
𝑄𝑠𝑎 = 𝑅𝑠𝑎 + 𝛾 × 𝑀𝑎𝑥[𝑄𝑠′ 𝑎′ ]
• 𝑃𝑠𝑠′ = 𝑃(𝑆𝑡+1 = 𝑆 ′ |𝑆𝑡 = 𝑆, 𝐴𝑡 = 𝑎)
∀ 𝑠 ′ 𝑎𝑐𝑐𝑒𝑠𝑠𝑖𝑏𝑙𝑒 𝑓𝑟𝑜𝑚 𝑠 Discounted Reward cons: Uncertainty may not be fully represented. Immediate
∀𝑎′ 𝑎𝑣𝑎𝑖𝑙𝑎𝑏𝑙𝑒 𝑖𝑛 𝑠 ′ rewards values> delayed, Avoid ∞ rewards in cycle
8.Repeat 4 until current state=goal state
Policy
Monte Carlo Policy Evaluation • 𝜋(𝑎|𝑠) = 𝑃(𝑎𝑡 = 𝑎|𝑠𝑡 = 𝑠)
1.To evaluate Value(s) 𝑉𝜋 (𝑆) • Either deterministic or stochastics. In deterministic P=1 for one 𝑎𝑡
2.At any time step t when state s is visited in an episode • 𝑃𝜋 (𝑠 ′ |𝑠) = ∑𝜋(𝑎|𝑠) × 𝑃(𝑠 ′ |𝑠, 𝑎) for stochastic process.
a. Increment Counter N(s)<- N(s)+1
• One step expected reward 𝑟𝜋 = ∑𝑎 𝜋(𝑎|𝑠)𝑟(𝑠, 𝑎)
b. Increment total return S(s) <- S(s)+𝐺𝑡
• For rewards as function of transition states
3.Value estimated is mean 𝑉(𝑠) = 𝑆(𝑠)/𝑁(𝑠)
𝑟𝜋 = ∑ 𝜋(𝑎|𝑠) ∑ 𝜋(𝑎|𝑠) × ∑ 𝑃(𝑠 ′ |𝑠, 𝑎) × 𝑟(𝑠, 𝑎, 𝑠 ′ )
Policy Gradients 𝑎 𝑠 𝑠′
• 𝑝𝜃 (𝑠1 , 𝑎1 , 𝑠2 , 𝑎2 … ) = 𝑝(𝑠1 ) × Π𝑡=1
𝑇
𝑝(𝑠𝑡+1 |𝑠𝑡 , 𝑎𝑡 )𝜋𝜃 (𝑠𝑡 , 𝑎𝑡 )
• Goal is to 𝜃 = arg max 𝐸𝜏~𝑃𝜃 (𝜏) [∑𝑡 𝑟(𝑠𝑡 , 𝑎𝑡 )] = arg max 𝐽(𝜃) Relation Between 𝑽𝝅 and 𝒒𝝅
∗
θ
1 • 𝑉𝜋 (𝑠) = ∑𝑎∈𝐴 𝜋(𝑎|𝑠𝑡 = 𝑠)𝑞𝜋 (𝑠, 𝑎)
• 𝐽(𝜃) = ∑𝑖 ∑𝑡 𝑟(𝑠𝑖 , 𝑎𝑖 )
𝑁 • 𝑉𝜋 (𝑠) = ∑𝑎∈𝐴 𝜋(𝑎|𝑠) × {𝑟(𝑠, 𝑎) + 𝛾 × ∑𝑠′ ∈𝑆 𝑃(𝑠 ′ |𝑠, 𝑎)𝑉𝜋 (𝑠)}
• ∇𝜃 𝑝𝜃 (𝜏) = 𝑝𝜃 (𝜏) × ∇𝜃 log 𝑝𝜃 (𝜏) • 𝑉𝜋 (𝑠) = 𝑟(𝑠) + 𝛾 ∑𝑎∈𝐴 𝜋(𝑎|𝑠)∑𝑝(𝑠 ′ |𝑠, 𝑎) × 𝑉𝜋 (𝑠 ′ )
1
• ∇𝜃 𝐽𝜃 = ∑𝑁 {∑ ∇
𝑡=1 𝜃 log 𝜋𝜃 (𝑎 |𝑠
𝑖,𝑡 𝑖,𝑡 ) ∑ 𝑡=1 𝑟(𝑠 , 𝑎
𝑖,𝑡 𝑖,𝑡 ) • 𝑞𝜋 (𝑠, 𝑎) = 𝑟(𝑠, 𝑎) + 𝛾 × ∑𝑠′ ∈𝑆 𝑃(𝑠 ′ |𝑠, 𝑎)𝑉𝜋 (𝑠)
𝑁 𝑖=1
• 𝜃 ← 𝜃 + ∇𝜃 𝐽(𝜃) • 𝑞𝜋 (𝑠, 𝑎) = 𝑟(𝑠, 𝑎) + 𝛾 × ∑𝑃(𝑠 ′ |𝑠, 𝑎){ ∑𝑎′ ∈𝐴 𝜋(𝑎′ |𝑠 ′ )𝑞𝜋 (𝑠 ′ , 𝑎′ ) }
• log 𝜋𝜃 (𝑎𝑖,𝑡 |𝑠𝑖,𝑡 ) is the log probability of action, defines how • 𝑞𝜋 (𝑠, 𝑎) = 𝑟(𝑠, 𝑎) + ∑𝑠∈𝑆 ′ 𝑝(𝑠 ′ |𝑠, 𝑎) ∑𝑎′ ∈𝐴 𝑞𝜋 (𝑠 ′ , 𝑎′ ) × 𝑃(𝑎′ |𝑠 ′ )
likely are we going to see 𝑎𝑖,𝑡 as action Optimality Condition
• 𝑉𝜋∗ (𝑠) = max 𝑉𝜋 (𝑠) ∀𝑠 ∈ 𝑆 similarly for 𝑞𝜋∗ (𝑠, 𝑎)
Actor Critic π
• Q-V = Advantage
• 𝑄𝜋 (𝑠𝑡 , 𝑎𝑡 ) = ∑𝑇𝑡′=𝑡 𝐸𝜋0 [𝑟(𝑠𝑡 , 𝑎𝑡 )|𝑠𝑡 , 𝑎𝑡 ] Reward of action 𝑎𝑡 in 𝑠𝑡
• 𝑉 𝜋 (𝑠𝑡 ) = 𝐸𝑎𝑡 ~𝜋𝜃 (𝑎𝑡|𝑠𝑡) [𝑄𝜋 (𝑠𝑡 , 𝑎𝑡 )] total reward from st
• 𝐴𝜋 (𝑠𝑡 , 𝑎𝑡 ) = 𝑄𝜋 (𝑠𝑡 , 𝑎𝑡 ) − 𝑉 𝜋 (𝑠𝑡 ) how much better 𝐴𝑡 is
1
• ∇𝜃 𝐽(𝜃) = ∑𝑁 𝑇 𝜋
𝑖=1 ∑𝑡=1 ∇𝜃 log 𝜋𝜃 (𝑎𝑖,𝑡 |𝑠𝑖,𝑡 ) 𝐴 (𝑠𝑡 , 𝑎𝑡 )
𝑁
• 𝑄𝜋 (𝑠𝑡 , 𝑎𝑡 ) = 𝑟(𝑠𝑡 , 𝑎𝑡 ) + ∑𝑇𝑡′=𝑡+1 𝐸𝜋𝜃 [𝑟(𝑠𝑡′ , 𝑎𝑡′ )|𝑠𝑡 , 𝑎𝑡 ]
= 𝑟(𝑠𝑡 , 𝑎𝑡 ) + 𝐸𝜋𝑡+1 ~𝑝(𝑠𝑡+1|𝑠𝑡,𝑎𝑡 ) [𝑉 𝜋 (𝑠𝑡+1 )]
• 𝐴 𝑡 , 𝑎𝑡 ) = 𝑟(𝑠𝑡 , 𝑎𝑡 ) + 𝑉 𝜋 (𝑠𝑡+1 ) − 𝑉 𝜋 (𝑠𝑡 )
𝜋 (𝑠

Reinforcement Learning

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Reinforcement Learning

Uploaded by

Copyright:

Available Formats

Value Function 𝑽(𝒔)

RL Cheat Sheet • Long term value of state S

You might also like