The document discusses reinforcement learning and describes active reinforcement learning. It notes that with active reinforcement learning, the learner makes choices and must balance exploration versus exploitation. The learner actually takes actions in the world and learns from the outcomes rather than doing offline planning. One method described is Q-learning, where the learner uses sample experiences to iteratively update and improve its estimates of the Q-values as it interacts with the environment. The document emphasizes that with active reinforcement learning, the learner has to explore enough options while eventually focusing more on exploitation, and in the long run the exact method for choosing actions does not matter.
The document discusses reinforcement learning and describes active reinforcement learning. It notes that with active reinforcement learning, the learner makes choices and must balance exploration versus exploitation. The learner actually takes actions in the world and learns from the outcomes rather than doing offline planning. One method described is Q-learning, where the learner uses sample experiences to iteratively update and improve its estimates of the Q-values as it interacts with the environment. The document emphasizes that with active reinforcement learning, the learner has to explore enough options while eventually focusing more on exploitation, and in the long run the exact method for choosing actions does not matter.
The document discusses reinforcement learning and describes active reinforcement learning. It notes that with active reinforcement learning, the learner makes choices and must balance exploration versus exploitation. The learner actually takes actions in the world and learns from the outcomes rather than doing offline planning. One method described is Q-learning, where the learner uses sample experiences to iteratively update and improve its estimates of the Q-values as it interacts with the environment. The document emphasizes that with active reinforcement learning, the learner has to explore enough options while eventually focusing more on exploitation, and in the long run the exact method for choosing actions does not matter.
• Jana Purukan • Rico Suot 1 Supervised Learning and Unsupervised Learning
2 Passive Reinforcement Learning
Outline 3 Temporal Difference Learning
4 Active Reinforcement Learning
Supervised Learning Unsupervised Learning Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Active Reinforcement Lear ning Active Reinforcement Lear ning
– You don’t know the transitions T(s,a,s’)
– You don’t know the rewards R(s,a,s’) – You choose the actions now – Goal: learn the optimal policy / values
– Learner makes choices!
– Fundamental tradeoff: exploration vs. exploitation – This is NOT offline planning! You actually take actions in the world and find out what happens… Active Reinforcement Lear ning
Q-Learning: sample-based Q-value iteration
Learn Q(s,a) values as you go
– Receive a sample (s,a,s’,r) – Consider your old estimate: – Consider your new sample estimate:
– Incorporate the new estimate into a running average
Active Reinforcement Lear ning Active Reinforcement Lear ning
off-policy learning
– You have to explore enough
– You have to eventually make the learning rate small enough – … but not decrease it too quickly – Basically, in the limit, it doesn’t matter how you select actions (!) THANK YOU! Any Questions?