Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

UNIT-4

Bootstrapping
In reinforcement learning (RL), bootstrapping refers to the use of estimated values, often based
on current estimates, to update the value function or make decisions. It is a fundamental concept
that distinguishes RL from other machine learning paradigms.
Here are some key aspects of bootstrapping in reinforcement learning:
1. Temporal Difference (TD) Learning:
• TD learning is a common approach in RL that involves updating the value of a
state based on the difference between the current estimate and a new estimate
that incorporates the immediate reward and the estimated value of the next state.
• The update equation for TD learning is often expressed as V(st)←V(st)+α[rt+1
+γV(st+1)−V(st)], where V(st) is the estimated value of state st, +1rt+1 is the
immediate reward, γ is the discount factor, and α is the learning rate.
2. Q-Learning:
• Q-learning is a popular model-free RL algorithm that uses bootstrapping to
update the action-value function Q(s,a).
• The update rule for Q-learning is max⁡)]Q(st,at)←Q(st,at)+α[rt+1+γmaxa
Q(st+1,a)−Q(st,at)].
• Q-learning estimates the value of taking action a in state s and then follows the
optimal policy thereafter.
3. Value Iteration:
• In dynamic programming methods like value iteration, bootstrapping is used to
iteratively update the value function for each state based on the values of
successor states.
• The update equation is ←max⁡]V(s)←maxa∑s′P(s′∣s,a)[R(s,a,s′)+γV(s′)],
where P is the transition probability, R is the immediate reward, and γ is the
discount factor.
4. Function Approximation:
• Bootstrapping is also employed when using function approximation methods,
such as neural networks, to estimate value functions.
• Deep Q-Networks (DQN) and other deep reinforcement learning approaches
use bootstrapping to update the neural network's weights based on the temporal
difference error.
5. Exploration-Exploitation Trade-off:
• Bootstrapping is involved in the exploration-exploitation trade-off, where the
agent uses its current estimates to decide whether to explore new actions or
exploit the current best-known actions.
Bootstrapping allows RL algorithms to learn from limited experiences by using estimated
values, enabling the agent to make decisions and improve its policy over time. It is a crucial
aspect that facilitates the adaptation and learning process in dynamic environments.

TD(0) algorithm
Temporal Difference (TD) learning is a fundamental concept in reinforcement learning, and
TD(0) is a specific variant of the TD learning algorithm. TD(0) stands for Temporal Difference
with a lookahead of 0 steps, meaning it only considers the immediate next state for updating
value estimates. Let's delve into the details of the TD(0) algorithm:
TD(0) Update Rule:
The TD(0) update rule for updating the value function for a state st is given by:
V(st)←V(st)+α[rt+1+γV(st+1)−V(st)]
where:
• V(st) is the estimated value of state st,
• α is the learning rate, controlling the size of the update,
• +1rt+1 is the immediate reward after taking action at in state st,
• γ is the discount factor, representing the importance of future rewards,
• V(st+1) is the estimated value of the next state +1st+1.
Steps of the TD(0) Algorithm:
1. Initialize: Initialize the value function V(s) for all states.
2. Iterate: Repeat the following steps until convergence or a predetermined number of
iterations:
• Observe the current state st.
• Take an action at based on the current policy.
• Observe the immediate reward +1rt+1 and the next state +1st+1.

• Update the value of the current state using the TD(0) update rule.
Convergence of Monte Carlo and batch TD(0) algorithms
In the context of reinforcement learning, a Monte Carlo method estimates values by averaging
the returns obtained from entire episodes. Convergence in the Monte Carlo setting is generally
assured under certain conditions:
1. Law of Large Numbers: The Monte Carlo estimate converges to the true value as the
number of samples (episodes) approaches infinity. This is a consequence of the law of
large numbers, which states that the average of a sequence of independent and
identically distributed random variables converges to the expected value.
2. Exploration: For convergence, it is essential that the agent explores the state-action
space sufficiently. If states or state-action pairs are not visited frequently, the estimates
may be biased or exhibit high variance.

Batch TD(0) is a temporal difference learning algorithm that updates the value function after
each time step based on a single-step lookahead. The convergence properties of TD(0) are
influenced by the choice of step size (learning rate) and the characteristics of the environment.
Convergence is generally not guaranteed for all cases.
1. Robbins-Monro Conditions: For TD(0) to converge, it is required to satisfy the
Robbins-Monro conditions. These conditions state that the learning rate (α) should
follow specific criteria to ensure convergence.
2. Stationarity and Ergodicity: Convergence is influenced by the stationarity and
ergodicity of the Markov decision process (MDP). If the environment is non-stationary
or lacks certain ergodic properties, convergence might be challenging.
3. Exploration-Exploitation Trade-off: Similar to MC methods, TD(0) benefits from
sufficient exploration to converge to the optimal values. Inadequate exploration may
lead to biased or suboptimal estimates.
Batch TD(0) vs. MC:
• Batch TD(0) updates values at each time step, making it suitable for online learning
scenarios.
• MC, on the other hand, updates values only after completing entire episodes, which
might be less suitable for online, non-episodic tasks.
• The choice between MC and TD(0) often depends on the characteristics of the problem,
available data, and computational considerations.
In practice, both MC and TD(0) algorithms can be effective in reinforcement learning tasks,
and the choice depends on the specific requirements of the problem at hand. Convergence
analysis often involves considering the assumptions and conditions under which these
algorithms are applied.

Model-free control

Model-free control is a category of reinforcement learning (RL) algorithms that focus on


learning an optimal policy without explicitly modeling the dynamics of the environment. In
model-free control, the agent directly interacts with the environment, observes the outcomes of
its actions, and adjusts its policy based on the observed rewards. This is in contrast to model-
based methods, which involve constructing an explicit model of the environment and using it
for planning.
There are two main types of model-free control methods: on-policy and off-policy algorithms.
On-Policy Methods:
On-policy methods update the policy that is used to interact with the environment. The policy
is often represented as a mapping from states to actions.
Q-learning is an example of an off-policy algorithm that can be used for model-free control.
The Q-learning algorithm learns the optimal action-value function, which represents the
expected cumulative reward of taking an action in a particular state and following the optimal
policy thereafter.
Off-Policy Methods:
Off-policy methods, on the other hand, involve learning one policy (the target policy) while
interacting with the environment using another policy (the behavior policy).
One popular off-policy algorithm for model-free control is SARSA (State-Action-Reward-
State-Action). Similar to Q-learning, SARSA estimates the optimal action-value function.
Key Components of Model-Free Control:
Policy: Model-free control algorithms aim to learn an optimal policy, which is a strategy that
guides the agent's actions in different states to maximize the expected cumulative reward over
time.
Value Functions:
State-Value Function (V): Represents the expected cumulative reward from a given state
following a certain policy.
Action-Value Function (Q): Represents the expected cumulative reward from a given state,
taking a specific action, and following a certain policy.
Exploration-Exploitation Trade-off:
Model-free control algorithms often face the challenge of exploration. Balancing exploration
and exploitation is crucial for learning an optimal policy in unknown environments.
Temporal Difference Learning:
Many model-free control algorithms, such as Q-learning and SARSA, use temporal difference
(TD) learning to update their value estimates. TD learning involves updating value estimates
based on the difference between current and future estimates.
Policy Improvement:
As the agent interacts with the environment and receives feedback in the form of rewards, the
model-free control algorithm continually updates its policy to improve its decision-making.
Model-free control methods are particularly suitable for situations where the dynamics of the
environment are complex, unknown, or hard to model accurately. These algorithms can adapt
to various environments and are widely used in applications ranging from game playing to
robotic control.

Q-learning
Q-learning is a popular model-free reinforcement learning algorithm that is used for learning
optimal action-selection policies in a Markov decision process (MDP) without requiring a
model of the environment. It is particularly well-suited for environments where the dynamics
are unknown or difficult to model accurately. The algorithm is based on estimating the action-
value function, denoted as Q(s, a), representing the expected cumulative reward of taking
action a in state s and following the optimal policy thereafter.
Here are the key components and steps involved in the Q-learning algorithm:
Q-Learning Algorithm:
1. Initialization:
• Initialize the Q-values for all state-action pairs arbitrarily, often with small
random values.
• Q(s,a) represents the estimated expected cumulative reward of taking action a
in state s.
2. Exploration-Exploitation Trade-off:
• Choose an action in the current state based on the current Q-values and an
exploration strategy (e.g., epsilon-greedy policy).
• With probability 1−1−ϵ, select the action with the highest Q-value
(exploitation).
• With probability ϵ, choose a random action (exploration).
3. Interaction with Environment:
• Execute the chosen action in the environment.
• Observe the immediate reward r and the next state ′s′.
4. Update Q-value:
• Update the Q-value for the current state-action pair using the Q-learning update
rule: ⁡]Q(s,a)←Q(s,a)+α[r+γmaxa′Q(s′,a′)−Q(s,a)]
• α is the learning rate, controlling the size of the update.
• γ is the discount factor, representing the importance of future rewards.
5. Repeat:
• Repeat steps 2-4 until convergence or a predefined number of iterations.
SARSA
SARSA is another popular model-free reinforcement learning algorithm, similar to Q-learning.
Like Q-learning, SARSA is used to learn optimal action-selection policies in Markov decision
processes (MDPs) without requiring a model of the environment. The name "SARSA" stands
for State-Action-Reward-State-Action, which represents the information used to update the
action-value estimates.
Here's an overview of the SARSA algorithm:
SARSA Algorithm:
1. Initialization:
• Initialize the Q-values for all state-action pairs arbitrarily.
• Q(s,a) represents the estimated expected cumulative reward of taking action a
in state s.
2. Exploration-Exploitation Trade-off:
• Choose an action a in the current state s based on the current Q-values and an
exploration strategy (e.g., epsilon-greedy policy).
• Execute action a.
3. Interaction with Environment:
• Observe the immediate reward r and the next state ′s′.
• Choose the next action ′a′ in state ′s′ using the same exploration strategy.
4. Update Q-value:
• Update the Q-value for the current state-action pair using the SARSA update
Q(s,a)←Q(s,a)+α[r+γQ(s′,a′)−Q(s,a)]
• α is the learning rate, controlling the size of the update.
• γ is the discount factor, representing the importance of future rewards.
5. Transition:
• Set the current state s to ′s′ and the current action a to ′a′.
6. Repeat:
• Repeat steps 2-5 until convergence or a predefined number of iterations.

Expected SARSA
Expected SARSA (sometimes written as Expected Sarsa or ESarsa) is another reinforcement
learning algorithm that falls within the family of model-free control methods. It's an extension
of SARSA (State-Action-Reward-State-Action), which also aims to learn optimal policies
without requiring a model of the environment.
The key difference between Expected SARSA and SARSA lies in how they update the Q-
values. In SARSA, the update is based on the actual next action taken, whereas Expected
SARSA takes an expectation over all possible next actions.
Expected SARSA Algorithm:
1. Initialization:
• Initialize the Q-values for all state-action pairs arbitrarily.
• Q(s,a) represents the estimated expected cumulative reward of taking action a
in state s.
2. Exploration-Exploitation Trade-off:
• Choose an action a in the current state s based on the current Q-values and an
exploration strategy (e.g., epsilon-greedy policy).
• Execute action a.
3. Interaction with Environment:
• Observe the immediate reward r and the next state ′s′.
4. Update Q-value:
• Update the Q-value for the current state-action pair using the Expected SARSA
update rule:
Q(s,a)←Q(s,a)+α[r+γ∑a′π(a′∣s′)Q(s′,a′)−Q(s,a)]
• Here, π(a′∣s′) represents the probability of taking action ′a′ in the next state ′s′
under the current policy.
• α is the learning rate, and γ is the discount factor.
5. Repeat:
• Set the current state s to ′s′ and repeat steps 2-4 until convergence or a
predefined number of iterations.

You might also like