Download as pdf or txt
Download as pdf or txt
You are on page 1of 1

Efficient Multitask Reinforcement

Learning Without Performane Loss


Jongchan Baek ,
Seungmin Baek , and
Soohee Han , Senior
Member, IEEE

INTRODUCTION
The research paper discusses the development of an efficient multitask reinforcement learning (RL) algorithm that does not
sacrifice performance. It proposes a method called Iterative Sparse Bayesian Policy Optimization (ISBPO) for sequential
multitask RL. The paper focuses on the efficiency of learning multiple tasks simultaneously, which is a common problem in
real-world applications. The proposed ISBPO method is introduced as a solution to these challenges. It is designed to
handle a sequence of control tasks by consecutively optimizing policy network weights based on SBPO. The SBPO
algorithm produces sparse weights w∗ k and the corresponding binary mask, mk, which records the valid weight positions
of a policy network

THEORETICAL EXPLANATION
Sparse Bayesian Policy Optimization (SBPO):
- Randomize weights w as random variables: q(w) = N(μ, σ2)
- Training via Variational Inference (VI)
- Evidence Lower Bound (ELBO):
L(φ) = E[log p(D|w)] - λ DKL(q(w)||p(w))
- Stochastic Gradient Variational Bayes (SGVB) update:
← ∇
φ φ + η L(φ)
- Pruning using Signal-to-Noise Ratio (SNR):
SNR(wi) = (E[wi])2 / Var[wi] = μi2 / σi2

Iterative Learning:
- Task sequence T1:K = [T1, T2, ..., TK]
Proposed Method: Iterative Sparse Bayesian Policy Optimization (ISBPO) - For task Tk+1, after learning T1:k
Overview of the proposed ISBPO method ⊕
- Trainable weight space: Wk+1 = (W1* W2* ⊕ ... ⊕ Wk*)^⊥
- Sparse Bayesian Policy Optimization (SBPO) technique ∈
- Train wk+1 Wk+1 to get w*k+1 using SBPO
- Randomizing weights using Variational Inference (VI) - Overall weights: w = w*1 + w*2 + ... + w*k
- Training objective with sparsity-inducing regularization
- Pruning based on Signal-to-Noise Ratio (SNR) criterion Key Contributions:
- Iterative learning of multiple tasks using ISBPO
- Sequential learning without catastrophic forgetting
- Sequential learning of tasks
- Reusing surviving weights for knowledge transfer - Efficient weight allocation and pruning
- Task-specific binary masks for activation of relevant weights - Improved sample efficiency via knowledge transfer

Comparative Evaluation: Three existing


EXPERIMENTAL SETUP methods (Fine-tuning, EWC, PackNet) are
Experimental Setup Overview: The proposed used for comparison, assessing the
ISBPO is evaluated across three dynamical efficiency of ISBPO in terms of learning
systems: simulated robot manipulator, image- multiple tasks without performance loss,
based dexterous manipulator, and a real sample efficiency, and policy network
double inverted pendulum. weight allocation.
Task Selection: Tasks include sequentially Implementation Details: ISBPO utilizes
learning ten robotic manipulation tasks from the SAC algorithm, and sparsity-inducing
Meta-World benchmark and image-based priors are applied to efficiently allocate
valve-turning tasks with a D’Claw robot. For weights. Experiments involve sequential
real-world testing, three control strategies of learning, each task for 1 million steps, and
a double inverted pendulum system are evaluation metrics include Success Rate,
considered. Performance Loss, and Forward Transfer.

EVALUATION METRICS RESULT


5 trials with different random seeds The proposed ISBPO enables the
3 metrics to evaluate policy network to successfully
Success Rate (SR) - Testing all learn all the tasks without PL
previous steps after every k task and, hence, achieve an overall
Performance Loss (PL) - SR of 99.9%.
Difference in performance The multitask learning process
between 2 consecutive tasks and additionally allocates a very
Forward Transfer (FT) - small percentage (0.23%–0.9%)
Normalized area between of the weights to each new task.
learning and reference curve

This confirms that ISBPO uses limited policy network resources very
efficiently and economically.

CONCLUSION
The ISBPO algorithm is designed to handle a sequence of control tasks by consecutively optimizing policy network weights based on the Sparse Bayesian Policy
Optimization (SBPO) algorithm. The algorithm produces sparse weights and corresponding binary masks, which are used to train for new tasks while reusing
previously learned weights. The results of the experiments on robot manipulation tasks and image-based dexterous manipulation tasks demonstrate the
effectiveness of the ISBPO algorithm in handling multiple tasks efficiently. The paper also discusses the use of sparsity-inducing prior distributions to intentionally
obtain a sparse network and the optimization process using a gradient ascent method. Overall, the ISBPO algorithm provides a principled and efficient solution for
multitask RL, which is crucial for real-world applications.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

You might also like