CS181 P - A - : Roject New Exploration of The Multi Armed Bandit Problem

CS181 P ROJECT - A NEW EXPLORATION OF THE MULTI - ARMED
BANDIT PROBLEM
Zhihao Tang Kexin Cui

School of Information Science and Technology School of Information Science and Technology
ShanghaiTech University ShanghaiTech University
2022533131 2021533047
Shangyi Zhang
School of Information Science and Technology
ShanghaiTech University
2021533051
June 21, 2024
A BSTRACT
The multi-armed bandit problem exemplifies the trade-off between exploration and exploitation in
decision theory. This report investigates the effectiveness of various algorithms, including Thomp-
son Sampling (TS), Theta-Greedy (TG), Beta, and novel approaches incorporating Graph AI and
Sequence AI, through implementation and analysis in simulated environments. Key findings indi-
cate that optimal parameter settings are crucial for balancing exploration and exploitation, with TS
demonstrating robustness and adaptability. Pretraining effects vary, enhancing TS performance while
potentially reducing flexibility in TG and Beta algorithms due to overfitting. These insights under-
score the importance of algorithm-specific parameter tuning and adaptive strategies for maximizing
long-term rewards in dynamic contexts.
1 Introduction
The multi-armed bandit problem is fundamental in reinforcement learning and decision theory, illustrating the trade-off
between exploration and exploitation. In scenarios ranging from clinical trials to online advertising, decision-makers
face the challenge of maximizing cumulative rewards over time from a set of arms (options) with unknown reward
distributions. Traditional strategies like epsilon-greedy and UCB methods attempt to balance exploring less-explored
arms to gather more information (exploration) and exploiting the best-known arms based on current knowledge
(exploitation).
Thompson Sampling (TS) stands out as a Bayesian probabilistic approach that leverages prior beliefs about arm
reward probabilities, continuously updating these beliefs with new data to guide decision-making. Unlike deterministic
strategies, TS maintains a distribution over possible reward outcomes for each arm, sampling from these distributions to
probabilistically balance exploration and exploitation.By evaluating how these algorithms perform in a simulated bandit
environment, this study aims to provide insights into their practical applications and comparative advantages.
2 Background
The multi-armed bandit problem derives its name from a casino analogy, where a gambler faces multiple slot machines
(arms), each with an unknown probability distribution of yielding rewards. The gambler aims to maximize their total
reward over a series of plays by strategically choosing which machine to play at each round. This dilemma extends
beyond gambling to real-world scenarios such as clinical trials, where researchers balance testing multiple treatments
(arms) to identify the most effective one efficiently.
CS181 Project - A new exploration of the multi-armed bandit problem
Early solutions to the bandit problem focused on deterministic strategies like epsilon-greedy, which allocate a fixed
portion of trials to exploration and the remainder to exploitation. These methods can be straightforward to implement
but may struggle to adapt to dynamic or uncertain environments where arm reward probabilities change over time.
Upper Confidence Bound (UCB) algorithms introduced a more sophisticated approach by prioritizing arms with high
upper confidence bounds, balancing exploration by favoring arms with uncertain reward estimates. While effective in
many scenarios, UCB algorithms may require fine-tuning and can be sensitive to initial assumptions and parameter
settings.
3 Theta-Greedy Algorithm
Theta-Greedy (TG) is an algorithm that balances exploration and exploitation using a combination of greedy selection
and random exploration. The algorithm maintains an estimated reward value for each arm, denoted as . At each step,
the algorithm either explores randomly with a probability or exploits by selecting the arm with the highest estimated
reward. Upon initialization, each bandit arm is assigned an initial estimated reward value . As the simulation progresses,
the agent chooses arms to pull based on these estimated values and observes rewards accordingly. TG updates these
estimates using the following formula: new = old + 1/N ∗ (R − old) ∗ lr where N is the number of times the arm
has been chosen, R is the observed reward, and lr is the learning rate. The steps for arm selection are as follows: 1.
With probability , randomly select an arm. 2. With probability 1−, select the arm with the highest estimated reward. If
multiple arms have the same estimated reward, randomly select one of them. The key advantage of TG is its simplicity
and effectiveness in balancing exploration and exploitation. It ensures that all arms have a chance of being explored
while prioritizing arms with higher estimated rewards.
4 Thompson Sampling Algorithm
Thompson Sampling (TS) is a probabilistic algorithm that combines exploration and exploitation in a principled manner
using Bayesian inference. At its core, TS operates by maintaining prior distributions over the reward probabilities of
each bandit arm. These prior distributions encode the agent’s initial beliefs about the arms’ potential rewards before any
data is observed.
As the agent interacts with the bandit environment and receives rewards from chosen arms, TS updates these prior
distributions using Bayesian inference. Specifically, after observing a reward from an arm, TS updates the corresponding
arm’s distribution to reflect this new evidence. Over time, as more data accumulates, these distributions become
increasingly centered around the true underlying reward probabilities of each arm.
To decide which arm to pull in each round, TS samples a reward probability from each arm’s posterior distribution and
selects the arm with the highest sampled value. This sampling approach ensures that arms with higher estimated rewards
(based on observed data) are more likely to be chosen, while still allowing for exploration of arms with uncertain reward
estimates.
The key advantage of TS lies in its ability to balance exploration and exploitation adaptively. By maintaining and
updating probabilistic beliefs about arm rewards, TS naturally incorporates uncertainty into decision-making. This
flexibility is particularly valuable in scenarios where arm reward probabilities are initially unknown or change over
time, as TS can quickly adapt its strategy based on new evidence.
5 Graph AI
Graph AI in Multi-Armed Bandit Problems Graph AI introduces a novel approach to solving multi-armed bandit
problems by leveraging graph structures to model dependencies among bandit arms. In this paradigm, each bandit arm is
represented as a node in a graph, and edges denote relationships or interactions between arms. This methodology extends
traditional bandit algorithms by incorporating contextual information and inter-arm influences into decision-making
processes.
At its core, Graph AI initializes a graph G with nodes corresponding to bandit arms and potentially edges reflecting
known or inferred relationships. Each node i in G maintains an estimated probability, estprob[i], which denotes the
likelihood of a favorable outcome or reward associated with choosing arm i.
During operation, Graph AI employs strategies that balance exploration and exploitation. It utilizes graph traversal
algorithms to navigate through interconnected arms, thereby optimizing the selection of arms based not only on
2
individual rewards but also on collective influences within the graph. This approach allows Graph AI to adaptively
adjust its exploration-exploitation trade-off according to the observed rewards and the evolving structure of the graph.
The algorithm updates the estimated probabilities estprob[i] based on the rewards received from chosen arms. By
integrating reinforcement learning techniques or Bayesian inference principles, Graph AI refines these estimates over
time, reflecting updated beliefs about arm performance. This adaptive learning process ensures that Graph AI can
effectively handle scenarios where arm reward probabilities are initially uncertain or change dynamically.
6 Sequence AI for Sequential Decision-Making
Sequence AI applies neural network (NN) models to address sequential decision-making in multi-armed bandit problems.
This approach utilizes machine learning techniques to predict optimal actions based on historical data and contextual
inputs, enhancing decision-making accuracy and adaptability in dynamic environments.
The algorithm initializes an NN architecture tailored for bandit problems, configuring layers for input processing, feature
extraction, and output prediction. The NN model is trained using historical data generated from bandit simulations or
real-world interactions, optimizing model parameters through supervised learning or reinforcement learning paradigms.
During operation, Sequence AI predicts optimal actions in each round of the bandit problem based on historical actions
and corresponding rewards. It leverages predictive probabilities outputted by the NN to balance exploration and
exploitation, maximizing cumulative rewards over sequential interactions with bandit arms.
The algorithm continuously updates the NN model with new rewards and actions, reinforcing learned patterns and
adapting strategies based on observed outcomes. This adaptive learning process ensures that Sequence AI can effectively
navigate changing bandit environments and optimize decision-making strategies in real-time scenarios.
7 Implementation Details
We implement various bandit algorithms, including Thompson Sampling, within a simulated gaming environment.
Central to the implementation is the Probgen class, responsible for initializing bandit arms with static or dynamic
reward probabilities. The Game class orchestrates simulations using different AI strategies derived from these algorithms,
facilitating comparative evaluations.
In practice, the TS algorithm within the codebase operates as follows: upon initialization, each bandit arm is assigned
a prior distribution over reward probabilities. As the simulation progresses, the agent chooses arms to pull based on
current beliefs and observes rewards accordingly. TS updates these beliefs using Bayesian inference, adjusting the
posterior distributions after each observed reward to reflect updated knowledge.
The UCB (Upper Confidence Bound) algorithm operates similarly within our framework. Each bandit arm initializes
with a prior distribution reflecting its reward probabilities. During simulations, the algorithm balances exploration
and exploitation by calculating an upper confidence bound for each arm based on observed rewards and exploration
parameters. UCB selects arms for pulling based on these bounds, adjusting beliefs about arm performance through
iterative updates.
Graph AI introduces a novel approach to solving multi-armed bandit problems by leveraging graph structures. Each
bandit arm corresponds to a node in a graph, with edges representing dependencies or interactions between arms. The
algorithm uses graph traversal strategies to optimize arm selection, considering both individual rewards and inter-arm
influences captured in the graph. This adaptive approach enables Graph AI to dynamically adjust its exploration-
exploitation strategy based on observed rewards and evolving graph structures.As for the dynamic arm proablity that
take the relation between arms , it will fit the relationship more better then normal baysian based on the individual arms.
Sequence AI applies neural network models to sequential decision-making in multi-armed bandit problems. It
initializes an NN architecture tailored for bandit problems, training it on historical data to predict optimal actions.
During operation, Sequence AI balances exploration and exploitation using predictive probabilities output by the NN,
continuously updating its model with new rewards and actions to adapt to changing bandit environments.As this is a
sequence predicting, So the RNN/LSTM may be a better model ,so we try the LSTM.
Comparative evaluations involve running multiple simulations with different algorithms (e.g., epsilon-greedy, UCB, TS)
under varying conditions (e.g., static versus dynamic reward probabilities). Performance metrics such as cumulative
rewards, exploration-exploitation trade-offs, and convergence rates are analyzed to assess the effectiveness of TS and its
counterparts in maximizing long-term rewards.
3
8 Innovations and the advantages of TS algorithm

One notable innovation is the integration of a SequenceAI class, which combines traditional bandit algorithms with
machine learning techniques. Specifically, the SequenceAI class utilizes a neural network model to predict arm choices
based on labeled data generated from the bandit environment. This hybrid approach aims to enhance decision-making
accuracy by leveraging historical data to inform future actions.
Empirical results from comparative simulations demonstrate TS’s superior performance in balancing exploration and
exploitation dynamics. In scenarios where reward probabilities are dynamic or initially unknown, TS consistently
achieves higher cumulative rewards compared to epsilon-greedy and UCB methods. This effectiveness stems from TS’s
ability to quickly adapt to changing environments by updating beliefs based on observed outcomes, thereby minimizing
regret over time.
Figure 1: TS
Furthermore, the integration of machine learning models within the bandit framework illustrates a promising direction
for future research. By combining reinforcement learning techniques with predictive modeling, the SequenceAI class
exemplifies how AI systems can evolve to handle complex decision-making tasks more effectively. This innovation
opens avenues for developing hybrid approaches that leverage both statistical inference and machine learning to optimize
decision strategies in uncertain environments.
9 Results and Analysis

In this section, we present the results and analysis of our simulations using various algorithms under different parameter
settings. We will focus on six distinct plots generated from our experiments, each depicting the win rates of the
algorithms as a function of their respective parameters. The key insights and patterns observed from these plots will
also be discussed in detail.
9.1 Theta-Greedy without Pretraining
Plot 1 illustrates the performance of the Theta-Greedy algorithm without pretraining across different values of ϵ. The
x-axis represents the ϵ parameter ranging from 0 to 0.9, while the y-axis shows the win rate after 100,000 steps. We
predict that the win rate will peak at an intermediate value of ϵ due to an optimal balance between exploration and
exploitation. Extremely low or high values of ϵ will likely lead to suboptimal performance due to excessive exploitation
or exploration, respectively.
4
Figure 2: ThetaGreedy without pretraining
9.2 Theta-Greedy with Pretraining
Plot 2 displays the win rate of the Theta-Greedy algorithm with pretraining, varying ϵ from 0 to 0.9. Similar to Plot 1,
the win rate is expected to peak at intermediate ϵ values. However, we anticipate that the win rates will generally be
lower than those in Plot 1, as pretraining may lead to overfitting on the initial data, reducing the algorithm’s adaptability
to new information.
Figure 3: ThetaGreedy with pretraining
9.3 Beta without Pretraining
Plot 3 shows the performance of the Beta algorithm without pretraining. Here, the x-axis represents the parameter c
calculated as 1 + 0.8 × ϵ, ranging from 1 to 1.72. The y-axis denotes the win rate after 100,000 steps. The win rate is
5
expected to exhibit a peak at intermediate values of c, reflecting an optimal balance between confidence and exploration.
Lower performance is anticipated at extreme values due to either excessive certainty or insufficient exploration.
Figure 4: Beta without Pretraininging
9.4 Beta with Pretraining
Plot 4 illustrates the win rate of the Beta algorithm with pretraining, with the c parameter varying from 1 to 1.72. We
predict that the win rate will peak at intermediate c values. However, similar to the Theta-Greedy results, we expect the
win rates to be generally lower than those in Plot 3. This decrease is likely due to the potential for overfitting caused by
pretraining, which might limit the algorithm’s ability to adapt effectively to new data.
Figure 5: Beta with pretraining
6
9.5 Thompson Sampling with Pretraining
Plot 5 depicts the performance of the Thompson Sampling (TS) algorithm with pretraining. The x-axis represents
the a and b parameters, calculated as [ 2ϵ + 0.01] and [1 − 2ϵ − 0.01], respectively. The y-axis shows the win rate after
100,000 steps. The win rate is expected to peak at intermediate values of ϵ, indicating an optimal trade-off between
exploration and exploitation. Pretraining is anticipated to provide a beneficial starting point, leading to improved win
rates compared to non-pretrained scenarios.
Figure 6: TS with pretraining
9.6 Thompson Sampling without Pretraining
Plot 6 displays the win rate of the Thompson Sampling algorithm without pretraining. The parameters a and b vary as
in Plot 5. We predict a peak in win rates at intermediate values of ϵ. Despite the absence of pretraining, TS is expected
to adapt well to the environment, potentially achieving win rates comparable to or slightly lower than those in Plot 5
due to the lack of an initial informed prior.
7
Figure 7: TS with pretraining
9.7 Analysis and Explanation
From the results, several key insights can be drawn: 1. Optimal Parameter Ranges: For all algorithms, the win rates
peak at intermediate parameter values, highlighting the importance of balancing exploration and exploitation. Extreme
parameter settings tend to degrade performance due to either excessive risk-taking or overly conservative strategies.
2. Impact of Pretraining: Pretraining generally results in lower win rates for the Theta-Greedy and Beta algorithms.
This phenomenon can be attributed to overfitting, where the initial training data biases the algorithm, reducing its
flexibility in adapting to new data. In contrast, pretraining appears beneficial for the Thompson Sampling algorithm,
providing a solid initial prior that aids in faster convergence and improved performance. 3. Algorithm Sensitivity: The
Theta-Greedy and Beta algorithms exhibit significant sensitivity to their respective parameters (ϵ and c). Small changes
in these parameters can lead to substantial differences in win rates, underscoring the necessity of careful parameter
tuning. Thompson Sampling, with its Bayesian approach, demonstrates robustness to parameter variations, maintaining
relatively stable performance across a range of settings.
In summary, our analysis underscores the critical role of parameter optimization and pretraining in multi-armed bandit
problems. While pretraining can enhance performance for certain algorithms like Thompson Sampling, it may hinder
others due to overfitting. Balancing exploration and exploitation through optimal parameter settings remains essential
for maximizing cumulative rewards in dynamic environments.
10 Conclusions
In conclusion, this report has provided a comprehensive exploration of the multi-armed bandit problem and Thompson
Sampling algorithm, focusing on their theoretical underpinnings, practical implementations, and empirical evaluations.
Through comparative results, TS emerges as a robust strategy for balancing exploration and exploitation dynamics in
uncertain environments.
Key insights include TS’s ability to leverage Bayesian inference to maintain and update probabilistic beliefs about arm
reward probabilities, adapting its strategy based on accumulated evidence. Comparative evaluations highlight TS’s
superior performance over traditional algorithms like epsilon-greedy and UCB, particularly in scenarios with dynamic
or unknown reward distributions.
This study provides a comparative analysis of multi-armed bandit algorithms, highlighting the effectiveness of Thompson
Sampling and the potential of novel approaches like Graph AI and Sequence AI. Key findings emphasize the importance
of algorithm-specific parameter tuning and adaptive strategies for balancing exploration and exploitation. Future work
8
involves further optimization of Graph AI and Sequence AI, exploring their applications in more complex and dynamic
environments.
Looking forward, the integration of machine learning models within bandit algorithms represents a promising direction
for enhancing decision-making accuracy and adaptability. By combining statistical techniques with predictive modeling,
future research can further advance the capabilities of AI systems in addressing complex decision problems across
various domains.
11 External Resources and Tools Used in the Project

11.1 NumPy
NumPy, a fundamental package for scientific computing in Python, provided essential support for array operations,
mathematical functions, and random number generation. It significantly streamlined matrix manipulations and statistical
computations required for initializing bandit arms, updating reward distributions, and performing algorithm evaluations.
11.2 Matplotlib
Matplotlib, a popular plotting library in Python, was employed to visualize simulation results, including cumulative
rewards, convergence rates, and algorithm comparisons. Matplotlib’s comprehensive plotting functionalities allowed for
clear and informative visualization of data trends, aiding in the analysis and interpretation of algorithm performance.
11.3 Graph AI and Sequence AI Libraries
The project explored innovative approaches like Graph AI and Sequence AI, leveraging specialized libraries and frame-
works tailored for multi-armed bandit problems. These libraries incorporated advanced algorithms and methodologies,
such as graph traversal techniques and neural network architectures, to enhance decision-making capabilities in dynamic
and complex environments.
11.4 PyTorch
For implementations involving Sequence AI and neural network-based approaches, PyTorch provided a robust frame-
work for developing and training deep learning models. Its flexibility and computational efficiency supported the
integration of advanced machine learning techniques into the bandit simulation, enhancing predictive accuracy and
adaptive learning capabilities.
12 References
[1]Jhelum Chakravorty and Aditya Mahajan. Multi-Armed Bandits, Gittins Index, and Its Calculation, 2014.
[2]Jie Tang. Online Learning-Stochastic Bandit, 2020.
[3]Sandeep Pandey, Deepayan Chakrabarti and Deepak Agarwal. Multi-armed Bandit Problems with Dependent Arms,
2008
[4]John Tsitsiklis. A Short Proof Of The GittinS Index Theorem, 1993
[5]Bobby Kleinberg. Multi-Armed Bandits and the Gittins Index, 2017

CS181 P - A - : Roject New Exploration of The Multi Armed Bandit Problem

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CS181 P - A - : Roject New Exploration of The Multi Armed Bandit Problem

Uploaded by

Copyright:

Available Formats

CS181 P ROJECT - A NEW EXPLORATION OF THE MULTI - ARMED

Zhihao Tang Kexin Cui

June 21, 2024

4 Thompson Sampling Algorithm

6 Sequence AI for Sequential Decision-Making

8 Innovations and the advantages of TS algorithm

9 Results and Analysis

9.1 Theta-Greedy without Pretraining

Figure 2: ThetaGreedy without pretraining

9.2 Theta-Greedy with Pretraining

Figure 3: ThetaGreedy with pretraining

9.3 Beta without Pretraining

Figure 4: Beta without Pretraininging

9.4 Beta with Pretraining

Figure 5: Beta with pretraining

9.5 Thompson Sampling with Pretraining

Figure 6: TS with pretraining

9.6 Thompson Sampling without Pretraining

Figure 7: TS with pretraining

9.7 Analysis and Explanation

11 External Resources and Tools Used in the Project

11.3 Graph AI and Sequence AI Libraries

You might also like