IEEE Conference Template

Inter-Slice Resource Allocation in Next-Generation
Wireless Networks: Distributed Q-Learning

Approach
Mostafa Naseri, Hojjat Navidan, Ingrid Moerman, Eli De Poorter, and Adnan Shahid
IDLab, Department of Information Technology at Ghent University - imec, Ghent, Belgium
{firstname.lastname}@ugent.be
Abstract—As we embrace the new era of mobile networks, network size and complexity grow. Given these constraints,
a pressing need for dynamic, intelligent resource allocation has reinforcement learning (RL), specifically Q-learning, emerges
emerged, particularly capable of adapting to diverse network as a promising solution for effective resource management in
conditions. This study presents a novel distributed Q-learning ap-
proach for inter-slice physical resource blocks (PRBs) allocation, the anticipated mobile network traffic scenarios [2] [3].
emphasizing the significance of collaborative learning through
information exchange among different cells. This distributed ap- A. Efficiency Imperative of Resource Allocation in Future
proach enables the model to effectively manage traffic scenarios Mobile Networks
not experienced during training. The proposed approach strives Efficient resource allocation, especially in the context of
to optimize PRB utilization and ensures a balanced resource dis-
tribution among slices while upholding service level agreements Physical Resource Block (PRB) allocation, serves as a piv-
(SLAs) consistently. Our collaborative learning approach proves otal performance metric in the evolving landscape of mobile
its efficacy by preventing unacceptable SLA violations, which networks. This core element of inter-slice management has
occurred in 4% and 8% of the cases in unseen traffic scenarios for garnered notable academic attention due to its consequential
cells learning in isolation. These findings accentuate the potential role in the effectiveness of network operations [4].
of shared learning in advancing resource management strategies
for future mobile networks, particularly in tackling unforeseen In the scope of network slicing, various virtual networks
situations effectively. are instantiated on a shared physical infrastructure, each ac-
Index Terms—Distributed Learning, Reinforcement Learning, commodating a wide array of use cases, service types, and
Q-Learning, Service Level Agreement, Inter-Slice Resource Al- performance needs. It is critical to judiciously allocate PRBs
location among these slices, to meet service level agreement (SLA)
requirements while ensuring optimal resource utilization. This
I. I NTRODUCTION dual-objective poses a complex problem as it necessitates
The exponential growth in mobile data traffic coupled with addressing conflicting goals simultaneously [5].
emerging applications and services place exceptional demands Potential issues arising from ineffective PRB allocation
on the mobile network infrastructure. The ubiquity of smart include resource wastage, subpar network performance, SLA
devices and the Internet of Things (IoT) requires future mobile violations, and service quality degradation [6]. Adding to this
networks to deliver ultra-reliable low-latency communication complexity, the scalability and adaptability of mobile networks
(URLLC), enhanced mobile broadband (eMBB), and massive demand flexible allocation mechanisms to respond to evolv-
machine-type communications (mMTC) [1]. This diverse de- ing service demands. Thus, it’s imperative for future mobile
mand landscape necessitates an evolution beyond conventional networks to develop proficient resource allocation strategies
resource allocation strategies, compelling a shift towards more that balance diverse requirements and limited resources effec-
intelligent, adaptable, and efficient methodologies. Traditional tively. Consequently, this area warrants further research and
methods, which often rely on precise mathematical models and development to optimize network efficiency.
centralized execution, struggle to accommodate the complexity
and variation inherent in these multifaceted requirements. The B. Q-Learning in Mobile Networks
mathematical modelling of such problems for a closed-form Q-learning, an off-policy RL algorithm, has emerged as a
solution proves impracticable due to the diverse and dynamic potent tool in the realm of resource allocation, predominantly
nature of network demands. Furthermore, centralized exe- due to its ability to learn and optimize policies over time
cution of these traditional approaches introduces formidable based on interactions with its environment [1]. This RL
computational complexity, resulting in scalability issues as algorithm presents a promising solution to effectively manage
resources without the need for complete knowledge of the
This work was executed within the imec.ICON 5GECO project, a research network dynamics. This section explores the role and potential
project bringing together academic research partners and industry partners.
The 5GECO project was co-financed by imec and received support from advantages of Q-Learning in future mobile networks’ resource
Flanders Innovation Entrepreneurship (project nr. HBC.2021.0673). allocation, focusing on its simplicity, adaptability, and capacity
for continuous improvement. Despite the significant promise conditions of the network, a set of actions A that the agent can
of Q-learning, several challenges impede its adoption in real- take, i.e., different ways to allocate resources among slices, a
world mobile networks, including the exploration-exploitation state transition probability P (s′ |s, a) defining the probability
trade-off, the curse of dimensionality, and the sensitivity to of transitioning from current state s to next state s′ given action
hyperparameters. Conversely, several factors provide unique a, and a reward function r(s, a, s′ ) providing the immediate
opportunities for its implementation, including the increasingly reward for performing action a in state s and moving to state
dynamic network environments, the development of more s′ , we want to find an optimal policy π ∗ that maximizes the
sophisticated learning algorithms, and the growing computa- expected cumulative reward over time, defined as [9]:
tional capacities of mobile devices and network infrastructure. " #
Among other works, [7] and [8] present innovative ap- ∗
X
t ′
π = arg max E γ · r(st , at , st )|π (1)
proaches to resource management in network slicing. The π
t
former introduces an LSTM-A2C algorithm that leverages
where π is a policy dictating what action to take in each
deep reinforcement learning to efficiently allocate resources
state, γ is the discount factor controlling the importance of
across network slices, taking into account user mobility. The
future rewards, and t indexes over time steps.
latter proposes a hierarchical resource management framework
The SLA satisfaction can be integrated into the reward
that uses deep reinforcement learning for admission control
function, where actions leading to higher SLA satisfaction
and resource adjustment in a dynamic and uncertain environ-
give higher rewards. This formulation allows us to utilize
ment, demonstrating the effectiveness of AI-based customized
Q-learning to solve the optimization problem, by iteratively
slicing. However, both studies do not address the challenge of
updating the Q-values for state-action pairs and progressively
unseen traffic scenarios. In contrast, our research specifically
refining the policy towards the optimal one [10].
focuses on these unseen traffic scenarios, evaluating the inter-
slice resource management using traffic scenarios not seen B. Conceptual Framework
during the training of one specific Q-learning agent. This subsection presents an explanation of the proposed Q-
This work aims to address the challenge of efficient resource table based reinforcement learning algorithm for inter-slice
allocation in 5G and beyond network slicing. We introduce resource allocation, encompassing three major components:
a Q-learning-based approach that leverages network KPIs to the digitized KPI states, actions, and the reward function.
formulate rewards, thereby aligning resource allocation more Digitized States (s): The states represent the crucial per-
closely with the requirements of the different network slices formance metrics of the mobile network slices, which are
formulating in action-state spaces. Furthermore, we explore discretized to facilitate the learning process and match the
the concept of information exchange via Q-table sharing discrete nature of tabular Q-learning. In the proposed system,
across network cells, demonstrating its potential to enhance the key performance indicators (KPI) signify vital metrics
performance, particularly in cells with limited exposure to high characterizing the performance of the mobile network. These
traffic scenarios during training. KPIs include the number of waiting packets in the buffer
and the throughput (both per UE and per network slice). The
II. S YSTEM M ODEL AND P ROPOSED M ETHOD
process of discretization involves mapping these continuous
A. System Model KPIs into a finite set of discrete levels, aiding the reinforce-
The system model utilized in this study is based on a ment learning process by converting the problem into a more
network composed of mobile cells, each featuring two slices - manageable state space. Each KPI is digitized into one of
eMBB and URLLC. This setting is representative of advanced the N predefined levels, enabling the Q-learning algorithm to
mobile networks, thus offering an appropriate context for handle the complex dynamics of the mobile network efficiently
exploring resource allocation strategies. Each slice accommo- and effectively. Moreover, we use an additional KPI indicating
dates two user equipments (UEs). The simulator employed in the proximity to an illegal number of PRBs, i.e., a state where
this study operates on a bandwidth of 200 MHz, reflecting the number of PRBs requested exceeds the total available
the high-frequency characteristics of 5G and beyond mobile PRBs or falls below the minimum per slice.
networks. Inter-slice resource allocation is governed by our Action (a): Actions constitute the different resource alloca-
proposed Q-learning agent, which optimally distributes phys- tion strategies that the RL agent can employ in each state. The
ical resource blocks (PRBs) among the slices in each cell. actions in our system correspond to the manipulation of PRBs
Within the individual slices, resource allocation to UEs is within each slice. Each action comprises either taking, giving,
managed by a Round-Robin (RR) scheduler. This scheduler or maintaining the current number of PRBs in a slice.
allocates PRBs, as determined by the inter-slice scheduler, Specifically, we consider three possible actions per slice:
among the UEs within each slice. It should be noted that -4, 0, and 4. A negative action signifies taking PRBs from
the packet arrival rate of each UE varies randomly during a slice, while a positive action denotes allocating additional
both the training and testing phases. This stochastic nature PRBs to a slice. Action 0 corresponds to not altering the
of traffic mimics real-world network conditions, contributing current allocation of PRBs for the slice. As a result, the action
to the robustness of our study and the practical applicability of space a is represented by a vector of length Nslc where Nslc
our findings. Given a set of states S representing the different is the number of slices and A is the set of all possible actions.
Piecewise Reward Function (r): The reward function guides C. Simulation Environment and Implementation Details
the RL agent’s learning, providing feedback based on the This section elucidates the simulation environment utilized
network performance resulting from the actions taken. Prior to test and validate the proposed method.
to defining the total reward r, we define slice-specific reward Cell Capacity Simulator: In this work we use Py5cheSim
rs based on the specific requirements of each slice. In this cell capacity simulator [12]. The simulator serves as a virtual
work, we use throughput as the slice reward for eMBB and environment to mimic the operation of a mobile network,
buffer occupancy for URLLC. Buffer occupancy is used in generating essential network parameters like throughput and
the slice reward for URLLC due to its critical role in ensuring the number of packets in the buffer. For the testing and
low latency and high reliability, key requirements for URLLC validation of the proposed Q-learning approach, we have mod-
applications. High buffer occupancy can increase latency and ified Py5cheSim to support different traffic scenarios along
decrease reliability, making it an essential metric to monitor with user-defined inter-slice resource allocation algorithms.
and optimize in URLLC scenarios [11]. We also normalize This simulator was chosen due to its unique capabilities and
slice reward to the [0, 2] range such that rs = 1 is the threshold flexibility in simulating network slicing at the access network
of SLA satisfaction for that slice, sr < 1 is the SLA violation level, a pivotal feature for our proposed method.
range, and sr > 1 is the SLA satisfaction range. The total A key aspect of Py5cheSim is its ability to support dynamic
reward is constructed as a piecewise function to reflect the inter-slice PRB allocation. This capability allows the PRBs
different priorities and implications of various KPI levels. allocated to different slices to change dynamically according
The reward function is designed as a piecewise function to the decisions made by the inter-slice scheduler, thereby
with different ranges and gradient terms to represent the vary- closely mimicking the real-world operations of a 5G network.
ing priorities and implications of different network conditions. Additionally, this dynamic allocation can be configured with
Specifically, the reward function considers factors such as rs variable time granularity, providing further flexibility to align
and PRB utilization (u). u has a value in [0, 1] for each slice the simulation environment with the specific requirements and
and reports what ratio of allocated PRBs are assigned to UEs conditions of the study.
in each slice. Implementation Details: The implementation of the Q-
learning algorithm involves several critical steps, including the
 min(a) initialization of the Q-table, the decay of the exploration factor
−11 + max(max(A))


if Flag = 1
ϵ, and the update of Q-values after each action. Q-values,
−11 − max(Assigned PRB)


 NP RBs else if Flag = 2 stored in a Q-table with dimensions aligned to digitized state

r = −10 + 0.5 min(rs ) P else if min(rs ) < 1.25 and action spaces, serve as estimates of the expected cumu-
Nslc


−10 + 2 min(u) + 8 i=1
u[i]
else if min(rs ) < 1.75 lative rewards an agent can achieve by executing particular
 Nslc


 P Nslc u[i] actions in various states.
70 min(u) + 30 i=1 else

Nslc To balance the exploration of unvisited state-action pairs
(2) and the exploitation of current knowledge, we use an ϵ-greedy
Flag=1 denotes when the proposed action violates the strategy, where ϵ represents the probability of the agent taking
minimum number of required PRBs for one of the slices. In a random action. Initially, epsilon is set to 1, encouraging
this situation, the reward function penalizes the system but maximum exploration. As the agent gains more knowledge
provides a small positive gradient based on the normalized about the environment, epsilon decays at a rate of 0.999 after
minimum number of assigned PRBs. each episode, thereby gradually increasing the emphasis on
In the second condition (Flag=2), the reward discourages exploiting the learned values.
the system from getting close to an illegal PRB state, with a The Q-values are updated after each action according to
penalty that increases with the maximum number of assigned the observed state s and received reward r. This update is
PRBs in order to avoid using more PRBs when we are close performed based on the Q-learning update rule, [3]:
to demanding more PRBs than available.
In the third condition (min(rs ) < 1.25), the reward function Q(st , at ) = Q(st , at )+α[rt+1 +γ max Q(st+1 , a)−Q(st , at )]
a
incentivizes the system to enhance the performance of the slice In this equation, Q(st , at ) is the current Q-value of taking
with the poorest performance. It is important to note that we action at in state st , rt+1 is the received reward after per-
introduce a 25% buffer when responding to SLA violations forming the action, α is the learning rate, γ is the discount
compared to the actual SLA violation threshold. In the fourth factor representing the importance of future rewards, and
condition (min(rs ) < 1.75), the system slightly encourages maxa Q(st+1 , a) is the maximum Q-value attainable in the
efficiency. When min(rs ) > 1.75, the reward function for next state st+1 . This rule allows the agent to iteratively
improving PRB utilization in the slice with the poorest u refine its Q-values, effectively guiding it towards actions that
consists of two terms: the minimum utilization and average maximize the cumulative future reward.
utilization, Eq. 2. This dual-term reward enables the system to
enhance the efficiency of other slices even when improving the D. Intra-Slice Resource Allocation
slice with the lowest performance is not feasible, preventing We utilize a RR scheduler for intra-slice resource allocation.
the agent from becoming stagnant. This ensures a fair allocation strategy, sequentially assigning
resources to users ensuring a balanced distribution. The count of our results is enabled by providing the same seed values
of available resources for each slice, managed by the inter-slice used in our experiments. We ensure robustness by evaluating
resource allocation algorithm, serves as an input to the RR our agent under low and high network traffic load scenarios,
scheduler. This hierarchical approach fosters efficient and eq- mimicking a range of real-world network conditions.
uitable resource allocation, meeting diverse user requirements
within the slice.
Algorithm 1 Inter-slice Resource Allocation with Information

Exchange
Require: Initial Q-tables for all cells
1: for each episode do
2: for each time step within episode do
3: for each cell do
4: Choose action according to policy derived from
Q-table
5: Execute action, observe new state and reward
6: Update Q-table using observed reward and new
state
7: end for
8: if time step % 10 = 0 then
9: Share Q-tables among cells
10: Each cell averages received Q-tables and update
its own
11: end if Fig. 1: Training progression of the Q-learning agent: The
12: end for figure depicts the average reward per episode and the cor-
13: end for responding exponential moving average. The varying back-
Ensure: Refined Q-tables for all cells ground color illustrates the changing value of the exploitation
probability at different stages of the learning process. β is a
parameter that controls the weight given to the past values.
III. P ERFORMANCE E VALUATION
A. Evaluation Metrics and Methods B. Comparison between Q-Learning and RR Inter-Slice
This section presents the evaluation metrics and their signif- Schedulers
icance in assessing the performance of the proposed approach. This section includes graphical representations and discus-
Fig. 1 provides a visual representation of the agent’s learning sion on the performance comparison between the proposed
progression. It can be observed that convergence is achieved learning scheduler and the RR inter-slice scheduler, demon-
after approximately 3000 episodes. The training process for strating the PRBs saving capability of the learning scheduler.
the agent was conducted with a learning rate (α) set at 0.1 Fig. 2 presents the empirical cumulative distribution func-
and a discount factor (γ) of 0.05, highlighting the effective- tion (CDF) of the percentage of unallocated PRBs for both
ness of these parameters in facilitating efficient learning and RR and our proposed Q-learning method, under high and
convergence. Our simulation unfolds over episodes lasting six low traffic loads. The RR curve consistently resides above
seconds each, with both the measurement interval and the the Q-learning curves, implying a lower percentage of unallo-
action granularity set at 10 ms. Adopting an online training cated PRBs. Notably, as network traffic escalates, Q-learning
approach, we promptly update the Q-values following each exhibits increased resource utilization to maintain optimal
action. In the scope of this study, we primarily focus on buffer network performance. This illustrates the adaptive nature of
occupancy, throughput, and PRB utilization as KPIs. We have our Q-learning approach, adjusting resource allocation dynam-
elected throughput as a paramount metric for eMBB slices due ically according to network conditions.
to their emphasis on high data rates. Conversely, for URLLC
slices, we primarily consider buffer occupancy as the crucial C. Satisfaction of SLAs
KPI, owing to the importance of ultra-reliable and low-latency In this study, we designed our reward structure, as described
communication in their associated applications. by (2), to prioritize the SLA satisfaction for each slice. The
We benchmark our Q-learning-based inter-slice resource individual slice reward, rs , is calculated based on the KPIs
allocation agent against the widely used RR scheduler, noted pertinent to that slice. Here, we employ a sigmoid function
for its fairness and simplicity similar to [13]. To validate our to normalize KPIs into a range of [0, 2], wherein an rs
approach, we generate varied traffic conditions using random value equal to or greater than 1 indicates the satisfaction
packet arrivals, controlled by a packet arrival rate. Replication of SLA for a given slice. The sigmoid function is defined
(a) Slice reward for low traffic load
Fig. 2: This figure displays the empirical CDF of the percent-

age of unallocated PRBs in both high and low-traffic scenarios.
The RR curve resides above the Q-learning curves, signifying
fewer unallocated PRBs in comparison.
1
as σ(x; a, b) = 1+e−(x−b)/a where a represents the shape
parameter controlling the steepness of the sigmoid curve, and
b represents the position parameter determining the horizontal
shift of the curve.
The design of our piecewise reward function significantly (b) Slice reward for high traffic load
guides the agent’s behavior. A large inter-range gap incen- Fig. 3: Empirical cumulative distribution functions (eCDFs)
tivizes the agent to transition between ranges, while potentially for slice rewards of eMBB and URLLC under Q-learning and
neglecting optimization within the current range. Conversely, a RR scheduling algorithms. The vertical line at slice reward =
small inter-range gap may lead to the agent focusing more on 1 represents the SLA satisfaction threshold. The eCDFs are
intra-range optimization, potentially at the expense of moving calculated using 60 thousand measurements in 30 episodes.
to more beneficial ranges.
Similarly, the intra-range amplitude, the variation in reward
within a single range, affects the agent’s behavior. A larger IV. Q- TABLE S HARING
amplitude encourages the agent to optimize its actions within
the current range, while a smaller amplitude might motivate A. Significance and Mechanism of Q-table Sharing
the agent to transition to other ranges, sometimes at the cost The value of information sharing during the learning phase
of optimal performance within the current range. is illustrated in this section. Specifically, we consider a setting
In essence, careful tuning of the reward function’s properties wherein multiple cells, namely Cell 2 through Cell 5, engage in
is required to ensure a balance between intra- and inter- a periodic exchange of Q-tables while training. The global Q-
range optimization, thereby guiding the agent to the most table is constructed using an averaging approach, incorporating
advantageous behaviors. As illustrated in Fig. 3, under low- information from all contributing cells [14].
traffic conditions, both methods effectively allocate resources, Conversely, Cell 1 is deployed as a control cell that pursues
with Q-learning demonstrating an edge in PRB utilization Q-learning in isolation, without any interaction with the shared
efficiency. global model. This design provides a basis for comparing the
As traffic load increases, the advantages of Q-learning effectiveness of information sharing versus individual learning.
become more pronounced. While the RR approach allocates Throughout the training phase, only Cell 5 experiences high
resources in a static and cyclical pattern, regardless of the traffic conditions. Upon completion of the training, we assess
traffic load, Q-learning adapts to the higher demand by the performance of the Q-learning agents across unseen high-
dynamically allocating PRBs. It reduces the percentage of load traffic scenarios.
unallocated PRBs, making better use of available resources to Compared to centralized PRB allocation, Q-table sharing
maintain network performance under pressure. This robustness reduces communication overhead as it is periodically updated.
to varying traffic loads suggests the suitability of Q-learning Moreover, this decentralized method avoids the computational
in dynamic and diverse network environments, underlining complexity of calculating PRB allocations globally in a cen-
its potential for optimizing resource allocation in advanced tralized manner, offering a more scalable and practical solution
mobile networks. for large network environments.
Fig. 4 offers a vivid illustration of the advantages conferred to address more complex and realistic network scenarios.
by Q-table sharing. While the SLAs are violated approxi- Additionally, efficient ways of encoding and sharing of Q-
mately 4% and 8% of the time for eMBB and URLLC slices tables to further reduce the overhead of information exchange
respectively in Cell 1, the cell devoid of any information could be investigated. Further research may also examine the
sharing, a remarkable improvement is observed in the case possibility of integrating the proposed mechanism with net-
of Cell 2 which participates in Q-table exchange despite not work slicing orchestration strategies in real-time, incorporating
encountering high traffic during training. Thus, the benefits of varying traffic patterns and user mobility. Lastly, considering
collaborative learning and information sharing in efficiently the stochastic nature of wireless networks, future work could
managing network resources under diverse load conditions are also look into the application of online learning algorithms for
underscored. robust and adaptive resource management.
R EFERENCES
[1] P. Wei, K. Guo, Y. Li, J. Wang, W. Feng, S. Jin, N. Ge, and Y.-C. Liang,
“Reinforcement Learning-Empowered Mobile Edge Computing for 6G
Edge Intelligence,” IEEE Access, vol. 10, pp. 65156–65192, 2022.
[2] Bei Yang, Yiqian Xu, Xiaoming She, Jianchi Zhu, Fengsheng Wei,
Peng Cheri, and Jianxiu Wang, “Reinforcement Learning Based Dynamic
Resource Allocation for Massive MTC in Sliced Mobile Networks,”
in 2022 IEEE 14th International Conference on Advanced Infocomm
Technology (ICAIT), pp. 298–303, 2022.
[3] Yi Shi, Yalin E. Sagduyu, and Tugba Erpek, “Reinforcement Learning for
Dynamic Resource Optimization in 5G Radio Access Network Slicing,”
in 2020 IEEE 25th International Workshop on Computer Aided Modeling
and Design of Communication Links and Networks (CAMAD), pp. 1–6,
2020, doi: 10.1109/CAMAD50429.2020.9209299.
[4] Fadoua Debbabi, Rihab Jmal, Lamia Chaari Fourati, and Rui Luis
Aguiar, “An Overview of Interslice and Intraslice Resource Allocation
in B5G Telecommunication Networks,” IEEE Transactions on Network
and Service Management, vol. 19, no. 4, pp. 5120–5132, 2022.
[5] Joao Francisco Nunes Pinheiro, Chia-Yu Chang, Tom Collins, Eric
Fig. 4: Empirical CDF of slice rewards for eMBB and URLLC Smekens, Revaz Berozashvili, Adnan Shahid, Danny De Vleeschauwer,
Paola Soto, Ingrid Moerman, Johann Marquez-Barja, Jens Buysse, and
in Cell 1 (isolated learning) and Cell 2 (part of distributed Miguel Camelo, “5GECO: A Cross-domain Intelligent Neutral Host
learning). The figure underscores the impact of information Architecture for 5G and Beyond,” in INFOCOM 2023 NG-OPERA: Next-
sharing, revealing that Cell 1 experiences SLA violations generation Open and Programmable Radio Access Networks, pp. 1–6,
New York, NY, USA, 2023.
in 4% and 8% of the cases for eMBB and URLLC slices, [6] Xuezhi Zeng, Saurabh Garg, Mutaz Barika, Sanat Bista, Deepak Puthal,
respectively, highlighting the enhanced performance achieved Albert Y. Zomaya, and Rajiv Ranjan, “Detection of SLA Violation for Big
through distributed learning. Data Analytics Applications in Cloud,” IEEE Transactions on Computers,
vol. 70, no. 5, pp. 746–758, 2021.
[7] Rongpeng Li, Chujie Wang, Zhifeng Zhao, Rongbin Guo, and Honggang
Zhang, “The LSTM-Based Advantage Actor-Critic Learning for Resource
V. C ONCLUSION Management in Network Slicing With User Mobility,” IEEE Communi-
cations Letters, vol. 24, no. 9, pp. 2005–2009, 2020.
In this research, we have underscored the potential of [8] Wanqing Guan, Haijun Zhang, and Victor C. M. Leung, “Customized
Slicing for 6G: Enforcing Artificial Intelligence on Resource Manage-
Q-learning for inter-slice resource management in emerging ment,” IEEE Network, vol. 35, no. 5, pp. 264–271, 2021.
wireless networks, particularly focusing on cross-cell Q-table [9] Yuxiu Hua, Rongpeng Li, Zhifeng Zhao, Xianfu Chen, and Honggang
exchange. Our analysis shows how this shared learning tech- Zhang, “GAN-Powered Deep Distributional Reinforcement Learning for
Resource Management in Network Slicing,” IEEE Journal on Selected
nique can improve network performance under high-traffic Areas in Communications, vol. 38, no. 2, pp. 334–349, 2020.
conditions, even when individual cells have not been exposed [10] Xiaolan Liu, Jiadong Yu, Zhiyong Feng, and Yue Gao, “Multi-agent
to such scenarios during the training. reinforcement learning for resource allocation in IoT networks with edge
computing,” China Communications, vol. 17, no. 9, pp. 220–236, 2020.
The value of such information sharing holds significant [11] Ali A. Esswie, Klaus I. Pedersen, and Preben E. Mogensen, “Semi-
implications for future mobile networks, demonstrating the Static Radio Frame Configuration for URLLC Deployments in 5G Macro
potential to enhance resource utilization and uphold Service TDD Networks,” in 2020 IEEE Wireless Communications and Networking
Conference (WCNC), pp. 1–6, 2020.
Level Agreements under diverse network conditions. We an- [12] G. Pereyra, L. Inglés, C. Rattaro, and P. Belzarena, ”An open source
ticipate that these insights will prompt further exploration multi-slice cell capacity framework,” CLEI electronic journal, vol. 25,
in leveraging advanced reinforcement learning techniques, no. 2, pp. 2–1, 2022.
[13] Doan Perdana, Aji Nur Sanyoto, and Yoseph Gustommy Bisono, “Per-
and particularly shared learning mechanisms, in the field of formance evaluation and comparison of scheduling algorithms on 5G
wireless networking. Ultimately, our research underscores the networks using network simulator,” International Journal of Computers
pivotal role of shared intelligence in shaping the future of Communications & Control, vol. 14, no. 4, pp. 530–539, 2019.
[14] Bilal H. Abed-Alguni, David J. Paul, Stephan K. Chalup, and Frans A.
efficient and adaptive mobile network management. Henskens, “A comparison study of cooperative Q-learning algorithms for
Potential future works could explore an assortment of independent learners,” Int. J. Artif. Intell, vol. 14, no. 1, pp. 71–93, 2016.
advanced reinforcement learning techniques, such as deep
reinforcement learning or multi-agent reinforcement learning,

IEEE Conference Template

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IEEE Conference Template

Uploaded by

Copyright:

Available Formats

Inter-Slice Resource Allocation in Next-Generation

Wireless Networks: Distributed Q-Learning

Algorithm 1 Inter-slice Resource Allocation with Information

Fig. 2: This figure displays the empirical CDF of the percent-

You might also like